# Phase 2: Feature Engineering

This notebook extracts comprehensive features from the processed flight data for anomaly detection.

## Objectives
1. Extract temporal features (duration, phase times, taxi times)
2. Extract spatial features (distance, trajectory sinuosity, altitude patterns)
3. Extract operational features (runway/taxiway events, ground complexity)
4. Extract sequence features (event n-grams, state transitions)
5. Extract contextual features (airport norms, peer comparisons)
6. Validate and analyze feature distributions
7. Prepare features for machine learning

## Prerequisites
- Phase 1 must be completed
- Processed data files must exist in `data/processed/`


In [1]:
# Setup and Import Libraries
import sys
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Add project root to path
project_root = Path.cwd().parent if Path.cwd().name == 'notebooks' else Path.cwd()
sys.path.insert(0, str(project_root))

# Import feature engineering modules
from src.features import (
    load_processed_data,
    extract_all_features,
    prepare_features_for_ml,
    save_features,
    run_feature_engineering_pipeline
)
from src.utils.helpers import load_config, ensure_dir

print("✓ Libraries imported successfully")
print(f"Project root: {project_root}")


✓ Libraries imported successfully
Project root: c:\Users\aiish\OneDrive\Desktop\MSDA-SJSU\Fall 2025\Big Data\project


In [2]:
# Configuration and Directory Setup
config = load_config(project_root / 'config' / 'config.yaml')

# Setup directories
output_dir = project_root / config['output']['figures_dir']
ensure_dir(output_dir)

# Set visualization style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("✓ Configuration loaded")
print(f"Figures will be saved to: {output_dir}")


✓ Configuration loaded
Figures will be saved to: c:\Users\aiish\OneDrive\Desktop\MSDA-SJSU\Fall 2025\Big Data\project\outputs\figures


## Step 1: Load Processed Data

Load the processed data from Phase 1 (events and flight summaries).


In [3]:
# Load processed data from Phase 1
processed_dir = project_root / config['data']['processed_dir']

try:
    events_df, flight_summary_df = load_processed_data(str(processed_dir))
    
    print(f"\nData Summary:")
    print(f"  Events: {len(events_df):,} records")
    print(f"  Flights: {len(flight_summary_df):,} unique flights")
    print(f"\nEvent columns: {list(events_df.columns)}")
    print(f"\nFlight summary columns: {list(flight_summary_df.columns)}")
    
except FileNotFoundError as e:
    print(f"❌ Error: {e}")
    print("\nPlease run Phase 1 notebook first to generate processed data.")
    raise


Loading events from c:\Users\aiish\OneDrive\Desktop\MSDA-SJSU\Fall 2025\Big Data\project\data\processed\events_sorted.csv.gz...
✓ Loaded 9,554,720 events
Loading flight summary from c:\Users\aiish\OneDrive\Desktop\MSDA-SJSU\Fall 2025\Big Data\project\data\processed\flight_summary.csv.gz...
✓ Loaded 3,833,918 flights

Data Summary:
  Events: 9,554,720 records
  Flights: 3,833,918 unique flights

Event columns: ['YEAR', 'MONTH', 'DAY_OF_MONTH', 'DAY_OF_WEEK', 'FL_DATE', 'OP_UNIQUE_CARRIER', 'ORIGIN', 'ORIGIN_CITY_NAME', 'ORIGIN_STATE_NM', 'DEST', 'DEST_CITY_NAME', 'DEST_STATE_NM', 'CRS_DEP_TIME', 'DEP_TIME', 'DEP_DELAY', 'DEP_DELAY_NEW', 'DEP_DEL15', 'DEP_DELAY_GROUP', 'DEP_TIME_BLK', 'TAXI_OUT', 'WHEELS_OFF', 'WHEELS_ON', 'TAXI_IN', 'CRS_ARR_TIME', 'ARR_TIME', 'ARR_DELAY', 'ARR_DELAY_NEW', 'ARR_DEL15', 'ARR_DELAY_GROUP', 'ARR_TIME_BLK', 'CANCELLED', 'CANCELLATION_CODE', 'DIVERTED', 'DUP', 'CRS_ELAPSED_TIME', 'ACTUAL_ELAPSED_TIME', 'AIR_TIME', 'FLIGHTS', 'DISTANCE', 'DISTANCE_GROUP', 'CA

## Step 2: Extract All Features

Extract features from all five categories:
1. **Temporal Features**: Duration, phase times, taxi times, time-of-day
2. **Spatial Features**: Distance, trajectory sinuosity, altitude patterns
3. **Operational Features**: Runway/taxiway events, ground complexity
4. **Sequence Features**: Event n-grams, state transitions
5. **Contextual Features**: Airport norms, peer comparisons


In [4]:
# Extract all features
# This may take 10-30 minutes depending on data size
features_df = extract_all_features(
    events_df,
    flight_summary_df,
    feature_types=None  # None = extract all feature types
)

print(f"\nFeature extraction complete!")
print(f"Total features: {len(features_df.columns)}")
print(f"Total flights: {len(features_df)}")



FEATURE ENGINEERING PIPELINE

[1/5] Extracting temporal features...
No 'timestamp' column found in events_df; skipping temporal feature extraction.

[2/5] Extracting spatial features...
Extracting spatial features for all flights...


KeyboardInterrupt: 

## Step 3: Feature Overview

Examine the extracted features and their basic statistics.


In [None]:
# Display feature overview
print("Feature Overview:")
print("="*60)
print(f"Total features: {len(features_df.columns)}")
print(f"Total flights: {len(features_df)}")
print(f"\nFeature categories:")

# Categorize features
feature_categories = {
    'Temporal': [col for col in features_df.columns if any(x in col.lower() for x in ['duration', 'time', 'hour', 'day', 'phase', 'taxi', 'ground'])],
    'Spatial': [col for col in features_df.columns if any(x in col.lower() for x in ['distance', 'lat', 'lon', 'trajectory', 'altitude', 'coordinate'])],
    'Operational': [col for col in features_df.columns if any(x in col.lower() for x in ['runway', 'taxiway', 'parking', 'ground_complexity', 'event_count'])],
    'Sequence': [col for col in features_df.columns if any(x in col.lower() for x in ['gram', 'transition', 'sequence', 'pattern'])],
    'Contextual': [col for col in features_df.columns if any(x in col.lower() for x in ['zscore', 'deviation', 'percentile', 'airport', 'global', 'time_deviation'])]
}

for category, cols in feature_categories.items():
    print(f"  {category}: {len(cols)} features")

print(f"\nFirst few features:")
print(features_df.head())


## Step 4: Feature Statistics and Missing Values

Analyze feature distributions and handle missing values.


In [None]:
# Check for missing values
missing_stats = features_df.isnull().sum()
missing_stats = missing_stats[missing_stats > 0].sort_values(ascending=False)

if len(missing_stats) > 0:
    print("Missing Values:")
    print("="*60)
    print(missing_stats)
    print(f"\nTotal missing values: {missing_stats.sum()}")
    print(f"Features with missing values: {len(missing_stats)}")
    
    # Visualize missing values
    plt.figure(figsize=(12, 6))
    missing_stats.head(20).plot(kind='barh')
    plt.title('Top 20 Features with Missing Values')
    plt.xlabel('Missing Count')
    plt.tight_layout()
    plt.savefig(output_dir / 'feature_missing_values.png', dpi=300, bbox_inches='tight')
    plt.show()
else:
    print("✓ No missing values found!")


In [None]:
# Basic statistics for numeric features
numeric_cols = features_df.select_dtypes(include=[np.number]).columns
print(f"Numeric features: {len(numeric_cols)}")

# Display summary statistics
print("\nSummary Statistics (sample of key features):")
key_features = [
    'total_duration_seconds', 'n_events', 'total_distance_meters',
    'altitude_max', 'ground_complexity_score', 'sequence_complexity_score',
    'global_zscore_duration', 'airport_zscore_duration'
]

available_key_features = [f for f in key_features if f in features_df.columns]
if available_key_features:
    print(features_df[available_key_features].describe())


## Step 5: Feature Distributions

Visualize distributions of key features to understand their characteristics.


In [None]:
# Plot distributions of key temporal features
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Duration distribution
if 'total_duration_seconds' in features_df.columns:
    axes[0, 0].hist(features_df['total_duration_seconds'] / 3600, bins=50, edgecolor='black')
    axes[0, 0].set_xlabel('Duration (hours)')
    axes[0, 0].set_ylabel('Frequency')
    axes[0, 0].set_title('Flight Duration Distribution')
    axes[0, 0].set_xlim(0, features_df['total_duration_seconds'].quantile(0.99) / 3600)

# Number of events
if 'n_events' in features_df.columns:
    axes[0, 1].hist(features_df['n_events'], bins=50, edgecolor='black')
    axes[0, 1].set_xlabel('Number of Events')
    axes[0, 1].set_ylabel('Frequency')
    axes[0, 1].set_title('Events per Flight Distribution')
    axes[0, 1].set_xlim(0, features_df['n_events'].quantile(0.99))

# Ground time
if 'ground_time_seconds' in features_df.columns:
    axes[1, 0].hist(features_df['ground_time_seconds'] / 60, bins=50, edgecolor='black')
    axes[1, 0].set_xlabel('Ground Time (minutes)')
    axes[1, 0].set_ylabel('Frequency')
    axes[1, 0].set_title('Ground Time Distribution')
    axes[1, 0].set_xlim(0, features_df['ground_time_seconds'].quantile(0.99) / 60)

# Taxi time
if 'taxi_time_seconds' in features_df.columns:
    axes[1, 1].hist(features_df['taxi_time_seconds'] / 60, bins=50, edgecolor='black')
    axes[1, 1].set_xlabel('Taxi Time (minutes)')
    axes[1, 1].set_ylabel('Frequency')
    axes[1, 1].set_title('Taxi Time Distribution')
    axes[1, 1].set_xlim(0, features_df['taxi_time_seconds'].quantile(0.99) / 60)

plt.tight_layout()
plt.savefig(output_dir / 'feature_temporal_distributions.png', dpi=300, bbox_inches='tight')
plt.show()


In [None]:
# Plot distributions of key spatial features
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Total distance
if 'total_distance_meters' in features_df.columns:
    axes[0, 0].hist(features_df['total_distance_meters'] / 1000, bins=50, edgecolor='black')
    axes[0, 0].set_xlabel('Total Distance (km)')
    axes[0, 0].set_ylabel('Frequency')
    axes[0, 0].set_title('Total Distance Distribution')
    axes[0, 0].set_xlim(0, features_df['total_distance_meters'].quantile(0.99) / 1000)

# Max altitude
if 'altitude_max' in features_df.columns:
    axes[0, 1].hist(features_df['altitude_max'] / 1000, bins=50, edgecolor='black')
    axes[0, 1].set_xlabel('Max Altitude (km)')
    axes[0, 1].set_ylabel('Frequency')
    axes[0, 1].set_title('Maximum Altitude Distribution')
    axes[0, 1].set_xlim(0, features_df['altitude_max'].quantile(0.99) / 1000)

# Trajectory sinuosity
if 'trajectory_sinuosity' in features_df.columns:
    axes[1, 0].hist(features_df['trajectory_sinuosity'], bins=50, edgecolor='black')
    axes[1, 0].set_xlabel('Trajectory Sinuosity')
    axes[1, 0].set_ylabel('Frequency')
    axes[1, 0].set_title('Trajectory Sinuosity Distribution')
    axes[1, 0].set_xlim(1, features_df['trajectory_sinuosity'].quantile(0.99))

# Path efficiency
if 'path_efficiency' in features_df.columns:
    axes[1, 1].hist(features_df['path_efficiency'], bins=50, edgecolor='black')
    axes[1, 1].set_xlabel('Path Efficiency')
    axes[1, 1].set_ylabel('Frequency')
    axes[1, 1].set_title('Path Efficiency Distribution')
    axes[1, 1].set_xlim(0, 1)

plt.tight_layout()
plt.savefig(output_dir / 'feature_spatial_distributions.png', dpi=300, bbox_inches='tight')
plt.show()


## Step 6: Feature Correlation Analysis

Analyze correlations between features to identify redundant or highly correlated features.


In [None]:
# Calculate correlation matrix for numeric features
numeric_features = features_df.select_dtypes(include=[np.number]).columns
correlation_matrix = features_df[numeric_features].corr()

# Find highly correlated feature pairs
high_corr_pairs = []
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        corr_val = correlation_matrix.iloc[i, j]
        if abs(corr_val) > 0.8:  # Threshold for high correlation
            high_corr_pairs.append((
                correlation_matrix.columns[i],
                correlation_matrix.columns[j],
                corr_val
            ))

if high_corr_pairs:
    print(f"Found {len(high_corr_pairs)} highly correlated feature pairs (|r| > 0.8):")
    print("="*60)
    for feat1, feat2, corr in sorted(high_corr_pairs, key=lambda x: abs(x[2]), reverse=True)[:20]:
        print(f"  {feat1} <-> {feat2}: {corr:.3f}")
else:
    print("✓ No highly correlated feature pairs found (|r| > 0.8)")


In [None]:
# Visualize correlation matrix for key features
key_features_for_corr = [
    'total_duration_seconds', 'n_events', 'total_distance_meters',
    'altitude_max', 'ground_complexity_score', 'sequence_complexity_score',
    'global_zscore_duration', 'airport_zscore_duration', 'trajectory_sinuosity'
]

available_corr_features = [f for f in key_features_for_corr if f in features_df.columns]

if len(available_corr_features) > 1:
    corr_subset = features_df[available_corr_features].corr()
    
    plt.figure(figsize=(10, 8))
    sns.heatmap(corr_subset, annot=True, fmt='.2f', cmap='coolwarm', center=0,
                square=True, linewidths=1, cbar_kws={"shrink": 0.8})
    plt.title('Feature Correlation Matrix (Key Features)')
    plt.tight_layout()
    plt.savefig(output_dir / 'feature_correlation_matrix.png', dpi=300, bbox_inches='tight')
    plt.show()


## Step 7: Prepare Features for Machine Learning

Handle missing values, encode categorical variables, and prepare features for ML models.


In [None]:
# Prepare features for ML
X, y, feature_info = prepare_features_for_ml(features_df)

print(f"\nML-Ready Features:")
print(f"  Shape: {X.shape}")
print(f"  Feature names: {len(feature_info['feature_names'])}")
print(f"  Samples: {len(X)}")

# Display sample of prepared features
print(f"\nSample features (first 5 rows, first 10 columns):")
print(X.iloc[:5, :10])


## Step 8: Save Features

Save the extracted features for use in Phase 3 (Model Development).


In [None]:
# Save features
features_dir = project_root / config['data']['features_dir']
ensure_dir(features_dir)

output_path = features_dir / 'flight_features.csv.gz'
save_features(features_df, str(output_path), X, feature_info)

print(f"\n✓ Features saved successfully!")
print(f"  Full features: {output_path}")
print(f"  ML-ready features: {features_dir / 'flight_features_ml_ready.csv.gz'}")
print(f"  Feature info: {features_dir / 'feature_info.json'}")


## Step 9: Feature Summary Report

Generate a summary report of all extracted features.


In [None]:
# Generate feature summary report
print("="*60)
print("FEATURE ENGINEERING SUMMARY REPORT")
print("="*60)

print(f"\n1. Dataset Overview:")
print(f"   - Total flights: {len(features_df):,}")
print(f"   - Total features extracted: {len(features_df.columns)}")
print(f"   - ML-ready features: {len(X.columns)}")

print(f"\n2. Feature Categories:")
for category, cols in feature_categories.items():
    print(f"   - {category}: {len(cols)} features")

print(f"\n3. Data Quality:")
print(f"   - Missing values: {features_df.isnull().sum().sum()}")
print(f"   - Features with missing values: {len(missing_stats)}")
print(f"   - Numeric features: {len(numeric_cols)}")
print(f"   - Categorical features: {len(features_df.select_dtypes(include=['object', 'category']).columns)}")

print(f"\n4. Key Statistics:")
if 'total_duration_seconds' in features_df.columns:
    print(f"   - Mean flight duration: {features_df['total_duration_seconds'].mean() / 3600:.2f} hours")
if 'n_events' in features_df.columns:
    print(f"   - Mean events per flight: {features_df['n_events'].mean():.1f}")
if 'total_distance_meters' in features_df.columns:
    print(f"   - Mean distance: {features_df['total_distance_meters'].mean() / 1000:.2f} km")

print(f"\n5. Next Steps:")
print(f"   - Features are ready for Phase 3: Model Development")
print(f"   - Use isolation forest, one-class SVM, or other anomaly detection models")
print(f"   - Features saved to: {output_path}")

print("\n" + "="*60)
print("PHASE 2 COMPLETE! ✓")
print("="*60)
