# Flight Delay Prediction: Feature Engineering Pipeline

**Author:** Daniel Costa  
**Course:** W261 - Machine Learning at Scale  
**Institution:** UC Berkeley

---

## Overview

This notebook implements a comprehensive feature engineering pipeline for flight delay prediction using PySpark. The pipeline processes one year of flight data and creates temporal, operational, and weather-based features designed to predict departure delays **2 hours before scheduled departure**.

### Key Features Engineered

| Feature | Description | Window |
|---------|-------------|--------|
| `origin_delays_4h` | Count of delays (>15 min) at origin airport | 4h to 2h before flight |
| `prev_flight_delay` | Binary flag if aircraft's previous flight was delayed | Previous flight |
| `prev_flight_delay_in_minutes` | Delay duration of aircraft's previous flight | Previous flight |
| `delay_origin_7d` | Sum of departure delays at origin | 7 days to 4h before |
| `delay_origin_carrier_7d` | Sum of delays by origin + carrier | 7 days to 4h before |
| `delay_route_7d` | Sum of delays on same route | 7 days to 4h before |
| `flight_count_24h` | Number of flights per aircraft per day | Same day |
| `AVG_ARR_DELAY_ORIGIN` | Rolling average arrival delay at origin | 7 days to 4h before |
| `AVG_TAXI_OUT_ORIGIN` | Rolling average taxi-out time at origin | 7 days to 4h before |

### Data Leakage Prevention

All features respect a **2-hour prediction buffer** to ensure no information from within 2 hours of departure is used, simulating real-world prediction conditions.

---
## 1. Configuration & Setup

In [None]:
# =============================================================================
# CONFIGURATION
# =============================================================================

# Paths
GROUP_SECTION = "2"
GROUP_NUMBER = "2"
BASE_PATH = f"dbfs:/student-groups/Group_{GROUP_SECTION}_{GROUP_NUMBER}"
DATASET_NAME = "1_year_custom_joined"

INPUT_PATH = f"{BASE_PATH}/{DATASET_NAME}/graph_feature_splits"
OUTPUT_PATH = f"{BASE_PATH}/{DATASET_NAME}/feature_eng/training_splits"

# Train/Validation/Test Split Ratios
TRAIN_RATIO = 0.70
VALIDATION_RATIO = 0.10
TEST_RATIO = 0.20  # Implicit: 1 - TRAIN_RATIO - VALIDATION_RATIO

# Time Windows (in seconds)
SECONDS_2_HOURS = 2 * 60 * 60      # 7,200
SECONDS_4_HOURS = 4 * 60 * 60      # 14,400
SECONDS_7_DAYS = 7 * 24 * 60 * 60  # 604,800

# Delay Threshold (minutes)
DELAY_THRESHOLD_MINUTES = 15

# MLflow Configuration
EXPERIMENT_NAME = "/Shared/team_2_2/mlflow-baseline"
RANDOM_SEED = 42

In [None]:
# =============================================================================
# IMPORTS
# =============================================================================

from typing import List
from pyspark.sql import DataFrame, Window
from pyspark.sql.functions import (
    col, lit, when, count, sum as spark_sum, avg, lag, lead,
    concat, lpad, to_timestamp, floor, row_number, max as spark_max,
    coalesce
)
import pyspark.sql.functions as F
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml import Pipeline

import mlflow

print(f"MLflow version: {mlflow.__version__}")

In [None]:
# =============================================================================
# MLFLOW SETUP
# =============================================================================

spark.conf.set("spark.databricks.mlflow.trackMLlib.enabled", "true")

try:
    experiment = mlflow.get_experiment_by_name(EXPERIMENT_NAME)
    if experiment is None:
        experiment_id = mlflow.create_experiment(EXPERIMENT_NAME)
        print(f"Created new experiment with ID: {experiment_id}")
    else:
        print(f"Using existing experiment: {experiment.name}")
    mlflow.set_experiment(EXPERIMENT_NAME)
except Exception as e:
    print(f"Error with experiment setup: {e}")
    fallback_path = f"/Users/{dbutils.notebook.entry_point.getDbutils().notebook().getContext().userName().get()}/default"
    mlflow.set_experiment(fallback_path)

---
## 2. Helper Functions

Reusable functions for feature engineering to ensure consistency across train/validation/test sets.

In [None]:
# =============================================================================
# WEATHER COLUMNS (Required - rows with nulls will be dropped)
# =============================================================================

WEATHER_COLUMNS = [
    'HourlyDryBulbTemperature',
    'HourlyDewPointTemperature',
    'HourlyRelativeHumidity',
    'HourlyAltimeterSetting',
    'HourlyVisibility',
    'HourlyStationPressure',
    'HourlyWetBulbTemperature',
    'HourlyPrecipitation',
    'HourlyCloudCoverage',
    'HourlyCloudElevation',
    'HourlyWindSpeed'
]

In [None]:
def add_utc_timestamp(df: DataFrame) -> DataFrame:
    """
    Create a UTC timestamp from FL_DATE and CRS_DEP_TIME.
    
    CRS_DEP_TIME is stored as an integer (e.g., 1430 for 2:30 PM),
    so we pad it and combine with the flight date.
    """
    return df.withColumn(
        "utc_timestamp",
        to_timestamp(
            concat(
                col("FL_DATE"),
                lit(" "),
                lpad(col("CRS_DEP_TIME").cast("string"), 4, "0")
            ),
            "yyyy-MM-dd HHmm"
        )
    )

In [None]:
def add_departure_minutes(df: DataFrame) -> DataFrame:
    """
    Convert CRS_DEP_TIME (HHMM format) to minutes since midnight.
    
    Example: 1430 -> 14*60 + 30 = 870 minutes
    """
    return df.withColumn(
        "CRS_DEP_MINUTES",
        (floor(col("CRS_DEP_TIME") / 100) * 60 + (col("CRS_DEP_TIME") % 100))
    )

In [None]:
def add_rolling_count_feature(
    df: DataFrame,
    partition_cols: List[str],
    condition_col: str,
    threshold: float,
    new_col: str,
    window_start: int,
    window_end: int
) -> DataFrame:
    """
    Add a rolling count of records meeting a threshold condition.
    
    Args:
        df: Input DataFrame
        partition_cols: Columns to partition the window by
        condition_col: Column to apply threshold condition to
        threshold: Value threshold for counting
        new_col: Name for the new feature column
        window_start: Window start in seconds (negative = past)
        window_end: Window end in seconds (negative = past)
    
    Returns:
        DataFrame with new count feature
    """
    window = Window \
        .partitionBy(*partition_cols) \
        .orderBy(col("utc_timestamp").cast("long")) \
        .rangeBetween(window_start, window_end)
    
    return df.withColumn(
        new_col,
        count(when(col(condition_col) > threshold, 1)).over(window)
    )

In [None]:
def add_rolling_sum_feature(
    df: DataFrame,
    partition_cols: List[str],
    agg_col: str,
    new_col: str,
    window_start: int,
    window_end: int,
    default_value: float = 0
) -> DataFrame:
    """
    Add a rolling sum feature with null handling.
    
    Args:
        df: Input DataFrame
        partition_cols: Columns to partition the window by
        agg_col: Column to sum
        new_col: Name for the new feature column
        window_start: Window start in seconds (negative = past)
        window_end: Window end in seconds (negative = past)
        default_value: Value to use when sum is null
    
    Returns:
        DataFrame with new sum feature
    """
    window = Window \
        .partitionBy(*partition_cols) \
        .orderBy(col("utc_timestamp").cast("long")) \
        .rangeBetween(window_start, window_end)
    
    return df.withColumn(
        new_col,
        coalesce(spark_sum(agg_col).over(window), lit(default_value))
    )

In [None]:
def add_rolling_avg_feature(
    df: DataFrame,
    partition_cols: List[str],
    agg_col: str,
    new_col: str,
    window_start: int,
    window_end: int,
    default_value: float = 0
) -> DataFrame:
    """
    Add a rolling average feature with null handling.
    
    Args:
        df: Input DataFrame
        partition_cols: Columns to partition the window by
        agg_col: Column to average
        new_col: Name for the new feature column
        window_start: Window start in seconds (negative = past)
        window_end: Window end in seconds (negative = past)
        default_value: Value to use when average is null
    
    Returns:
        DataFrame with new average feature
    """
    window = Window \
        .partitionBy(*partition_cols) \
        .orderBy(col("utc_timestamp").cast("long")) \
        .rangeBetween(window_start, window_end)
    
    return df.withColumn(
        new_col,
        coalesce(avg(agg_col).over(window), lit(default_value))
    )

In [None]:
def add_previous_flight_delay_features(df: DataFrame) -> DataFrame:
    """
    Add features based on the aircraft's previous flight delay.
    
    Creates:
        - prev_flight_delay_in_minutes: Delay of previous flight (-1 if none)
        - prev_flight_delay: Binary flag (1 if previous flight delayed >15 min)
    """
    window = Window.partitionBy("TAIL_NUM").orderBy("utc_timestamp")
    
    return df \
        .withColumn(
            "prev_flight_delay_in_minutes",
            coalesce(lag("DEP_DELAY_NEW", 1).over(window), lit(-1))
        ) \
        .withColumn(
            "prev_flight_delay",
            when(col("prev_flight_delay_in_minutes") > DELAY_THRESHOLD_MINUTES, 1).otherwise(0)
        )

In [None]:
def add_route_feature(df: DataFrame) -> DataFrame:
    """
    Create a route identifier by combining origin and destination.
    
    Example: ORIGIN='SFO', DEST='LAX' -> route='SFO-LAX'
    """
    return df.withColumn(
        "route",
        concat(col("ORIGIN"), lit("-"), col("DEST"))
    )

In [None]:
def add_flight_count_feature(df: DataFrame) -> DataFrame:
    """
    Count the number of flights per aircraft per day.
    
    This captures how busy an aircraft's schedule is.
    """
    window = Window \
        .partitionBy("TAIL_NUM", "FL_DATE") \
        .orderBy(col("utc_timestamp").cast("long"))
    
    return df.withColumn(
        "flight_count_24h",
        count("*").over(window)
    )

In [None]:
def add_turnaround_time_feature(df: DataFrame) -> DataFrame:
    """
    Calculate time between landing and next scheduled departure.
    
    Uses WHEELS_ON (actual landing time) and next flight's CRS_DEP_TIME.
    Returns -999 for the last flight of the day (no next flight).
    """
    window = Window \
        .partitionBy("TAIL_NUM") \
        .orderBy(col("WHEELS_ON").cast("long"))
    
    return df \
        .withColumn(
            "_next_scheduled_dep",
            lead("CRS_DEP_TIME", 1).over(window)
        ) \
        .withColumn(
            "LANDING_TIME_DIFF_MINUTES",
            coalesce(
                (col("_next_scheduled_dep").cast("long") - col("WHEELS_ON").cast("long")) / 60,
                lit(-999)
            )
        ) \
        .drop("_next_scheduled_dep")

---
## 3. Feature Engineering Pipeline

Main function that applies all feature engineering transformations.

In [None]:
def engineer_features(df: DataFrame) -> DataFrame:
    """
    Apply all feature engineering transformations to a DataFrame.
    
    This function is applied identically to train, validation, and test sets
    to ensure consistency.
    
    Args:
        df: Input DataFrame with raw flight data
    
    Returns:
        DataFrame with all engineered features
    """
    print("Starting feature engineering pipeline...")
    
    # Drop rows with missing weather data
    df = df.dropna(subset=WEATHER_COLUMNS)
    print("  ✓ Dropped rows with missing weather data")
    
    # Basic time features
    df = add_departure_minutes(df)
    print("  ✓ Added CRS_DEP_MINUTES")
    
    # Route identifier
    df = add_route_feature(df)
    print("  ✓ Added route feature")
    
    # Previous flight delay features
    df = add_previous_flight_delay_features(df)
    print("  ✓ Added previous flight delay features")
    
    # Origin delays in 4h-2h window
    df = add_rolling_count_feature(
        df,
        partition_cols=["ORIGIN_AIRPORT_SEQ_ID"],
        condition_col="DEP_DELAY_NEW",
        threshold=DELAY_THRESHOLD_MINUTES,
        new_col="origin_delays_4h",
        window_start=-SECONDS_4_HOURS,
        window_end=-SECONDS_2_HOURS
    )
    print("  ✓ Added origin_delays_4h")
    
    # 7-day delay sums (with 4-hour buffer)
    df = add_rolling_sum_feature(
        df,
        partition_cols=["ORIGIN_AIRPORT_SEQ_ID"],
        agg_col="DEP_DELAY_NEW",
        new_col="delay_origin_7d",
        window_start=-SECONDS_7_DAYS,
        window_end=-SECONDS_4_HOURS
    )
    print("  ✓ Added delay_origin_7d")
    
    df = add_rolling_sum_feature(
        df,
        partition_cols=["ORIGIN_AIRPORT_SEQ_ID", "OP_UNIQUE_CARRIER"],
        agg_col="DEP_DELAY_NEW",
        new_col="delay_origin_carrier_7d",
        window_start=-SECONDS_7_DAYS,
        window_end=-SECONDS_4_HOURS
    )
    print("  ✓ Added delay_origin_carrier_7d")
    
    df = add_rolling_sum_feature(
        df,
        partition_cols=["route"],
        agg_col="DEP_DELAY_NEW",
        new_col="delay_route_7d",
        window_start=-SECONDS_7_DAYS,
        window_end=-SECONDS_4_HOURS
    )
    print("  ✓ Added delay_route_7d")
    
    # Flight count per aircraft per day
    df = add_flight_count_feature(df)
    print("  ✓ Added flight_count_24h")
    
    # Turnaround time
    df = add_turnaround_time_feature(df)
    print("  ✓ Added LANDING_TIME_DIFF_MINUTES")
    
    # 7-day rolling averages
    df = add_rolling_avg_feature(
        df,
        partition_cols=["ORIGIN_AIRPORT_SEQ_ID"],
        agg_col="ARR_DELAY",
        new_col="AVG_ARR_DELAY_ORIGIN",
        window_start=-SECONDS_7_DAYS,
        window_end=-SECONDS_4_HOURS
    )
    print("  ✓ Added AVG_ARR_DELAY_ORIGIN")
    
    df = add_rolling_avg_feature(
        df,
        partition_cols=["ORIGIN_AIRPORT_SEQ_ID"],
        agg_col="TAXI_OUT",
        new_col="AVG_TAXI_OUT_ORIGIN",
        window_start=-SECONDS_7_DAYS,
        window_end=-SECONDS_4_HOURS
    )
    print("  ✓ Added AVG_TAXI_OUT_ORIGIN")
    
    print("Feature engineering complete!")
    return df

---
## 4. Load Data

In [None]:
# Load pre-split datasets
print(f"Loading data from: {INPUT_PATH}")

train_df = spark.read.parquet(f"{INPUT_PATH}/train.parquet")
validation_df = spark.read.parquet(f"{INPUT_PATH}/validation.parquet")
test_df = spark.read.parquet(f"{INPUT_PATH}/test.parquet")

print(f"\nDataset sizes:")
print(f"  Train:      {train_df.count():,} rows")
print(f"  Validation: {validation_df.count():,} rows")
print(f"  Test:       {test_df.count():,} rows")

In [None]:
# Cache datasets for performance
train_df = train_df.cache()
validation_df = validation_df.cache()
test_df = test_df.cache()

print("Datasets cached.")

---
## 5. Apply Feature Engineering

In [None]:
print("=" * 60)
print("TRAINING SET")
print("=" * 60)
train_df = engineer_features(train_df)

In [None]:
print("=" * 60)
print("VALIDATION SET")
print("=" * 60)
validation_df = engineer_features(validation_df)

In [None]:
print("=" * 60)
print("TEST SET")
print("=" * 60)
test_df = engineer_features(test_df)

---
## 6. Validation & Quality Checks

In [None]:
def check_null_counts(df: DataFrame, name: str) -> None:
    """Display null counts for all columns in a DataFrame."""
    print(f"\nNull counts for {name}:")
    null_counts = df.select(
        [count(when(col(c).isNull(), c)).alias(c) for c in df.columns]
    )
    display(null_counts)


def verify_column_consistency(train: DataFrame, val: DataFrame, test: DataFrame) -> bool:
    """Verify that all datasets have the same columns."""
    train_cols = set(train.columns)
    val_cols = set(val.columns)
    test_cols = set(test.columns)
    
    if train_cols == val_cols == test_cols:
        print("✓ All datasets have consistent columns")
        return True
    else:
        print("✗ Column mismatch detected!")
        print(f"  Only in train: {train_cols - val_cols - test_cols}")
        print(f"  Only in validation: {val_cols - train_cols - test_cols}")
        print(f"  Only in test: {test_cols - train_cols - val_cols}")
        return False

In [None]:
# Verify column consistency across all datasets
verify_column_consistency(train_df, validation_df, test_df)

In [None]:
# Check for nulls in validation set (representative sample)
check_null_counts(validation_df, "Validation")

In [None]:
# Display sample of engineered features
ENGINEERED_FEATURE_COLS = [
    "ORIGIN", "DEST", "route", "FL_DATE", "CRS_DEP_TIME", "CRS_DEP_MINUTES",
    "origin_delays_4h", "prev_flight_delay", "prev_flight_delay_in_minutes",
    "delay_origin_7d", "delay_origin_carrier_7d", "delay_route_7d",
    "flight_count_24h", "LANDING_TIME_DIFF_MINUTES",
    "AVG_ARR_DELAY_ORIGIN", "AVG_TAXI_OUT_ORIGIN"
]

print("Sample of engineered features:")
display(train_df.select(ENGINEERED_FEATURE_COLS).limit(10))

---
## 7. Save Processed Datasets

In [None]:
# Save feature-engineered datasets
# Set SAVE_DATA = True when ready to write to DBFS
SAVE_DATA = False

if SAVE_DATA:
    print(f"Saving datasets to: {OUTPUT_PATH}")
    checkpoint_dataset(train_df, f"{DATASET_NAME}/feature_eng/training_splits/train")
    checkpoint_dataset(validation_df, f"{DATASET_NAME}/feature_eng/training_splits/validation")
    checkpoint_dataset(test_df, f"{DATASET_NAME}/feature_eng/training_splits/test")
    print("\n✓ All datasets saved successfully!")
else:
    print("⚠ SAVE_DATA is False. Set SAVE_DATA = True to save datasets to DBFS.")

---
## 8. Summary

### Features Created

| Category | Feature | Description |
|----------|---------|-------------|
| **Time** | `CRS_DEP_MINUTES` | Scheduled departure as minutes since midnight |
| **Route** | `route` | Origin-Destination pair (e.g., "SFO-LAX") |
| **Aircraft History** | `prev_flight_delay` | Binary: was previous flight delayed >15 min? |
| **Aircraft History** | `prev_flight_delay_in_minutes` | Previous flight's delay duration |
| **Aircraft History** | `flight_count_24h` | Number of flights for this aircraft today |
| **Aircraft History** | `LANDING_TIME_DIFF_MINUTES` | Turnaround time to next flight |
| **Airport History** | `origin_delays_4h` | Count of delays at origin (4h-2h window) |
| **Airport History** | `delay_origin_7d` | Sum of delays at origin (7 day window) |
| **Airport History** | `AVG_ARR_DELAY_ORIGIN` | Avg arrival delay at origin (7 day) |
| **Airport History** | `AVG_TAXI_OUT_ORIGIN` | Avg taxi-out time at origin (7 day) |
| **Carrier History** | `delay_origin_carrier_7d` | Sum of delays by carrier at origin (7 day) |
| **Route History** | `delay_route_7d` | Sum of delays on route (7 day) |

### Next Steps

The processed datasets are ready for model training. Proceed to the modeling notebook to:
1. Encode categorical features
2. Assemble feature vectors
3. Train and evaluate models