# Additional Feature Engineering for Neural Network Models

**Author:** Daniel Costa  
**Project:** Flight Delay Prediction (W261 Final Project)  
**Dataset:** US Domestic Flights (2015-2019)

---

## Overview

This notebook implements **additional feature engineering** specifically designed to enhance neural network model performance. These features capture temporal patterns, weather dynamics, and airport congestion that are particularly useful for deep learning architectures.

### Features Created

| Category | Features | Description |
|----------|----------|-------------|
| **Cyclic Time** | `dep_hour_sin/cos`, `dow_sin/cos`, `doy_sin/cos` | Sine/cosine encoding of time to capture periodicity |
| **Weather Deltas** | `*_3h_change` (5 features) | 3-hour changes in weather conditions |
| **Origin Congestion** | `ground_flights_last_hour` | Rolling count of departures at origin |
| **Destination Congestion** | `arrivals_last_hour` | Rolling count of arrivals at destination |

### Why Cyclic Encoding?

Neural networks benefit from cyclic encoding of time features because it preserves the circular nature of time:
- Hour 23 is close to hour 0 (not 23 units apart)
- December is close to January
- Saturday is close to Sunday

```
sin/cos encoding:  hour 0 ≈ hour 24  ✓
linear encoding:   hour 0 ≠ hour 24  ✗
```

---

## Table of Contents

1. [Setup & Configuration](#1-setup--configuration)
2. [Data Loading](#2-data-loading)
3. [Feature Engineering Functions](#3-feature-engineering-functions)
   - [3.1 Cyclic Time Features](#31-cyclic-time-features)
   - [3.2 Weather Delta Features](#32-weather-delta-features)
   - [3.3 Origin Congestion Features](#33-origin-congestion-features)
   - [3.4 Destination Congestion Features](#34-destination-congestion-features)
4. [Apply Feature Engineering](#4-apply-feature-engineering)
5. [Save Results](#5-save-results)

## 1. Setup & Configuration

In [0]:
# -----------------------------------------------------------------------------
# Imports
# -----------------------------------------------------------------------------
from pyspark.sql import Window
import pyspark.sql.functions as F
import numpy as np

# -----------------------------------------------------------------------------
# Configuration
# -----------------------------------------------------------------------------
# Data paths
BASE_PATH = "dbfs:/student-groups/Group_2_2"
DATASET_NAME = "5_year_custom_joined"
INPUT_DIR = f"{BASE_PATH}/{DATASET_NAME}/fe_graph_and_holiday/training_splits"
OUTPUT_DIR = f"{BASE_PATH}/{DATASET_NAME}/fe_graph_and_holiday_nnfeat/training_splits"

# Weather columns for delta features
WEATHER_COLUMNS = [
    "HourlyVisibility",
    "HourlyStationPressure", 
    "HourlyDryBulbTemperature",
    "HourlyWindSpeed",
    "HourlyPrecipitation"
]

# Time constants
SECONDS_PER_HOUR = 3600
HOURS_PER_DAY = 24
DAYS_PER_WEEK = 7
DAYS_PER_YEAR = 365

2.21.3


## 2. Data Loading

In [0]:
# Load train/validation/test splits
train_df = spark.read.parquet(f"{INPUT_DIR}/train.parquet/")
val_df = spark.read.parquet(f"{INPUT_DIR}/validation.parquet/")
test_df = spark.read.parquet(f"{INPUT_DIR}/test.parquet/")

print(f"Train:      {train_df.count():,} rows")
print(f"Validation: {val_df.count():,} rows")
print(f"Test:       {test_df.count():,} rows")

## 3. Feature Engineering Functions

### 3.1 Cyclic Time Features

Convert time features (hour, day of week, day of year) to sine/cosine pairs to preserve their circular nature for neural networks.

In [0]:
def add_time_features(df):
    """
    Add cyclic time features using sine/cosine encoding.
    
    Creates 6 new features:
    - dep_hour_sin, dep_hour_cos: Hour of departure (0-24 cycle)
    - dow_sin, dow_cos: Day of week (0-7 cycle)
    - doy_sin, doy_cos: Day of year (0-365 cycle)
    
    Args:
        df: PySpark DataFrame with CRS_DEP_MINUTES, DAY_OF_WEEK, utc_timestamp
        
    Returns:
        DataFrame with cyclic time features added
    """
    # Extract base time values
    df = df.withColumn("dep_hour", F.col("CRS_DEP_MINUTES") / 60.0)
    df = df.withColumn("day_of_year", F.dayofyear("utc_timestamp").cast("double"))

    # Hour of day: sin/cos encoding (24-hour cycle)
    df = df.withColumn(
        "dep_hour_sin", 
        F.sin(2 * F.lit(np.pi) * F.col("dep_hour") / HOURS_PER_DAY)
    )
    df = df.withColumn(
        "dep_hour_cos", 
        F.cos(2 * F.lit(np.pi) * F.col("dep_hour") / HOURS_PER_DAY)
    )

    # Day of week: sin/cos encoding (7-day cycle)
    df = df.withColumn(
        "dow_sin", 
        F.sin(2 * F.lit(np.pi) * F.col("DAY_OF_WEEK") / DAYS_PER_WEEK)
    )
    df = df.withColumn(
        "dow_cos", 
        F.cos(2 * F.lit(np.pi) * F.col("DAY_OF_WEEK") / DAYS_PER_WEEK)
    )

    # Day of year: sin/cos encoding (365-day cycle)
    df = df.withColumn(
        "doy_sin", 
        F.sin(2 * F.lit(np.pi) * F.col("day_of_year") / DAYS_PER_YEAR)
    )
    df = df.withColumn(
        "doy_cos", 
        F.cos(2 * F.lit(np.pi) * F.col("day_of_year") / DAYS_PER_YEAR)
    )
    
    return df

### 3.2 Weather Delta Features

Compute 3-hour changes in weather conditions to capture weather trends that may indicate developing delays.

In [0]:
def add_weather_deltas(df):
    """
    Add 3-hour weather change features for trend detection.
    
    Captures weather dynamics by computing the difference between current
    and 3-hour-ago values. Rapidly changing weather (e.g., dropping visibility,
    rising wind speed) often precedes delays.
    
    Creates 5 new features:
    - HourlyVisibility_3h_change
    - HourlyStationPressure_3h_change
    - HourlyDryBulbTemperature_3h_change
    - HourlyWindSpeed_3h_change
    - HourlyPrecipitation_3h_change
    
    Args:
        df: PySpark DataFrame with weather columns and utc_timestamp
        
    Returns:
        DataFrame with weather delta features added
    """
    w = Window.partitionBy("ORIGIN_AIRPORT_SEQ_ID").orderBy("utc_timestamp")
    
    for col_name in WEATHER_COLUMNS:
        lag_col = F.lag(col_name, 3).over(w)
        delta_col = F.col(col_name) - lag_col
        
        # Return None if lag value doesn't exist (avoids bias from defaulting to 0)
        df = df.withColumn(
            f"{col_name}_3h_change",
            F.when(lag_col.isNull(), F.lit(None)).otherwise(delta_col)
        )
    
    return df

### 3.3 Origin Congestion Features

Count departing flights in the past hour at the origin airport to capture ground congestion.

In [0]:
def add_origin_congestion_features(df):
    """
    Add origin airport congestion feature.
    
    Counts the number of flights departing from the same origin airport
    in the previous hour. High congestion indicates potential ground delays,
    gate availability issues, and runway queuing.
    
    Creates 1 new feature:
    - ground_flights_last_hour: Count of departures in past 60 minutes
    
    Args:
        df: PySpark DataFrame with ORIGIN_AIRPORT_SEQ_ID and utc_timestamp
        
    Returns:
        DataFrame with origin congestion feature added
    """
    # Convert timestamp to seconds for range-based window
    df = df.withColumn("utc_ts_sec", F.col("utc_timestamp").cast("long"))

    # Rolling 1-hour window ending at current row
    w = (Window
         .partitionBy("ORIGIN_AIRPORT_SEQ_ID")
         .orderBy("utc_ts_sec")
         .rangeBetween(-SECONDS_PER_HOUR, 0))

    # Count flights in window, subtract 1 to exclude current row
    df = df.withColumn(
        "ground_flights_last_hour",
        F.count("utc_ts_sec").over(w) - 1
    )

    return df

### 3.4 Destination Congestion Features

Count arriving flights in the past hour at the destination airport to capture airspace and arrival congestion.

In [0]:
def add_dest_congestion_features(df):
    """
    Add destination airport congestion feature.
    
    Counts the number of flights arriving at the same destination airport
    in the previous hour. High arrival congestion can cause holding patterns,
    diversions, and cascading delays.
    
    Creates 1 new feature:
    - arrivals_last_hour: Count of arrivals in past 60 minutes
    
    Args:
        df: PySpark DataFrame with DEST_AIRPORT_SEQ_ID and utc_timestamp
        
    Returns:
        DataFrame with destination congestion feature added
    """
    # Convert timestamp to seconds for range-based window
    df = df.withColumn("utc_ts_sec", F.col("utc_timestamp").cast("long"))
    
    # Rolling 1-hour window ending at current row
    w = (Window
         .partitionBy("DEST_AIRPORT_SEQ_ID")
         .orderBy("utc_ts_sec")
         .rangeBetween(-SECONDS_PER_HOUR, 0))
    
    # Count flights in window, subtract 1 to exclude current row
    df = df.withColumn(
        "arrivals_last_hour",
        F.count("utc_ts_sec").over(w) - 1
    )
    
    return df

## 4. Apply Feature Engineering

Apply all feature transformations to train, validation, and test sets.

In [0]:
def apply_all_features(df):
    """
    Apply all feature engineering transformations in sequence.
    
    Args:
        df: Input PySpark DataFrame
        
    Returns:
        DataFrame with all new features added
    """
    return (df
            .transform(add_time_features)
            .transform(add_weather_deltas)
            .transform(add_origin_congestion_features)
            .transform(add_dest_congestion_features))


# Apply to all splits
print("Applying feature engineering...")
train_df_fe = apply_all_features(train_df)
val_df_fe = apply_all_features(val_df)
test_df_fe = apply_all_features(test_df)
print("✓ Feature engineering complete")


## 5. Save Results

Save the enriched datasets to parquet for use in neural network training.

In [0]:
# Create output directory
dbutils.fs.mkdirs(OUTPUT_DIR)

# Save each split
print("Saving datasets...")
train_df_fe.write.mode("overwrite").parquet(f"{OUTPUT_DIR}/train.parquet")
val_df_fe.write.mode("overwrite").parquet(f"{OUTPUT_DIR}/val.parquet")
test_df_fe.write.mode("overwrite").parquet(f"{OUTPUT_DIR}/test.parquet")

print(f"✓ Saved to: {OUTPUT_DIR}")


Checkpointed 5_year_custom_joined/fe_graph_and_holiday_nnfeat/training_splits/train
Checkpointed 5_year_custom_joined/fe_graph_and_holiday_nnfeat/training_splits/val
Checkpointed 5_year_custom_joined/fe_graph_and_holiday_nnfeat/training_splits/test


---

## Summary

This notebook adds **13 new features** specifically designed for neural network models:

| Feature | Description |
|---------|-------------|
| `dep_hour_sin`, `dep_hour_cos` | Cyclic encoding of departure hour |
| `dow_sin`, `dow_cos` | Cyclic encoding of day of week |
| `doy_sin`, `doy_cos` | Cyclic encoding of day of year |
| `HourlyVisibility_3h_change` | 3-hour change in visibility |
| `HourlyStationPressure_3h_change` | 3-hour change in pressure |
| `HourlyDryBulbTemperature_3h_change` | 3-hour change in temperature |
| `HourlyWindSpeed_3h_change` | 3-hour change in wind speed |
| `HourlyPrecipitation_3h_change` | 3-hour change in precipitation |
| `ground_flights_last_hour` | Origin airport departure congestion |
| `arrivals_last_hour` | Destination airport arrival congestion |

### Key Design Decisions

1. **Cyclic Encoding**: Sine/cosine pairs preserve the circular nature of time for neural networks
2. **Weather Deltas**: 3-hour window captures weather trends without requiring excessive lookback
3. **Congestion Windows**: 1-hour rolling windows balance recency with statistical stability
4. **Null Handling**: Missing lag values return `None` rather than 0 to avoid bias