# Flight Delay Prediction - Exploratory Data Analysis


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from pathlib import Path
import warnings

warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")


## Feature Selection Rationale

### Features We Rule Out

We exclude most features from the dataset for several key reasons:

- **Redundant features**: Temporal components (Year, Quarter, Month, DayOfWeek) can be derived from `FlightDate`; geographic identifiers have multiple representations (AirportID vs AirportSeqID, State vs StateFips) where one suffices; delay metrics have overlapping versions (DepDelay vs DepDelayMinutes, DepDel15 as binary indicator).

- **Data leakage risk**: Delay cause breakdowns (`CarrierDelay`, `WeatherDelay`, `NASDelay`, `SecurityDelay`, `LateAircraftDelay`) are post-hoc categorizations unavailable at prediction time. **Note**: While we exclude these from the prediction model, we can recover them later to validate our explainability (comparing model predictions against actual delay causes to ensure the model's reasoning aligns with ground truth).

- **Too specific or rare**: Diversion details (Div1Airport through Div5Airport) apply to <1% of flights; aircraft-level features (`Tail_Number`) are too granular and risk overfitting; code-share complexities add minimal predictive value.

- **Different purposes**: `Flights` is an aggregation count; `Duplicate` is a data quality flag; elapsed time metrics are less directly related to delays than delay minutes themselves.

### Why Selected Features Are Sufficient

The selected features provide sufficient information for delay prediction by capturing essential dimensions:

- **Temporal patterns**: `FlightDate` and scheduled times (`CRSDepTime`, `CRSArrTime`) capture time-of-day and seasonal effects
- **Airline and route characteristics**: `Airline`, `Flight_Number_Marketing_Airline`, and origin/destination identifiers capture carrier-specific and route-specific patterns
- **Geographic context**: Airport IDs and city/state names enable modeling of weather patterns, regional congestion, and regulatory differences
- **Operational factors**: `TaxiOut` and `TaxiIn` directly impact delays and indicate ground congestion
- **Status indicators**: `Cancelled` and `Diverted` capture extreme operational disruptions

This feature set balances predictive power with practical feasibilityâ€”all features are available at prediction time, avoid redundancy, and maintain interpretability for explainability analysis.

In [None]:
# Feel free de modificar el subset si creus q s'ha d'afegir o sobra algo
column_subset = [
    "FlightDate",
    "Airline",
    "Flight_Number_Marketing_Airline",
    "Origin",
    "Dest",
    "Cancelled",
    "Diverted",
    "CRSDepTime",
    "DepTime",
    "DepDelayMinutes",
    "OriginAirportID",
    "OriginCityName",
    "OriginStateName",
    "DestAirportID",
    "DestCityName",
    "DestStateName",
    "TaxiOut",
    "TaxiIn",
    "CRSArrTime",
    "ArrTime",
    "ArrDelayMinutes",
]

max_rows_per_file = 100000  # We limit the number of rows to avoid memory issues

data_dir = 'data'
dfs = []
for filename in os.listdir(data_dir):
    file_path = os.path.join(data_dir, filename)
    if filename.endswith('.csv'):
        df_temp = pd.read_csv(file_path)
        # Random sampling preserves data distribution and avoids temporal/systematic bias from taking first N rows
        if len(df_temp) > max_rows_per_file:
            df_temp = df_temp.sample(n=max_rows_per_file, random_state=42)
        dfs.append(df_temp)

df = pd.concat(dfs, ignore_index=True)
df_filtered = df.copy()
df_filtered = df_filtered[column_subset]
