In [2]:
import os

import pandas as pd
import ipywidgets as widgets
from IPython.display import display
from datetime import datetime

from sklearn.base import BaseEstimator, TransformerMixin

# Airline Flight Delay Analysis and Prediction

## Collected Datasets

The core data used for this analysis comes from the [Airline Delay Analysis](https://www.kaggle.com/datasets/sherrytp/airline-delay-analysis?resource=download) dataset available on Kaggle. Data hase been selected for 2019, with the primary CSV containing over 10 million rows of data. Due to the size of this dataset, it has been reduced down via sampling.

To fetch the full dataset, go to the [provided link](https://www.kaggle.com/datasets/sherrytp/airline-delay-analysis?resource=download), download the '2019' csv file, then unzip and place the file in the `data` directory of this project. Then run the following code blocks to select a sample size, and create/save a new sample.


In [139]:
sample_slider = widgets.IntSlider(value=100000, min=10000, max=1000000, step=10000, description='Sample Size:')
display(sample_slider)

IntSlider(value=100000, description='Sample Size:', max=1000000, min=10000, step=10000)

In [155]:
if os.path.exists('data/2019.csv'):
    df = pd.read_csv('data/2019.csv')
    df = df.sample(n=sample_slider.value, random_state=42)
    df.to_csv('data/2019_sample.csv', index=False)

Additional data collected includes a list of all United States airport codes mapped to their respective latitude and longitude coordinates. This data is available [on GitHub](https://raw.githubusercontent.com/ip2location/ip2location-iata-icao/master/iata-icao.csv). 

In [4]:
if not os.path.exists('data/airports.csv'):
    airports_df = pd.read_csv('https://raw.githubusercontent.com/ip2location/ip2location-iata-icao/master/iata-icao.csv')
    airports_df.to_csv('data/airports.csv', index=False)


## Parsing and Cleaning Data

The collected and sampled data contains many columns that are unnecessary or redundant for this analysis. These columns will be removed.
- OP_CARRIER_FL_NUM - The flight number is arbitrarily chosen by the airline and has no bearing on the chance of flight delay.
- DEP_DELAY - This will usually be directly correlated with ARR_DELAY, but ARR_DELAY is more indicative of actual airline performance. For example, some airlines will fly faster and burn more fuel to "make up time" enroute.
- TAXI_OUT, WHEELS_OFF, WHEELS_ON, TAXI_IN - These just represent the amount of time it takes to taxi the aircraft to and from the gate and runway. This directly correlates either with the delay value itself or with the airport (as some airports have longer taxi times due to runway configurations and ground congestion). Airlines will have compensated for this in their flight time planning.
- AIR_TIME - Another measure of time, indicating how long the aircraft was in the air. By the time the aircraft is in the air, the root cause of the delays will have already occurred, rendering a predictive analysis no longer useful.
- CARRIER_DELAY, NAS_DELAY, SECURITY_DELAY, LAT_AIRCRAFT_DELAY - These columns in the dataset attempt to separate out delay minutes by root cause. These columns could prove useful for a detailed analysis of flight delay causes, however these values are not known at the time a predictive analysis would be useful.

In [165]:
df = pd.read_csv('data/2019_sample.csv')
df = df.drop(columns = ['OP_CARRIER_FL_NUM', 'DEP_DELAY'])
df = df.drop(columns = ['TAXI_OUT', 'WHEELS_OFF', 'WHEELS_ON', 'TAXI_IN', 'ARR_TIME', 'AIR_TIME'])
df = df.drop(columns = ['CARRIER_DELAY', 'NAS_DELAY', 'WEATHER_DELAY', 'SECURITY_DELAY', 'LATE_AIRCRAFT_DELAY'])
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]

df = df.dropna(subset=['FL_DATE', 'DEP_TIME', 'ARR_DELAY'])

The remaining columns will be transformed using custom preprocessing steps in a pipeline. This allows the input of the data to be in an intuitive format before transformations are applied.



### Date/Time Transformer

This custom transformer takes in the provided data and performs the following transformations:
- Transform `FL_DATE` into `FL_DAY`, converting a string date from "yyyy-mm-dd" format to the number of days since January 1st
- Transform `DEP_TIME` into `DEP_TIME_MINUTES`, converting a number formatted as "hhmm" into the number of minutes since midnight
- Transform `FL_DATE` into `DAY_OF_WEEK`, converting the date into the day of the week 0-6 Mon-Sun

Note: This is also defined in `tranformers.py` for easy import into other project notebooks.

In [150]:
class DateTimeTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X_transformed = X.copy()
        X_transformed['FL_DAY'] = X['FL_DATE'].apply(self.days_since_year_start)
        X_transformed['DEP_MINUTES'] = X['DEP_TIME'].apply(self.minutes_since_midnight)
        X_transformed['DAY_OF_WEEK'] = X['FL_DATE'].apply(self.day_of_week)
        return X_transformed.drop(columns = ['FL_DATE', 'DEP_TIME'])

    def days_since_year_start(self, date_str: str) -> int:
        input_date = datetime.strptime(date_str, "%Y-%m-%d")
        january_first = datetime(input_date.year, 1, 1)
        days_difference = (input_date - january_first).days
        return days_difference

    def minutes_since_midnight(self, time_float: float) -> int:
        hours = int(time_float // 100)
        minutes = int(time_float % 100)
        total_minutes = hours * 60 + minutes
        return total_minutes
    
    def day_of_week(self, date_str: str) -> int:
        input_date = datetime.strptime(date_str, "%Y-%m-%d")
        return input_date.weekday()

### Airport Latitude/Longitude Transformer

This transformer reads the `ORIGIN` and `DEST` columns, each of which are represented by IATA airport codes (i.e. ATL for Atlanta, LAX for Los Angeles), and and creates new latitude and longitude columns `ORIGIN_LAT`, `ORIGIN_LON`, `DEST_LAT`, and `DEST_LON`. 

Note: This is also defined in `tranformers.py` for easy import into other project notebooks.

In [144]:
class AirportLatLongTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        airports_df = pd.read_csv('data/airports.csv')
        self.airports_dict = airports_df[['iata', 'latitude', 'longitude']].set_index('iata').T.to_dict('list')
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X_transformed = X.copy()
        for col in ['ORIGIN', 'DEST']:
            lat_col = col + '_LAT'
            lon_col = col + '_LON'
            X_transformed[lat_col] = X[col].apply(lambda x: self.airports_dict[x][0])
            X_transformed[lon_col] = X[col].apply(lambda x: self.airports_dict[x][1])
        return X_transformed


### Preparing Arrival Delay Values

To use classification for delay identification, the arrival delay will be sorted into one of four categories based on the severity of the delay. Arrival delays under 15 minutes are considered on time, delays from 15 minutes to 45 minutes are considered minor, delays from 45 minutes to 2 hours are considered major, and delays beyond two hours are considered severe.


In [166]:
# df = df['ARR_DELAY'].apply(lambda x: max(x, 0))

def categorize_delay(delay):
    if delay < 15:
        return 'NO_DELAY'
    elif delay < 45:
        return 'MINOR_DELAY'
    elif delay < 120:
        return 'MAJOR_DELAY'
    else:
        return 'SEVERE_DELAY'

# Transform into categories for making categorical predictions
df['DELAY_CATEGORY'] = df['ARR_DELAY'].apply(categorize_delay)
df = df.drop(columns = 'ARR_DELAY')


## Persisting Prepared Data

With the necessary transformers in place, the data can be saved and used in other notebooks for descriptive and predictive analysis.

In [169]:
df.to_csv('data/2019_prepared.csv', index=False)
print("Successfully saved prepared data")

Successfully saved prepared data
