# Analysis and Prediction of Flight Delays

Flight delays are often due to weather, ATC limitations, or airline causes such as crew availability or mechanical issues. While measuring each of these causes directly may prove challenging, certain components of a typical flight may indirectly reveal, through common patterns, these issues. For example, departure times in the mid-afternoon are more likely to encounter weather delays in the summer due to the frequency of afternoon thunderstorms, certain airlines or aircraft are more likely to experience mechanical or crew availability issues, and evening flights are more likely to experience delays due to late inbound aircraft from delays caused earlier in the day. This notebook demonstrates these simple correlations and then provides a trained machine learning model to predict flight delays with reasonable accuracy.

## Airline Delay Data

The core data used for this analysis comes from the [Airline Delay Analysis](https://www.kaggle.com/datasets/sherrytp/airline-delay-analysis?resource=download) dataset available on Kaggle. Data hase been selected for 2019, with the primary CSV containing over 10 million rows of data.

The data will be imported as a dataframe, and the following transformations will be applied:
- The 'FL_DATE' column, representing the flight date, will be transformed to simply contain the month of the year. This should effectively capture seasonal fluctuations in flight delay patterns.
- The 'DEP_TIME' column, representing the departure time of each flight, will be transformed to consist only of the hour of the day. This should capture delay patterns based on operational patterns throughout the day.
- The 'ARR_DELAY' column, which represents the arrival delay of the flight, represents the target variable and will be transformed to the following categories:
    - 'NO_DELAY' - represents arrival delays of 15 minutes or less, which most airlines consider an 'on-time' arrival
    - 'MINOR_DELAY' - represents delays between 15 and 45 minutes, which are not likely to cause any significant disruptions
    - 'MAJOR_DELAY' - represents delays beyond 45 minutes and up to 2 hours, which may cause serious disruptions in passenger movement
    - 'SEVERE_DELAY' - represents delays beyond 2 hours which are guaranteed to cause significant disruptions and costs in passenger movement
- The 'OP_UNIQUE_CARRIER' column, which represents the airline carrier, will be included and preprocessed as a categorical feature - and represents that certain carriers are more likely to experience delays
- The 'ORIGIN' and 'DESTINATION' columns, which represent the airports the flight is operating to and from, will also be included as categorical features and should capture variations based on the flight's location 

In [2]:
import pandas as pd

df = pd.read_csv('data/2019.csv')

# Remove any rows where this data is missing
df = df.dropna(subset=['FL_DATE', 'DEP_TIME', 'ARR_DELAY'])

# Transform from format 2019-11-17 -> 11
df['FL_MONTH'] = df['FL_DATE'].apply(lambda x: int(x[5:7]))

# Transform from format 1715 (converted to 1715.0 by pd) -> 17
df['DEP_HOUR'] = df['DEP_TIME'].apply(lambda x: int(x // 100))

def categorize_delay(delay):
    if delay < 15:
        return 'NO_DELAY'
    elif delay < 45:
        return 'MINOR_DELAY'
    elif delay < 120:
        return 'MAJOR_DELAY'
    else:
        return 'SEVERE_DELAY'

# Transform into categories for making categorical predictions
df['DELAY_CATEGORY'] = df['ARR_DELAY'].apply(categorize_delay)

# Drop the remaining unused columns
keep_columns = ['FL_MONTH', 'DEP_HOUR', 'OP_UNIQUE_CARRIER', 'ORIGIN', 'DEST', 'DELAY_CATEGORY']
df = df[keep_columns]

## Sampling

To make the data easier to work with, it should be sampled down to a reasonable number.

In [3]:
import ipywidgets as widgets
from IPython.display import display

sample_slider = widgets.IntSlider(value=50000, min=10000, max=1000000, step=10000, description='Sample Size:')
display(sample_slider)

IntSlider(value=50000, description='Sample Size:', max=1000000, min=10000, step=10000)

In [4]:
def sample_data(sample_size):
    return df.sample(n=sample_size, random_state=42)

sample_size = sample_slider.value

df_sampled = sample_data(sample_size)

df_sampled.head()

Unnamed: 0,FL_MONTH,DEP_HOUR,OP_UNIQUE_CARRIER,ORIGIN,DEST,DELAY_CATEGORY
1046527,2,21,AA,PHL,AUS,SEVERE_DELAY
6212992,11,6,DL,SEA,SLC,NO_DELAY
1649212,3,10,YV,BDL,IAD,NO_DELAY
2584035,5,8,WN,ONT,SMF,NO_DELAY
33601,1,20,UA,IAH,ORD,MINOR_DELAY


## Export Data

Once the data has been cleaned and sampled to a reasonable size, it can be exported to its own file for other notebooks in this project to reference.

In [7]:
df_sampled.to_csv('data/sampled_data.csv', index=False, header=True)

[Continue to Data Visualizations]("visualizations.ipynb")