# Predicting Flight Delays For Domestic Airlines

### Authors: Ethan Bleier, Elijah Kramer, Roberto Palacios

The U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics (BTS) maintains performance records of domestic flights. These statistics include several interesting variables: dates, taxi times, delays, origins, destinations, departure, and arrival times. 

Using data analysis and various machine learning algorithms, this notebook plans to predict whether or not a flight will experience a delay. Specifically, we are interested in which predictors will play the biggest role in causing flight delays.

Our main dataset is taken from [Kaggle](https://www.kaggle.com/datasets/giovamata/airlinedelaycauses/data) and represents data from 2008. However, we expect to find that many of these same patterns found in this dataset to persist today.

### Python/environment setup

In [None]:
# TODO - check for unused imports

# Module imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import zipfile
import io
import requests

# sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

# scipy
from scipy.stats import zscore

In [None]:
sns.set_theme(context='notebook', style='whitegrid')
plt.rcParams['figure.figsize'] = 6, 4

### Reading in data

First, let's read in our main dataset. Because this file weighs in at almost 250MB and because git struggles with large files, we let kaggle provide the hosting - but this means some extra work to work with their .zip compression.

In [None]:
# Download .zip
csv_url = 'https://www.kaggle.com/api/v1/datasets/download/giovamata/airlinedelaycauses'
csv_zip = zipfile.ZipFile(io.BytesIO(requests.get(csv_url).content))

# Open and load .csv
with csv_zip.open('DelayedFlights.csv') as csv:
    df = pd.read_csv(csv)

# Close zip stream
csv_zip.close()

Next we need to read in our 2 supporting datasets, both provided by openflights ([https://openflights.org/](https://openflights.org/)) and hosted on GitHub.
Unfortunately, these aren't formatted as nicely as our main dataset - while they're still technically valid .csvs, pandas expects to find column names in row 0 and these don't contain any. So we pass `header=None` to ensure we don't lose row 0 or end up with unexpected column names.

In [None]:
port_df = pd.read_csv('https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat', header=None)
plane_df = pd.read_csv('https://raw.githubusercontent.com/jpatokal/openflights/master/data/airlines.dat', header=None)

### Data Preprocessing

Let's start with some initial cleanup for our main dataset. First, we have a number of columns that don't seem very useful to us or have a large portion of NaN values: so we drop these. Also, we know we plan to rely heavily on our `DepDelay` and `ArrDelay` variables - so we also drop any rows that have these missing.

In [None]:
df = df.drop(columns=['Year', 'Unnamed: 0', 'FlightNum', 'TailNum', 'Cancelled', 'DayofMonth', 'DayOfWeek', 'CancellationCode',
                      'CarrierDelay', 'WeatherDelay', 'NASDelay', 'SecurityDelay','LateAircraftDelay'])
df = df.dropna(subset=['DepDelay', 'ArrDelay'])

Lastly, let's generate some helper columns for us to use later.

In [None]:
# log10 is undefined for <= 0 values, so we set those NaN
with np.errstate(divide='ignore'):
    df['ArrDelayLog'] = np.where(df['ArrDelay'] > 0, np.log10(np.abs(df['ArrDelay'])), np.nan)
    df['DepDelayLog'] = np.where(df['DepDelay'] > 0, np.log10(np.abs(df['DepDelay'])), np.nan)

# Is flight actively delayed (not early or on-time)?
df['IsDelayed'] = df['ArrDelay'] > 0

Now we can begin looking at the remaining fields in dataframe.

In [None]:
df.info()

### Initial Data Exploration

While several factors may influence delays, we'd like to examine the constants in this situation rather than rely on chance incidents of weather or unexpected events. As such we'll first be looking at delays in relation to carriers.

In [None]:
fig, (ax1, ax2) = plt.subplots(1,2, figsize = (14,6))

median_carrier_delay = df.groupby('UniqueCarrier')[['ArrDelay', 'DepDelay']].median().sort_values(by ='ArrDelay', ascending = False)
mean_carrier_delay = df.groupby('UniqueCarrier')[['ArrDelay', 'DepDelay']].mean().sort_values(by ='ArrDelay', ascending = False)

median_carrier_delay.plot.barh(ax = ax1, title = 'Median Delay By Carrier', xlabel = 'Median Delay in Minutes');
mean_carrier_delay.plot.barh(ax = ax2, title = 'Mean Delay By Carrier', xlabel = 'Mean Delay in Minutes');

ax1.set_xlim(0,60)
ax2.set_xlim(0,60)

plt.tight_layout()
plt.show()

This initial analysis suggests that delays are closely tied to carriers. The plot on the left considers the median amount of delay by carrier. This allows us to avoid skewing the data based on some of the larger outliers. When taking this approach, we can see that it is rare for flights to of any carrier to experience delays of over 35 minutes.

The right plot displays similar information but considers the mean delay of every carrier. Here, the result varies slightly from the previous plot, but the overall trends remain the same. The major difference likely comes from outlier flights that increase the mean for every carrier.

Our dataset also contained a column that relates specifically to carrier delays, which could be used to elucidate some of these figures. However, over a third of these values are null, making further evaluation much more difficult. Instead, we'll explore the relationship between airports and delays.

### Delays At Airports

Our initial assumption is that airports play a role in increasing flight delays, so we'll measure this relationship by plotting the airports with the highest amount of delay.

In [None]:
fig, (ax1, ax2) = plt.subplots(1,2, figsize = (14,6))

dest_by_median_delay = df.groupby('Dest')[['ArrDelay','DepDelay']].median().sort_values(by = 'DepDelay', ascending = False).head(10)
dest_by_mean_delay = df.groupby('Dest')[['ArrDelay','DepDelay']].mean().sort_values(by = 'DepDelay', ascending = False).head(10)

dest_by_median_delay.plot.barh(ax = ax1, title = 'Top 10 Airports by Median Delay', xlabel = 'Median Delay in Minutes');
dest_by_mean_delay.plot.barh(ax = ax2, title = 'Top 10 Airports by Mean Delay', xlabel = 'Mean Delay in Minutes');

ax1.set_xlim(0, 80)
ax2.set_xlim(0, 80)

plt.tight_layout()
plt.show()

Here we can see that airports are quite consistent with the amount of delay each flight experiences at their terminals. Once again, the median figures and mean figures are quite similar with some variance. However, we can see that smaller regional airports seem to be overrepresented in this examination. 

Regional airports may not provide the best measure in this case as their record may be skewed in one direction or another. To compensate for this, we will instead look at both the most and least visited airports to assess their levels of delay.

### Delays at High and Low Traffic Airports

First, we'll start by finding the top 10 and bottom 10 airports in terms of receiving flights. Then we'll find the median delay for each respective airport and plot the two together.

In [None]:
fig, (ax1, ax2) = plt.subplots(1,2, figsize = (14,6))

top_ten_airports = df[df['Dest'].isin((df['Dest'].value_counts().head(10)).index)]
top_ten_airports.groupby('Dest')[['ArrDelay','DepDelay']].median().sort_values(by = 'DepDelay', ascending  = False).plot.barh(ax = ax1);

bottom_ten_airports = df[df['Dest'].isin((df['Dest'].value_counts().sort_values().head(10)).index)]
bottom_ten_airports.groupby('Dest')[['ArrDelay','DepDelay']].median().sort_values(by = 'DepDelay', ascending  = False).plot.barh(ax = ax2);

ax1.set_title('Delays At Top 10 Most Visited Airports')
ax1.set_xlabel('Median Delay in Minutes')

ax2.set_title('Delays At Top 10 Least Visited Airports')
ax2.set_xlabel('Median Delay in Minutes')

ax1.set_xlim(0, 60)
ax2.set_xlim(0, 60)

plt.show()

One might expect that busier airports are more likely to suffer flight delays but the data suggests otherwise. However, less trafficked airports appear to suffer equal or higher rates of delays when compared to high-volume airports. The main difference here lies in the consistency of the airports as the 10 most visited airports have similar rates of delay or are at least equal in their rate of arrival delay and departure delay. Low-traffic airports on the other hand vary widely from one another and even vary in their own arrival and departure delays. 

From this information, we surmise that traffic is not a significant factor in increasing the amount of time by which a flight is delayed. As such we should look at other key variables such as the distance of each flight and the time of year the flight takes place.

### Flight Delays By Distance

The dataset contains the distance traveled for each flight. By plotting this data against the number of arrival delays experienced on flights, we hope to get a clearer picture of how these two factors interact. However, to reduce the number of outliers outside the bounds of our plot, we've taken the log of base 10 of the arrival data in our set and will be using that in our scatterplot. We've also excluded the negative values of our arrival delay data to avoid NaN values.

In [None]:
scatter_sample = df[df['CRSArrTime'] < df['ArrTime']].sample(n = 1000, random_state = 42)

fig, (ax1, ax2) = plt.subplots(1,2, figsize = (14,6))

ax1.scatter(scatter_sample['ArrDelayLog'],scatter_sample['Distance'], alpha = 0.5)


ax1.set_ylim(0, 3000)
ax1.set_title('Scatterplot of Arrival Delay By Distance')
ax1.set_xlabel('Log10 of ArrivalDelay')
ax1.set_ylabel('Distance')

ax2.scatter(scatter_sample['DepDelayLog'],scatter_sample['Distance'], alpha = 0.5)

ax2.set_ylim(0, 3000)
ax2.set_title('Scatterplot of Departure Delay By Distance')
ax2.set_xlabel('Log10 of Departure Delay')
ax2.set_ylabel('Distance')

plt.show()

To reduce overplotting and avoid the NaN values that result from taking the log10 of negative values, we only show a subset of the data that also excludes negative arrival delay values. Then, we plotted the arrival and departure delays of these flights against their distance traveled.

The results show that delay times are evenly distributed regardless of the distance traveled, with outliers being the natural result of unforeseen incidents occurring occasionally. Unfortunately, this means that distance will lack utility as a predictor when we perform machine learning.

Next, we'll assess how much of an impact the time of year has on flight delays

### Flight Delays According to Month

We've previously established that increased traffic doesn't necessarily correlate to longer flight delays. As such, we can't assume that certain months will have higher average delay times due to increased volumes of passengers. Instead, we'll plot delays against months of the year to assess the patterns that may occur here.

In [None]:
months = ['JAN', 'FEB', 'MAR', 'APR', 'MAY', 'JUN', 'JUL', 'AUG', 'SEP', 'OCT', 'NOV', 'DEC']

med_month_delay = df.groupby('Month')['ArrDelay'].median()
mean_month_delay = df.groupby('Month')['ArrDelay'].mean()

# applies months labels after grouping
med_month_delay.index = med_month_delay.index.map(lambda x:months[x-1])
mean_month_delay.index = mean_month_delay.index.map(lambda x:months[x-1])

fig, (ax1, ax2) = plt.subplots(1,2, figsize = (14,6))

med_month_delay.sort_values().plot.barh(ax = ax1, title = 'Median Arrival Delay By Month', xlabel = 'Arrival Delay in Minutes')
mean_month_delay.sort_values().plot.barh(ax = ax2, title = 'Mean Arrival Delay By Month', xlabel = 'Arrival Delay in Minutes')

ax1.set_xlim(0, 50)
ax2.set_xlim(0, 50)

plt.show()

When looking at the arrival delay by month, we notice a trend in the winter and summer months. These months often see much heavier flight traffic than normal due to people taking vacations or visiting family. As a result, arrival delays seem to increase in those months and stay relatively low for the rest of the year. 

Initially, this seems to go against what we've previously established. However, when we were looking at high-traffic airports vs low ones, we didn't consider that those airports are accustomed to operating at their respective levels. When flight traffic is increased across the board for say, the holidays, then most airports would end up operating beyond their usual capacity.

With this, we've taken a look at the majority of the variables we believe would be useful in predicting flight delays but we would like to give the rest of our continuous variables a quick look before moving on.

### Arrival Delay Correlation Heatmap

We can assess our remaining continous variables using a heatmap to gauge their relationships to our target variable.

In [None]:
contin_vars = ['DepDelay','ArrDelay','TaxiIn', 'TaxiOut', 'AirTime', 'DepTime', 'ArrTime']
sns.heatmap(df[contin_vars].corr(), cmap = 'PRGn', vmin = -1, vmax = 1)
plt.title('Heatmap of Continous Variables in the DataSet')

We hoped to find a strong relationship between some of these variables and the amount of arrival delay. Instead, we found a few variables whose effect on arrival delay is marginal at best. Taxi times both on arrival and departure have the most noticeable impact on this heatmap, while departure time provides a small impact on delays.

### Taxi Times

While the effect of taxi times is small, we can take a quick look at them so that we have an idea of how they might affect our data.

In [None]:
fig, (ax1, ax2) = plt.subplots(1,2, figsize = (14,6))

df.groupby('UniqueCarrier')['TaxiIn'].median().sort_values(ascending  = False).plot.barh(ax = ax2, title = 'Taxi In Times By Carrier', xlabel = 'Median Time in Minutes')
df.groupby('UniqueCarrier')['TaxiOut'].median().sort_values(ascending = False).plot.barh(ax = ax1, title = 'Taxi Out Times By Carrier', xlabel = 'Median Time in Minutes')
ax1.set_xlim(0,20)
ax2.set_xlim(0,20)
plt.show()


Intitally, taxi times appeared to have little effect on dealy outcomes. However, when we group these times by carrier it becomes clear that together these times account for a sizeable fraction of delays. Assuming that taxi delays are included in the arrival delay statistics, the difference in taxi out time between carriers OH and AQ almost entirely accounts for the differences in their overall median delay.

### Machine Learning

### Delay prediction

Next, let's see if we can *predict* whether a flight is going to be delayed ahead of time using a KNN classifier. If we choose good enough features, we should be.

First, though, we have a problem - even though KNN is a relatively fast algorithm, our dataset is so large that prediction becomes prohibitively slow. So, let's randomly sample a more reasonable number of data.

In [None]:
df_reduced = df.sample(n=100000, random_state=0) # TODO: sample count
print(f'Using {df_reduced.shape[0]:,} rows, reduced from {df.shape[0]:,}')

Next, let's pick out some variables to use. We know we want to find out if a flight is going to be delayed, so anything with a clear relationship to that is probably going to be useful.

In [None]:
encoder = LabelEncoder()

# Pick out features
X = df_reduced[['UniqueCarrier', 'Dest', 'Month', 'DayOfWeek']].copy()

# Encode categorical variables
X['Dest'] = encoder.fit_transform(X['Dest'])
X['UniqueCarrier'] = encoder.fit_transform(X['UniqueCarrier'])

Finally, we will perform a train/test split. If we use a `test_size` of 0.1, we should still have plenty of data left for training.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, df_reduced['IsDelayed'], test_size=0.1, random_state=0)

Now our data is ready, we should be able to train our KNN classifier now. This should be very fast.

In [None]:
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

Perfect, now let's see what happens when we make a prediction off our test data. How's does our classifier perform?

In [None]:
y_pred = knn.predict(X_test)
print(f'Accuracy: {np.mean(y_pred == y_test):.2f}')
print(f'Baseline accuracy: {(np.mean(y_test == True)):.2f}')

In [None]:
# Output some useful metrics
print(f'Precision: {precision_score(y_test, y_pred):.2f}')
print(f'Recall: {recall_score(y_test, y_pred):.2f}')
print(f'F1 Score: {f1_score(y_test, y_pred, average='binary'):.2f}')

mat = confusion_matrix(y_test, y_pred)
sns.heatmap(mat, annot=True, fmt='d', cmap='Blues', xticklabels=['Delay Predicted False', 'Delay Predicted True'], yticklabels=['Actual False', 'Actual True'])
plt.title('IsDelayed Confusion Matrix')
plt.show()

### Linear Regression

In [None]:
predictors = ['Month', 'DepDelay', 'DepTime']
target = 'ArrDelay'
X = df[predictors].values
y = df[target].values
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 42)

from sklearn.linear_model import LinearRegression
reger = LinearRegression()
reger.fit(X_train, y_train)

y_pred = reger.predict(X_test)

In [None]:
mean_target = y_train.mean()
mse_baseline = ((mean_target - y_test) ** 2).mean()
rmse_baseline = np.sqrt(mse_baseline)

print(f'Baseline RMSE: {rmse_baseline:.1f}')

In [None]:
test_rmse = np.sqrt(np.mean((y_pred - y_test) ** 2))
print(f'Test RMSE: {test_rmse:3f}')

In [None]:
sns.scatterplot(x = y_test, y = y_pred)
plt.plot([0, 2500], [0, 2500], color = 'grey', linestyle = 'dashed')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Arrival Delay: Actual by Predicted')
plt.show()

In [None]:
### Linear Regression

Classifying flights based on whether or not the flight will face arrival delays turns out to be less practical than expected because the wide majority of flights see some amount of delay. It makes much more sense for us to predict the amount of time that the flight will be delayed. Specifically, we will look at arrival delays and try to predict them using some of the predictors we laid out in our exploration.

In [None]:
predictors = ['Month', 'DepDelay', 'DepTime', 'TaxiOut']
target = 'ArrDelay'
X = df[predictors].values
y = df[target].values
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 42)

reger = LinearRegression()
reger.fit(X_train, y_train)

y_pred = reger.predict(X_test)

We start by setting our predictors variables equal to the variables we've previously determined to be related to arrival delays. However, we did omit the TaxiIn time as the amount of delay a flight experiences would likely be evident as it taxis in after a landing. After fitting our data we then move on to assessing the accuracy of our regressor.

In [None]:
mean_target = y_train.mean()
mse_baseline = ((mean_target - y_test) ** 2).mean()
rmse_baseline = np.sqrt(mse_baseline)

print(f'Baseline RMSE: {rmse_baseline:.1f}')

In [None]:
test_rmse = np.sqrt(np.mean((y_pred - y_test) ** 2))
print(f'Test RMSE: {test_rmse:.1f}')

Now we notice that our root mean squared error has reduced significantly which we take to mean that we have supplied our model with sufficiently effective predictors. As such, we'll move on to visualizing out actual data vs our predictions.

In [None]:
sns.scatterplot(x = y_test, y = y_pred)
plt.plot([0, 2500], [0, 2500], color = 'grey', linestyle = 'dashed')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Arrival Delay: Actual by Predicted')
plt.show()

Our regressor accurately predicts the correct amount of arrival delay based on our current predictors. Given that we did not exclude outliers from our data, we expect for there to be a few predictions that fall outside of the actual bounds of the data.