# Predicting Arrival Status of Flights at SLC Airport: Supervised ML

# Introduction
Late flight arrivals are a hassle, not only for travelers, but also for the people waiting for them at their destination. Fortunately, the [US Bureau of Transportation Statistics](https://www.bts.gov/) has information about flight arrival and departure statistics going back to the 80s. I combined this with distance information obtained from [OpenFlights](https://openflights.org/), and weather information from [NOAA](https://www.ncdc.noaa.gov/cdo-web/search) to create a set of inputs that would help predict whether or not a flight would arrive on time.

# 1 Loading Data

## 1.1 Importing Packages
For this project we will use four packages during import (Pandas, Numpy, pyplot, and glob), and ten packages from Scikit-Learn during construction and deployment of the machine learning model.

In [1]:
#import packages: pandas, numpy, pyplot, glob
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import glob 

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_curve, roc_auc_score, confusion_matrix, plot_confusion_matrix
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline

## 1.2 Importing Data
### 1.2.1 Arrival Data
Begin by generating a list of the files from the USBTS. We will createa an empty list of dataframes, then populate that list with each dataframe as we import it. We can then concatenate the seven dataframes together, and replace teh column names with more usable names. 

In [2]:
#use glob to generate list of arrival detail files
arr_stat_list = glob.glob("Detailed*.csv")
arr_stat_list

['Detailed_Statistics_Arrivals_AA.csv',
 'Detailed_Statistics_Arrivals_AS.csv',
 'Detailed_Statistics_Arrivals_B6.csv',
 'Detailed_Statistics_Arrivals_DL.csv',
 'Detailed_Statistics_Arrivals_F9.csv',
 'Detailed_Statistics_Arrivals_UA.csv',
 'Detailed_Statistics_Arrivals_WN.csv']

In [3]:
#read in CSVs into dummy list
cols = [0, 1, 4, 5, 7, 8, 9]
df_list = []

for i in range(7):
    file = arr_stat_list[i]
    df_list.append(pd.read_csv(file, 
                engine = 'python',
                usecols = cols, 
                dtype = {'Origin Airport':'category'},
                skiprows = 7, 
                parse_dates = [[1, 3]],
                skipfooter = 1
                ))

In [4]:
#concatenate list into single df, reset index, drop original index, rename cols, 
#and make the carrier and origin columns categoricals
arr_data = pd.concat(df_list)
arr_data.reset_index(inplace = True)
arr_data = arr_data.drop("index", axis = 1)
arr_data.columns = ['scheduled_arr', 'carrier', 'origin', 'scheduled_elapsed', 'actual_elapsed', 'arr_delay']
arr_data[['carrier', 'origin']] = arr_data[['carrier', 'origin']].astype('category')

We will also read in the distances between KSLC and every other airport that is serviced from there. We can then merge this dataframe with the arrivals dataframe, so each flight hsa a distance associated with it. 

In [5]:
#read in airport distance info
distance = pd.read_csv('SLC_routes.csv')
arr_data_dist = arr_data.merge(distance, how = 'left', left_on = 'origin', right_on = 'faa_code')
arr_data_dist.drop('faa_code', axis = 1)

Unnamed: 0,scheduled_arr,carrier,origin,scheduled_elapsed,actual_elapsed,arr_delay,distance
0,1988-01-01 11:02:00,AA,ORD,197,211,22,1245
1,1988-01-01 11:17:00,AA,DFW,160,170,9,987
2,1988-01-01 12:31:00,AA,DFW,154,174,21,987
3,1988-01-01 15:00:00,AA,JAC,52,50,-3,204
4,1988-01-01 07:50:00,AA,IDA,48,0,0,188
...,...,...,...,...,...,...,...
2043276,2019-12-31 11:30:00,WN,MDW,210,191,-9,1254
2043277,2019-12-31 15:45:00,WN,PHX,100,86,-16,507
2043278,2019-12-31 10:00:00,WN,LAX,110,118,-1,589
2043279,2019-12-31 22:20:00,WN,SAN,110,110,6,626


Lastly, we will desconstruct the datetime associated with each flight, so we can have separate columns for hour, date, day, day of year, and year. 

In [6]:
#add cols: 'arr_DateHour', 'date', 'day_name', and 'day_of_year' and 'sceduled_hour'
arr_data_dist['arr_DateHour'] = arr_data_dist['scheduled_arr'].dt.round('H')
arr_data_dist['date'] = arr_data_dist['scheduled_arr'].dt.date
arr_data_dist['day_name'] = arr_data_dist['scheduled_arr'].dt.day_name()
arr_data_dist['day_of_year'] = arr_data_dist['scheduled_arr'].dt.dayofyear
arr_data_dist['date'] = pd.to_datetime(arr_data_dist['date'])
arr_data_dist['scheduled_hour'] = arr_data_dist['scheduled_arr'].dt.hour
arr_data_dist['year'] = arr_data_dist['scheduled_arr'].dt.year

### 1.2.2 Precipitation Data
The precipitation data is pretty spotty, foremost because most days have no precipitation. First, we will resample the data so that every hour is represented. Then we will fill all NAs with 0s. This dataframe is now ready to be merged into the next weather dataframe.

In [7]:
#read in precip data
precip = pd.read_csv('kslc_precip_data.csv',
                    usecols = [2, 3])
precip["DATE"] = pd.to_datetime(precip["DATE"])

#create a dt index, resample the data to hourly, and replace NaN with 0
precip.set_index('DATE', inplace = True)
precip_hour = precip.resample('H').asfreq()
precip_hour.fillna(0, inplace = True)
precip_hour.reset_index(inplace = True)
precip_hour.columns = ['DateHour', 'HourPrecip']

### 1.2.3 Weather Data
The weather data is hourly, so it is already aligned with the resampled precipitation frame. For all missing average wind values, we will just use the overall mean avg_wind value. We can assume that a null values for snow or water on the ground is 0. Lastly, we will fill null T_avg values with the mean of T_max and T_min.

In [8]:
#read in weather data
weather = pd.read_csv('kslc_daily_weather_data.csv')
weather["date"] = pd.to_datetime(weather["date"])
weather.drop('wind_fastest_1min', axis = 1, inplace = True)

#fillna on the various missing values with reasonable substitutes
weather["avg_wind"].fillna(weather["avg_wind"].mean(), inplace = True)
weather['water_equiv_on_grd'].fillna(0, inplace = True)
weather['snowfall'].fillna(0, inplace = True)
weather["tavg"].fillna(((weather.tmax + weather.tmin) / 2), inplace = True)

## 1.3 Merge Data
Now that all the weather data is hourly and complete, we can merge each of them into the arrivals frame, based on the hour and day that the flight was scheduled to arrive. 

In [9]:
#merge arr_data_dist and precip_hour on 'arr_DateHour' and 'DateHour'
arr_precip_merge = arr_data_dist.merge(precip_hour, how = 'left', left_on = 'arr_DateHour', right_on = 'DateHour')

In [10]:
#merge arr_precip_merge and weather on 'date'
SLC_arrival_merge = arr_precip_merge.merge(weather, how = 'left', on = 'date')

In [12]:
SLC_arrival_merge.columns

Index(['scheduled_arr', 'carrier', 'origin', 'scheduled_elapsed',
       'actual_elapsed', 'arr_delay', 'faa_code', 'distance', 'arr_DateHour',
       'date', 'day_name', 'day_of_year', 'scheduled_hour', 'year', 'DateHour',
       'HourPrecip', 'avg_wind', 'precip', 'snowfall', 'tavg', 'tmax', 'tmin',
       'water_equiv_on_grd'],
      dtype='object')

We will make a separate dataframe that only contains data that will be used as inputs, including both numericals and categoricals.

In [13]:
#make a new df with only the info that will be used in the ML model, but keep categoricals for now
SLC_arr_ml_cat = SLC_arrival_merge[['carrier', 
                                    'scheduled_elapsed', 
                                    'distance', 
                                    'year',
                                    'day_name', 
                                    'day_of_year', 
                                    'scheduled_hour', 
                                    'HourPrecip', 
                                    'avg_wind',
                                    'precip', 
                                    'snowfall',
                                    'tavg',
                                    'tmax',
                                    'tmin', 
                                    'water_equiv_on_grd',
                                    'arr_delay']]

In [None]:
#fillna the HourPrecip col with the mean for the column and create a column for "ontime"
SLC_arr_ml_cat['HourPrecip'].fillna(SLC_arr_ml_cat['HourPrecip'].mean(), inplace = True)


In [15]:
SLC_arr_ml_cat.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2043281 entries, 0 to 2043280
Data columns (total 16 columns):
 #   Column              Dtype   
---  ------              -----   
 0   carrier             category
 1   scheduled_elapsed   int64   
 2   distance            int64   
 3   year                int64   
 4   day_name            object  
 5   day_of_year         int64   
 6   scheduled_hour      int64   
 7   HourPrecip          float64 
 8   avg_wind            float64 
 9   precip              float64 
 10  snowfall            float64 
 11  tavg                float64 
 12  tmax                int64   
 13  tmin                int64   
 14  water_equiv_on_grd  float64 
 15  arr_delay           int64   
dtypes: category(1), float64(6), int64(8), object(1)
memory usage: 251.4+ MB


In [None]:
SLC_arr_ml_cat['ontime'] = (SLC_arr_ml_cat['arr_delay'] <= 0)

In [18]:
SLC_arr_ml_cat.head()

Unnamed: 0,carrier,scheduled_elapsed,distance,year,day_name,day_of_year,scheduled_hour,HourPrecip,avg_wind,precip,snowfall,tavg,tmax,tmin,water_equiv_on_grd,arr_delay,ontime
0,AA,197,1245,1988,Friday,1,11,0.0,9.17,0.0,0.0,17.0,27,7,0.5,22,False
1,AA,160,987,1988,Friday,1,11,0.0,9.17,0.0,0.0,17.0,27,7,0.5,9,False
2,AA,154,987,1988,Friday,1,12,0.0,9.17,0.0,0.0,17.0,27,7,0.5,21,False
3,AA,52,204,1988,Friday,1,15,0.0,9.17,0.0,0.0,17.0,27,7,0.5,-3,True
4,AA,48,188,1988,Friday,1,7,0.0,9.17,0.0,0.0,17.0,27,7,0.5,0,True


Export this data fram to CSV as both a checkpoint, and as a way to easily call it in other models.

In [19]:
SLC_arr_ml_cat.to_csv("KSLC_arrivals_tidy.csv")

# 2 Data Preparation
Reimport the data for further cleaning. Drop the old index, and separate out the numerical and categorical columns. Scale all the numerical columns so they have a mean of 0 and standard deviation of 1. Then remerge the numerical and categorical columns.

In [None]:
#read in KSCL arrivals data
df = pd.read_csv("KSLC_arrivals_tidy.csv")

In [None]:
#drop the old index
df.drop("Unnamed: 0", axis = 1, inplace = True)

In [None]:
#create separate dfs with the numeric and categorical variables and scale the numeric df
num_cols = ['scheduled_elapsed', 'distance', 'year','day_of_year', 
                 'scheduled_hour', 'HourPrecip',
                 'snowfall', 'water_equiv_on_grd']
cat_cols = ['carrier', 'day_name', 'ontime']

df_numeric = df[num_cols]
df_cat = df[cat_cols]

npa_num_scaled = scale(df_numeric)
df_num_scaled = pd.DataFrame(npa_num_scaled)
df_num_scaled.columns = num_cols

In [None]:
#concat the numeric and categorical dfs
df_scaled = pd.concat([df_num_scaled, df_cat], axis = 1)

Get dummy variables (one-hot encoded vectors) from the categoricals (airline, day of week, and arrival status).

In [None]:
#create dummy variables for "carrier" and "day_name"
df_dummies = pd.get_dummies(df_scaled, drop_first = True)

In [None]:
#randomly sample 400k flights from df_dummies
df_dummies_sample = df_dummies.sample(500000, random_state = 21)

Separate out the indepenedent variables form the target value.

In [None]:
#separate out the features and target
X = df_dummies_sample.drop('ontime', axis = 1)
y = df_dummies_sample['ontime']

Perform principle component analysis (PCA) to determine the 8 most independent variables. For example, distance and expected flight time will largely provide the same information to the ML model, so we should only need one of those two.

In [None]:
#perfrom PCA on the dataset
pca = PCA(n_components = 8)
pca.fit(df_dummies_sample)
X_transformed = pca.transform(df_dummies_sample)
print(X_transformed.shape)

Instantiate a logistic regression model, and fit it using a train-test split of the dataset.

In [None]:
#instantiate a LogisticRegression classifier called logreg
logreg = LogisticRegression()

In [None]:
#create a train and test split
X_train, X_test, y_train, y_test = train_test_split(X_transformed, y, test_size = 0.2, random_state = 42)

In [None]:
#create a fit based on the training data
logreg.fit(X_train, y_train)

Run the test data through the logistic model, and compare the predicted outputs with the actual outputs using a confusion matrix.

In [None]:
y_pred = logreg.predict(X_test)

In [None]:
conf = confusion_matrix(y_test, y_pred)

Plot the data from the confusion matrix, assiging the color of the cell to the percent accuracy. 

In [None]:
plot_confusion_matrix(logreg, X_test, y_test, display_labels = ['late', 'on time'], normalize = 'true', cmap = 'Blues')
plt.savefig('cm.png', dpi = 300)

Print the classification report, which shows the precision and recall for the model.

In [None]:
print(classification_report(y_test, y_pred))

Plot the Reciever Operating Characteristic (ROC) curve, and include the area under the curve as an annotation on the plot. 

In [None]:
#copute predicted probabilities
y_pred_prob = logreg.predict_proba(X_test)[:,1]

#get roc-auc
roc_auc = roc_auc_score(y_test, y_pred_prob)

#create annotation for plot
annot = 'Area Under Curve: {:.3f}'.format(roc_auc)

# Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

# Plot ROC curve
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.annotate(annot, (0.5, 0.25), c = "DarkRed")
plt.show()