# Challenge for Startup.ml

Erin Craig
4/2/2016

I want to predict arrival delays using day of month, day of week, arrival airport, destination airport, and flight ID.

In [23]:
import pandas
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
from sklearn.metrics import roc_curve, auc
from sklearn.lda import LDA
from sklearn import linear_model, datasets

## Introduction and data

The goal of this document is to use the US Dept. of Transportation on-time arrival data for non-stop domestic flights by major air carriers to predict arrival delays. We will build a binary classification model for predicting arrival delays. 

The data can be found [here](http://transtats.bts.gov/Tables.asp?DB_ID=120&DB_Name=Airline%20On-Time%20Performance%20Data&DB_Short_Name=On-Time). I chose to use data from January 2016 and January 2015.

In this notebook, we try three classification models: logistic regression, a random forest, and LDA. They all show roughly the same behavior: the AUC in all cases was roughly .64. Looking forward, I would like to download 12 months of data. My intuition is that the month of the year is meaningful because it is an indicator of weather (when taken into account with departure airport).

## Read and clean

First, read the flights data (from January 2015 and 2016), and add a binary column indicating whether or not there was a delay. Then, take the categorical variables (airport ids, airline ids), and convert them into dummy variables.

In [9]:
flights2016 = pandas.read_csv("./data/834057395_T_ONTIME.csv")
flights2015 = pandas.read_csv("./data/834086792_T_ONTIME.csv")

flights2016.drop(
    ['UNIQUE_CARRIER', 'CARRIER', 'ORIGIN_AIRPORT_SEQ_ID', 'ORIGIN_CITY_MARKET_ID', 'DEST_AIRPORT_SEQ_ID',
     'DEST_CITY_MARKET_ID', 'ARR_DELAY_NEW', 'ARR_DEL15', 'ARR_DELAY_GROUP', 'FL_DATE', 'MONTH', 'Unnamed: 18'], 
axis = 1, inplace = True)

flights2015['YEAR'] = 2015

flights = pandas.concat([flights2016, flights2015])

# Create a binary column: delay or not?
flights["ArrDelayQ"] = flights["ARR_DELAY"] > 0

# Turn departure cities into dummy variables
airport_ids = pandas.get_dummies(flights['ORIGIN_AIRPORT_ID'])
airline_ids = pandas.get_dummies(flights['AIRLINE_ID'])
dest_airport_ids = pandas.get_dummies(flights['DEST_AIRPORT_ID'])
# this it to avoid confusion with departure airport ids:
dest_airport_ids.columns = dest_airport_ids.columns * 10

flights = pandas.concat([flights, airport_ids], axis=1)   
flights = pandas.concat([flights, airline_ids], axis=1)   
flights = pandas.concat([flights, dest_airport_ids], axis=1)   
flights.drop(['ORIGIN_AIRPORT_ID', 'AIRLINE_ID', 'DEST_AIRPORT_ID', 'ARR_DELAY', 'Unnamed: 5'], inplace=True, axis=1)

In [11]:
flights.head()

Unnamed: 0,DAY_OF_MONTH,DAY_OF_WEEK,YEAR,ArrDelayQ,10135,10136,10140,10141,10146,10155,...,154110,154120,154970,155820,156070,156240,158410,159190,159910,162180
0,6,3,2016,False,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,7,4,2016,False,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,8,5,2016,True,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,9,6,2016,False,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,10,7,2016,True,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Classification: logistic regression

We use scikitlearn's feature selection, followed by logistic regression. However, note that some feature selection has already been done: I began by selecting

    * day of month 
    * day of week 
    * airline 
    * destination airport
    * departure airport 

Scikitlearn's feature selection occurs after my pre-selection. 

In [24]:
training, test = train_test_split(flights, test_size = 0.5)

X = training[[i for i in training.columns if i != 'ArrDelayQ']].values
y = training['ArrDelayQ'].values
Xtest = test[[i for i in test.columns if i != 'ArrDelayQ']].values
ytest = test['ArrDelayQ'].values

clf = Pipeline([
  ('feature_selection', SelectFromModel(LinearSVC())),
  ('classification', linear_model.LogisticRegression(C=1e5))
])
clf.fit(X, y)

Pipeline(steps=[('feature_selection', SelectFromModel(estimator=LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0),
        prefit=False, thresho...ty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False))])

In [25]:
disbursed = clf.predict_proba(Xtest)
fpr, tpr, _ = roc_curve(ytest, disbursed[:,1])
roc_auc = auc(fpr, tpr)
print roc_auc

0.638203552593


## Classification: random forest with feature selection 

We use scikitlearn's feature selection, followed by its random forest classifier. 

In [12]:
training, test = train_test_split(flights, test_size = 0.5)

X = training[[i for i in training.columns if i != 'ArrDelayQ']].values
y = training['ArrDelayQ'].values
Xtest = test[[i for i in test.columns if i != 'ArrDelayQ']].values
ytest = test['ArrDelayQ'].values

clf = Pipeline([
  ('feature_selection', SelectFromModel(LinearSVC())),
  ('classification', RandomForestClassifier(n_estimators=50, criterion = "entropy"))
])
clf.fit(X, y)

Pipeline(steps=[('feature_selection', SelectFromModel(estimator=LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0),
        prefit=False, thresho...n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])

In [14]:
disbursed = clf.predict_proba(Xtest)
fpr, tpr, _ = roc_curve(ytest, disbursed[:,1])
roc_auc = auc(fpr, tpr)
print roc_auc

0.655441680777


## Classification: LDA

I am curious how my random forest classifier compares to linear discriminant analysis. So let's find out!

We use scikitlearn's feature selection, followed by its random forest classifier.

In [20]:
training, test = train_test_split(flights, test_size = 0.5)

X = training[[i for i in training.columns if i != 'ArrDelayQ']].values
y = training['ArrDelayQ'].values
Xtest = test[[i for i in test.columns if i != 'ArrDelayQ']].values
ytest = test['ArrDelayQ'].values

clf = Pipeline([
  ('feature_selection', SelectFromModel(LinearSVC())),
  ('classification', LDA(tol = .000001))
])
clf.fit(X, y)

Pipeline(steps=[('feature_selection', SelectFromModel(estimator=LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0),
        prefit=False, threshold=None)), ('classification', LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None,
              solver='svd', store_covariance=False, tol=1e-06))])

In [22]:
disbursed = clf.predict_proba(Xtest)
fpr, tpr, _ = roc_curve(ytest, disbursed[:,1])
roc_auc = auc(fpr, tpr)
print roc_auc

0.6370832771


## Conclusions and next steps

Because these models had such similar results, I suspect that we need a different data set to see better results. But I also think this is a good start! 

Regarding the model choice, I would be curious to try again with SVM. I do not have a strong intuition that it would dramatically outperform the models we have so far, but it would be fun to try.

Regarding the data, I wonder whether month of the year might influence departure delays. I suspect that it would - the combination of location and month gives a nod toward weather (Cleveland, OH in January probably has more delays than it does in July). This would just involve downloading 12 months of data and concatenating (as above for 2015 and 2016).

It would be fun to know whether individual pilot/copilot pairs influence timeliness of flights. I suspect that they would influence delays very little, but it would be fun to know. Given the data, I could use lasso or ridge regression - or scikit learn's model selection.