In [1]:
import pandas as pd
import numpy as np
import scipy
import random
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn import ensemble
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.utils import resample
import statsmodels.formula.api as smf
from sklearn import neighbors
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor
%matplotlib inline

# Predicting Flight Lateness

I will make a model that can predict whether or not a plane will arrive on time. I have data from flights accross the USA from 2007 and 2008.  I will train a model to predict how late a flight will be using 2007 data, and then test its accuracy using the 2008 data.

A plane is only considered late if it arrives more than 30 minutes after arrival.

## Importing Data

First, I must import the data

In [2]:
flights07 = pd.read_csv("flights07.csv", nrows=10000)
flights08 = pd.read_csv("flights08.csv", nrows=10000)

In [3]:
print(flights07.shape)
display(flights07.head())

(10000, 29)


Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
0,2007,1,1,1,1232.0,1225,1341.0,1340,WN,2891,...,4,11,0,,0,0,0,0,0,0
1,2007,1,1,1,1918.0,1905,2043.0,2035,WN,462,...,5,6,0,,0,0,0,0,0,0
2,2007,1,1,1,2206.0,2130,2334.0,2300,WN,1229,...,6,9,0,,0,3,0,0,0,31
3,2007,1,1,1,1230.0,1200,1356.0,1330,WN,1355,...,3,8,0,,0,23,0,0,0,3
4,2007,1,1,1,831.0,830,957.0,1000,WN,2278,...,3,9,0,,0,0,0,0,0,0


In [4]:
# delete column of null values
del flights07['CancellationCode']

In [5]:
# drop nulls
flights07 = flights07.dropna()
flights07 = flights07.reset_index(drop=True)

In [6]:
display(flights07.head())

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,Distance,TaxiIn,TaxiOut,Cancelled,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
0,2007,1,1,1,1232.0,1225,1341.0,1340,WN,2891,...,389,4,11,0,0,0,0,0,0,0
1,2007,1,1,1,1918.0,1905,2043.0,2035,WN,462,...,479,5,6,0,0,0,0,0,0,0
2,2007,1,1,1,2206.0,2130,2334.0,2300,WN,1229,...,479,6,9,0,0,3,0,0,0,31
3,2007,1,1,1,1230.0,1200,1356.0,1330,WN,1355,...,479,3,8,0,0,23,0,0,0,3
4,2007,1,1,1,831.0,830,957.0,1000,WN,2278,...,479,3,9,0,0,0,0,0,0,0


In [7]:
# delete column of null values
del flights08['CancellationCode']

In [8]:
# drop nulls
flights08 = flights08.dropna()
flights08 = flights08.reset_index(drop=True)

In [9]:
print(flights08.shape)
display(flights08.head())

(3675, 28)


Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,Distance,TaxiIn,TaxiOut,Cancelled,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
0,2008,1,3,4,1829.0,1755,1959.0,1925,WN,3920,...,515,3.0,10.0,0,0,2.0,0.0,0.0,0.0,32.0
1,2008,1,3,4,1937.0,1830,2037.0,1940,WN,509,...,1591,3.0,7.0,0,0,10.0,0.0,0.0,0.0,47.0
2,2008,1,3,4,1644.0,1510,1845.0,1725,WN,1333,...,828,6.0,8.0,0,0,8.0,0.0,0.0,0.0,72.0
3,2008,1,3,4,1452.0,1425,1640.0,1625,WN,675,...,1489,7.0,8.0,0,0,3.0,0.0,0.0,0.0,12.0
4,2008,1,3,4,1323.0,1255,1526.0,1510,WN,4,...,838,4.0,9.0,0,0,0.0,0.0,0.0,0.0,16.0


## Creating the Outcome 

Now that the data is imported, I need to prepare my target variable. As I said before, a plane is only considered late if it arrives more than 30 minutes after its scheduled arrival.  Fortunately, there is a column that tells how many minutes late (or early) a flight was compared to its scheduled arrival time.

In [10]:
#initiate a feature dateframe that is the same as the 2007 flights dataframe minus a few columns
features = flights07.drop(['Year','DepTime','CRSDepTime','ArrTime','CRSArrTime','FlightNum','TailNum'],axis=1)
#create a binary feature telling if the plane was late
features['late'] = np.where(flights07['ArrDelay']>30,1,0)
#create a continous feature telling how many minutes after 30 the plane was late
features['arrlatetime'] = np.where(features['late']==1,features['ArrDelay']-30,0)

display(features.head())

Unnamed: 0,Month,DayofMonth,DayOfWeek,UniqueCarrier,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,...,TaxiOut,Cancelled,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay,late,arrlatetime
0,1,1,1,WN,69.0,75,54.0,1.0,7.0,SMF,...,11,0,0,0,0,0,0,0,0,0.0
1,1,1,1,WN,85.0,90,74.0,8.0,13.0,SMF,...,6,0,0,0,0,0,0,0,0,0.0
2,1,1,1,WN,88.0,90,73.0,34.0,36.0,SMF,...,9,0,0,3,0,0,0,31,1,4.0
3,1,1,1,WN,86.0,90,75.0,26.0,30.0,SMF,...,8,0,0,23,0,0,0,3,0,0.0
4,1,1,1,WN,86.0,90,74.0,-3.0,1.0,SMF,...,9,0,0,0,0,0,0,0,0,0.0


In [11]:
#initiate a feature dateframe that is the same as the 2007 flights dataframe minus a few columns
features08 = flights08.drop(['Year','DepTime','CRSDepTime','ArrTime','CRSArrTime','FlightNum','TailNum'],axis=1)
#create a binary feature telling if the plane was late
features08['late'] = np.where(flights08['ArrDelay']>30,1,0)
#create a continous feature telling how many minutes after 30 the plane was late
features08['arrlatetime'] = np.where(features08['late']==1,features08['ArrDelay']-30,0)

display(features.head())

Unnamed: 0,Month,DayofMonth,DayOfWeek,UniqueCarrier,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,...,TaxiOut,Cancelled,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay,late,arrlatetime
0,1,1,1,WN,69.0,75,54.0,1.0,7.0,SMF,...,11,0,0,0,0,0,0,0,0,0.0
1,1,1,1,WN,85.0,90,74.0,8.0,13.0,SMF,...,6,0,0,0,0,0,0,0,0,0.0
2,1,1,1,WN,88.0,90,73.0,34.0,36.0,SMF,...,9,0,0,3,0,0,0,31,1,4.0
3,1,1,1,WN,86.0,90,75.0,26.0,30.0,SMF,...,8,0,0,23,0,0,0,3,0,0.0
4,1,1,1,WN,86.0,90,74.0,-3.0,1.0,SMF,...,9,0,0,0,0,0,0,0,0,0.0


# Creating Models

Now that I have my outcome made and my feature dataframe all setup, I can begin to model. This problem calls for regression modeling, and I will create models based on different regression techniques.

In [34]:
# initiate model variables 
Xtrain = features.drop(['arrlatetime','ArrDelay','LateAircraftDelay'],axis=1)
Xtrain = pd.get_dummies(Xtrain)
Ytrain = features['arrlatetime']

In [35]:
Xtest = features08.drop(['arrlatetime','ArrDelay','LateAircraftDelay'],axis=1)
Xtest = pd.get_dummies(Xtest)
Ytest = features08['arrlatetime']

In [36]:
print('Training Set Size:')
print(Xtrain.shape)
print('\nTraining Set Size:')
print(Xtest.shape)

Training Set Size:
(9886, 144)

Training Set Size:
(3675, 146)


Unfortunately, the difference in feature size between the training and testing set will not work for any of my the models I will make.  Before I can initiate anything, I need to remove the 2 extra feature columns in the testing set.

In [37]:
train_col = Xtrain.columns
test_col = Xtest.columns
# define a training function that will output difference between lists
def Diff(li1, li2):
    return (list(set(li1) - set(li2)))

diff = Diff(test_col,train_col)
print('Columns in Test Dummy Set not in Train Set:')
print(diff)

Columns in Test Dummy Set not in Train Set:
['Origin_SFO', 'Dest_SFO']


In [38]:
Xtest = Xtest.drop(diff, axis=1)
print('Training Set Size:')
print(Xtrain.shape)
print('\nTraining Set Size:')
print(Xtest.shape)

Training Set Size:
(9886, 144)

Training Set Size:
(3675, 144)


In [39]:
#initiate dataframe to hold all the model scores
scores = pd.DataFrame()

Perfect, now the data sets are the same size, I can begin to model. 

## Random Forest Regression

Random Forest Regression is a modeling technique that creates and uses the outcomes of many decision trees, all which model a slighly different portion of the same data set to prevent overfitting.  Random Forest are extremely flexible and highly powerful in their own right, but come with the tradeoff of being very computationally intensive and highly prone to overfitting, especially with certain datasets.

In [40]:
# call RF regressor
rfc = ensemble.RandomForestRegressor()
# fit model
rfc.fit(Xtrain,Ytrain)
# set prediction
Y_predrfc = rfc.predict(Xtest)

In [20]:
cvscores = cross_val_score(rfc, Xtrain, Ytrain)
cvscoret = cross_val_score(rfc, Xtest, Ytest)
print("Random Forest Training Set:")
print("\nTraing R-squared:")
print(rfc.score(Xtrain,Ytrain))
print('\nCross Validation Score:')
print('{}% +/- {}%'.format(round(cvscores.mean(),2),round(cvscores.std()*2,2)))
print("\nRandom Forest Testing Set:")
print('\nTesting R-squared:')
print(rfc.score(Xtest,Ytest))
print('\nCross Validation Score:')
print('{}% +/- {}%'.format(round(cvscoret.mean(),2),round(cvscoret.std()*2,2)))

scores['Random Forest'] = [rfc.score(Xtrain,Ytrain),rfc.score(Xtest,Ytest)]

Random Forest Training Set:

Traing R-squared:
0.9946151084435568

Cross Validation Score:
0.96% +/- 0.01%

Random Forest Testing Set:

Testing R-squared:
0.9034444517644261

Cross Validation Score:
0.96% +/- 0.04%


The Random Forests works fairly well at modeling the data. The training accuracy is great, while the testing accuracy is unfortunately a little too low for real world use.

## Boosted Decision Trees

The next model I will use is Gradient Boosted Decision Trees. These work by iteratively running decision trees on the data over and over again.  When one decision tree is done, the next tree focuses on modeling its error, until a stopping point is reached.  It is able to tell features importances in much the same way as random forests.

In [21]:
# call Boosted Tree regressor
bdt = AdaBoostRegressor(DecisionTreeRegressor(max_depth=4))
# fit model
bdt.fit(Xtrain,Ytrain)
# set prediction
Y_predbdt = bdt.predict(Xtest)

In [22]:
cvscores = cross_val_score(bdt, Xtrain, Ytrain)
cvscoret = cross_val_score(bdt, Xtest, Ytest)
print("Boosted Decision Trees Training Set:")
print("\nTraing R-squared:")
print(bdt.score(Xtrain,Ytrain))
print('\nCross Validation Score:')
print('{}% +/- {}%'.format(round(cvscores.mean(),2),round(cvscores.std()*2,2)))
print("\nBoosted Decision Trees Testing Set:")
print('\nTesting R-squared:')
print(bdt.score(Xtest,Ytest))
print('\nCross Validation Score:')
print('{}% +/- {}%'.format(round(cvscoret.mean(),2),round(cvscoret.std()*2,2)))

scores['Boosted Decision Trees'] = [bdt.score(Xtrain,Ytrain),bdt.score(Xtest,Ytest)]

Boosted Decision Trees Training Set:

Traing R-squared:
0.9567823482488664

Cross Validation Score:
0.9% +/- 0.05%

Boosted Decision Trees Testing Set:

Testing R-squared:
0.9046641812688981

Cross Validation Score:
0.92% +/- 0.02%


The boosted trees have modeled the data about as well as the Random Forest. It has slightly less R squared than the random forest model but slightly less signs of overfitting.

## KNN Regression

This next model works by plotting the data in n-dimentional space, where n is the number of features. Then, the data will predict new values by looking at the nearest data points in n-dimensional space, where the number of nearest neighbors used is determined by the value 'k'. Each neighbor gets to vote on what the predicted value will be.

In [23]:
# call KNN regressor
knn = neighbors.KNeighborsRegressor(10, weights='distance')
# fit model
knn.fit(Xtrain,Ytrain)
# set prediction
Y_predknn = knn.predict(Xtest)

In [24]:
cvscores = cross_val_score(knn, Xtrain, Ytrain)
cvscoret = cross_val_score(knn, Xtest, Ytest)
print("KNN Regression Training Set:")
print("\nTraing R-squared:")
print(knn.score(Xtrain,Ytrain))
print('\nCross Validation Score:')
print('{}% +/- {}%'.format(round(cvscores.mean(),2),round(cvscores.std()*2,2)))
print("\nKNN Regression Testing Set:")
print('\nTesting R-squared:')
print(knn.score(Xtest,Ytest))
print('\nCross Validation Score:')
print('{}% +/- {}%'.format(round(cvscoret.mean(),2),round(cvscoret.std()*2,2)))

scores['KNN Regression'] = [knn.score(Xtrain,Ytrain),knn.score(Xtest,Ytest)]

KNN Regression Training Set:

Traing R-squared:
1.0

Cross Validation Score:
0.73% +/- 0.08%

KNN Regression Testing Set:

Testing R-squared:
0.6225512789328291

Cross Validation Score:
0.84% +/- 0.04%


While the KNN Regression perfectly models the orginial data set, it is obviously very overfit as seen by its abyssmal testing R-squared value.

## OLS Regression

Ordinary Least Squares (OLS) Regression works by trying to find a line of best fit between the data to be modeled. It can be quite powerful and accurate, but at the tradeoff of needing heavy data tuning to reach max performance. 

In [25]:
#call OLSR function
regr = linear_model.LinearRegression()
#fit model
regr.fit(Xtrain,Ytrain)
#set prediction
Y_predregr = regr.predict(Xtest)

In [26]:
cvscores = cross_val_score(regr, Xtrain, Ytrain)
cvscoret = cross_val_score(regr, Xtest, Ytest)
print("OLS Regression Training Set:")
print("\nTraing R-squared:")
print(regr.score(Xtrain,Ytrain))
print('\nCross Validation Score:')
print('{}% +/- {}%'.format(round(cvscores.mean(),2),round(cvscores.std()*2,2)))
print("\nOLS Regression Testing Set:")
print('\nTesting R-squared:')
print(regr.score(Xtest,Ytest))
print('\nCross Validation Score:')
print('{}% +/- {}%'.format(round(cvscoret.mean(),2),round(cvscoret.std()*2,2)))

scores['OLS Regression'] = [regr.score(Xtrain,Ytrain),regr.score(Xtest,Ytest)]

OLS Regression Training Set:

Traing R-squared:
0.6854585088278161

Cross Validation Score:
0.61% +/- 0.06%

OLS Regression Testing Set:

Testing R-squared:
-1.2305696592835882e+17

Cross Validation Score:
-73797.08% +/- 208336.47%


It seems OLS did not do a very good job at modeling the data.  As previously stated, OLS requires massive data tuning to reach peak performance.

# Conclusion

Now that I have tried many different models to predict airplane prices, I can decide which model worked best.

In [27]:
scores['Stat'] = ['Training Accuracy','Testing Accuracy']
scores = scores.set_index('Stat')
display(scores)

Unnamed: 0_level_0,Random Forest,Boosted Decision Trees,KNN Regression,OLS Regression
Stat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Training Accuracy,0.994615,0.956782,1.0,0.6854585
Testing Accuracy,0.903444,0.904664,0.622551,-1.23057e+17


Random Forest Regression and Boosted Decision Tree Regression definitely seem to work best.  Though they have a little difference in their functionality, they both model the data about as well. 

KNN Regression can most accurately predict the data it is fit to, but has many troubles with overfitting. If I were to further tweak the model, I can lower its overfitting in tradeoff for training set accuracy, but it would probably not works as well as the Random Forest or Boosted Trees models.

If I so desired, I could probably get the OLS to function about the same of perhaps even better than the Random Forest and Boosted Trees models.  I would do so by making sure the data adhered to the four principal OLS data sets must: low co-linearity between features, near linear relationships between features and outcome, normal distribution of error, and equal distribution of error amoung of outcome values.

While many models can be made to work at a similar level, Random Forests and Boosted Trees ability to predict data without need for heavy tuning makes them extremely desireable models.

# References

http://stat-computing.org/dataexpo/2009/the-data.html