In [1]:
import pandas as pd
import numpy as np
import scipy
import random
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn import ensemble
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.utils import resample
import statsmodels.formula.api as smf

%matplotlib inline

# Predicting Flight Lateness

I will make a model that can predict whether or not a plane will arrive on time. I have data from flights accross the USA from 2007 and 2008.  I will train a model to predict how late a flight will be using 2007 data, and then test its accuracy using the 2008 data.

A plane is only considered late if it arrives more than 30 minutes after arrival.

## Importing Data

First, I must import the data

In [2]:
flights07 = pd.read_csv("flights07.csv", nrows=10000)
flights08 = pd.read_csv("flights08.csv", nrows=10000)

In [3]:
display(flights07.head())

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
0,2007,1,1,1,1232.0,1225,1341.0,1340,WN,2891,...,4,11,0,,0,0,0,0,0,0
1,2007,1,1,1,1918.0,1905,2043.0,2035,WN,462,...,5,6,0,,0,0,0,0,0,0
2,2007,1,1,1,2206.0,2130,2334.0,2300,WN,1229,...,6,9,0,,0,3,0,0,0,31
3,2007,1,1,1,1230.0,1200,1356.0,1330,WN,1355,...,3,8,0,,0,23,0,0,0,3
4,2007,1,1,1,831.0,830,957.0,1000,WN,2278,...,3,9,0,,0,0,0,0,0,0


In [4]:
# delete column of null values
del flights07['CancellationCode']

In [5]:
# drop nulls
flights07 = flights07.dropna()
flights07 = flights07.reset_index(drop=True)

In [6]:
display(flights07.head())

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,Distance,TaxiIn,TaxiOut,Cancelled,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
0,2007,1,1,1,1232.0,1225,1341.0,1340,WN,2891,...,389,4,11,0,0,0,0,0,0,0
1,2007,1,1,1,1918.0,1905,2043.0,2035,WN,462,...,479,5,6,0,0,0,0,0,0,0
2,2007,1,1,1,2206.0,2130,2334.0,2300,WN,1229,...,479,6,9,0,0,3,0,0,0,31
3,2007,1,1,1,1230.0,1200,1356.0,1330,WN,1355,...,479,3,8,0,0,23,0,0,0,3
4,2007,1,1,1,831.0,830,957.0,1000,WN,2278,...,479,3,9,0,0,0,0,0,0,0


## Creating the Outcome 

Now that the data is imported, I need to prepare my target variable. As I said before, a plane is only considered late if it arrives more than 30 minutes after its scheduled arrival.  Fortunately, there is a column that tells how many minutes late (or early) a flight was compared to its scheduled arrival time.

In [7]:
display(flights07['ArrDelay'].head())

0     1.0
1     8.0
2    34.0
3    26.0
4    -3.0
Name: ArrDelay, dtype: float64

In [8]:
#initiate a feature dateframe that is the same as the 2007 flights dataframe minus a few columns
features = flights07.drop(['Year','DepTime','CRSDepTime','ArrTime','CRSArrTime','FlightNum','TailNum'],axis=1)
#create a binary feature telling if the plane was late
features['late'] = np.where(flights07['ArrDelay']>30,1,0)
#create a continous feature telling how many minutes after 30 the plane was late
features['arrlatetime'] = np.where(features['late']==1,features['ArrDelay']-30,0)

display(features.head())

Unnamed: 0,Month,DayofMonth,DayOfWeek,UniqueCarrier,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,...,TaxiOut,Cancelled,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay,late,arrlatetime
0,1,1,1,WN,69.0,75,54.0,1.0,7.0,SMF,...,11,0,0,0,0,0,0,0,0,0.0
1,1,1,1,WN,85.0,90,74.0,8.0,13.0,SMF,...,6,0,0,0,0,0,0,0,0,0.0
2,1,1,1,WN,88.0,90,73.0,34.0,36.0,SMF,...,9,0,0,3,0,0,0,31,1,4.0
3,1,1,1,WN,86.0,90,75.0,26.0,30.0,SMF,...,8,0,0,23,0,0,0,3,0,0.0
4,1,1,1,WN,86.0,90,74.0,-3.0,1.0,SMF,...,9,0,0,0,0,0,0,0,0,0.0


# Creating Models

Now that I have my outcome made and my feature dataframe all setup, I can begin to model. This problem calls for regression modeling, and I will create models based on different regression techniques.

In [9]:
# initiate model variables 
X = features.drop(['arrlatetime','ArrDelay','LateAircraftDelay'],axis=1)
X = pd.get_dummies(X)
Y = features['arrlatetime']

## OLS Regression

Ordinary Least Squares (OLS) Regression works by trying to find a line of best fit between the data to be modeled. It can be quite powerful and accurate, but at the tradeoff of needing heavy data tuning to reach max performance. 

In [10]:
#call OLSR function
regr = linear_model.LinearRegression()
#fit model
regr.fit(X,Y)
#set prediction
Y_predregr = regr.predict(X)

In [11]:
print('R-squared for OLS Regression:')
print(regr.score(X,Y))
cvscore = cross_val_score(regr, X, Y)
print('\nCross Validation Score:')
print('{}% +/- {}%'.format(round(cvscore.mean(),2),round(cvscore.std()*2,2)))

R-squared for OLS Regression:
0.6854585088278161

Cross Validation Score:
0.61% +/- 0.06%


It seems OLS did not do a very good job at modeling the data.  As previously stated, it requires massive data tuning to reach peak performance.  Before committing to any of that, I will try a different model.

## Random Forest Regression

Random Forest Regression is a modeling technique that creates and uses the outcomes of many decision trees, all which model a slighly different portion of the same data set to prevent overfitting.  Random Forest are extremely flexible and highly powerful in their own right, but come with the tradeoff of being very computationally intensive and highly prone to overfitting, especially with certain datasets.

In [12]:
# call RF regressor
rfc = ensemble.RandomForestRegressor()
# fit model
rfc.fit(X,Y)
# set prediction
Y_predrfc = rfc.predict(X)

In [13]:
print('R-squared for Random Forest Regression:')
print(rfc.score(X,Y))
cvscore = cross_val_score(rfc, X, Y)
print('\nCross Validation Score:')
print('{}% +/- {}%'.format(round(cvscore.mean(),2),round(cvscore.std()*2,2)))

R-squared for Random Forest Regression:
0.994382228840546

Cross Validation Score:
0.96% +/- 0.02%


Random Forests appear to work best for modeling the data. The accuracy is much higher and totally acceptable at greater than 99%. Additionally, it's cross validation score show it is much less overfit than the OLS model.

# Conclusion

If I so desired, I could get the OLS to function at a closer level to the Random Forest model.  I would do so by making sure the data adhered to the four principal OLS data sets must: low co-linearity between features, near linear relationships between features and outcome, normal distribution of error, and equal distribution of error amoung of outcome values.

However, doing so hardly seems worth it since the Random Forest model fits so well without overfitting, I would gladly use it instead.

# References

http://stat-computing.org/dataexpo/2009/the-data.html