<a id=toc></a>
# MSDS 7333 - Final Project: Analyzing Airline Flight Delays Using Graphlab Create

### Investigators
- [Matt Baldree](mailto:mbaldree@smu.edu?subject=lab14)
- [Ben Brock](bbrock@smu.edu?subject=lab14)
- [Tom Elkins](telkins@smu.edu?subject=lab14)
- [Austin Kelly](ajkelly@smu.edu?subject=lab14)


<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:5px;'>
    <h3>Instructions</h3>
    <p>Work with the airline data set (use R or Python to manage out-of-core).</p>
     <p>Answer the following questions by using the split-apply-combine technique</p>
    <ol>
        <li>Which airports are most likely to be delayed flying out of or into?</li>
        <li>Which flights with same origin and destination are most likely to be delayed?</li>
        <li>Can you regress how delayed a flight will be before it is delayed?</li>
        <li>What are the most important features for this regression?
            <ul>
            <li>Remember to properly cross-validate models.
            <li>Use meaningful evaluation criteria.
            <li>Create at least one new feature variable for the regression.
            </ul>
            
    </ol> 
            

    <p>Report Sections:</p>
    <ol>
        <li>[Introduction](#introduction) <b>(5 points)</b></li>
        <li>[Background](#background) <b>(10 points)</b></li>
        <li>[Methods](#methods) <b>(30 points)</b></li>
        <li>[Results](#results) <b>(30 points)</b></li>
        <li>[Conclusion](#conclusion) <b>(5 points)</b></li>
        <li>[Bibliography and Citation](#biblio) <b>(5 points)</b></li>
        <li>[Code](#code) <b>(5 points)</b></li>
    </ol>
     <p>Other Grading Criterium:</p>
    <ol>
        <li>Grammar and Organization <b>(10 points)</b></li>
    </ol>
</div>

<a id='introduction'></a>
## 1 - Introduction
<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:10px;'>
<h3>Introduction (<b>5 points total</b>)</h3>
</div>

<div style="color:red">
<h3><b>Talk about how we are expanding upon the third question from the most recent case study&darr;</b></h3>
</div>

For this case study, we are tasked with acquiring and combining airline data from 22 separate years of airline history. Once the data is downloaded, it will be parsed and appended to a data frame in which we will be able to determine the statistics of said data. With such a large amount of data, it will be notably difficult to be able to use conventional methods to aggregate and perform calculations with conventional methods. 

The data in question totals just over 123.5 Million records and sizes up to be about 14 Gigabytes **uncompressed** of just csv data. 

That's a lot of data.

In order to be able to not only handle the data but also perform calculations over the dataframe, we will need to utilize more than just a single core of the (current) 4-core processors embedded within our machines. When more than a single processor core is utilized, we venture into the realm of parallel computing. As we parse and sift through the data, parallel computig allows for a rather novel idea: break the data down into even parts and process all three parts at the same time. Many titans of industry use platforms such as Hadoop Distributed File System (HDFS) to manage massive amounts of data relatively quickly with clusters of commodity servers. When a massive datafile comes through (in our case, 12-14 Gb), instead of just using a single core to process all of the data, we will use three cores to process 4-5 Gb of data _each_, leaving a spare core (the master) to manage all three cores.

For this case study, we were met with many roadblocks such as software compatibility with hardware along with version control. We found it was quite difficult to manage older versions of R alongside the newest version of Python, all in the same Jupyter notebook. To minimize these roadblocks, our team utilized the Python 3.4 package [Dask](https://dask.pydata.org/en/latest/) along with Python 2.7's [Graphlab-Create](https://turi.com/). Once these processes were executed in their entirety, we decided to cross-validate our findings by generating an equivalent Javascript environment to independently test our findings. 

[&uarr; ToC](#toc)

<a id="background"></a>
## 2 - Background

<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:10px;'>
<h3>Background (<b>10 points total</b>)</h3>
</div>

<div style="color:red">
<h3><b>Reiterate and rephrase this section.&darr;</b></h3>
</div>

The dataset our group acquired was comprised of just over 123 Million records with 29 attributes. The attributes are described in this table:

### Variable descriptions of original data set
|Item|Name|Description|
|:--:|:--|:--|
|1|	Year	|1987-2008|
|2|	Month	|1-12|
|3|	DayofMonth	|1-31|
|4|	DayOfWeek	|1 (Monday) - 7 (Sunday)|
|5|	DepTime	|actual departure time (local, hhmm)|
|6|	CRSDepTime	|scheduled departure time (local, hhmm)|
|7|	ArrTime	actual |arrival time (local, hhmm)|
|8|	CRSArrTime	|scheduled arrival time (local, hhmm)|
|9|	UniqueCarrier	|unique carrier code|
|10|	FlightNum	|flight number|
|11|	TailNum	plane |tail number|
|12|	ActualElapsedTime	|in minutes|
|13|	CRSElapsedTime	|in minutes|
|14|	AirTime	|in minutes|
|15|	ArrDelay	|arrival delay, in minutes|
|16|	DepDelay	|departure delay, in minutes|
|17|	Origin	|origin IATA airport code|
|18|	Dest	|destination IATA airport code|
|19|	Distance	|in miles|
|20|	TaxiIn	|taxi in time, in minutes|
|21|	TaxiOut	|taxi out time in minutes|
|22|	Cancelled	|was the flight cancelled?|
|23|	CancellationCode	|reason for cancellation (A = carrier, B = weather, C = NAS, D = security)|
|24|	Diverted	|1 = yes, 0 = no|
|25|	CarrierDelay	|in minutes|
|26|	WeatherDelay	|in minutes|
|27|	NASDelay	|in minutes|
|28|	SecurityDelay	|in minutes|
|29|	LateAircraftDelay	|in minutes|

The three most-important (and required) questions are:

(click on each question to navigate to the section of the notebook)

<div style="color:red">
<h3><b>Determine which  links we need and which we don't&darr;</b></h3>
</div>

<li>[Q1.What airports have the most delayed departures and arrivals?](#Question1)</li> 
<li>[Q2. What flights are most frequently delayed with same origin and destination?](#Question2)</li>
<li>[Q3. Can you predict a flight's delayed time in minutes?](#Question3)</li>

While these questions seem obvious to us, it is important to clearly identify our intent of what we are looking to explore in order to discover an appropriate answer to the proper questions. 

First and foremost, we will want to investigate (using the basic aggregation functions) just which airports are the main culprits for delayed departures and which are subject to the late arrivals. It must be declared a flight is considered to be delayed if it leaves or arrives more than 15 minutes from it's scheduled time. Something to be investigated at a later date (when adequate resources are available) is whether or not the late departures influence the late arrivals more than the late arrivals affect the late departures.

The second question begs investigation into whether or not there is a specific route plagued with said delays. With so many unique routes, it will be interesting to see whether or not one route really sticks out over the rest. Since our analysis is limited to our data, we will not be seeing many "entire" routes. This is attributed to the simple fact that many entire routes (e.g. New York to Los Angeles) are _typically_ comprised of multiple sub-routes. Thus, we will be focusing on the routes which comprise the longer routes. This does not mean that the longer routes are not included, however (we will see an example of this later). 

With all of the data we have at our disposal, we will explore the possibility of being able to predict just _how_ delayed a flight will be based on the many factors involved. While we do have numerous factors to possibly influence the outcome of our predictions, there are also several factors outside of the scope of this study that will be considered to be confounding variables. One such variable is the weather of the locations involved. As difficult as it may be to predict the delay of a particular flight based on the day of the week coupled with the carrier, it will be far more difficult to predict exact snowfall along with wind speed for the area in question, ultimately grounding unsuspecting travelers. 

[&uarr; ToC](#toc)

<a id="methods"></a>
## 3 - Methods

<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:10px;'>
<h3>Methods (<b>30 points total</b>)</h3>

<div style="color:red">
<h3><b>Rephrase this section and add links to each of the different models &darr;</b></h3>
</div>

<a id=Question3></a>
## Q3. Can you predict a flight's delayed time in minutes?

The goal of this section is to create a model we will be able to use with relative success in predicting these flight delays. Some confounding variables which will not be able to be added are variables such as the weather. It is well-known the weather (at this time) cannot be accurately predicted further out than a week. Of course, there are multiple underlying variables when we say "weather" such as the average windspeed for the day for the region or even the specific precipitation experienced. Until mankind is able to accurately predict the weather, we will never have a perfect model. 

However, there are some variables not explicitly included in the data which are highly influential, one being the holidays and the days which surround them. To better utilize the effect this variable has on delays, we will create a variable named `hdays` to indicate how many days lie between the flight event date and the nearest holiday(s). This is only one variable of potentially many which would serve our purpose well, but more time would be needed to adequately explore these other options. 

<h2>Predicting Airline On-Time Performance using Turi's GraphLab Create</h2>

Here, we used the work done by Prof Larson discussed in the Split_Apply_Combine in R and Python.ipynb.  

- one hot encoded data set - airline_encoded_data SFrame </li>



In [1]:
import graphlab
import graphlab as gl

## Get an airline_encoded_data SFRAME for all of the alirline data from 1988 to 2008

#### Note: It takes about 4 min and 30 seconds to execute the code below.

In [2]:
%time airline_encoded_data = gl.SFrame('data/'+ 'AirlineDataAll.csv')
airline_encoded_data.shape

This non-commercial license of GraphLab Create for academic use is assigned to bbrock@smu.edu and will expire on August 02, 2018.


[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: C:\Users\BENBRO~1\AppData\Local\Temp\graphlab_server_1502419640.log.0


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[long,long,long,long,long,long,long,long,long,long,long,long,long,str,long,long,long,long,long,str,str,long,long,long,str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


Wall time: 4min 51s


(123534969, 29)

In [3]:
airline_encoded_data.shape

(123534969, 29)

In [4]:
%time airline_encoded_data.head()

Wall time: 398 ms


Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum
1987,10,14,3,741,730,912,849,7,1451,9722
1987,10,15,4,729,730,903,849,7,1451,9722
1987,10,17,6,741,730,918,849,7,1451,9722
1987,10,18,7,729,730,847,849,7,1451,9722
1987,10,19,1,749,730,922,849,7,1451,9722
1987,10,21,3,728,730,848,849,7,1451,9722
1987,10,22,4,728,730,852,849,7,1451,9722
1987,10,23,5,731,730,902,849,7,1451,9722
1987,10,24,6,744,730,908,849,7,1451,9722
1987,10,25,7,729,730,851,849,7,1451,9722

ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled
91,79,,23,11,172,202,447,,,0
94,79,,14,-1,172,202,447,,,0
97,79,,29,11,172,202,447,,,0
78,79,,-2,-1,172,202,447,,,0
93,79,,33,19,172,202,447,,,0
80,79,,-1,-2,172,202,447,,,0
84,79,,3,-2,172,202,447,,,0
91,79,,13,1,172,202,447,,,0
84,79,,19,14,172,202,447,,,0
82,79,,2,-1,172,202,447,,,0

CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
4,0,,,,,
4,0,,,,,
4,0,,,,,
4,0,,,,,
4,0,,,,,
4,0,,,,,
4,0,,,,,
4,0,,,,,
4,0,,,,,
4,0,,,,,


In [8]:
list_of_features = ['TaxiIn', 
                    'TaxiOut', 
                    'CarrierDelay', 
                    'WeatherDelay', 
                    'NASDelay', 
                    'SecurityDelay', 
                    'LateAircraftDelay',
                    'ActualElapsedTime']

In [9]:
airline_encoded_data = airline_encoded_data.dropna(list_of_features)

In [10]:
airline_encoded_data.shape

(33540215, 29)

In [11]:
airline_encoded_data.head()

Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum
2003,6,29,7,1756,1725,1904,1838,28,781
2003,6,30,1,1721,1725,1827,1838,28,781
2003,6,1,7,736,740,1004,1001,28,782
2003,6,2,1,736,740,1015,1001,28,782
2003,6,3,2,737,740,956,1001,28,782
2003,6,4,3,739,740,957,1001,28,782
2003,6,5,4,734,740,948,1001,28,782
2003,6,6,5,739,740,949,1001,28,782
2003,6,7,6,730,740,940,1001,28,782
2003,6,8,7,733,740,938,1001,28,782

TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut
13444,128,133,103,26,31,207,33,862,8,17
6483,126,133,110,-11,-4,207,33,862,4,12
259,148,141,126,3,-4,20,114,925,7,15
7997,159,141,133,14,-4,20,114,925,9,17
4922,139,141,120,-5,-3,20,114,925,7,12
5012,138,141,116,-4,-1,20,114,925,7,15
10881,134,141,115,-13,-6,20,114,925,4,15
5493,130,141,115,-12,-1,20,114,925,5,10
7295,130,141,113,-21,-10,20,114,925,6,11
11413,125,141,115,-23,-7,20,114,925,2,8

Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
0,4,0,0,26,0,0,0
0,4,0,0,0,0,0,0
0,4,0,0,0,0,0,0
0,4,0,0,0,0,0,0
0,4,0,0,0,0,0,0
0,4,0,0,0,0,0,0
0,4,0,0,0,0,0,0
0,4,0,0,0,0,0,0
0,4,0,0,0,0,0,0
0,4,0,0,0,0,0,0


<div style="color:red">
<h3><b>Discuss why these are important &darr;</b></h3>
</div>

# Create New Feature 'IS_DELAYED'

Our goal here was to make this variable a categorical output variable.   Using Turi's GraphLab Create models, one must set the target variable.  This is why the new output categorical variable was created on the airline_encoded_data SFRAME;

- airline_encoded_data['is_delayed'].

We know the an arriving flight is considered delayed if is more 15 minutes late or if a departing flight is considered delayed it if delayed more than 15 minutes.   Hence, either one of the choices would create the proper categorical variable declaration to achieve the proper handling of this case.

First, let's construct a binary target variable. In this example, we will predict if an airline flight is delayed or not, with 1 (or True) indicating the flight is delayed or with 0 (or False) indicating the flight is not delayed. We will use the following features.

In our case, to make the airline_encoded_data['is_delayed'] binary, we will execute either one of the following statements:

- airline_encoded_data['is_delayed'] = airline_encoded_data['ArrDelay'] > 15, or
- airline_encoded_data['is_delayed'] = airline_encoded_data['DepDelay'] > 15


<div style="color:red">
<h3><b>Discuss why these are important &uarr;</b></h3>
</div>

In [None]:
# Make sure the target is discrete
airline_encoded_data['is_delayed'] = airline_encoded_data['ArrDelay'] > 15
airline_encoded_data['is_delayed'] = airline_encoded_data['DepDelay'] > 15

We split the data into training and test subsets. 

In [23]:
# split the data randomly, keeping 80% for training and the rest for validation
(train, test) = airline_encoded_data.random_split(0.8)

## Baseline approach: Linear Regression Modeling

## Select the custom_airline_features

Based on the investigators knowledge of the subject matter, the custom airling features which are believed to be significant to the test are the following:

- Month
- Days of Month
- Day of Week
- Departure Time
- CSR Departure Time
- Arrival Time
- CRS Arrival Time
- Unique Carrier
- Flight Number
- Tail Number

In [21]:
custom_airline_features = ['Month',
                           'DayofMonth', 
                           'DayOfWeek', 
                           'DepTime', 
                           'CRSDepTime', 
                           'ArrTime', 
                           'CRSArrTime', 
                           'UniqueCarrier', 
                           'FlightNum', 
                           'TailNum']

# Linear Regression


### Note:  https://turi.com/learn/userguide/supervised-learning/linear-regression.html

Austin use this as an guide in the write up.  I will add more tommorrow.   This is a place holder for now.


Per Matt Baldree's comments
Since the target variable, airline_encoded_data['is_delayed'] is a dependent categorical variable or a dependent binary categorical variable with values of '0' or '1', the OLS (ordinary least squares) can no longer produce the best linear unbiased estimator; that it is, OLS is biased and inefficient.   In this case, airline_encoded_data['is_delayed'] is a binary response variable, where we would use Logistic Regression to model the probabilities to predict if the flight will be delayed or not. Using Turi's GraphLab Create API, we will use the graphlab.logistic_classifier model to predict if the flight is delayed.


- https://turi.com/products/create/docs/generated/graphlab.logistic_classifier.LogisticClassifier.html
- https://onlinecourses.science.psu.edu/stat504/node/149
- https://onlinecourses.science.psu.edu/stat501/node/374


**I propose that we delete approach 1.**


<div style="color:red">
<h3><b>Linear Regression is not applicable ONLY Logistic Regression is! &uarr;</b></h3>
</div>

In [71]:
%time model = graphlab.linear_regression.create(train, target = 'is_delayed', features = custom_airline_features)

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



Wall time: 2min 18s


In [72]:
print model.get('coefficients').topk('value')

+---------------+-------+--------------------+-------------------+
|      name     | index |       value        |       stderr      |
+---------------+-------+--------------------+-------------------+
|   DayOfWeek   |  None |  0.00112888981078  | 3.78746451684e-05 |
|    DepTime    |  None | 0.000799119061006  | 6.29639266067e-07 |
|   DayofMonth  |  None | 0.000660258637362  | 8.57949481577e-06 |
| UniqueCarrier |  None | 0.000565958993793  | 8.02785513679e-06 |
|   CRSArrTime  |  None | 4.95917574517e-05  | 3.68145813696e-07 |
|   FlightNum   |  None | 2.88970167198e-06  | 4.02726510174e-08 |
|    TailNum    |  None | -1.9592011493e-08  | 1.93787402877e-08 |
|    ArrTime    |  None | -6.09670073353e-05 | 3.03444379758e-07 |
|   CRSDepTime  |  None | -0.000625952085893 | 6.40909534241e-07 |
|     Month     |  None | -0.00148475651716  | 2.22636269218e-05 |
+---------------+-------+--------------------+-------------------+
[10 rows x 4 columns]



In [73]:
print model.get('coefficients').topk('value',reverse=True)

+---------------+-------+--------------------+-------------------+
|      name     | index |       value        |       stderr      |
+---------------+-------+--------------------+-------------------+
|  (intercept)  |  None |  -0.0415920000331  | 0.000405282639296 |
|     Month     |  None | -0.00148475651716  | 2.22636269218e-05 |
|   CRSDepTime  |  None | -0.000625952085893 | 6.40909534241e-07 |
|    ArrTime    |  None | -6.09670073353e-05 | 3.03444379758e-07 |
|    TailNum    |  None | -1.9592011493e-08  | 1.93787402877e-08 |
|   FlightNum   |  None | 2.88970167198e-06  | 4.02726510174e-08 |
|   CRSArrTime  |  None | 4.95917574517e-05  | 3.68145813696e-07 |
| UniqueCarrier |  None | 0.000565958993793  | 8.02785513679e-06 |
|   DayofMonth  |  None | 0.000660258637362  | 8.57949481577e-06 |
|    DepTime    |  None | 0.000799119061006  | 6.29639266067e-07 |
+---------------+-------+--------------------+-------------------+
[10 rows x 4 columns]



In [74]:
# Number of feature columns
print "Number of features: %s"   % model['num_features']

Number of features: 10


In [75]:
# Number of coefficients in the model
print "Number of coefficients in the model : %s" % model['num_coefficients']

Number of coefficients in the model : 11


In [76]:
# Number of features (including expanded lists and dictionaries)
print "Number of unpacked features : %s " % model['num_unpacked_features']

Number of unpacked features : 10 


In [78]:
# Save predictions to an SArray
predictions = model.predict(test)

In [79]:
#Evaluatte the model and save the results into a dictionary
results = model.evaluate(test)

In [80]:
print results

{'max_error': 2.52661610587084, 'rmse': 0.38050752172443986}


<div style="color:red">
<h3><b>Discuss these results &uarr;</b></h3>
</div>

##  Approach 2: Logistic Regression Modeling

### NOTE:   https://turi.com/learn/userguide/supervised-learning/logistic-regression.html

Austin use this as an guide in the write up.  I will add more tommorrow.   This is a place holder for now.

We start by using a simple yet powerful Logistic Regression model to try and predict the actual flight times will be delayed or not.

In [81]:
%time model = graphlab.logistic_classifier.create(train, target = 'is_delayed', features = custom_airline_features)

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



Wall time: 1min 41s


## Evaluate the Logistic Regression Model

In [82]:
# Evaluate the model and save the results into a dictionary
print model.evaluate(test)

{'f1_score': 0.663910682314313, 'auc': 0.9567648088104059, 'recall': 0.49755493048156946, 'precision': 0.9973818012849726, 'log_loss': 0.339373096367494, 'roc_curve': Columns:
	threshold	float
	fpr	float
	tpr	float
	p	int
	n	int

Rows: 100001

Data:
+-----------+----------------+----------------+---------+---------+
| threshold |      fpr       |      tpr       |    p    |    n    |
+-----------+----------------+----------------+---------+---------+
|    0.0    |      1.0       |      1.0       | 1345974 | 5359888 |
|   1e-05   | 0.999642902986 | 0.988830393455 | 1345974 | 5359888 |
|   2e-05   | 0.999642902986 | 0.988827421629 | 1345974 | 5359888 |
|   3e-05   | 0.999642902986 | 0.98882519276  | 1345974 | 5359888 |
|   4e-05   | 0.999642902986 | 0.988822963891 | 1345974 | 5359888 |
|   5e-05   | 0.999642902986 | 0.988822963891 | 1345974 | 5359888 |
|   6e-05   | 0.999642902986 | 0.988819992065 | 1345974 | 5359888 |
|   7e-05   | 0.999642716415 | 0.988817763196 | 1345974 | 5359888 |
| 

In [83]:
model.get('coefficients')   # get the weights

name,index,class,value,stderr
(intercept),,1,-3.20339550318,0.00326865759365
Month,,1,-0.00767687949441,0.000170328694882
DayofMonth,,1,0.00330861640912,6.60649583003e-05
DayOfWeek,,1,0.00643665379704,0.000290866980591
DepTime,,1,0.0246941668579,1.44759873417e-05
CRSDepTime,,1,-0.0239628520927,1.46165924854e-05
ArrTime,,1,-1.49558418681e-06,2.38279893282e-06
CRSArrTime,,1,0.000106316680141,2.79159740634e-06
UniqueCarrier,,1,0.00307879915123,6.16274451957e-05
FlightNum,,1,1.0072278144e-05,3.08077610523e-07


### Making Predictions

Predictions using a GraphLab Create classifier is done by suing the classify() method.  The classify() method provides a one-stop shop for all that you need from a classifier.

    - A class prediction
    - Probability/Confidence associated with that class prediction.

In the following example, the first prediction was class 0 with a 84.5% probability.

In [84]:
# Save predictions (probability estimates) to an SArray
predictions = model.classify(test)
print predictions

+-------+----------------+
| class |  probability   |
+-------+----------------+
|   0   | 0.844730608912 |
|   0   | 0.928220771797 |
|   0   | 0.927374259688 |
|   0   | 0.927672774909 |
|   0   | 0.963834207836 |
|   0   | 0.960816172997 |
|   0   | 0.894976931714 |
|   0   | 0.903081132374 |
|   0   | 0.90741645003  |
|   0   | 0.902842377616 |
+-------+----------------+
[6705862 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


<div style="color:red">
<h3><b>Discuss these results &uarr;</b></h3>
</div>

# GraphLab Create's Logistic Regression model can return predictions for any of these types:

In [85]:
class_predictions = model.predict(test, output_type = "class")  # Class

# Evaluating Results

We can also evaluate our predictions by comparing them to known ratings. The results are evaluated using two metrics:

* Classification Accuracy: Fraction of test set examples with correct class label predictions.
* Confusion Matrix: Cross- tabulation of predicted and actual class labels.

The accuracy of the model is 89.89%.   The confusion matrix is listed below.

In [86]:
result = model.evaluate(test)
print "Accuracy         : %s " % result['accuracy']
print "Confusion Matrix : \n%s " % result['confusion_matrix']

Accuracy         : 0.89888906154 
Confusion Matrix : 
+--------------+-----------------+---------+
| target_label | predicted_label |  count  |
+--------------+-----------------+---------+
|      0       |        1        |   1758  |
|      1       |        1        |  669696 |
|      1       |        0        |  676278 |
|      0       |        0        | 5358130 |
+--------------+-----------------+---------+
[4 rows x 3 columns]
 


<div style="color:red">
<h3><b>Discuss these results &uarr;</b></h3>
</div>

<h1>Non linear regression: Traditional Matrix Factorization</h1>

Our task is to predict if there is flight delay before the flight is delayed, which is affected by the airport load, weather, plane type, carrier and many other parameters.  Let us try regular matrix factorization.

The factorization recommender took approximately 15 minutes to execute.  The final results are listed below:

- Training RMSE 0.189621495987
- Validation RMSE 0.19030777841

## Reference 

- https://github.com/turi-code/userguide/blob/master/recommender/choosing-a-model.md

- https://turi.com/products/create/docs/generated/graphlab.recommender.factorization_recommender.create.html?highlight=factorization_recommender

<div style="color:red">
<h3><b>Discuss this section</b></h3>
</div>

In [87]:
# Warning, this could take some time to run!!!!
#   14 minutes and 22 seconds
# Train a matrix factorization model with default parameters
%time model = graphlab.recommender.factorization_recommender.create(train, user_id="FlightNum", item_id="Dest", target="is_delayed", side_data_factorization=False)

# check out the results of training and validation
print 'Training RMSE', model.get('training_rmse')
print 'Validation RMSE', graphlab.evaluation.rmse(test['is_delayed'], model.predict(test))

Wall time: 14min 22s
Training RMSE 0.189621495987
Validation RMSE 0.19030777841


# Let's remove the potential bootlenecks which could cause delays

<div style="color:red">
<h3><b>Discuss why these are bottlenecks </b></h3>
</div>


## NOTE:  
### We executed the Boosted decision trees model two times, (1) by not removing the potential bottlenecks from the train and test data set before executing the Boosted decision tree model, and (2) removing the potential bottlenecksf from the train and test data set and executing the Boosted decision tree model.   Here we consider these factors to be bottlenecks because may cause a delay to the flight.  We executed the tests w/o removing the potential bottlenecks so we could later compare.

Potential bottlenecks
- AirTime,
- ArrDelay
- DepDelay
- ArrTime

## Non Linear regression: Boosted decision trees

## Execute the Boosted Trees Regression Model with the Train and Test Data Set Not Modified

## Do exercise without removing the columns from the train and test data


It took almost 5 hours to complete the boosted trees regression model.    

Below are the results of the model.

- Training RMSE 9.18903822367e-07
- Validation RMSE 9.18951323831e-07


# Reference:   

-  https://turi.com/products/create/docs/generated/graphlab.boosted_trees_regression.BoostedTreesRegression.html?highlight=boosted_trees_regression
- avesbiodiv.mncn.csic.es/estadistica/bt1.pdf

In [26]:
# This could take some time to run  ==> It took me 4 hours, 45 minutes and 49 seconds to 
# execute this section of code on a Windows 10 64 Bit HP Envy.
#
# DO YOU REALLY WANT TO RE-EXECUTE THIS CODE
#
# Train a matrix factorization model with default parameters
%time model = graphlab.boosted_trees_regression.create(train, target="is_delayed", max_iterations=50)

# check out the results of training and validation
print 'Training RMSE', model.get('training_rmse')
print 'Validation RMSE', graphlab.evaluation.rmse(test['is_delayed'], model.predict(test))

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



Wall time: 4h 45min 17s
Training RMSE 9.18903822367e-07
Validation RMSE 9.18951323831e-07


## Feature Importance Analysis

The important features are dominated by DepDelay and WeatherDelay.

In [28]:
print model.get_feature_importance()

+--------------+-------+-------+
|     name     | index | count |
+--------------+-------+-------+
|   DepDelay   |  None |  104  |
| WeatherDelay |   3   |   0   |
| WeatherDelay |   56  |   0   |
| WeatherDelay |  102  |   0   |
| WeatherDelay |   88  |   0   |
| WeatherDelay |   39  |   0   |
| WeatherDelay |   9   |   0   |
| WeatherDelay |   61  |   0   |
| WeatherDelay |  181  |   0   |
| WeatherDelay |   2   |   0   |
+--------------+-------+-------+
[7544 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


## Let's remove the potential bootlenecks which could cause delays

In [24]:
train.remove_columns(['AirTime','ArrDelay','DepDelay','ArrTime'])
test.remove_columns(['AirTime','ArrDelay','DepDelay','ArrTime'])

Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime
2003,6,29,7,1756,1725,1838,28,781,13444,128
2003,6,1,7,736,740,1001,28,782,259,148
2003,6,2,1,736,740,1001,28,782,7997,159
2003,6,5,4,734,740,1001,28,782,10881,134
2003,6,7,6,730,740,1001,28,782,7295,130
2003,6,9,1,737,740,1001,28,782,435,132
2003,6,11,3,736,740,1001,28,782,12555,140
2003,6,4,3,1054,1100,1346,28,782,10986,97
2003,6,13,5,1056,1055,1343,28,782,10192,131
2003,6,17,2,1052,1055,1343,28,782,4464,122

CRSElapsedTime,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay
133,207,33,862,8,17,0,4,0,0
141,20,114,925,7,15,0,4,0,0
141,20,114,925,9,17,0,4,0,0
141,20,114,925,4,15,0,4,0,0
141,20,114,925,6,11,0,4,0,0
141,20,114,925,4,9,0,4,0,0
141,20,114,925,7,12,0,4,0,0
106,112,15,622,3,12,0,4,0,0
108,112,15,622,3,42,0,4,0,0
108,112,15,622,4,27,0,4,0,0

WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay,is_delayed
26,0,0,0,1
0,0,0,0,0
0,0,0,0,0
0,0,0,0,0
0,0,0,0,0
0,0,0,0,0
0,0,0,0,0
0,0,0,0,0
0,23,0,1,1
0,0,0,0,0


## Re-Execute the Boosted Trees Regression Model With the Updated Train and Test Data Set


Now, this time execute the boosted tress regression with the bottlenecks removed.  We should compare with the execution of the boosted trees regression with ane without potential bootlenecks results.

Again, this model took approximately 4 hours to complete.

The results of this model is listed below:
- Training RMSE 0.0168041735888
- Validation RMSE 0.0171184349095

<div style="color:red">
<h3><b>What is duplicated here? &uarr;</b></h3>
</div>

In [25]:
# This could take some time to run  ==> It took me 3 hours, 15 minutes and 49 seconds to 
# execute this section of code on a Windows 10 64 Bit HP Envy.
#
# DO YOU REALLY WANT TO RE-EXECUTE THIS CODE
#
# Train a matrix factorization model with default parameters
%time model = graphlab.boosted_trees_regression.create(train, target="is_delayed", max_iterations=50)

# check out the results of training and validation
print 'Training RMSE', model.get('training_rmse')
print 'Validation RMSE', graphlab.evaluation.rmse(test['is_delayed'], model.predict(test))

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



Wall time: 3h 50min 55s
Training RMSE 0.0168041735888
Validation RMSE 0.0171184349095


# Check which one model is better

## Feature Importance Analysis

The important features in this model are WeatherDelay, LateAircraftDelay, NASDelay, SecurityDelay.

In [26]:
print model.get_feature_importance()

+-------------------+-------+-------+
|        name       | index | count |
+-------------------+-------+-------+
|    WeatherDelay   |   0   |  233  |
| LateAircraftDelay |   0   |  156  |
|      NASDelay     |   0   |  156  |
|    CarrierDelay   |   0   |  138  |
|   SecurityDelay   |   0   |   90  |
|      NASDelay     |   5   |   64  |
|      NASDelay     |   3   |   53  |
|      NASDelay     |   1   |   45  |
| LateAircraftDelay |   6   |   42  |
| LateAircraftDelay |   10  |   41  |
+-------------------+-------+-------+
[5578 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


<div style="color:red">
<h3><b>Add conclusion here&uarr;</b></h3>
</div>

<div style="color:red">
<h3><b>Add References and whatnot. Let's tie everything up here</b></h3>
</div>