<img src="https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png" style="float: left; margin: 15px;">

## Gradient Descent Lab

Week 7 | 1.2

---

In this lab you will be investigating airline delays during the time period between October 2015 and January 2016. The airline data has been pulled down from the [Bureau of Transportation and Statistics](http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236). 

The data is split into csvs for each month. There is also a `codebook.csv` that describes the fields in the data.

---

### Load packages and data

In [2]:
import numpy as np
import scipy 
import seaborn as sns
import pandas as pd

import patsy

import matplotlib
import matplotlib.pyplot as plt

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

plt.style.use('fivethirtyeight')

The csvs are located in the `flight_delays` folder. They are initially compressed since they are hundreds of MB in size.

    DSI-SF-4/datasets/flight_delays/oct2015.csv.zip
    DSI-SF-4/datasets/flight_delays/nov2015.csv.zip
    DSI-SF-4/datasets/flight_delays/dec2015.csv.zip
    DSI-SF-4/datasets/flight_delays/jan2015.csv.zip
    DSI-SF-4/datasets/flight_delays/codebook.csv
    
Please take a look at the codebook to identify what information each column contains.

The csvs may have a column full of nan values as the final column, and this can be removed.

---

### 1. Predict whether a flight was delayed 15 minutes or more (`DEP_DEL15`) using `SGDClassifier` with one month of data

The `SGDClassifier` class in sklearn solves classification problems using stochastic gradient descent. This is useful for datasets that are quite large.

The `SGDClassifier` is very general and flexible, and can be customized with a variety of keyword arguments.

- `loss`: `['log', ...]`
    - The `'log'` loss corresponds to solving a logistic regression classifier. This is what I expect you'll use, but there are many other options, most of which we haven't covered.
- `penalty`: `['none','l1','l2','elasticnet']`
    - This defines the penalty on the regression that you would like to solve. The l1 and l2 are the Lasso and Ridge, while the elasticnet is the combination of them both.
- `alpha`
    - The regularization strength to be used with a chosen penalty. Same as in Lasso and Ridge.
- `l1_ratio`
    - The mix of the Lasso and Ridge penalties when elasticnet is chosen as the penalty.
- `n_iter`
    - The number of training "epochs" over the data. This is the number of passes that the gradient descent algorithm will make over the data to iteratively fit the weights (defaults to 5).

Like the other models you have been using, `SGDClassifier` can be used in tandem with grid searching to find the optimal parameters for certain models. 

It is up to you how you want to define the model as far as predictors. You should:

1. Perform any necessary cleaning of the data. 
- Do any feature engineering you think is interesting. For example, you may create a variable indicating whether a day is a holiday or the eve of a holiday.
- Choose predictors that seem relevant/interesting. Be careful not to include predictors that include information about the target variable you wouldn't expect to have for future data.
- Fit a model using stochastic gradient descent. You may want to find the optimal parameters for your model, but since this data is large don't go overboard on the search space.
- Validate your model. Explain, ideally visually, how well your model performs over baseline (or not).
- What can you interpret about airline delays from your model?

**Don't forget to standardize if you're using regularization!**


In [17]:
oct2015 = pd.read_csv('/Users/austinwhaley/Desktop/DSI-SF-4-austinmwhaley/datasets/flight_delays/oct2015.csv')

In [18]:
oct2015.head()

Unnamed: 0,YEAR,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,UNIQUE_CARRIER,TAIL_NUM,ORIGIN,ORIGIN_CITY_NAME,ORIGIN_STATE_NM,...,CANCELLATION_CODE,ACTUAL_ELAPSED_TIME,AIR_TIME,DISTANCE,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY,Unnamed: 29
0,2015,10,22,4,2015-10-22,DL,N975DL,DSM,"Des Moines, IA",Iowa,...,,123.0,101.0,743.0,,,,,,
1,2015,10,22,4,2015-10-22,DL,N354NW,MCI,"Kansas City, MO",Missouri,...,,68.0,52.0,393.0,,,,,,
2,2015,10,22,4,2015-10-22,DL,N354NW,MSP,"Minneapolis, MN",Minnesota,...,,85.0,64.0,393.0,,,,,,
3,2015,10,22,4,2015-10-22,DL,N922DX,ATL,"Atlanta, GA",Georgia,...,,81.0,64.0,432.0,,,,,,
4,2015,10,22,4,2015-10-22,DL,N922DX,IND,"Indianapolis, IN",Indiana,...,,87.0,63.0,432.0,,,,,,


In [19]:
oct2015 = oct2015[['MONTH','UNIQUE_CARRIER', 'ORIGIN', 'DEST', 'DEP_TIME', 'DEP_DELAY']]

In [20]:
from sklearn.preprocessing import StandardScaler

In [21]:
oct2015.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 486165 entries, 0 to 486164
Data columns (total 6 columns):
MONTH             486165 non-null int64
UNIQUE_CARRIER    486165 non-null object
ORIGIN            486165 non-null object
DEST              486165 non-null object
DEP_TIME          483826 non-null float64
DEP_DELAY         483826 non-null float64
dtypes: float64(2), int64(1), object(3)
memory usage: 22.3+ MB


In [22]:
oct2015 = oct2015.dropna()

In [23]:
oct2015.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 483826 entries, 0 to 486164
Data columns (total 6 columns):
MONTH             483826 non-null int64
UNIQUE_CARRIER    483826 non-null object
ORIGIN            483826 non-null object
DEST              483826 non-null object
DEP_TIME          483826 non-null float64
DEP_DELAY         483826 non-null float64
dtypes: float64(2), int64(1), object(3)
memory usage: 25.8+ MB


In [24]:
X = oct2015[['MONTH','UNIQUE_CARRIER', 'ORIGIN', 'DEST', 'DEP_TIME']]

In [25]:
y = oct2015['DEP_DELAY']

In [26]:
X = pd.get_dummies(X)

In [27]:
SS = StandardScaler()
SS.fit_transform(X)

array([[ 0.        ,  0.67815893, -0.43369375, ..., -0.04395572,
        -0.01122917, -0.01913029],
       [ 0.        , -0.85230589, -0.43369375, ..., -0.04395572,
        -0.01122917, -0.01913029],
       [ 0.        , -1.28693318, -0.43369375, ..., -0.04395572,
        -0.01122917, -0.01913029],
       ..., 
       [ 0.        , -1.37344667, -0.43369375, ..., -0.04395572,
        -0.01122917, -0.01913029],
       [ 0.        , -1.43730186, -0.43369375, ..., -0.04395572,
        -0.01122917, -0.01913029],
       [ 0.        ,  0.24559148, -0.43369375, ..., -0.04395572,
        -0.01122917, -0.01913029]])

In [28]:
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

In [29]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75)

In [30]:
SGDC = SGDClassifier()
SGDC.fit(X_train, y_train)
scores = cross_val_score(SGDC, X_test, y_test, cv=5, verbose=2, n_jobs=-1)

print scores, '\n'
print np.mean(scores)

[CV]  ................................................................
[CV]  ................................................................
[CV]  ................................................................
[CV]  ................................................................
[CV] ................................................. , total= 4.3min
[CV]  ................................................................
[CV] ................................................. , total= 5.9min
[CV] ................................................. , total= 5.9min
[CV] ................................................. , total= 5.8min
[CV] ................................................. , total= 4.1min


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  8.5min finished


[ 0.01792804  0.00090722  0.01249948  0.00020754  0.00457895] 

0.0072242448646


---

### 2. Fit your model on all 4 months of data.

Now that you have determined a good model for the data, concatenate all four of the months and fit the model on all of the data.

You may need to re-clean the data if concatenation causes trouble.

1. How does your model perform on all of the data?
2. Do your findings from the one month of data change with 4 months of data?

---

### 3. Fit a regression using `SGDRegressor` on the actual number of minutes a flight is delayed (`DEP_DELAY`)

I recommend going back to just 1 month if you are going to do any gridsearching for optimal model hyperparameters.

The `SGDRegressor` object is for the most part the same as its classification counterpart. The primary difference will be that you change the loss function to a regresion loss function instead of classification.

1. How does your regression perform? Explain the metric and process you used to evaluate the model.
2. What is your interpretation of the regression results? How do your results compare with the results from your classification model?