In [4]:
import seaborn as sns
sns.set()

In [1]:
import numpy as np
import pandas as pd


# Part 2: Classification

Next we'll look at delay and cancellation data for airline flights from 2008.  We'll only consider flights from two airline carriers: Southwest Airlines and American Airlines.  Our goal is to build a classifier that predicts which carrier each flight belongs to.

Let's start by loading in the data.

In [33]:
pd.set_option('display.max_columns', None)

In [35]:
flight_data = pd.read_csv("./ml_flight_data.csv", index_col=0)

print(type(flight_data), flight_data.shape)

flight_data.head()

<class 'pandas.core.frame.DataFrame'> (20000, 23)


Unnamed: 0,Carrier,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Distance,TaxiIn,TaxiOut,Cancelled,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
0,Southwest,10,1,3,17.4,17.333333,18.95,19.083333,273.0,285.0,259.0,-8.0,4.0,1848,7.0,7.0,0,0,,,,,
1,Southwest,6,3,2,6.916667,7.0,7.35,7.416667,86.0,85.0,71.0,-4.0,-5.0,460,2.0,13.0,0,0,,,,,
2,Southwest,7,3,4,8.566667,8.416667,10.966667,10.916667,84.0,90.0,72.0,3.0,9.0,487,4.0,8.0,0,0,,,,,
3,American,2,22,5,,10.666667,,13.75,,125.0,,,,733,,,1,0,,,,,
4,American,3,29,6,17.733333,17.333333,19.933333,19.333333,192.0,180.0,149.0,36.0,24.0,987,34.0,9.0,0,0,6.0,0.0,12.0,0.0,18.0


A brief description of the columns can be found [here](http://stat-computing.org/dataexpo/2009/the-data.html).  For convenience, we've converted times from `hhmm` format to hours and dropped some columns.  All features can be treated as continuous.

To start, we'll select the appropriate columns to get our feature and label sets, `X` and `y`.  Depending on your strategy for model selection and evaluation, you may wish to further split these into test and training sets using scikit-learn's [`train_test_split`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).

In [36]:
X = flight_data.drop('Carrier', axis=1)
y = flight_data['Carrier']

## Question 5: Logistic Regression Model

We'll start by building a classifier that uses Logistic Regression.  Practically speaking, this could be very similar to the models that we built before, but we have a new challenge to deal with: our data has missing values.

In a different situation, we might want to drop rows or columns, but this often means throwing out good data, and we may not have the luxury of ignoring incomplete observations when the time comes to apply our model.  Here we'll assign a value to each missing field based on the other values in the same column.  We suggest using the [`SimpleImputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) transformer from scikit-learn's  `impute` module.  

It should be noted that the behavior of `SimpleImputer` is somewhat limited.  If we wanted to use a more complicated strategy to impute values (or to use different strategies for different columns), then we'd need to write a custom transformer.  Another strategy to be aware of is filling in random values.  This is a middle ground between imputation and dropping rows/columns.          

In [34]:
# pd.set_option('display.max_rows', 500)
# pd.set_option('display.max_columns', 500)

In [176]:
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV


features = ColumnTransformer([
    ('complete_cols', 'passthrough', ['Month', 'DayofMonth', 'DayOfWeek', 'CRSDepTime',
       'CRSArrTime', 'CRSElapsedTime', 'Distance','Cancelled',
       'Diverted']),
    ('Fill_specific_delay', SimpleImputer(strategy='constant',fill_value=0), ['CarrierDelay', 'WeatherDelay', 'NASDelay', 'SecurityDelay',
       'LateAircraftDelay']),
    ('Fill_all_other_nan_values', SimpleImputer(strategy='constant',fill_value=9999), [ 'DepTime', 'ArrTime',
      'ActualElapsedTime', 'AirTime',
       'ArrDelay', 'DepDelay', 'TaxiIn', 'TaxiOut'])
])

pipe = Pipeline([
    ('features', features),
    ('scaling', StandardScaler()),
    ('logistic_regressor',LogisticRegression(C=22,max_iter=10000))
], verbose=True)

pipe.fit(X,y)


log_est= pipe

[Pipeline] .......... (step 1 of 3) Processing features, total=   0.0s
[Pipeline] ........... (step 2 of 3) Processing scaling, total=   0.0s
[Pipeline]  (step 3 of 3) Processing logistic_regressor, total=   3.5s


Once you have a working pipeline, try modifying it to improve performance.

* Consider the [`StandardScaler`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) transformer.  Is it appropriate to use it here?   
* You can use [`GridSearchCV`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) to do hyperparameter selection.  `LogisticRegression` has hyperparameter, `C`, which controls regularization, and `SimpleImputer` has an argument, `strategy`, that controls how imputed values are calculated.     

In [177]:
log_est.named_steps['logistic_regressor'].best_params_

## Question 6: Feature Importance

One way to gain insight into our machine learning models is to look at how much influence each variable has on their predictions.  In a broad sense, variables with more influence are more important because they have more predictive power.  For example, with linear or logistic regression we can measure importance by looking at the coefficients learned by the model.  If our features were normalized to begin with, then the coefficients with the greatest absolute value correspond to the most predictive features.  

In this question, we'll use the `.coef_` attribute of `LogisticRegression` to get the coefficients from the estimator we built in the previous step.  Keep in mind that we need to do feature scaling to get a fair comparison of the coefficients, so you may need to modify your pipeline to include `StandardScaler`.

Then write a function that returns a list of the five most important features, together with their coefficients.  Depending on how you built your estimator, you may need to do some digging to access the `LogisticRegression` component.  The `best_estimator_` attribute of [`GridSearchCV`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) and the `named_steps` attribute of [`Pipeline`](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) may come in handy.  

In [149]:
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler


features = ColumnTransformer([
    ('complete_cols', 'passthrough', ['Month', 'DayofMonth', 'DayOfWeek', 'CRSDepTime',
       'CRSArrTime', 'CRSElapsedTime', 'Distance','Cancelled',
       'Diverted']),
    ('Fill_specific_delay', SimpleImputer(strategy='median'), ['CarrierDelay', 'WeatherDelay', 'NASDelay', 'SecurityDelay',
       'LateAircraftDelay']),
    ('Fill_all_other_nan_values', SimpleImputer(strategy='median'), [ 'DepTime', 'ArrTime',
      'ActualElapsedTime', 'AirTime',
       'ArrDelay', 'DepDelay', 'TaxiIn', 'TaxiOut'])
])

    
pipe = Pipeline([
    ('features', features),
    ('scaling', StandardScaler()),
    ('logistic_regressor', LogisticRegression(C= 20, max_iter= 1000, penalty= 'l2', solver= 'saga'))
], verbose=True)

pipe.fit(X,y)


log_est= pipe

[Pipeline] .......... (step 1 of 3) Processing features, total=   0.0s
[Pipeline] ........... (step 2 of 3) Processing scaling, total=   0.0s
[Pipeline]  (step 3 of 3) Processing logistic_regressor, total=   2.3s


In [150]:
import numpy as np

In [181]:
colnames = X.columns
coefs = (log_est.named_steps['logistic_regressor'].coef_[0])

kk = list(tuple(zip(colnames, coefs)))

sorted(kk, key =lambda x :np.abs(x[1]), reverse = True )[:5]

[('CarrierDelay', 86.02351499936037),
 ('Diverted', -58.41910113076498),
 ('LateAircraftDelay', -50.74883504920496),
 ('SecurityDelay', -39.034976333180595),
 ('Cancelled', 39.02556465198696)]

In [1]:
column_names = list(flight_data.drop('Carrier', axis=1)) 

top_5 = sorted(kk, key =lambda x :x[1], reverse = True )[:5]


NameError: name 'flight_data' is not defined