## Scikit-learn Classification

In [None]:
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import sklearn as sklearn
from sklearn import linear_model, cross_validation, metrics, svm, ensemble
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_fscore_support, accuracy_score
from sklearn.model_selection  import train_test_split, cross_val_score, ShuffleSplit
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder


In [None]:


flights = pd.read_csv('../data/nycflights13/flights.csv.gz')
weather = pd.read_csv('../data/nycflights13/weather.csv.gz')
airports = pd.read_csv('../data/nycflights13/airports.csv.gz')

df_withweather = pd.merge(flights, weather, how='left', on=['year','month', 'day', 'hour', 'origin'])
df = pd.merge(df_withweather, airports, how='left', left_on='dest', right_on='faa')

df = df.dropna()


In [None]:
df

In [None]:

pred = 'dep_delay'
features =  ['month','day','dep_time','arr_time','carrier','dest','air_time','distance', 
             'lat', 'lon', 'alt',  'dewp', 'humid', 'wind_speed', 'wind_gust', 
             'precip', 'pressure', 'visib' ]

features_v = df[features]
pred_v = df[pred]

how_late_is_late = 15.0;

pd.options.mode.chained_assignment = None  # default='warn'


# carrier is not a number, so transform it into an number
features_v['carrier'] = pd.factorize(features_v['carrier'])[0]

# dest is not a number, so transform it into a number
features_v['dest'] = pd.factorize(features_v['dest'])[0]

scaler = StandardScaler()
scaled_features_v = scaler.fit_transform(features_v)

features_train, features_test, pred_train, pred_test = train_test_split(
    scaled_features_v, pred_v, test_size=0.30, random_state=0)


### Doing the classification with Logistic Regression

We will first attempt to do classification with logistic regression.  This likely will not give us the best results because logistic regression is a linear model.


In [None]:
# Perform logistic regression for classification

clf_lr = sklearn.linear_model.LogisticRegression(penalty='l2', 
                                                 class_weight='balanced')
logistic_fit=clf_lr.fit(features_train, 
                        np.where(pred_train >= how_late_is_late,1,0))

predictions = clf_lr.predict(features_test)

In [None]:
# Summary Report

# Confusion Matrix
cm_lr = confusion_matrix(np.where(pred_test >= how_late_is_late,1,0), 
                         predictions)
print("Confusion matrix")
print(pd.DataFrame(cm_lr))

# Get accuracy
report_lr = precision_recall_fscore_support(
    list(np.where(pred_test >= how_late_is_late,1,0)), 
    list(predictions), average='binary')

#Print Accuracy
print ("\nprecision = %0.2f, recall = %0.2f, F1 = %0.2f, accuracy = %0.2f"
       % (report_lr[0], report_lr[1], report_lr[2],                                                                         
          accuracy_score(list(np.where(pred_test >= how_late_is_late,1,0)), 
                                                                                             list(predictions))))



An accuracy of 67% is not particularly good.  A bigger concern, however is the relatively low precision and F1 scores, which indicate that our model is better at predicting negatives (not late) than positives (late).  

However, predicting flight delays from the data we have is not easy. 

### Another Attempt: A Random Forest Classifier

The low precision should concern us. This indicates that potentially we are less able to predict flights that are actuallly late (which is more important) than those that are not late.  This is due to an unbalanced training set.

Perhaps a Random Forest Classifier could help us.  Let's try that.  We'll do 40 trees, and we'll scramble the input set of features (bagging) so that we'll get 40 different trees. Let's see what that does to our precision.




In [None]:
# Perform Random Forest Model for classification
# 40 Trees.

clf_rf = sklearn.ensemble.RandomForestClassifier(n_estimators=40)
rf_fit=clf_rf.fit(features_train, 
                        np.where(pred_train >= how_late_is_late,1,0))

predictions_rf = clf_rf.predict(features_test)

In [None]:
# Summary Report

# Confusion Matrix
cm_rf = confusion_matrix(np.where(pred_test >= how_late_is_late,1,0), 
                         predictions_rf)
print("Confusion matrix")
print(pd.DataFrame(cm_rf))

# Get accuracy
report_rf = precision_recall_fscore_support(
    list(np.where(pred_test >= how_late_is_late,1,0)), 
    list(predictions_rf), average='binary')

#Print Accuracy
print ("\nprecision = %0.2f, recall = %0.2f, F1 = %0.2f, accuracy = %0.2f"
       % (report_rf[0], report_rf[1], report_rf[2],                                                                         
          accuracy_score(list(np.where(pred_test >= how_late_is_late,1,0)), 
                                                                                             list(predictions))))




### Evaluating the Results

Our accuracy is about the same, but precision is now quite good at 82%. That's good.  But the low recall score should give us pause, and possibly tell us that we need to further tune our model.

### Feature Importances

Random forest classifiers can also tell us how important each feature is.  This is because we use different sets of input features for each of the 40 trees, and we can use this to evaluate which features appear to be the most predictive. 

In [None]:
# Print the feature ranking
print("Feature ranking:")

indices = np.argsort(clf_rf.feature_importances_)[::-1]
std = np.std([clf_rf.feature_importances_ for tree in clf_rf.estimators_],
             axis=0)

for f in range(features_train.shape[1]):
    print("%d. feature %d: %s (%f)" % (f + 1, indices[f], features[f], clf_rf.feature_importances_[indices[f]]))

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(features_train.shape[1]), clf_rf.feature_importances_[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(features_train.shape[1]), indices)
plt.xlim([-1, features_train.shape[1]])
plt.show()

### Evaluating Feature Importances

What does the relative weight of feature importances say about
our dataset? Are there any features we might want to omit?
