# Model Training

# 1. Initialization

In [2]:
import numpy as np
import pandas as pd
import time
from IPython.display import Image

# Undersampling and Oversampling for class imbalance
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE

# Classifiers
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
import lightgbm as lgb

# Metrics & Corss-Validation
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.externals import joblib
from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score,
                             fbeta_score, make_scorer, classification_report, confusion_matrix)

from utils import *

import warnings
warnings.filterwarnings('ignore')

# 2. Preprocessing

Normally, in this section we would process and transform the dataset in diffirent way so that they are *in shape* and ready for the different algorithms to train them.

However, for this particular dataset and problem, we don't have many steps we need to take. This stems in part from the work we did on the *EDA* notebook, the fact that the majority of the features available are categorical and that our previous run of PCA clearly showed that, although the two first PCs explained for over 97% of total variability and that sampled do align in certain columns when seen by these PCs (as depicted in section 6 of *EDA* notebook, and in the Tableau workbook available in Tableau Public) they don't create a separation between classes in our target variable; and so, the results of PCA is not usefull for the classification task.

Therefore, in this section we will only create a One Hot Encoding version of the dataset and treat the imbalance natura of classes in the target variable.

Also, because most of the features are categorical, there's no need to scale or normalize their values. Although once we've run OHE we will have many binary features.

In [3]:
# Create One Hot Encoding

crash_data = pd.read_csv('data/Crash_Analysis_System_CAS_data_clean.csv', keep_default_na=False)
features_catalog = pd.read_table('data/features_description.tsv')
data_ohe = create_one_hot_encoding(crash_data, features_catalog)

In order to treat the imbalance natura of classes in the target variable `crashSeverity`, we will train every model using three different variations of the dataset.

The first variation will be the complete dataset. In this case, we will use all samples.

The second variation will be an undersampled version of the dataset. In this variation, only the class with the least amount of samples will remain as it is –which in this case is class `F`. We will remove random samples from every other class so that all classses are left with the same amount of samples as class `F`.

The third variation will be an oversampled version of the dataset, where all classes will have as many samples as the majority class. To achieve this, we will produce synthetic samples using the *SMOTE* technique. However, due to the high amount of samples already present in the dataset for classes `N` and `M`, we will first take a random sample of both classes to reduce their size and then run *SMOTE* to all the classes.

For the undersampling and oversampling steps, we will use the `imblearn` library.

## Original Variation

In [4]:
# Full dataset variation

y = data_ohe['crashSeverity']
X = data_ohe.drop('crashSeverity', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25, random_state=seed)
pd.Series(y).value_counts()

N    444945
M    140863
S     34913
F      5855
Name: crashSeverity, dtype: int64

## Undersampled Variation

In [5]:
# Undersamples variation
# ----------------------

# Generate variation
# rus = RandomUnderSampler(random_state=seed)
# X_under, y_under = rus.fit_sample(X, y)
# undersampled_ohe = pd.DataFrame(X_under, columns=X.columns)
# undersampled_ohe['crashSeverity'] = y_under
# undersampled_ohe.to_csv('data/undersampled_ohe.csv', index=False)


# Load variation
undersampled_ohe = pd.read_csv('data/undersampled_ohe.csv', keep_default_na=False)
y_under = undersampled_ohe['crashSeverity']
X_under = undersampled_ohe.drop('crashSeverity', axis=1)

# -----------------------------------------------------------------

X_train_under, X_test_under, y_train_under, y_test_under = train_test_split(X_under, y_under,test_size=.25,
                                                                            random_state=seed)

# Print class frequencies for variation
pd.Series(y_under).value_counts()

N    5855
S    5855
M    5855
F    5855
Name: crashSeverity, dtype: int64

## Oversampled Variation

In [6]:
# Oversampled Variation
# ---------------------


# Generate variation
# ------------------

# Drop 80% of samples with class N and 40% of sampled with class M to build an oversampled variation of reduced size.
# random_sample_N = data_ohe[data_ohe['crashSeverity'] == 'N'].sample(frac=.8).index
# random_sample_M = data_ohe[data_ohe['crashSeverity'] == 'M'].sample(frac=.4).index
# remove_index = random_sample_N.append(random_sample_M)
# y_rs = y.drop(remove_index)
# X_rs = X.drop(remove_index)


# Create oversample variation
# sm = SMOTE(random_state=seed)
# X_over, y_over = sm.fit_sample(X_rs, y_rs)
# oversampled_ohe = pd.DataFrame(X_over, columns=X.columns)
# oversampled_ohe['crashSeverity'] = y_over
# oversampled_ohe.to_csv('data/oversampled_ohe.csv', index=False)


# Load variation
oversampled_ohe = pd.read_csv('data/oversampled_ohe.csv', keep_default_na=False)
y_over = oversampled_ohe['crashSeverity']
X_over = oversampled_ohe.drop('crashSeverity', axis=1)

# -----------------------------------------------------------------

X_train_over, X_test_over, y_train_over, y_test_over = train_test_split(X_over, y_over, test_size=.25,
                                                                        random_state=seed)

pd.Series(y_over).value_counts()

M    88989
N    88989
F    88989
S    88989
Name: crashSeverity, dtype: int64

# 3. Benchmark

## Model Evaluation & Performance Metrics

The most simple metric that comes to mind naturally is *Accuracy*. That is, the ratio of the correctly predicted crashes among all samples. This includes both *True Positives* and *True Negatives*. Although, since we are in a multiclass case, we will perform a *One vs All classification* for each class –and then average the results.

However, accuracy alone is rarely enough to evaluate the performance of a model. In fact, it usually falls short on describing the ability of such model to be precise on its predictions.

This is why we will use F1 score as the main metric to evaluate the performance of the different models we will train.

F1 score in turn, is a combination of precision and recall. it actually has a parameter $\beta$; which allows us to choose the relevance of one metric over the other. When $\beta$ is equal to 1, the score is the harmonic mean between precision and recall. And so, our models will be trained to reduced *FP* and *FN*.

<span style="color: red">Put formula for the three metrics and clarify F1 balance , FP and FN

## Original Variation

First, we train a Naive Bayes classifier to use as a benchmark. As such, we won't be doing any hyperparameter tunning just yet. Its performance will be used as a baseline that we will work on to improve.

In [7]:
clf_NB_benchmark = MultinomialNB()

In [8]:
# Train NB benchmark model with original variation
clf_NB_benchmark.fit(X_train, y_train)

# Make prediction with test set
predictions_NB = clf_NB_benchmark.predict(X_test)

# Print results and save to explore further in Tableau
NB_original = structure_and_print_results('NB Benchmark', 'Original', y_test, predictions_NB, digits=5)

Model accuracy:  0.63390235183
             precision    recall  f1-score   support

          F    0.03122   0.47848   0.05862      1417
          M    0.52733   0.20320   0.29336     35231
          N    0.79478   0.81232   0.80346    111164
          S    0.14984   0.13123   0.13992      8832

avg / total    0.69136   0.63390   0.64458    156644



## Undersampled Variation

In [9]:
# Fit model, make prediction with test set and save results

clf_NB_benchmark.fit(X_train_under, y_train_under)
predictions_NB = clf_NB_benchmark.predict(X_test_under)
NB_under = structure_and_print_results('NB Benchmark', 'Undersampled', y_test_under, predictions_NB, digits=5)

Model accuracy:  0.419812126388
             precision    recall  f1-score   support

          F    0.41261   0.64778   0.50412      1465
          M    0.33037   0.20027   0.24937      1483
          N    0.48574   0.68468   0.56831      1443
          S    0.36013   0.15301   0.21477      1464

avg / total    0.39668   0.41981   0.38306      5855



## Oversampled Variation

In [10]:
# Fit model, make prediction with test set and save results

clf_NB_benchmark.fit(X_train_over, y_train_over)
predictions_NB = clf_NB_benchmark.predict(X_test_over)
NB_over = structure_and_print_results('NB Benchmark', 'Oversampled', y_test_over, predictions_NB, digits=5)

Model accuracy:  0.429614896223
             precision    recall  f1-score   support

          F    0.42566   0.68104   0.52388     22175
          M    0.33268   0.20579   0.25429     22124
          N    0.49005   0.68195   0.57029     22317
          S    0.38287   0.15005   0.21560     22373

avg / total    0.40793   0.42961   0.39099     88989



## Conclusion

In [10]:
# Save performance metrics (used in tableau workbook, Viz shown below)

# NB_metrics = pd.concat([NB_original, NB_under, NB_over])
# NB_metrics.to_csv('data/NB_benchmark_metrics.csv', index=False)

By training a NB classifier on all three variations of the dataset we can clearly see the huge effect that balancing the classes on the target variable has on precision, recall and F1 score.

Both undersampling and oversampling –which result in the same amount of samples for every class– have a detrimental effect on all metrics when we look at the averaged results for each class. However, if we look closer at the class level we see a completely different story (from Tableau workbook):

<img src="images/NB_benchmark_metrics.png" />

The first thing we notice is that for every metric –and dataset variation– the values among classes varies greatly. In other words, each metrics shows significant variance across target classes. This means that while we might see good overall values as is the case for the original variation with an averaged F1 score of 0.645; this could be due to some classes having a much bigger value than others.

The second thing to notice is that the full variation of the dataset present values that are too skewed toward class `N`. This is expected since this class accounts for more than 70% of all samples. In fact, the imbalanced natura of the dataset will prevent us achieving a good F1 score that is also balanced.

As a final comment, and following the previous one, we highlight the fact that both the under and oversampled variations show very similar metrics, although the oversampled variation is a little bit better.

Moreover, this is natural. Considering that both datasets have each the same amount of samples for each class, the NB classifier will generate very similar conditional distributions for each class with each variation, thus producing the same F1 scores if not for the tiny differences as we see above.

We expect that this will also be the case for the other classifiers and will use both variations to be able to pick the one that produces the best results.

# 4. Classifiers

In this section we will implement different algorithms and search for the one that performs best. In order to choose the best performing algorithm, we will carefully analyze the same metrics from the previous section and the different hyperparameter values to use. To optimize these hyperparameters we will use a mixed approach between gridsearch and hand-picking some specific value to use for the search.

Also, considering that most of the features in the dataset are categorical and that it is not clear which ones are the most meaningful or relevant, we will focus on decision trees as base/weak learner and train two ensemble models; one for bagging and one for boosting. Specifically, we will train and optimize a Random Forest and an AdaBoost GBM.

Our approach will be first to train a model with all default values for each variation to get a baseline. Then we will do a grid search using the variation that produced the best results and with relevant hyperparameter values to find the optimal combination. The grids of parameters and values will be determined by analyzing the results of the baseline model. Finally, we will further tune some parameters manually.

## 4.1 Random Forest

Random Forest is a great and powerfull tool. It leverages the power of decision trees and adds different levels of randomness to overcome their tendency to overfit. This randomness happens when sampling or boostrapping the instances used to train each tree in the ensemble –which makes it a bagging algorithm– and again, when a random sample of features are used at each split of each node of each tree. This generates a strong learner able to produce more accurate and stable predictions that any of the weak learners it is made of.

Despite all the above, Random Forest is sensible to class imbalance. So we will train a baseline for each variation and optimize the best performing one through gridsearch.

### 4.1.1 Baseline Models

#### Original Variation

In [11]:
clf_RF_base = RandomForestClassifier(random_state=seed, n_jobs=-1)

In [12]:
# Train RF baseline with original variation
# clf_RF_base.fit(X_train, y_train)
# joblib.dump(clf_RF_base, 'saved_models/clf_RF_base_original.joblib', compress=3)

# Load RF baseline with original variation
clf_RF_base = joblib.load('saved_models/clf_RF_base_original.joblib')

# -----------------------------------------------------------------

# Make predictions with test set and save results
pred_RF_base_original = clf_RF_base.predict(X_test)
RF_base_original_metrics = structure_and_print_results('RF Baseline', 'Original',
                                                       y_test, pred_RF_base_original, digits=5)

# -----------------------------------------------------------------

# Explore properties of forest to choose values for gridsearch
depths = []
for tree in clf_RF_base.estimators_:
    depths.append(tree.tree_.max_depth)
print('Depths: ', depths)

Model accuracy:  0.716427057532
             precision    recall  f1-score   support

          F    0.12270   0.02823   0.04590      1417
          M    0.44586   0.32559   0.37635     35231
          N    0.78281   0.89944   0.83708    111164
          S    0.25419   0.08243   0.12449      8832

avg / total    0.67125   0.71643   0.68612    156644

Depths:  [94, 89, 93, 95, 105, 93, 106, 99, 100, 101]


#### Undersampled Variation

In [13]:
# Train RF baseline with undersampled variation
# clf_RF_base.fit(X_train_under, y_train_under)
# joblib.dump(clf_RF_base, 'saved_models/clf_RF_base_under.joblib', compress=3)

# Load RF baseline with undersampled variation
clf_RF_base = joblib.load('saved_models/clf_RF_base_under.joblib')

# -----------------------------------------------------------------

# Make predictions with test set and save results
pred_RF_base_under = clf_RF_base.predict(X_test_under)
RF_base_under_metrics = structure_and_print_results('RF Baseline', 'Undersampled',
                                                    y_test_under, pred_RF_base_under, digits=5)

# -----------------------------------------------------------------

# Explore properties of forest to choose values for gridsearch
depths = []
for tree in clf_RF_base.estimators_:
    depths.append(tree.tree_.max_depth)
print('Depths: ', depths)

Model accuracy:  0.428351836038
             precision    recall  f1-score   support

          F    0.45924   0.58840   0.51586      1465
          M    0.33978   0.30816   0.32320      1483
          N    0.51884   0.53430   0.52646      1443
          S    0.36443   0.28552   0.32018      1464

avg / total    0.41996   0.42835   0.42075      5855

Depths:  [71, 66, 60, 56, 51, 57, 56, 59, 59, 62]


#### Oversampled Variation

In [14]:
# Train RF baseline with oversampled variation
# clf_RF_base.fit(X_train_over, y_train_over)
# joblib.dump(clf_RF_base, 'saved_models/clf_RF_base_over.joblib', compress=3)

# Load RF baseline with oversampled variation
clf_RF_base = joblib.load('saved_models/clf_RF_base_over.joblib')

# -----------------------------------------------------------------

# Make predictions with test set and save results
pred_RF_base_over = clf_RF_base.predict(X_test_over)
RF_base_over_metrics = structure_and_print_results('RF Baseline', 'Oversampled',
                                                   y_test_over, pred_RF_base_over, digits=5)

# -----------------------------------------------------------------

# Explore properties of forest to choose values for gridsearch
depths = []
for tree in clf_RF_base.estimators_:
    depths.append(tree.tree_.max_depth)
print('Depths: ', depths)

Model accuracy:  0.687871534684
             precision    recall  f1-score   support

          F    0.90320   0.95179   0.92686     22175
          M    0.49369   0.51632   0.50475     22124
          N    0.60390   0.62508   0.61431     22317
          S    0.76015   0.65856   0.70572     22373

avg / total    0.69036   0.68787   0.68794     88989

Depths:  [90, 80, 82, 84, 80, 81, 83, 102, 91, 91]


#### Conclusion

In [None]:
# Save performance metrics (used in tableau workbook, Viz shown below)

# RF_metrics = pd.concat([RF_base_original_metrics, RF_base_under_metrics, RF_base_over_metrics])
# RF_metrics.to_csv('data/RF_baseline_metrics.csv', index=False)

<img src="images/RF_baseline_metrics.png" />

The clear winner is the over sampled variation. This seems logical. Althoug the full variation has a very simil averaged F1 score, it has a huge imbalance in the classes we aim to predict, so the performance is good for the majority class. But it's also rather lame for classes `F` and `S` which are the two minority classes. In fact, this is a chracteristic of Random Forest; which is quite sensible to class imbalance.

On the other hand, the under and oversampled variations have a more balanced performance across all classes. But since the oversampled variation has more samples, the model trained with this variation is able to produce better predictions. This is also a characteristic of Random Forest. Where the more samples available for trainning the better the model becomes.

Therefore, we will do a grid search for the oversampled variation.

### 4.1.2 Grid Search

Considering that all three baseline models have trees with depths ranging from 50 to 100, we choose to find the values for this parameters from the following list: `[70, 80, 90, 100, 110, 120]`.

Apart from `max_depth` we will include also `n_estimators` with values `[20, 50, 100, 200]` and `min_samples_split` with values `[2, 50, 100, 500]`.

In [15]:
# Perform grid search
# -------------------

# parameters = {
#     'max_depth': [70, 80, 90, 100, 110, 120],
#     'min_samples_split': [2, 50, 100, 500],
#     'n_estimators': [20, 50, 100, 200]
# }
# clf_RF_gridsearch = RandomForestClassifier(random_state=seed, n_jobs=-1)
# scorer = make_scorer(fbeta_score, beta=1, average='weighted')
# grid_obj_RF_over = GridSearchCV(clf_RF_gridsearch, parameters, scorer, verbose=4)
# grid_obj_RF_over = grid_obj_RF_over.fit(X_train_over, y_train_over)
# joblib.dump(grid_obj_RF_over, 'saved_models/grid_obj_RF_over.joblib', compress=3)

# -----------------------------------------------------------------

# Load grid search results
# ------------------------

grid_obj_RF_over = joblib.load('saved_models/grid_obj_RF_over.joblib')

In [11]:
grid_obj_RF_over.best_params_

{'max_depth': 70, 'min_samples_split': 2, 'n_estimators': 200}

In [None]:
# Save results (used in tableau workbook, Viz shown below)

# grid_results = pd.DataFrame(grid_obj_RF_over.cv_results_)
# grid_results.to_csv('data/grid_RF_over_results.csv', index=False)

Looking at the CV results for the averaged F1 score of the three-fold validation, we find values between 0.5659 and 0.7086. With a maximum at `N Estimator` = `200`, `Min Samples Split` = `2` and `Max Depth` = `70`.

It's interesting to note that for each `N Estimators` and `Max Depth` we considered, the highest scores are for `Min Samples Split` = `2`. This becomes evident by looking at the color stripes from the viz below –plus looking at the actual score values, of course.

At the same time (and as the second viz below shows), if we fix `Min Samples Split` at `2`and look at the scores for the different combinations of `N Estimators` and `Max Depth`; we see that for a fixed value of `N Estimators`, the scores are almost always decreasing as `Max Depths` increases. And also, as the `N Estimators` increases, the line of scores as a function of `Max Depth` also increses (or moves up). This means that the highest the `N Estimators` and the lowest the `Max Depth` the better.

Nevertheless, after doing the search, the optimal combination was only able to produce an uplift short of 3 point compared to the baseline model on the training set.

At this point, considering the similar results we've had with all the models trained so far, we start to suspect that maybe the features available don't carry enough predictable value. At least not in the way we expected to. Perhaps, considering the dataset and problem as a multiple binary classification framework could yield better results. Nevertheless, we are above par from our benchmark models.

<img src="images/grid_RF_over_results.png" />

<img src="images/grid_RF_min_samples_split_2_test_scores.png" />

In [16]:
# Predict with test set 
predictions_best_RF_over = grid_obj_RF_over.best_estimator_.predict(X_test_over)
RF_grid_over_metrics = structure_and_print_results('RF Grid Search', 'Oversampled',
                                                       y_test_over, predictions_best_RF_over, digits=5)

Model accuracy:  0.722752250278
             precision    recall  f1-score   support

          F    0.94940   0.96284   0.95607     22175
          M    0.53125   0.51632   0.52368     22124
          N    0.61807   0.68468   0.64967     22317
          S    0.80208   0.72690   0.76264     22373

avg / total    0.72531   0.72275   0.72310     88989



Surprisingly, when we make predictions on the test set (for the oversampled variation) we get an F1 score of 0.7231 which is bettet than the 0.7086 we got when training the model.

Now our belief that the dataset lacks predictable information in the way we expect it to becomes stronger. We could also say that this Random Forest has generalized well for the available data, albeit not being able to achieve an F1 score higher that 72.3% on either training nor testing sets.

### 4.1.3 Summary

First we trained one baseline model using each of the variation of the dataset and the default values for hyperparameters.

The model trained using the original dataset produced an averaged F1 that when looked at the class level was uneven due to class imbalanced.

On the other hand, the model trained with the undersampled variation of the dataset produced an averaged F1 that when looked at the class level was much more balanced. However, the averaged F1 score was too low.

But the model trained with the oversampled variation of the dataset was able to produce and averaged F1 score even better that the one produce by using the original dataset plus an F1 score at the class level as balanced as those for the undersampled variation. Also, for each class, the F1 score was a good balance between Precision and Recall (this was also the case the undersampled variation).

Therefore, we chose this last variation for further optimization through a gridsearch of posible hyperparameter values.

We created a grid with a few values for three hyperparameters: `n_estimators`, `max_depth` and `min_samples_split`. The search produced a *best estimator* with parameters `n_estimator=70`, `max_depth=70` and `min_samples_split=2`. With an averaged F1 score of 0.7086 for the training set and 0.7231 for the testing set.

Althoug the gidsearch was able to produce a model that performs better than the baseline (and the benchmark from section 3), the improvement was minimal.

Given the size of the dataset, the test score better than the train score, the composition of the forest and the overall performance; we are inclined to believe that the model was able to generalize well to the data available but also that the data available might not carry enough predictable information for the model to predict the severity of a crash in a multiclass classification framework.

An alternative path for developing a model/application able to make such prediction could be to develop and integrate an individual binary classifier for each class.

## 4.2 ADABoost

ADABoost is a boosting algorithm. It is an ensemble of weak learners trained sequentially where each learner or stump is trained with a special focus on samples that were wrongly classified by the previous weak learner. Thus, boosting the performance acheived so far.

This characteristics make ADABoost a good candidate to improve the performance we got so far. However, the algorithm is also sensible to noise and outliers, and class imbalance. And it could also be affected by the curse of dimensionality given that the One Hot Encoded version of the dataset has 287 features.

### 4.2.1 Baseline Models

#### Original Variation

In [17]:
clf_ADA_base = AdaBoostClassifier(random_state=seed)

In [18]:
# Train ADABoost baseline with original variation
# clf_ADA_base.fit(X_train, y_train)
# joblib.dump(clf_ADA_base, 'saved_models/clf_ADA_base_original.joblib', compress=3)

# Load ADABoost baseline with original variation
clf_ADA_base = joblib.load('saved_models/clf_ADA_base_original.joblib')

# -----------------------------------------------------------------

# Make predictions with test set and save results
pred_ADA_base_original = clf_ADA_base.predict(X_test)
ADA_base_original_metrics = structure_and_print_results('ADA Baseline', 'Original',
                                                       y_test, pred_ADA_base_original, digits=5)

Model accuracy:  0.744363014223
             precision    recall  f1-score   support

          F    0.24074   0.00917   0.01768      1417
          M    0.56679   0.22449   0.32160     35231
          N    0.76454   0.97367   0.85653    111164
          S    0.41408   0.04993   0.08912      8832

avg / total    0.69557   0.74436   0.68536    156644



#### Undersampled Variation

In [19]:
# Train ADABoost baseline with undersampled variation
clf_ADA_base.fit(X_train_under, y_train_under)
joblib.dump(clf_ADA_base, 'saved_models/clf_ADA_base_under.joblib', compress=3)

# Load ADABoost baseline with undersampled variation
clf_ADA_base = joblib.load('saved_models/clf_ADA_base_under.joblib')

# -----------------------------------------------------------------

# Make predictions with test set and save results
pred_ADA_base_under = clf_ADA_base.predict(X_test_under)
ADA_base_under_metrics = structure_and_print_results('ADA Baseline', 'Undersampled',
                                                    y_test_under, pred_ADA_base_under, digits=5)

Model accuracy:  0.456532877882
             precision    recall  f1-score   support

          F    0.47589   0.61297   0.53580      1465
          M    0.38853   0.20094   0.26489      1483
          N    0.49375   0.68399   0.57350      1443
          S    0.40765   0.33470   0.36759      1464

avg / total    0.44110   0.45653   0.43441      5855



#### Oversampled Variation

In [20]:
# Train ADABoost baseline with overampled variation
# clf_ADA_base.fit(X_train_over, y_train_over)
# joblib.dump(clf_ADA_base, 'saved_models/clf_ADA_base_over.joblib', compress=3)

# Load ADABoost baseline with oversampled variation
clf_ADA_base = joblib.load('saved_models/clf_ADA_base_over.joblib')

# -----------------------------------------------------------------

# Make predictions with test set and save results
pred_ADA_base_over = clf_ADA_base.predict(X_test_over)
ADA_base_over_metrics = structure_and_print_results('ADA Baseline', 'Oversampled',
                                                    y_test_over, pred_ADA_base_over, digits=5)

Model accuracy:  0.532661340166
             precision    recall  f1-score   support

          F    0.64125   0.70304   0.67073     22175
          M    0.43069   0.27762   0.33761     22124
          N    0.54362   0.72846   0.62261     22317
          S    0.45888   0.42069   0.43895     22373

avg / total    0.51856   0.53266   0.51757     88989



#### Conclusion

In [59]:
# Save performance metrics (used in tableau workbook, Viz shown below)

# RF_metrics = pd.concat([ADA_base_original_metrics, ADA_base_under_metrics, ADA_base_over_metrics])
# RF_metrics.to_csv('data/ADA_baseline_metrics.csv', index=False)

<img src="images/ADA_baseline_metrics.png" />

Here we see something similar as for the Random Forest above. The original variation has an averaged F1 score of 0.685. However, the image above clearly shows that it is mostly due to the F1 score for class `N`. All other classes have a much lower F1 score. Therefore, we won't be tuning ADABoost for this variation.

On the other hand, The other two variations show F1 scores that are much more balanced between classes –although much lower also. What's more, for each class, the F1 score is also a very good balance between Precision adn Recall. And because the oversampled variation has better overall results, we will be tuning ADABoost for this dataset as the algorithm could give a major improvement to this initial low value.

### 4.2.2 Gird Search

ADABoost only has two parameters we can use on a grid search. Namely, `N Estimators` and `Learning Rate`. The number of estimators determines how many weak learning we want on the ensemble. The learning rate is a factor used to indicate the relevance of each learner as they are added to the ensemble.

We could also add to the grid the `Max Depth` of the weak learners. But is goes somewhat against the aidea of having learner that are weak. so we will leave the default value of `1`.

For the `n_estimators` parameter we will use 

In [21]:
# Perform grid search with oversampled variation
# ----------------------------------------------

# parameters = {
#     'learning_rate': [.3, .5, 1, 1.3, 1.5],
#     'n_estimators': [100, 200, 300, 400, 500]
# }
# clf_ADA_gridsearch = AdaBoostClassifier(random_state=seed)
# scorer = make_scorer(fbeta_score, beta=1, average='weighted')
# grid_obj_ADA_over = GridSearchCV(clf_ADA_gridsearch, parameters, scorer, verbose=4)
# grid_obj_ADA_over = grid_obj_ADA_over.fit(X_train_over, y_train_over)
# joblib.dump(grid_obj_ADA_over, 'saved_models/grid_obj_ADA_over.joblib', compress=3)

# -----------------------------------------------------------------

# Load grid search results for oversampled variation
# --------------------------------------------------

grid_obj_ADA_over = joblib.load('saved_models/grid_obj_ADA_over.joblib')

In [119]:
grid_obj_ADA_over.best_params_

{'learning_rate': 1, 'n_estimators': 500}

In [None]:
# Save results (used in tableau workbook, Viz shown below)

# grid_results = pd.DataFrame(grid_obj_ADA_over.cv_results_)
# grid_results.to_csv('data/grid_ADA_over_results.csv', index=False)

Looking at the CV results for the averaged F1 score of the three-fold validation, we find values between 0.50564 and 0.60365. With a maximum at `N Estimator` = `500` and `Learning Rate` = `1`.

In this case, the interesting part comes from noticing that for each values of `n_estimators`, the best test score is achieved with a learning rate of 1. This in a way, makes sense. The algorithm itself is already assigning weights to correctly and incorrectly classified sampled so that the next weak learner is trained so as to improve the misclassifications from the previous one.

<img src="images/grid_ADA_over_results.png" />

<img src="images/grid_ADA_test_scores.png" />

In [22]:
predictions_best_ADA_over = grid_obj_ADA_over.best_estimator_.predict(X_test_over)
ADA_grid_over_metrics = structure_and_print_results('ADA Grid Search', 'Oversampled',
                                                       y_test_over, predictions_best_ADA_over, digits=5)

Model accuracy:  0.608974142872
             precision    recall  f1-score   support

          F    0.75603   0.77267   0.76426     22175
          M    0.47706   0.50899   0.49251     22124
          N    0.61592   0.70305   0.65661     22317
          S    0.58601   0.45175   0.51020     22373

avg / total    0.60879   0.60897   0.60583     88989



### 4.2.3 Summary

In this section we trained a few ADABoost models following the same steps and analysis we used for the Random Forest above.

The baseline models produced similar performance as the Random Forest baseline models. However, the baseline model trained with the oversampled variation produced an F1 score of 0.518. A value much lower that the 0.688 produced by its Random Forest counterpart.

However, beacuase of the F1 score for the orignal variation at the class level, we chose to do a gridsearch using the oversampled one as we believed the search could greatly improved the score for the later variation. But this was not the case. The search produced a *best estimator* that, although it gave better predictions than its baseline sibling, its F1 score was too low and even still lower than the score of the baseline model trained with the original variation.

Unfortunately, the model trainer with the oversampled variation produced a very low accuracy and averaged F1 score.