# Supervised Learning Models

In this notebook, we present a series of experiments to compare different models that rely on supervised learning for a classification task. 

We will rely on a dataset that contains instances of **UK Traffic Accidents** to attempt to **classify an accident as severe or non-severe given the relevant features in the dataset**. For more details on the preparation of the dataset used here consult the [dataset_prep notebook](dataset_prep.ipynb).

Before evaluating different models, we will introduce the metrics that will be utilized by this purpose. 
We will also present the definition of the concepts of cross validation and grid search, which will also be used to evaluate the studied models.

## Classification Task Metrics

In this section we present the metrics that will be utilized to compare the different models for classification.

**Note:** For more details on the definition of the metrics presented below refer to the scikit learn documentation: [sklearn.metrics Documentation](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics)

To better understand these metrics, we are going to consider a very simple example related to a classification task. Let's say we have a model trained to identify if the animal in a given picture is or not a dog. The model is evaluated with 100 pictures, from which 40 do contain a dog and 60 do not. From the 40 pictures with a dog, the algorithm correctly classifies 35 of them as containing a dog. From the 60 that do not contain a dog, the algorithm incorrectly classisifes 10 of them as containing a dog.

* 100 pictures
  * 40 have a dog.
    * Model predicts 35 as having a dog.
    * Model predicts 5 as not having a dog.
  * 60 do not have a dog.
    * Model predicts 10 as having a dog.
    * Model predicts 50 as not having a dog.


### Classification Accuracy

The simplest and most widely used method. Represents the ratio of correct to the total number of predictions. Normally presented as a percentage:

$$Accuracy (\%) = 100 * \frac{correct\_predictions}{total\_predictions}$$

For our example scenario, the accuracy would be calculated as below:

$$Accuracy (\%) = 100 * \frac{35 + 50}{100}$$
$$Accuracy = 85\%$$

### Confusion Matrix

A matrix or table that aids visualizing the accuracy of the classification by comparing the *'Truth'* values against the predictions done by the model in a row against column comparison. This will be better understood with the data of our example scenario.

In the rows, we will present the count of predictions done by the model for each of the categories. For the columns, we will present the *'Truth'* count for each of them.

|                        | non-dog Truth | dog Truth |
|------------------------|---------------|-----------|
| **non-dog prediction** |       50      |     5     |
| **dog  prediction**    |       10      |     35    |

A Confusion Matrix from a good classifier will tend to have high numbers in the main diagonal (correct classifications) and low numbers elsewhere. 

### Sensitivity, Recall or True Positive Rate

Aids determining the ability of the model to correctly identify *'True'* values correctly in binary classifications. It can be calculated as the ratio of *'True'* values classified as such (true_positives) to the total number of *'True'* values (true_positives and false_negatives).

Normally presented as a percentage:

$$Sensitivity (\%) = 100 * \frac{true\_positives}{true\_positives + false\_negatives}$$

For our example scenario, the sensitivity would be calculated as below:

$$Sensitivity (\%) = 100 * \frac{35}{35 + 5}$$
$$Sensitivity = 87.5\%$$


### Specificity or True Negative Rate

Aids determining the ability of the model to correctly identify *'False'* values correctly in binary classifications. It can be calculated as the ratio of *'False'* values classified as such (true_negatives) to the total number of *'False'* values (true_negatives and false_positives).

Normally presented as a percentage:

$$Specificity (\%) = 100 * \frac{true\_negatives}{true\_negatives + false\_positives}$$

For our example scenario, the specificity would be calculated as below:

$$specificity (\%) = 100 * \frac{50}{50 + 10}$$
$$specificity = 83.3\%$$


### Precision

It represents the ratio of results correctly identified as *'True'* values to the total number of values classified as *'True'* by the model.

Normally presented as a percentage:

$$Precision (\%) = 100 * \frac{true\_positives}{true\_positives + false\_positives}$$

For our example scenario, the precision would be calculated as below:

$$Precision (\%) = 100 * \frac{35}{35 + 10}$$
$$Precision = 77.8\%$$

### F1 Score

It aids combining the Precision and Recall into a single metric as a harmonic mean of them.

Normally presented as a percentage:

$$F1(\%) = 100 * 2 * \frac{precision * recall}{precision + recall}$$

For our example scenario, the F1-score would be calculated as below:

$$F1 (\%) =  2 * \frac{77.8 * 87.5}{77.8 + 87.5}$$
$$F1 = 82.4\%$$

## Cross Validation and Grid Search

Below we present the definition of these concepts, which will be used while evaluating our learning models. For more information, consult the resources below which were used to extract these concepts:

[Cross Validation and Grid Search for Model Selection in Python](https://stackabuse.com/cross-validation-and-grid-search-for-model-selection-in-python/)\
[Cross Validation and Grid Search](https://amueller.github.io/ml-training-intro/slides/03-cross-validation-grid-search.html#1)\
[Cross-Validation and Hyperparameter Tuning: How to Optimise your Machine Learning Model](https://towardsdatascience.com/cross-validation-and-hyperparameter-tuning-how-to-optimise-your-machine-learning-model-13f005af9d7d)\

### Cross Validation

With the standard method, we select a fixed portion of the dataset which is used for testing while the rest is used for training. One potential problem with this approach is that our evaluation is heavily dependent on that fixed selection of the training and testing buckets. With this we have less confidence that our model will perform well when this selection changes.

To deal with this potential problem, cross validation techniques propose evaluating the model with different partitions of the dataset to obtain a better understanding of its ability to generalize. 

The most common technique of this type is known as *K-Fold Cross-Validation*. The process followed with this technique is explained below:

1. The dataset is divided into $k$ groups or folds of equal size.
2. Use $k-1$ of the groups to train the algorithm.
3. Use the remaining group (not used for training) to validate the trained model. Obtain the accuracy and other metrics as needed.
4. Repeat steps 2 and 3 rotating the selection until all the $k$ groups have been used at least once for evaluation.
5. Obtain an average of the accuracy and other metrics used for all the runs. We can also obtain standard deviations to see how stable the model is against these changes.

The image shown below represents this process.

![cross_validation](images/cross_validation.svg "K-Fold Cross Validation")
<center>Image Source: Wikipedia</center>

### Grid Search

As it is already known, most of the models for supervised learning have different hyper-parameters used during the training process that end-up affecting the final performance obtained in the predictions. For example, we require to choose an appropriate number of neighbors to use for a *KNN* model, a regularization value for a *SVM* model, a maximum depth for a *Binary Tree* model, etc. When using grid-search, we will configure an automated algorithm to sweep and combine a series of hyper-parameters that we define, in an attempt to search for the values that would yield better results. 

For example, let's say we have a model that has 2 hyper-parameters used for training called *A* and *B*. *A* is an integer number that can range from 1 to infinite and *B* is a categorical value with 5 different potential values. Let's say we are interested in attempting all the potential values for *B* and only the next values for *A*:

$$A=\{10, 30, 50, 70, 90\}$$

Since we have 5 potential values for *A* and 5 potential values for *B*, the grid search will train and evaluate our algorithm with 25 different combinations in an attempt to find the best among them. As it can be seen, this type of technique will significantly increase the training time.

It is worth mentioning that although cross-validation and grid-search are independent and can be used by themselves, they are particularly powerful when used together.

## Learning Models Evaluation

In this section we present the results of evaluating different learning models for a classification task. More specifically, we will utiliza the already processed **UK Accidents Dataset** to attempt to classify an accident as severe or non-severe given the relevant features in the dataset. 

Below we start by importing the main python packages that will be used and also by loading the dataset with the corresponding identification of the features and the target column.

In [1]:
# Start by importing relevant python modules
import numpy as np
import pandas as pd
import sklearn
from matplotlib import pyplot as plt
%matplotlib inline

In [2]:
# Load accidents Dataset
df_accidents = pd.read_csv('dataset/uk_accidents_for_sev_prediction.csv')

# Separate features and target columns
features = df_accidents.drop('Severe_Accident', axis=1)
target = df_accidents['Severe_Accident']

In [None]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(features, target, test_size=0.2, random_state=8)

In [None]:
training_accuracy = []
test_accuracy = []
neighbors_settings = range(1, 11)
for n_neighbors in neighbors_settings:
    # se construye el modelo de clasificacion
    clf = neighbors.KNeighborsClassifier(n_neighbors=n_neighbors)
    clf.fit(X_train, y_train)
    # se almacena el "training set accuracy"
    training_accuracy.append(clf.score(X_train, y_train))
    # se almacena la "generalization accuracy"
    test_accuracy.append(clf.score(X_test, y_test))
plt.plot(neighbors_settings, training_accuracy, label="training accuracy")
plt.plot(neighbors_settings, test_accuracy, label="test accuracy")
plt.ylabel("Accuracy")
plt.xlabel("n_neighbors")
plt.legend()

In [None]:
from sklearn import linear_model

In [None]:
logreg = linear_model.LogisticRegression(solver='liblinear').fit(X_train, y_train)
print("Training set score: {:.3f}".format(logreg.score(X_train, y_train)))
print("Test set score: {:.3f}".format(logreg.score(X_test, y_test)))

In [None]:
from sklearn import naive_bayes

In [None]:
nbg = naive_bayes.GaussianNB().fit(X_train, y_train)
print("Training set score: {:.3f}".format(nbg.score(X_train, y_train)))
print("Test set score: {:.3f}".format(nbg.score(X_test, y_test)))

In [None]:
from sklearn import tree

In [None]:
b_tree = tree.DecisionTreeClassifier(max_depth=8, random_state=0)
b_tree.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(b_tree.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(b_tree.score(X_test, y_test)))

In [None]:
print("Feature importances:\n{}".format(b_tree.feature_importances_))

In [None]:
def plot_feature_importances(model):
    imps = []
    imp_names = []
    for imp, imp_name in zip(model.feature_importances_, features.columns):
        if imp > 0.001:
            imps.append(imp)
            imp_names.append(imp_name)
        
    n_features = len(imp_names)
    plt.barh(range(n_features), imps, align='center')
    plt.yticks(np.arange(n_features), imp_names)
    plt.xlabel("Feature importance")
    plt.ylabel("Feature")
    
plot_feature_importances(b_tree)

In [None]:
from sklearn import ensemble

In [None]:
rf = ensemble.RandomForestClassifier(n_estimators=10).fit(X_train, y_train)
print("Training set score: {:.3f}".format(rf.score(X_train, y_train)))
print("Test set score: {:.3f}".format(rf.score(X_test, y_test)))