# Tutorial 4.1 - Model Evaluation

The performance metrics that you choose to evaluate your machine learning algorithms are very important.

Choice of performance metrics influences how the performance of machine learning algorithms is measured and compared. They influence how you weight the importance of different characteristics in the results and your ultimate choice of which algorithm to choose.

**Expected Tasks**

- Follow the performance metrics showcase.
- Play with the different performance metric variants. 

**Learning Objectives**

- Choosing the right performance metric according to your task and data. 
- Computing a performance metric in scikit-learn, given a certain evaluation method. 

**Notes**

Performance metrics are demonstrated in this notebook using small code recipes in Python and  scikit-learn.

More information on performance metrics supported by ScikitLearn are listed on the page [Model evaluation: quantifying the quality of predictions](https://scikit-learn.org/stable/modules/model_evaluation.html). 

In [None]:
# Traditional packages
import matplotlib.pyplot as plt
import seaborn as sn
import pandas as pd
import numpy as np
import math

%matplotlib inline

# Supporting packages
from sklearn import datasets, metrics, model_selection

# Model packages
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.dummy import DummyClassifier

## Data Generation

We already discussed why generating datasets for different purposes, such as regression and classification. Now, we can see how to this for classification, and you this sample dataset to showcase how you can compute the different performance metrics covered in the lecture. 

Please, note that information on synthetic data generation can be found in the [ScikitLearn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html).  

In [None]:
X, y = datasets.make_classification(n_samples=100, n_features=10, n_redundant=0, n_informative=10, class_sep=0.5)

In [None]:
X.shape, y.shape

For a better understanding, we just show some examples for the features... 

In [None]:
X[0, :10]

...and for the target variable. 

In [None]:
y[:10]

## Performance Metrics for Classication

Classification problems are perhaps the most common type of machine learning problem and, as such, there is wide range of performance metrics that can be used to evaluate predictions for these problems. On the other hand, having all these performance  metrics will require to carefully understand, motivate, and discuss the performance metrics you think that are more relevant for your problem, according to the data and the model you selected.  

In this tutorial, we will review how to compute and report:

- Accuracy.
- Balanced Accuracy. 
- Precision.
- Recall. 
- F-Measure.
- Area Under the ROC Curve (AUC).
- Confusion Matrix.
- Classification Report.


*Note that your results may vary given the stochastic nature of the evaluation procedure or the differences in numerical precision, based on your device.* 

As you have seen in the lecture, you may need to compute performance metrics on a training-test split or through a re-sampling method. In the tutorial, due to time constraints, we will see how to compute performance metrics for an example training-test split and for an example cross-validation strategy, both on a RandomForest classifier. In any case, Please note that you can easily adapt our showcase to the other cases.  

In [None]:
kfold_cv = model_selection.KFold(n_splits=10, shuffle=True, random_state=0)

In [None]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2,  random_state=0)

### Accuracy

The accuracy represents the percentage of correct predictions your model has made. This is one of the most common and intuitive evaluation metric for classification problems. However, it is also the most misused. It is really only suitable when there is an equal number of observations in each class (which is rarely the case) and when all predictions and prediction errors are equally important (which is rarely the case). Below is an example of calculating accuracy.

More details on the [scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html). 

In [None]:
# Example of a cross-validation scenario 
results = model_selection.cross_val_score(RandomForestClassifier(random_state=0), X, y, cv=kfold_cv, scoring='accuracy')
print("Accuracy: %.3f (%.3f)" % (results.mean(), results.std()))

# Example for a training-test split
model = RandomForestClassifier(random_state=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy: %.3f" % (metrics.accuracy_score(y_test, y_pred)))

### Balanced Accuracy 

The accuracy is the total number of correct predictions divided by the total number of predictions made for a dataset. As a performance measure, so accuracy is inappropriate for imbalanced classification problems. The main reason is that the number of examples from the majority class (or classes) may be far higher than the number of examples in the minority class. Hence, if we have a dataset with 90% of the samples from the majority class, a model that always predicts the majority class can achieve an accuracy score of 90%.

The balanced accuracy is an alternative metric to accuracy in binary and multiclass classification problems, suitable in case of imbalanced datasets. It is defined as the average number of predictions correctly classified *per class* (i.e., it is the average recall per class). 

More details on the [scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html). 

In [None]:
# Example of a cross-validation scenario 
results = model_selection.cross_val_score(RandomForestClassifier(random_state=0), X, y, cv=kfold_cv, scoring='balanced_accuracy')
print("Balanced Accuracy: %.3f (%.3f)" % (results.mean(), results.std()))

# Example for a training-test split
model = RandomForestClassifier(random_state=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Balanced Accuracy: %.3f" % (metrics.balanced_accuracy_score(y_test, y_pred)))

### Confusion Matrix

The confusion matrix is a handy presentation of the accuracy of a model for two or more classes. The table presents predictions on the x-axis and accuracy outcomes on the y-axis. The cells of the table include the percentage of predictions made correctly by a machine learning algorithm for that class. For example, a binary classifier can predict 0 or 1, and each prediction may actually have been a 0 or 1. Predictions for 0 that were actually 0 appear in the cell for True label=0 and Predicted label=0, whereas predictions for 0 that were actually 1 appear in the cell for True label= 0 and Predicted label=1, and so on.

More details on the [scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html). 

In [None]:
model = RandomForestClassifier(random_state=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
metrics.plot_confusion_matrix(model, X_test, y_test, display_labels=np.arange(0, 2), cmap=plt.cm.Blues, normalize='true');

### Precision and Recall

An alternative to accuracy in case of imbalanaced problems is represented by precision and recall metrics. Precision and recall might be useful in cases where there is an imbalance in the observations between the two (or more) classes. For instance, let's consider a dataset where there are many examples of no event (class 0) and only a few examples of an event (class 1). In this cases, having the large number of class 0 examples means that we are less interested in the ability of the model at predicting class 0 correctly, e.g. high true negatives. The key to the calculation of precision and recall is so that the calculations do not make use of the true negatives, and it is only based on the correct prediction of the minority class, class 1.

$P = \frac{TP}{TP + FP}$

$R = \frac{TP}{TP + FN}$ 

Mode details on [precision](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score) and [recall](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html) are provided in the scikit-learn documentation.  

In [None]:
# Example of a cross-validation scenario 
results = model_selection.cross_val_score(RandomForestClassifier(random_state=0), X, y, cv=kfold_cv, scoring='precision')
print("Precision: %.3f (%.3f)" % (results.mean(), results.std()))

# Example for a training-test split
model = RandomForestClassifier(random_state=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Precision: %.3f" % (metrics.precision_score(y_test, y_pred)))

In [None]:
# Example of a cross-validation scenario 
results = model_selection.cross_val_score(RandomForestClassifier(random_state=0), X, y, cv=kfold_cv, scoring='recall')
print("Recall: %.3f (%.3f)" % (results.mean(), results.std()))

# Example for a training-test split
model = RandomForestClassifier(random_state=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Recall: %.3f" % (metrics.recall_score(y_test, y_pred)))

You may decide to use precision or recall on your imbalanced classification problem. Maximizing precision will minimize the number false positives, whereas maximizing the recall will minimize the number of false negatives. Therefore:

- Precision might be more appropriate when minimizing false positives is the focus.
- Recall might be more appropriate when minimizing false negatives is the focus.

Sometimes, you may want good predictions of the positive class. We want high precision and high recall. This can be challenging, as often increases in recall often come at the expense of decreases in precision. In imbalanced datasets, the goal is to improve recall without hurting precision. These goals, however, are often conflicting, since to increase the true positives for the minority class, the number of false positive is also increased, resulting in reduced precision.

Your question might be what you could do in case you have a multi-class problem. The average paremeter of the metric functions (e.g., precision and recall) is required for multiclass/multilabel targets to determine the type of averaging performed on the data:
- 'binary' only reports results for the class specified by pos_label, and is applicable only if targets are binary.
- 'micro' calculates metrics globally by counting the total true positives, false negatives and false positives.
- 'macro': calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
- 'weighted': calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.

### F-Measure or F1

F-Measure or F1 score provides a way to combine both precision and recall into a single measure that captures both properties. Unfortunately, neither precision or recall tells the whole story. You may have a very high precision and a very low recall, or alternately, a very low precision with a very high recall. F1 provides a way to express both performance  metrics with a single score. Once precision and recall have been calculated for a binary or multiclass classification problem, the two scores can be combined into the calculation of the F-Measure.

$F1 = \frac{2 * Precision * Recall}{Precision + Recall}$. 

More details on the [scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html). 

In [None]:
# Example of a cross-validation scenario 
results = model_selection.cross_val_score(RandomForestClassifier(random_state=0), X, y, cv=kfold_cv, scoring='f1')
print("F1: %.3f (%.3f)" % (results.mean(), results.std()))

# Example for a training-test split
model = RandomForestClassifier(random_state=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("F1: %.3f" % (metrics.f1_score(y_test, y_pred)))

### Area Under ROC Curve 

The AUC represents the model’s ability to discriminate between positive and negative classes. An area of 1.0 represents a model that made all predictions perfectly. An area of 0.5 represents a model as good as random. A ROC Curve is a plot of the true positive rate and the false positive rate for a given set of probability predictions at different thresholds used to map the probabilities to class labels. The AUC score thus indicates the Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.

More details on the [scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html). 

In [None]:
# Example of a cross-validation scenario 
results = model_selection.cross_val_score(RandomForestClassifier(random_state=0), X, y, cv=kfold_cv, scoring='roc_auc')
print("AUC: %.3f (%.3f)" % (results.mean(), results.std()))

# Example for a training-test split
model = RandomForestClassifier(random_state=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("AUC: %.3f" % (metrics.roc_auc_score(y_test, y_pred)))

The area under the curve is the approximate integral under the ROC Curve. For more information on ROC Curves and ROC AUC, see [this tutorial](https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html). Here, we show you a short example on how to plot it. 

In [None]:
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred)

plt.figure()
plt.plot(fpr, tpr)
plt.plot([-1,1],[-1,1], 'red', linestyle='--', linewidth=1)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.grid()
plt.show()

You can see that the majority of the churns are wrongly classified (true label 1, predicted label 0).

### Classification Report
Scikit-learn provides a convenience report when working on classification to give you a quick idea of the accuracy of a model using a number of measures.

The classification_report() function displays the precision, recall, f1-score and support for each class.

In [None]:
pd.DataFrame(metrics.classification_report(y_test, y_pred, output_dict=True))

### Dummy Classifiers

Scikitlearn provides handy classifiers that make predictions using simple rules. This kind of classifiers is useful as a simple baseline to compare with other (real) classifiers. For instance, you could define a classifier that generates predictions by respecting the training set’s class distribution or a classifier that always predicts the most frequent label in the training set.

More details on the [scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html). 

In [None]:
model = DummyClassifier(random_state=0, strategy='most_frequent')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy: %.3f" % (metrics.accuracy_score(y_test, y_pred)))

## Performance Metrics for Regression

In this section will review four of the most common performance metrics for evaluating predictions on regression machine learning problems:

- Mean Absolute Error.
- (Root) Mean Squared Error.
- R^2.

We will use here a synthetic dataset for regression. More information on synthetic data generation can be found in the [ScikitLearn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html).  

In [None]:
X, y = datasets.make_regression(n_samples=100, n_features=10, noise=0.4)

In [None]:
X.shape, y.shape

Again, for a better understanding, we just show some examples for the features... 

In [None]:
X[0, :10]

In [None]:
y[:10]

For convenience, we scale target values between 0 and 1. 

In [None]:
y = (y - y.min()) / (y.max() - y.min())

In [None]:
y[:10]

As in this case, we will see how to compute performance metrics for an example training-test split and for an example cross-validation strategy, both on a RandomForest regressor. In any case, Please note that you can easily adapt our showcase to the other cases.  

In [None]:
kfold_cv = model_selection.KFold(n_splits=10, shuffle=True, random_state=0)

In [None]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2,  random_state=0)

### Mean Absolute Error

The Mean Absolute Error (or MAE) is the average of the absolute differences between predictions and actual values. The measure gives an idea of the magnitude of the error, but no idea of the direction (e.g. over or under predicting).

More details on the [scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html)

In [None]:
# Example of a cross-validation scenario 
results = model_selection.cross_val_score(RandomForestRegressor(random_state=0), X, y, cv=kfold_cv, scoring='neg_mean_absolute_error')
print("MAE: %.3f (%.3f)" % (results.mean(), results.std()))

# Example for a training-test split
model = RandomForestRegressor(random_state=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("MAE: %.3f" % (metrics.mean_absolute_error(y_test, y_pred)))

A value of 0 indicates no error or perfect predictions. Please, note that this metric is inverted by the cross_val_score() function.

### Root Mean Squared Error (RMSE)
The Root Mean Squared Error (RMSE) is a quadratic scoring rule which measures the average magnitude of the error. Expressing the formula in words, the difference between the prediction and the true value are each squared and then averaged over the sample. Finally, the square root of the average is taken. Since the errors are squared before they are averaged, the RMSE gives a relatively high weight to large errors. This means the RMSE is most useful when large errors are particularly undesirable. 

More details onthe [scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html)

In [None]:
# Example of a cross-validation scenario 
results = model_selection.cross_val_score(RandomForestRegressor(random_state=0), X, y, cv=kfold_cv, scoring='neg_root_mean_squared_error')
print("RMSE: %.3f (%.3f)" % (results.mean(), results.std()))

# Example for a training-test split
model = RandomForestRegressor(random_state=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("RMSE: %.3f" % (metrics.mean_squared_error(y_test, y_pred, squared=False)))

This metric too is inverted so that the results are increasing. To obtain the Mean Squared Error (MSE), the squared parameter should be True. 

### R^2 Metric
The R^2 (or R Squared) metric provides an indication of the goodness of fit of a set of predictions to the actual values. In statistical literature, this measure is called the coefficient of determination. This is a value between 0 and 1 for no-fit and perfect fit respectively.

More details on the [scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html). 

In [None]:
# Example of a cross-validation scenario 
results = model_selection.cross_val_score(RandomForestRegressor(random_state=0), X, y, cv=kfold_cv, scoring='r2')
print("R2: %.3f (%.3f)" % (results.mean(), results.std()))

# Example for a training-test split
model = RandomForestRegressor(random_state=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("R2: %.3f" % (metrics.r2_score(y_test, y_pred)))

## Summary

In this tutorial, you discovered performance metrics that you can use to evaluate your machine-learning models, covering both classification and regression cases. Furthermore, you have seen how to compute them in scikit-learn. Please, rememeber that the choice of the performance metrics depends on the data and the problem at your hands. Therefore, you should be able to justify what you picked a given set of performance metrics. 