# ML Algorithm Performance Metrics

The metrics that you choose to evaluate your ML algorithms are very important. Choice of metrics influences how the performance of ML algorithms is measured and compared. They influence how you weight the importance of different characteristics in the results and your ultimate choice of which algorithm to choose.

We will discover how to select and use different ML performance metrics using (for the moment, still) only Python with *scikit-learn*.

## Algorithm Evaluation Metrics

We will see various algorithm evaluation metrics and we will demonstrate them for both classification and regression type ML problems. In this section we need to focus on how to evaluate and compare algos themselves, so we need more models, and to do so we need more input datasets:

We need more input datasets:

* For CLASSIFICATION metrics, the Pima Indians onset of diabetes dataset is used as demonstration. This is a binary classification problem where all of the input variables are numeric.
* For REGRESSION metrics, we introduce the Boston House Price dataset and we use it as demonstration. This is a regression problem where all of the input variables are also numeric.


We do not focus on modelling utself, in this notebook, so we use ***Logistic Regression*** for the classification problem and ***Linear Regression*** for the regression problems.

A 10-fold CV test harness is used to demonstrate each metric (because this is a  likely scenario you will use when employing different algorithm evaluation metrics)

More about ML algorithm performance metrics supported by scikit-learn can be found [here](http://scikit-learn.org/stable/modules/model_evaluation.html) on the page "Model evaluation: quantifying the quality of predictions".

*CAVEAT. A caveat in these recipes is the `cross validation.cross_val_score` function (more [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) used to report the performance in each recipe. It does allow the use of different scoring metrics that will be discussed, but all scores are reported so that they can be sorted in ascending order (largest score is best). Some evaluation metrics (like mean squared error) are naturally descending scores (the smallest score is best) and as such are reported as negative by the cross validation.cross val score() function. This is important to note, because some scores will be reported as negative that by definition can never be negative. --- In other words, sklearn (from historical APIs) always tries to maximize scores, so loss functions (like MSE) have to be negated.*

# CLASSIFICATION Metrics

## 0. Import the data

In [None]:
import pandas as pd

url = 'https://raw.githubusercontent.com/dbonacorsi/AMLBas2324/main/pima-indians-diabetes.data.csv'

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pd.read_csv(url, names=names)
data

Classification problems are perhaps the most common type of ML problem and as such there is a myriad of metrics that can be used to evaluate predictions for these problems. In this section we will review 5 classification metrics.


Here they are:
* [CLAS-1] Classification Accuracy
* [CLAS-2] Logarithmic Loss
* [CLAS-3] Area Under ROC Curve
* [CLAS-4] Confusion Matrix
* [CLAS-5] Classification Report

## [CLAS-1] Classification Accuracy

Classification accuracy is **the number of correct predictions made as a ratio of all predictions made**.

This is the most common evaluation metric for classification problems, and it is also often the most misused.

**It is really only suitable when there are an equal number of observations in each class** - which is rarely the case - and that all predictions and prediction errors are equally important - which is often not the case.


Below is an example of calculating classification accuracy.

In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
#
from sklearn.linear_model import LogisticRegression

In [None]:
array = data.values
X = array[:,0:8]
Y = array[:,8]

In [None]:
seed = 7

We do k-fold CV.

In [None]:
kfold = KFold(n_splits=10, random_state=seed, shuffle=True)
model = LogisticRegression(solver='lbfgs', max_iter=5000)

In [None]:
# Cross Validation Classification Accuracy
scoring = 'accuracy'                                             # <---
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("Accuracy: %.3f (%.3f)" % (results.mean(), results.std()))

You can see that the accuracy ratio is reported: we built a model that is approximately 78% accurate.

## <font color='red'>Exercise 1</font>

Measure the time it takes to run the previous cell, by running k-fold CV with different k's, and compare timing and accuracies obtained.

## <font color='green'>Solution 1</font>

In [None]:
# enter your code here

## [CLAS-2] Logarithmic Loss (aka "logloss")

Logarithmic loss (or logloss) is **a performance metric for evaluating the predictions of probabilities of membership to a given class**.

The scalar probability between 0 and 1 can be seen as a measure of confidence for a prediction by an algorithm. Predictions that are correct or incorrect are rewarded or punished proportionally to the confidence of the prediction.


Below is an example of calculating logloss for Logistic regression predictions on the Pima Indians onset of diabetes dataset.


In [None]:
# Cross Validation Classification LogLoss
scoring = 'neg_log_loss'                      #<---
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("Logloss: %.3f (%.3f)" % (results.mean(), results.std()))

Smaller logloss is better, with 0 representing a perfect logloss. The
measure is inverted to be ascending when using the `cross_val_score()` function (see the documentation).


> Go to the slides for more examples and explanations..


## [CLAS-3] Area Under ROC Curve

Area under ROC Curve (or AUC for short) is **a performance metric for binary classification problems**.

ROC can be broken down into **sensitivity** and **specificity**. A binary classification problem is really a trade-off between sensitivity and specificity.
* Sensitivity is the true positive rate (TPR) also called the Recall. It is the number of instances from the positive (first) class that actually predicted correctly.
* Specificity is also called the true negative rate (TNR). It is the number of instances from the
negative (second) class that were actually predicted correctly.

The AUC represents a model’s ability to discriminate between positive and negative classes. An area of 1.0 represents a model that made all predictions perfectly. An area of 0.5 represents a model that is as good as random.

The example below provides a demonstration of calculating AUC.

In [None]:
# Cross Validation Classification ROC AUC
scoring = 'roc_auc'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("AUC: %.3f (%.3f)" % (results.mean(), results.std()))

You can see the AUC is relatively close to 1 and greater than 0.5, suggesting some skills in the predictions.

## [CLAS-4] Confusion Matrix

The confusion matrix is **a handy (and more informative) presentation of the accuracy of a model with two or more classes**.

The table presents predictions on the x-axis and accuracy outcomes on the y-axis. The cells of the table are the number of predictions made by a ML algorithm. See the frontal lectures for some examples.

Below is an example of calculating a confusion matrix for a set of predictions by a Logistic Regression on the Pima Indians onset of diabetes dataset.

*E.g. a ML algorithm can predict 0 or 1 and each prediction may actually have been a 0 or 1. Predictions for 0 that were actually 0 appear in the cell for prediction = 0 and actual = 0, whereas predictions for 0 that were actually 1 appear in the cell for prediction = 0 and actual = 1. And so on.*

There are (at least) 2 different ways to do so
*   NOTE: to compare these two approaches and avoid to do mistakes, we need to re-execute (or just write again for clarity) some cells above - unnecessary if you do just one method, of course..


### First method

The first is not to rely on `cross_val_score` at all: there is no option to have a confusion matrix as scoring function in its call after having done the k-fold CV, so one way is not to do CV at all,  opt for a static splitting and validation, then use `confusion_matrix` directly.

In [None]:
from sklearn.model_selection import train_test_split      # <---
from sklearn.metrics import confusion_matrix              # <---

In [None]:
test_size = 0.33

# Cross Validation Classification Confusion Matrix
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
model = LogisticRegression(solver='lbfgs', max_iter=500)
model.fit(X_train, Y_train)
predicted = model.predict(X_test)
matrix1 = confusion_matrix(Y_test, predicted)              # <---
print(matrix1)


Although the array is printed without headings, you can see that the majority of the predictions fall on the diagonal line of the matrix (which are correct predictions).

### Second method

The second is keep doing k-fold CV, but to drop the use of `cross_val_score` in favour of `cross_val_predict`.


In [None]:
from sklearn.model_selection import KFold                 # <---
from sklearn.model_selection import cross_val_predict     # <---
from sklearn.metrics import confusion_matrix              # <---

In [None]:
kfold = KFold(n_splits=4, random_state=seed, shuffle=True)
model = LogisticRegression(solver='lbfgs', max_iter=300)

predicted = cross_val_predict(model, X, Y, cv=kfold)    # <--- NOTE: no 'scoring'
matrix2 = confusion_matrix(Y, predicted)
print(matrix2)

Same as above: the majority of the predictions fall on the diagonal line of the matrix. Good.

Let's make a couple of plots. We discuss later.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sn
#
df_cm = pd.DataFrame(matrix1)
plt.figure(figsize = (10,7))
sn.heatmap(df_cm, annot=True, cmap="YlOrRd", fmt="d")

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sn
#
df_cm = pd.DataFrame(matrix2)
plt.figure(figsize = (10,7))
sn.heatmap(df_cm, annot=True, cmap="YlOrRd", fmt="d")

## <font color='red'>Exercise 2</font>

The 2 matrices are not the same, though, aren't they? So: are the 2 results, content-wise, the same? or comparable?


## <font color='green'>Solution 2</font>

In [None]:
# write your code here

> Go to the slides for a hint on the solution, and check the code below..

A hint for the solution: in the slides, and in the code below.

In [None]:
### compare first and second method by changing matrix1 and matrix2 below
overall_size = 768.
test_size = 768.*0.33
train_size = overall_size - test_size

print("Overall set size :", overall_size)
print("Training set size :", train_size)
print("Test set size :", test_size)


In [None]:
tn, fp, fn, tp = matrix1.ravel()   #<--- here: do the same for matrix1 and matrix2

print(tn+fp+fn+tp)

In [None]:
tn, fp, fn, tp = matrix2.ravel()   #<--- here: do the same for matrix1 and matrix2

print(tn+fp+fn+tp)

## [CLAS-5] Classification Report

There is also **a convenience report provided by the scikit-learn library** when working on classification problems to give you a quick idea of the accuracy of a model using a number of measures. The `classification report()` function displays the precision, recall, F1-score and support for each class.

The example below demonstrates the report on the binary classification problem.

In [None]:
from sklearn.metrics import classification_report               # <---

In [None]:
test_size = 0.33
seed = 7

# Cross Validation Classification Report
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size,
    random_state=seed)
model = LogisticRegression(solver='lbfgs', max_iter=5000)
model.fit(X_train, Y_train)
predicted = model.predict(X_test)
report = classification_report(Y_test, predicted)               # <---
print(report)

You can see good prediction and recall for the algorithm.

# REGRESSION Metrics

In the regression examples, we will use the Boston house price dataset, which you can find (original source) [here](https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data), and for your convenience it is already in the github repo of the course.

## 0. Import the data

In [None]:
import pandas as pd

url = 'https://raw.githubusercontent.com/dbonacorsi/AML2324Bas/main/housing.data.csv'

names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
data = pd.read_csv(url, delim_whitespace=True, names=names)
data

#array = dataframe.values
#X = array[:,0:13]
#Y = array[:,13]


Here we will review 3 of the most common metrics for evaluating predictions on regression ML problems:
* [REGR-1] Mean Absolute Error
* [REGR-2] Mean Squared Error
* [REGR-3] $R^2$



## [REGR-1] Mean Absolute Error

The Mean Absolute Error (or MAE) is **the sum of the absolute differences between predictions and actual values**.

It gives an idea of how wrong the predictions were. The measure gives an idea of the magnitude of the error, but no idea of the direction (e.g. over or under predicting).


It gives an idea of how wrong the predictions were. The measure gives an idea of the magnitude of the error, but no idea of the direction (e.g. over or under predicting).

The example below demonstrates calculating mean absolute error on the house dataset.

In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

In [None]:
# Cross Validation Regression MAE
kfold = KFold(n_splits=10, random_state=7, shuffle=True)
model = LinearRegression()
#
scoring = 'neg_mean_absolute_error'                                  # <---
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("MAE: %.3f (%.3f)" % (results.mean(), results.std()))

A value of 0 indicates no error or perfect predictions. Like log-loss, this metric is inverted by
the `cross_val_score()` function.

## [REGR-2] Mean Squared Error

The Mean Squared Error (or MSE) is much like the MAE in that **it provides a gross idea of the magnitude of error**.

Taking the square root of the mean squared error converts the units back to the original units of the output variable and can be meaningful for description and presentation. This is called the ***Root Mean Squared Error*** (or ***RMSE***). in Italian: (radice quadrata dell') errore quadratico medio.


The example below provides a demonstration of calculating MSE.

In [None]:
#num_folds = 10       # a remnant..
kfold = KFold(n_splits=10, random_state=7, shuffle=True)
model = LinearRegression()
#
scoring = 'neg_mean_squared_error'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("MSE: %.3f (%.3f)" % (results.mean(), results.std()))

This metric too is inverted so that the results are increasing.

Of course, remember to take the absolute value before taking the square root if you are interested in calculating the RMSE.

## [REGR-3] $R^2$ metric

The $R^2$ (or R Squared) metric provides **an indication of the goodness of fit of a set of predictions to the actual values**.


In statistical literature this measure is called the coefficient of determination (in Italian: "coefficiente di determinazione)". This is a value between 0 and 1 for no-fit and perfect fit respectively.


The example below provides a demonstration of calculating the mean $R^2$ for a set of predictions.

In [None]:
# Cross Validation Regression R^2
kfold = KFold(n_splits=10, random_state=7, shuffle=True)
model = LinearRegression()
#
scoring = 'r2'                                                     # <---
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("R^2: %.3f (%.3f)" % (results.mean(), results.std()))

You can see the predictions have a poor fit to the actual values with a value closer to zero and less than 0.5.

## Summary

What we did:

* we discovered metrics that you can use to evaluate your ML algorithms. We learned about 3 classification metrics (Accuracy, LogLoss and AUC) and 2 convenience methods for classification prediction results (Confusion Matrix and Classification Report), as well as 3 metrics for regression problems (MAE, MSE, R2).