# Model Evaluation

### Notebook Content 

### Evaluation Metrics for Regression
1. Explained-Variance-Score
2. Max-error
3. Mean-Absolute-Error
4. Mean-Squared-Error
5. Mean-Squared-Logarithmic-Error
6. Median-Absolute-Error
7. R²-score
8. Mean-Percentage-Erro
9. Mean-Absolute-Percentage-Error
10. Weighted-Mean-Absolute-erro
11. Tips-for-Metric-Selection
12. Cross-Validation
13. StratifiedKFold

# Evaluation Metrics for Regression
The **sklearn.metrics module implements several loss, score, and utility functions to measure regression performance**. Some of those have been enhanced to handle the **multioutput case**: mean_squared_error, mean_absolute_error, explained_variance_score and r2_score.These functions have an **multioutput keyword argument which specifies the way the scores or losses for each individual target should be averaged**. The default is 'uniform_average', which specifies a uniformly weighted mean over outputs.<br> 
If an ndarray of shape (n_outputs,) is passed, then its entries are interpreted as weights and an according weighted average is returned. If multioutput is 'raw_values' is specified, then all unaltered individual scores or losses will be returned in an array of shape (n_outputs,).
<br><br>The **r2_score and explained_variance_score accept an additional value 'variance_weighted'** for the multioutput parameter. This option leads to a weighting of each individual score by the variance of the corresponding target variable. This setting quantifies the globally captured unscaled variance. If the target variables are of different scale, then this score puts more importance on well explaining the higher variance variables. multioutput='variance_weighted' is the default value for r2_score for backward compatibility. This will be changed to uniform_average in the future.<br><br>

<a class="list-group-item list-group-item-action" data-toggle="list" href="#Code" role="tab" aria-controls="messages">Go to Code<span class="badge badge-primary badge-pill"> </span></a>

## Explained Variance Score
The explained_variance_score **computes the explained variance regression score**. **Explained variation measures the proportion to which a mathematical model accounts for the variation (dispersion)** of a given data set. Often, variation is quantified as variance; then, the more specific term explained variance can be used.<br><br>The best possible score is 1.0, lower values are worse.

If $\hat{y}$ is the estimated target output, $y$ the corresponding (correct) target output, and $Var$ is Variance, the square of the standard deviation, then the explained variance is estimated as follow:

$explained\_{}variance(y, \hat{y}) = 1 - \frac{Var\{ y - \hat{y}\}}{Var\{y\}}$<br><br>
<a class="list-group-item list-group-item-action" data-toggle="list" href="#Code" role="tab" aria-controls="messages">Go to Code<span class="badge badge-primary badge-pill"> </span></a>

## Max error
The max_error function computes the **maximum residual error** , a metric that captures the worst case error between the predicted value and the true value. In a perfectly fitted single output regression model, max_error would be 0 on the training set and though this would be highly unlikely in the real world, this metric shows the extent of error that the model had when it was fitted.

If $\hat{y}_i$ is the predicted value of the $i$-th sample, and $y_i$ is the corresponding true value, then the max error is defined as:

$\text{Max Error}(y, \hat{y}) = max(| y_i - \hat{y}_i |)$<br><br>
<a class="list-group-item list-group-item-action" data-toggle="list" href="#Code" role="tab" aria-controls="messages">Go to Code<span class="badge badge-primary badge-pill"> </span></a>


## Mean Absolute Error
The mean_absolute_error function computes mean absolute error, a risk metric corresponding to the expected value of the absolute error loss or $l1$-norm loss.In statistics, mean absolute error (MAE) is a **measure of errors between paired observations** expressing the same phenomenon.

If $\hat{y}_i$ is the predicted value of the $i$-th sample, and $y_i$ is the corresponding true value, then the mean absolute error (MAE) estimated over $n_{\text{samples}}$ is defined as:

$\text{MAE}(y, \hat{y}) = \frac{1}{n_{\text{samples}}} \sum_{i=0}^{n_{\text{samples}}-1} \left| y_i - \hat{y}_i \right|$.<br><br>
<a class="list-group-item list-group-item-action" data-toggle="list" href="#Code" role="tab" aria-controls="messages">Go to Code<span class="badge badge-primary badge-pill"> </span></a>

## Mean Squared Error
The mean_squared_error function computes mean square error, a risk metric **corresponding to the expected value of the squared (quadratic) error or loss**.

If $\hat{y}_i$ is the predicted value of the $i$-th sample, and $y_i$ is the corresponding true value, then the mean squared error (MSE) estimated over $n_{\text{samples}}$ is defined as

$\text{MSE}(y, \hat{y}) = \frac{1}{n_\text{samples}} \sum_{i=0}^{n_\text{samples} - 1} (y_i - \hat{y}_i)^2.$<br><br>
<a class="list-group-item list-group-item-action" data-toggle="list" href="#Code" role="tab" aria-controls="messages">Go to Code<span class="badge badge-primary badge-pill"> </span></a>


## Mean Squared Logarithmic Error
The mean_squared_log_error function computes a **risk metric corresponding to the expected value of the squared logarithmic (quadratic) error or loss.**

If $\hat{y}_i$ is the predicted value of the $i$-th sample, and $y_i$ is the corresponding true value, then the mean squared logarithmic error (MSLE) estimated over $n_{\text{samples}}$ is defined as:

$\text{MSLE}(y, \hat{y}) = \frac{1}{n_\text{samples}} \sum_{i=0}^{n_\text{samples} - 1} (\log_e (1 + y_i) - \log_e (1 + \hat{y}_i) )^2.$
Where $\log_e (x)$ means the natural logarithm of $x$. <br>
**This metric is best to use when targets having exponential growth, such as population counts**, average sales of a commodity over a span of years etc. Note that this metric penalizes an under-predicted estimate greater than an over-predicted estimate.
<br><br><a class="list-group-item list-group-item-action" data-toggle="list" href="#Code" role="tab" aria-controls="messages">Go to Code<span class="badge badge-primary badge-pill"> </span></a>

## Median Absolute Error
The median_absolute_error is particularly interesting because **it is robust to outliers**. The loss is calculated by taking the **median of all absolute differences between the target and the prediction**.<br><br>
If $\hat{y}_i$ is the predicted value of the $i$-th sample and $y_i$ is the corresponding true value, then the median absolute error (MedAE) estimated over $n_{\text{samples}}$ is defined as

$\text{MedAE}(y, \hat{y}) = \text{median}(\mid y_1 - \hat{y}_1 \mid, \ldots, \mid y_n - \hat{y}_n \mid).$

<a class="list-group-item list-group-item-action" data-toggle="list" href="#Code" role="tab" aria-controls="messages">Go to Code<span class="badge badge-primary badge-pill"> </span></a>

## R² score
The r2_score function **computes the coefficient of determination**, usually denoted as R².<br><br>
It represents the **proportion of variance (of y) that has been explained by the independent variables** in the model. It provides an **indication of goodness of fit** and therefore a measure of how well unseen samples are likely to be predicted by the model, through the proportion of explained variance.<br><br>
As such variance is dataset dependent, **R² may not be meaningfully comparable across different datasets**. <br>Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R² score of 0.0.

If $\hat{y}_i$ is the predicted value of the $i$-th sample and $y_i$ is the corresponding true value for total $n$ samples, the estimated R² is defined as:

$R^2(y, \hat{y}) = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}$ <br>
where $\bar{y} = \frac{1}{n} \sum_{i=1}^{n} y_i$ and $\sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \sum_{i=1}^{n} \epsilon_i^2$ <br>


<a class="list-group-item list-group-item-action" data-toggle="list" href="#Code" role="tab" aria-controls="messages">Go to Code<span class="badge badge-primary badge-pill"> </span></a>

<a class="list-group-item list-group-item-action" data-toggle="list" href="#Model-Evaluation" role="tab" aria-controls="messages">Go to top<span class="badge badge-primary badge-pill"></span></a>


## Mean Percentage Error
In statistics, the mean percentage error (MPE) is the computed average of percentage errors by which forecasts of a model differ from actual values of the quantity being forecast.

The formula for the mean percentage error is:
![MPE.svg](attachment:MPE.svg) <br>
where $a_t$ is the actual value of the quantity being forecast, $f_t$ is the forecast, and n is the number of different times for which the variable is forecast.

Because actual rather than absolute values of the forecast errors are used in the formula, positive and negative forecast errors can offset each other; as a result the formula can be used as a measure of the bias in the forecasts.

A disadvantage of this measure is that it is undefined whenever a single actual value is zero.<br>
The fuction defination of MPE is given below

In [1]:
def MPE(y_true, y_pred): 
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean((y_true - y_pred) / y_true)

<a class="list-group-item list-group-item-action" data-toggle="list" href="#Model-Evaluation" role="tab" aria-controls="messages">Go to top<span class="badge badge-primary badge-pill"></span></a>


## Mean Absolute Percentage Error
It is a simple average of absolute percentage errors. The MAPE calculation is as follows:
![Mape.svg](attachment:Mape.svg)
where $A_t$ is the actual value and $F_t$ is the forecast value. The MAPE is also sometimes reported as a percentage, which is the above equation multiplied by 100. The difference between $A_t$ and $F_t$ is divided by the actual value At again. <br>
It **cannot be used if there are zero values** (which sometimes happens for example in demand data) because there would be a division by zero.<br>
For forecasts which are too low the percentage error cannot exceed 100%, but **for forecasts which are too high there is no upper limit to the percentage error**.<br>

The fuction defination of MAPE is Given below:

In [2]:
def MAPE(y_true, y_pred): 
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return (np.mean(np.abs((y_true - y_pred) / y_true)) * 100)

<a class="list-group-item list-group-item-action" data-toggle="list" href="#Model-Evaluation" role="tab" aria-controls="messages">Go to top<span class="badge badge-primary badge-pill"></span></a>


## Weighted Mean Absolute error
The wMAPE is the metric in which the sales are weighted by sales volumne. Weighted Mean Absolute Percentage Error, as the name suggests, is a **measure that gives greater importance to faster selling products**. Thus it overcomes one of the potential drawbacks of MAPE. There is a very simple way to calculate WMAPE. This involves adding together the absolute errors at the detailed level, then calculating the total of the errors as a percentage of total sales.  This method of calculation leads to the additional benefit that it is robust to individual instances when the base is zero, thus overcoming the divide by zero problem that often occurs with MAPE.

WMAPE is a highly useful measure and is becoming increasingly popular both in corporate KPIs and for operational use. It is easily calculated and gives a concise forecast accuracy measurement that can be used to summarise performance at any detailed level across any grouping of products and/or time periods. If a measure of accuracy required this is calculated as 100 - WMAPE.

Now, in order to show how to run the different Accuracy metrics, we will be prforming Random Forest Regression on a dataset the download link for which is given below:

https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009<br>
We will start by importing the necessary libraries and the required dataset.

<a class="list-group-item list-group-item-action" data-toggle="list" href="#Model-Evaluation" role="tab" aria-controls="messages">Go to top<span class="badge badge-primary badge-pill"></span></a>


In [3]:
# Importing the libraries 
import numpy as np 
import matplotlib.pyplot as plt 
import pandas as pd 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import plot_confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import feature_extraction, linear_model, model_selection, preprocessing
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=RuntimeWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [4]:
data = pd.read_csv('dataset/winequality-red.csv')

In [5]:
data = data.fillna(method='ffill')

Our next step is to divide the data into “attributes” and “labels”. X variable contains all the attributes/features and y variable contains labels.

In [6]:
X = data[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates','alcohol']]
y = data['quality']

Next, we split 80% of the data to the training set while 20% of the data to test set using below code.

In [7]:
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Now we will run random forest regression on the training set.

In [8]:
# Import RF Regressor
from sklearn.ensemble import RandomForestRegressor
# Instantiate model with 1000 decision trees
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
# Train the model on training data
rf.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=1000, n_jobs=None, oob_score=False,
                      random_state=42, verbose=0, warm_start=False)

In [9]:
# Now let's do prediction on test data.

y_pred = rf.predict(X_test) 

### _Code_
Now we can calculate the accuracy of our model using the different accuracy metrics that we have explained above. 

In [10]:
#Error Calculations
from sklearn import metrics
#Explained Variance Score
print("_____________________________________")
print('Explained variance Score:', metrics.explained_variance_score(y_test, y_pred, multioutput='raw_values'))
explained_variance_score= metrics.explained_variance_score(y_test, y_pred, multioutput='raw_values')
#Max Error
print("_____________________________________")
print('Max Error:', metrics.max_error(y_test, y_pred))
max_error=metrics.max_error(y_test, y_pred)
#Mean Absolute Error
print("_____________________________________")
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
mean_absolute_error=metrics.mean_absolute_error(y_test, y_pred)
#Mean Squared Error
print("_____________________________________")
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred)) 
mean_squared_error=metrics.mean_squared_error(y_test, y_pred)
#Root Mean Squared Error
print("_____________________________________")
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
root_mean_squared_error=np.sqrt(metrics.mean_squared_error(y_test, y_pred))
#Mean Squared Log Error
print("_____________________________________")
print('Mean Squared Log Error:', metrics.mean_squared_log_error(y_test, y_pred))
mean_squared_log_error=metrics.mean_squared_log_error(y_test, y_pred)
#Median Absolute Error
print("_____________________________________")
print('Median Absolute Error:', metrics.median_absolute_error(y_test, y_pred))
median_absolute_error=metrics.median_absolute_error(y_test, y_pred)
#Median Absolute Error
print("_____________________________________")
print('Mean Percentage Error:', MPE(y_test, y_pred))
mean_percentage_error=MPE(y_test, y_pred)
#Median Absolute Error
print("_____________________________________")
print('Mean Absolute Percentage Error:', MAPE(y_test, y_pred))
mean_absolute_percentage_error=MAPE(y_test, y_pred)

_____________________________________
Explained variance Score: [0.45337001]
_____________________________________
Max Error: 2.615
_____________________________________
Mean Absolute Error: 0.40428437500000003
_____________________________________
Mean Squared Error: 0.318432278125
_____________________________________
Root Mean Squared Error: 0.5642980401569724
_____________________________________
Mean Squared Log Error: 0.008019925336798525
_____________________________________
Median Absolute Error: 0.3089999999999997
_____________________________________
Mean Percentage Error: -0.024683098958333333
_____________________________________
Mean Absolute Percentage Error: 7.658587425595238


In [11]:
#Dataframe of errors given by all the metrics explained above
data={"Error Metric":['Explained variance Score', 'Max Error', 'Mean Absolute Error', 'Mean Squared Error', 'Root Mean Squared Error','Mean Squared Log Error','Median Absolute Error','Mean Percentage Error','Mean Absolute Percentage Error'],
      "Score":["%.4f" % explained_variance_score, "%.4f" % max_error, "%.4f" % mean_absolute_error, "%.4f" % mean_squared_error, "%.4f" % root_mean_squared_error, "%.4f" % mean_squared_log_error, "%.4f" % median_absolute_error, "%.4f" % mean_percentage_error, "%.4f" % mean_absolute_percentage_error]}
error_df = pd.DataFrame(data)

In [12]:
error_df

Unnamed: 0,Error Metric,Score
0,Explained variance Score,0.4534
1,Max Error,2.615
2,Mean Absolute Error,0.4043
3,Mean Squared Error,0.3184
4,Root Mean Squared Error,0.5643
5,Mean Squared Log Error,0.008
6,Median Absolute Error,0.309
7,Mean Percentage Error,-0.0247
8,Mean Absolute Percentage Error,7.6586


<a class="list-group-item list-group-item-action" data-toggle="list" href="#Model-Evaluation" role="tab" aria-controls="messages">Go to top<span class="badge badge-primary badge-pill"></span></a>


## Tips for Metric Selection
- The **MAE is also the most intuitive of the metrics** one should by just looking at the absolute difference between the data and the model’s predictions<br>


- MAE does not indicate underperformance or overperformance of the model (whether or not the model under or overshoots actual data). Each residual contributes proportionally to the total amount of error, meaning that larger errors will contribute linearly to the overall error<br> 


- A **small MAE** suggests the model is great at prediction, while a **large MAE** suggests that, model may have trouble in certain areas. A MAE of 0 means that your model is a perfect predictor of the outputs (but this will almost never happen)<br><br>


- While the MAE is easily interpretable, using the absolute value of the residual often is not as desirable as squaring this difference. Depending on how the model to treat outliers, or extreme values, in the data, one may want to bring more attention to these outliers or downplay them. The issue of outliers can play a major role in which error metric you use<br>


- Because of squaring the difference, the **MSE will almost always be bigger than the MAE**. For this reason, we cannot directly compare the MAE to the MSE. **one can only compare model’s error metrics to those of a competing model**<br>

- The effect of the square term in the MSE equation is most apparent with the presence of outliers in our data. **While each residual in MAE contributes proportionally to the total error, the error grows quadratically in MSE**<br><br>


- RMSE is the square root of the MSE. Because the **MSE is squared, its units do not match that of the original output**. Researchers will often use **RMSE to convert the error metric** back into similar units, making interpretation easier<br>


- Taking the square root before they are averaged, **RMSE gives a relatively high weight to large errors**, so RMSE should be useful when large errors are undesirable.<br><br>


- Just as MAE is the average magnitude of error produced by your model, the **MAPE is how far the model’s predictions are off** from their corresponding outputs on average. Like MAE, MAPE also has a clear interpretation since percentages are easier for people to conceptualize. **Both MAPE and MAE are robust to the effects of outliers**<br>


- Many of MAPE’s weaknesses actually stem from use division operation. MAPE is **undefined for data points where the value is 0**. Similarly, the MAPE can grow unexpectedly large if the actual values are exceptionally small themselves<br>


- Finally, the MAPE is **biased towards predictions** that are systematically less than the actual values themselves. That is to say, MAPE will be lower when the prediction is lower than the actual compared to a prediction that is higher by the same amount. 

The table given below can also be helpful in regard to metric understanding:


|Acroynm|Full Name|Residual Operation?|Robust To Outliers|
|-------|---------|-------------------|------------------|
|MAE|Mean Absolute Error|Absolute Value|Yes|
|MSE|Mean Squared Error	Square|No|No|
|RMSE|Root Mean Squared Error|Square|No|
|MAPE|Mean Absolute Percentage Error|Absolute Value|Yes|
|MPE|Mean Percentage Error|N/A|Yes|


<a class="list-group-item list-group-item-action" data-toggle="list" href="#Model-Evaluation" role="tab" aria-controls="messages">Go to top<span class="badge badge-primary badge-pill"></span></a>


# Cross Validation
Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.
Cross-validation is primarily used in applied machine learning to estimate the skill of a machine learning model on unseen data. That is, to use a limited sample in order to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model.

**K-Folds Cross Validation**
<br>In K-Folds Cross Validation we split our data into k different subsets (or folds). We use k-1 subsets to train our data and leave the last subset (or the last fold) as test data. We then average the model against each of the folds and then finalize our model. After that we test it against the test set.

In [13]:
from sklearn.model_selection import KFold
kf = KFold(n_splits = 5, shuffle = True, random_state = 2)
for train_index, test_index in kf.split(X):
    # Split train-test
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y[train_index], y[test_index]

In [14]:
rf = RandomForestRegressor(n_estimators = 100, random_state = 42)
#Train the model on training data
rf.fit(X_train, y_train)
y_pred_kfold = rf.predict(X_test)
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred_kfold)) 
mean_squared_error=metrics.mean_squared_error(y_test, y_pred_kfold)


Mean Squared Error: 0.39817868338557993


<a class="list-group-item list-group-item-action" data-toggle="list" href="#Model-Evaluation" role="tab" aria-controls="messages">Go to top<span class="badge badge-primary badge-pill"></span></a>


# StratifiedKFold
StratifiedKFold is a variation of KFold. First, StratifiedKFold shuffles your data, after that splits the data into n_splits parts and Done. Now, it will use each part as a test set. Note that it only and always shuffles data one time before splitting.


**Difference in KFold and ShuffleSplit output**

KFold will divide your data set into prespecified number of folds, and every sample must be in one and only one fold. A fold is a subset of your dataset.

ShuffleSplit will randomly sample your entire dataset during each iteration to generate a training set and a test set. The test_size and train_size parameters control how large the test and training test set should be for each iteration. Since you are sampling from the entire dataset during each iteration, values selected during one iteration, could be selected again during another iteration.

Summary: ShuffleSplit works iteratively, KFold just divides the dataset into k folds.

**Difference when doing validation**

In KFold, during each round you will use one fold as the test set and all the remaining folds as your training set. However, in ShuffleSplit, during each round n you should only use the training and test set from iteration n. As your data set grows, cross validation time increases, making shufflesplits a more attractive alternate. If you can train your algorithm, with a certain percentage of your data as opposed to using all k-1 folds, ShuffleSplit is an attractive option.

In [15]:
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, random_state=None)
# X is the feature set and y is the target
for train_index, test_index in skf.split(X,y):
    # Split train-test
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y[train_index], y[test_index]

In [16]:
rf = RandomForestRegressor(n_estimators = 100, random_state = 42)
# Train the model on training data
rf.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=42, verbose=0, warm_start=False)

In [17]:
y_pred_kfold = rf.predict(X_test)
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred_kfold)) 
mean_squared_error=metrics.mean_squared_error(y_test, y_pred_kfold)


Mean Squared Error: 0.4155485893416928
