# Regression example

## Metrics for model evaluation



In a dataset after applying a regression model how to evaluate it. There are many metrics that we can use. We will be using mean absolute error (MAE), mean squared error (MSE, RMSE) and R squared (R^2).

We are going to use a classic dataset called "Wine", which analyzes the biochemical quality of series of wines according to different types of varieties associated with their genotype.

**Recipe Objective**:
* Step 1 - Import the library
* Step 2 - Setting up the Data
* Step 3 - Training model and calculating Metrics

### Step 1: Import the library

In [None]:
from sklearn import datasets
from sklearn import tree, model_selection
from sklearn.model_selection import train_test_split
from numpy import mean
from numpy import absolute
from numpy import sqrt

We have imported datasets, tree, model_selection and test_train_split which are the libraries needed for the exercise.



### Step 2: Setting up the data

In the next step we will import the inbuilt "Wine" dataset, storing the data in "x" and the labels in "y". Once this is done, we will split the data into test and train, both "x" and "y".

Then, we will use "KFold", which allows us to divide the datset into "K" folds, considering one of them as a validation set and the remaining folders as a training set, making "K" different combinations (learn more in https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html).


In [None]:
#The wine dataset is a classic and very easy multi-class classification dataset.
#The copy of UCI ML Wine Data Set dataset is downloaded and modified to fit standard format from: https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data

#DATA:
seed = 42
dataset = datasets.load_wine()
X = dataset.data #data
y = dataset.target #labels (3 wine cultivars/varieties)

#PRINT DATA:
print("X........................")
print(X)
print("y........................")
print(y)

#dimension
print(X.shape)



X........................
[[1.423e+01 1.710e+00 2.430e+00 ... 1.040e+00 3.920e+00 1.065e+03]
 [1.320e+01 1.780e+00 2.140e+00 ... 1.050e+00 3.400e+00 1.050e+03]
 [1.316e+01 2.360e+00 2.670e+00 ... 1.030e+00 3.170e+00 1.185e+03]
 ...
 [1.327e+01 4.280e+00 2.260e+00 ... 5.900e-01 1.560e+00 8.350e+02]
 [1.317e+01 2.590e+00 2.370e+00 ... 6.000e-01 1.620e+00 8.400e+02]
 [1.413e+01 4.100e+00 2.740e+00 ... 6.100e-01 1.600e+00 5.600e+02]]
y........................
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
(178, 13)


Now split data in Training and test and obtain k-fols:

k-fold cross-validation, is a technique used to evaluate the performance of machine learning models. It works by splitting your data into k folds (usually equal-sized) and performing the following steps:

For each fold:
Train the model on k-1 folds of data (excluding the current fold).
Evaluate the model's performance on the remaining fold (held-out set).
Average the performance metrics across all folds.

In [None]:
#split data into train 75% and test (25%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25) # 75 training, 25 test

#obtain 10 kfolds to allow cross-validation, using shuffle= True ->specifies whether to shuffle the data before splitting it into folds for cross-validation. ("Shuffle the cards before dealing." - "Baraja las cartas antes de repartir.")
kfold = model_selection.KFold(n_splits=10, random_state=seed, shuffle= True) #10 splits of data;

### Step 3 - Training model and calculating Metrics


Here we will be using DecisionTreeRegressor as our model



*   tree.DecisionTreeRegressor is a powerful algorithm from the scikit-learn library used for regression tasks. It builds a tree-like structure to predict continuous target values based on input features
*   Easy to interpret: The tree structure provides a clear understanding of how the model arrives at its predictions.
*   Handles complex relationships: Can learn complex non-linear relationships between features and the target variable.
*   Robust to outliers: Less sensitive to outliers in the data compared to some other regression methods.






In [None]:
model = tree.DecisionTreeRegressor()

Now we will be calculating different **metrics**. We will be using cross validation score to calculate the metrices (https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation). So we will be printing the mean and standard deviation of all the scores (10 kfolds).

##### **Calculating Mean Absolute Error (MAE)**

Mean Absolute Error is the average of the sum of absolute difference between the actual values and the predicted values. A model with less MAE performs better than model with large MAE value

$MAE=\frac{1}{N}\cdot \sum_{i=1}^{N} |y_{i}-\hat{y}_{i}|$

In [None]:
scoring = "neg_mean_absolute_error" #Calculating Mean Absolute Error for the train set
results = model_selection.cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring)

print("Mean Absolute Error:each kfold:.......................")
print(absolute(results))
print("Mean of the Mean Absolute Error------------------------")
print("Mean Absolute Error: ", mean(absolute(results))); print("Standard Deviation: ", results.std())

Mean Absolute Error:each kfold:.......................
[0.         0.14285714 0.07142857 0.15384615 0.23076923 0.23076923
 0.15384615 0.07692308 0.07692308 0.07692308]
Mean of the Mean Absolute Error------------------------
Mean Absolute Error:  0.12142857142857141
Standard Deviation:  0.07046904116411021


##### **Calculating Mean Squared Error (MSE)**



Mean Square Error is the average of the sum of square of the difference between actual and predicted values.


$MSE=\frac{1}{N}\cdot \sum_{i=1}^{N} ( y_{i}-\hat{y}_{i})^2$

In [None]:
scoring = "neg_mean_squared_error" #Calculating Mean squared error
results = model_selection.cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring)
print(); print("Mean Squared Error: ", mean(absolute(results))); print("Standard Deviation: ", results.std())


Mean Squared Error:  0.09890109890109891
Standard Deviation:  0.07660846179311374


There is also the Root Mean Square Error, which is the same as Mean Square Error but root of the MSE is considered while evaluating the model. RMSE is more sensitive to the presence of false data .i.e. outliers.

##### **Calculating R squared value**



R² score also known as the coefficient of determination gives the measure of how good a model fits to a given dataset. It indicates how closer are the predicted values to the actual values. The R² value ranges from 0 to 1.

$R^2=1-\frac{\sum_{i=1}( y_{i}-\hat{y}_{i})^2}{\sum_{i=1}( y_{i}-\bar{y}_{i})^2}$

In [None]:
scoring = "r2" #Calculating R squared value
results = model_selection.cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring)
print(); print("R squared val: ", results.mean()); print("Standard Deviation: ", results.std())


R squared val:  0.8075236912132073
Standard Deviation:  0.17957907480493432


<!--NAVIGATION-->
Based partially in internet, only for teaching pourposes.

See in: https://www.projectpro.io/recipes/use-regression-metrics-in-python