# Introduction to model evaluation and best model selection

In this Model_selection directory we will learn how to evaluate the regression models we have learnt so far and how to select the best one. We have learnt total 6 regression models: (i) Simple linear regression, (ii) Multiple linear regression, (iii) Polynomial linear regression, (iv) Support vector regression, (v) Decision tree regression and (vi) Random forest regression.

Here we have made the codes of the above models as general as possible so that if we want to apply these models to any future dataset we have to change very minimal in the codes. In fact, here the code templates are so generic that we just have to change the name of the dataset!!!

At the end of each code, we will evaluate the performance of the trained model trained on the given datset so that we can easily compare the performances of the models and choose the best one. So, we can select the best model for a new dataset in a short amount of time and very efficiently using the code templates.  

So the answer to the question: "How should I select the best model?" The simple answer is try all the models and select the best one which have best performance result which is measured by $R^2$ or adjusted $R^2$. The metric $R^2$ is given by:

$R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$ ,

where $SS_{res} = \sum_{i=1}^n (y_i - \hat{y}_i)^2$ is the residual sum of squares and $SS_{tot} = \sum_{i=1}^n (y_i - \bar{y})^2$ is the total sum of squares. $y_i = $ actual/true value of the dependent variable, $\hat{y}_i = $ predicted value of the dependent variable by the regression model, $\bar{y} = $ mean or average of all the true values of the dependent variable.

The $R^2$ is used in evaluating the goodness of a regression fit.

We have not included the simple linear regression model as now we have real world dataset having several features. SLR is possible only when there is only one feature.

The requrement for a minimal change in the code templates is that

(i) dataset is a csv format,

(ii) Features are in the left columns and dependent variable vector is at the last (most right) column.

(iii) Dataset has no missing values (missing values can be handled by data preprocessing tools we leanrt) and no categorical data.

# About the Dataset

Now we will use a real world dataset **'Data.csv'** which has 4 features and 1 dependendent variable. And it has almost 10k observations (each row is a single observation). Very good. All the features are numerical (not categorical). No missing values are there. Here the dataset is a classic example of UCI machinery repository (check in the website which contains a lot of dataset on which we can practice). This dataset is called "combined cycle powerplpant". $y$ is an energy output. We don't have to understand the physics of the dataset. The features are: ambient temperature, exhaust vacuum, ambient pressure and relative humidity. This 4 features form the marix of features $X$.

# Multiple Linear Regression

## Importing the libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [None]:
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values   # Matrix of features X consists of all the columns except the last one.
y = dataset.iloc[:, -1].values   # Dependent variable y is the last column.

In [None]:
dataset

Unnamed: 0,AT,V,AP,RH,PE
0,14.96,41.76,1024.07,73.17,463.26
1,25.18,62.96,1020.04,59.08,444.37
2,5.11,39.40,1012.16,92.14,488.56
3,20.86,57.32,1010.24,76.64,446.48
4,10.82,37.50,1009.23,96.62,473.90
...,...,...,...,...,...
9563,16.65,49.69,1014.01,91.00,460.03
9564,13.19,39.18,1023.67,66.78,469.62
9565,31.32,74.33,1012.92,36.48,429.57
9566,24.48,69.45,1013.86,62.39,435.74


In [None]:
dataset.isnull().sum()

# No missing values in the dataset.

AT    0
V     0
AP    0
RH    0
PE    0
dtype: int64

In [None]:
X

array([[  14.96,   41.76, 1024.07,   73.17],
       [  25.18,   62.96, 1020.04,   59.08],
       [   5.11,   39.4 , 1012.16,   92.14],
       ...,
       [  31.32,   74.33, 1012.92,   36.48],
       [  24.48,   69.45, 1013.86,   62.39],
       [  21.6 ,   62.52, 1017.23,   67.87]])

In [None]:
y

array([463.26, 444.37, 488.56, ..., 429.57, 435.74, 453.28])

## Splitting the dataset into the Training set and Test set

Splitting the dataset into training set and test set. Since we want to select the best model, we need the test set on which model performance is evaluated.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## Training the Multiple Linear Regression model on the Training set

In [None]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

## Predicting the Test set results

In [None]:
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[431.43 431.23]
 [458.56 460.01]
 [462.75 461.14]
 ...
 [469.52 473.26]
 [442.42 438.  ]
 [461.88 463.28]]


## Evaluating the Model Performance

Go to **scikitlearn** website and then click on **API** which contain all the modules and inside all the functions and classes. Go to **scikit.metrics** module which contain all the metrics of all the ML models. There are many regression metrics, but we will use the $R^2$ metric using **r2_score** function from **sklearn.metrics** module (i.e., **sklearn.metrics.r2_score** function). The **sklearn.metrics.r2_score** function takes mainly the two arguments $y_{true}$ and $y_{pred}$ arrays for the test set only.

We evaluate the model performances of all the regression models using the $R^2$ coefficient values for all the models and the model having the highest $R^2$ value is the best regression model.

In [None]:
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

0.9325315554761303

The above steps will be followed in all the regression models to find the best model having the highest $R^2$ value.

**Results: Choose the best regression model**

The regression model with the corresponding $R^2$ value is given below:

     MLR: 0.9325
     PLR: 0.9458
     SVR: 0.9481
     DTR: 0.9229
     RFR: 0.9616

We see that the RFR model has the highest $R^2$ value and thus RFR model is the big winner of this data competition!!

So, to choose the best ML model, simply try all the ML models and the model which has highest $R^2$ value is the best model.