# Regression Toolkit

Now we have a complete Regression Toolkit which gives us a lot of different options for future machine learning problems. Now, we can focus on how to use this Regression Toolkit efficiently. First, our focus will be on the performance evaluation of a Regression Model. After this, we will focus on selection of the best Regression Model for any given dataset.

Performance Evaluation of a Regression Model can be done using:

1.   $R^2$ Measure
2.   Adjusted $R^2$ Measure



## Intuition behind $R^2$ Measure

Taking Simple Linear Regression as an example, we can see how the model finds the best fit line by minimizing the sum of squares of deviations. This sum of squares is denoted as Sum of Squares of Residuals, $SS_{res}$.

![R-Squared Measure - Intuition](07-Regr-Performance-Evaluation-01.PNG)

For finding the $R^2$ Measure, an average line (horizontal line corresponding to average salary across all observations) is drawn. The sum of squares of deviations from this average line is found and is denoted as Total Sum of Squares, $SS_{tot}$. The average line is a horizontal trend line, but we can think of it as a model fitted to our dataset, but it is not the best model.

![R-Squared Measure - Intuition](07-Regr-Performance-Evaluation-02.PNG)

The Regression Model is trying to fit the best possible line to minimize $SS_{res}$ to make it as small as possible.
$R^2$ value tells us how good our fitted line is compared to the average line. In the ideal scenario, if $SS_{res}$ = 0 (i.e. fitted line goes through all points in the dataset), $R^2$ = 1. The closer $R^2$ gets to 1, the better our model will be.



## Intuition behind Adjusted $R^2$ Measure

Lets say we already have a Regression Model with two Features. We now want to add more variables to our model to make it better.

![Adjusted R-Squared Measure - Intuition](07-Regr-Performance-Evaluation-03.PNG)

Adding a new variable can lead to either $R^2$ increasing or remaining the same. But $R^2$ will never decrease.
Even if the added variable doesn't provide any actual improvement to the model other than some random correlation, $R^2$ may increase. Because of this bias in $R^2$ (it always increases regardless of actual improvement), we will never know whether added variables are actually helping the model or not. So, we need a new parameter to measure the goodness of fit of a model. This is where Adjusted $R^2$ comes into picture.

![R-Squared Measure - Intuition](07-Regr-Performance-Evaluation-04.PNG)

Adjusted $R^2$ has a Penalization Factor. It penalizes you for adding independent variables that don't help your model. As you add more regressors (i.e. independent variables), Adjusted $R^2$ decreases on one hand due to increase in $p$ and on the other hand Adjusted $R^2$ increases due to increase in $R^2$. So, if independent variable doesn't help the model, increase in $R^2$ will be minimal and its effect on Adjusted $R^2$ will be less compared to increase in $p$. This leads to a resultant decrease in Adjusted $R^2$, thus penalizing the added variable. If on the other hand, if added independent variable is helping the model a lot, there will be a significant increase in $R^2$ and its effect on Adjusted $R^2$ will be more compared to increase in $p$. This will lead to a resultant increase in Adjusted $R^2$, overwhelming the Penalizing Factor.

Thus Adjusted $R^2$ is a very good metric that helps in understanding whether you are adding good variables to a model or not.

## Problem Statement

We have a dataset containing salary data of employees of different position levels (1 to 10) in a company. Here,


*   Independent Variable/Feature = Level
*   Dependent/Target Variable = Salary

Here, Salary and Level have non-linear relationship between them. We want to build a model that can predict the salary given the level of an employee.

**Important Note:** Random Forest Regression works well with complex datasets having multiple features. But it is really not the best adapted to 2D datasets with only one Feature and one Target Variable. Here, we are using a simple dataset with only one feature in order to visualize both X and y in a 2D plot. Hence this dataset is not ideal for visualizing the merits of Random Forest Regression. 

## Importing the libraries

## Importing the dataset

## Taking care of missing data

Here we have no missing data in the dataset.

**Important Note:** If missing data acounts for less than 1% of dataset, we can discard them. But in all other cases, we have to replace missing data. Missing data can be replaced with either mean, median, most frequent data or with a constant using `SimpleImputer` from `sklearn.impute`. Other solutions include `IterativeImputer`, `KNNImputer` and `MissingIndicator`.



## Encoding categorical data

If any of the columns have categorical data, you should apply Encoding to convert them into numerical data. Encoding should be

*   One Hot Encoding if you know there is no ordered relationship in your categorical variable (eg: Country, State etc.)
*   Label Encoding if there is an ordered relationship (eg: Position Levels in a company, Purchase Decisions etc.)



In the dataset, you can see that 'Position' column is effectively encoded in 'Level' column. Thus, no explicit encoding is required. Also, since 'Position' column is redundant, it is not included in the Feature Matrix X.

## Splitting the dataset into the Training set and Test set

Here, we want to predict the salary for Employee Level (which is a continuous real value that varies from 1 to 10). Hence we want the data for all levels from 1 to 10 for accurate prediction. Hence, in this scenario, we are not splitting the dataset.

## Feature Scaling

Since predictions from Decision Tree Regression Model or Random Forest Regression Model are resulting from successive splits of the dataset, they both don't require Feature Scaling. Hence, it is not applied.

## Training the Random Forest Regression model on the whole dataset

## Predicting a new result

## Visualising the Random Forest Regression results (higher resolution)

Here, you can see that the no: of steps between each position level has increased to 2 compared to 1 in Decision Tree Regression Model. This is because of the use of a forest of trees in the Random Forest Regression Model.