# Regression Toolkit

Now we have a complete Regression Toolkit which gives us a lot of different options for future machine learning problems. Now, we can focus on how to use this Regression Toolkit efficiently.

In order to select the best Regression Model suited to our real-world dataset, we need to first do a performance evaluation of each of our Regression Models. Based on the performance metrics, we can select the best Regression Model for any given dataset.

Performance Evaluation of a Regression Model can be done using:

1.   $R^2$ Measure
2.   Adjusted $R^2$ Measure



## Intuition behind $R^2$ Measure

Taking Simple Linear Regression as an example, we can see how the model finds the best fit line by minimizing the sum of squares of deviations. This sum of squares is denoted as Sum of Squares of Residuals, $SS_{res}$.

![R-Squared Measure - Intuition](07-Regr-Performance-Evaluation-01.PNG)

For finding the $R^2$ Measure, an average line (horizontal line corresponding to average salary across all observations) is drawn. The sum of squares of deviations from this average line is found and is denoted as Total Sum of Squares, $SS_{tot}$. The average line is a horizontal trend line, but we can think of it as a model fitted to our dataset, but it is not the best model.

![R-Squared Measure - Intuition](07-Regr-Performance-Evaluation-02.PNG)

The Regression Model is trying to fit the best possible line to minimize $SS_{res}$ to make it as small as possible.
$R^2$ value tells us how good our fitted line is compared to the average line. In the ideal scenario, if $SS_{res}$ = 0 (i.e. fitted line goes through all points in the dataset), $R^2$ = 1. The closer $R^2$ gets to 1, the better our model will be.



## Intuition behind Adjusted $R^2$ Measure

Lets say we already have a Regression Model with two Features. We now want to add more variables to our model to make it better.

![Adjusted R-Squared Measure - Intuition](07-Regr-Performance-Evaluation-03.PNG)

Adding a new variable can lead to either $R^2$ increasing or remaining the same. But $R^2$ will never decrease.
Even if the added variable doesn't provide any actual improvement to the model other than some random correlation, $R^2$ may increase. Because of this bias in $R^2$ (it always increases regardless of actual improvement), we will never know whether added variables are actually helping the model or not. So, we need a new parameter to measure the goodness of fit of a model. This is where Adjusted $R^2$ comes into picture.

![R-Squared Measure - Intuition](07-Regr-Performance-Evaluation-04.PNG)

Adjusted $R^2$ has a Penalization Factor. It penalizes you for adding independent variables that don't help your model. As you add more regressors (i.e. independent variables), Adjusted $R^2$ decreases on one hand due to increase in $p$ and on the other hand Adjusted $R^2$ increases due to increase in $R^2$. So, if independent variable doesn't help the model, increase in $R^2$ will be minimal and its effect on Adjusted $R^2$ will be less compared to increase in $p$. This leads to a resultant decrease in Adjusted $R^2$, thus penalizing the added variable. If on the other hand, if added independent variable is helping the model a lot, there will be a significant increase in $R^2$ and its effect on Adjusted $R^2$ will be more compared to increase in $p$. This will lead to a resultant increase in Adjusted $R^2$, overwhelming the Penalizing Factor.

Thus Adjusted $R^2$ is a very good metric that helps in understanding whether you are adding good variables to a model or not.

## Model Selection

After going through the intuition behind Performance Evaluation of Regression Models, our focus is now on Model Selection. How do we know which regression model to choose for a particular problem/dataset?

We are going to work on a generic real-world dataset with 9000+ samples and multiple features. We will be first doing the Performance Evaluation of each Regression Model we have already discussed. We then select one Regression Model most suited to our dataset.


## Combined Cycle Power Plant Dataset

This is an open dataset from the Machine Learning Repository of Center for Machine Learning and Intelligent Systems at the University of California, Irvine [(Dataset Link)](https://archive.ics.uci.edu/ml/datasets/combined+cycle+power+plant). The dataset contains 9568 data points that have been collected from a Combined Cycle Power Plant over 6 years (2006-2011). During this period, the power plant was set to work with a full load. Features consist of hourly average ambient variables Ambient Temperature (AT), Ambient Pressure (AP), Relative Humidity (RH), and Exhaust Vacuum (V) to predict the net hourly electrical energy output (PE) of the plant.

Features consist of hourly average ambient variables
- Ambient Temperature (AT) in the range 1.81°C to 37.11°C
- Exhaust Vacuum (V) in the range 25.36-81.56 cm Hg
- Ambient Pressure (AP) in the range 992.89-1033.30 millibar
- Relative Humidity (RH) in the range 25.56% to 100.16%
- Net hourly electrical energy output (PE) 420.26-495.76 MW

The averages are taken from various sensors located around the plant that record the ambient variables every second. The variables are given without normalization.

Given the attributes, our objective is to predict Net hourly electrical energy output (PE).

**Important Note 1:** This dataset contains no missing data and no categorical data. Hence, those steps for Data Preprocessing (Taking care of missing data and Encoding categorical data) are not required before starting Training of the Model.

**Important Note 2:**

"Regression-Toolkit-Templates" folder contains templates of all the Regression Models we have discussed (except Simple Linear Regression which will not be useful for a dataset with multiple features).

"Regression-Toolkit-Example" folder contains all the above models trained with the Combined Cycle Power Plant Dataset.