## Evaluation Metrics

### R Squared

Compares how well the model fits the data compared to the average line

${\bf E}_{LS} = \sum (  y_{i} - \check{y}_{i} )^{2}$

where $y_{i}$ are predicted results and $\check{y}_{i}$ are actual results.

${\bf E}_{TOT} = \sum (  y_{i} - y_{avg} )^{2}$

where $y_{avg}$ is the average of actual results

${\bf R}^{2} = 1 - \frac{{\bf E}_{LS}}{{\bf E}_{TOT}} $

${\bf R}^{2} = 1 \implies \qquad$ Good predictor 

When ${\bf R}^{2}$ is further away from 1, the predictions are worse.

**Note :** ${\bf R}_{2}$ can be negative.

### Adjusted R Squared

When multiple independent variables are present, it is not possible to determine with ${\bf R}_{2}$, if the new variables are helping the model or not.

Adjusted ${\bf R}_{2}$ penalizes for adding variables that don't help.

$Adj {\bf R}_{2} =  1 - (1-{\bf R}_{2})\frac{n-1}{n-p-1}$

where $p$ is the number of independent variables and $n$ is the sample size. When $p=1$, ${\bf R}_{2}$ is realized.

So when more variables are added, this metric wont increase unless there is some contribution by factor more than $\frac{n-1}{n-p-1}$

### Revisiting the problem used for Multiple Linear Regression

In [17]:
#Supress warnings
import warnings
warnings.filterwarnings('ignore')

# Importing the libraries
import numpy as np
import pandas as pd

np.set_printoptions(precision=3)
np.set_printoptions(suppress=True) #Otherwise prints in scientific format


# Importing the dataset
dataset = pd.read_csv('50_Startups.csv')



X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values

# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[:, 3] = labelencoder.fit_transform(X[:, 3])

onehotencoder = OneHotEncoder(categorical_features = [3])
X = onehotencoder.fit_transform(X).toarray()


# Avoiding dummy variable trap
X = X[:,1:]

print("Partial Data")
dataset[:5]

Partial Data


Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


#### All variables included

In [18]:
import statsmodels.formula.api as sm

X_b = np.append(arr = np.ones((len(X),1)), values = X, axis=1)
#create a new regressor object
X_opt = X_b[:,[0,1,2,3,4,5]] #0 is intercept
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
print("Num Variables: 5", "\nR squared :", regressor_OLS.rsquared,"\nAdj R squared :", regressor_OLS.rsquared_adj)

Num Variables: 5 
R squared : 0.9507524843355148 
Adj R squared : 0.945156175737278


#### Eliminating variables by backward elimination

Significant Level : 0.05

In [19]:
import statsmodels.formula.api as sm

X_b = np.append(arr = np.ones((len(X),1)), values = X, axis=1)
#create a new regressor object
X_opt = X_b[:,[0,1,3,4,5]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
print("Num Variables: 4", "\nR squared :", regressor_OLS.rsquared,"\nAdj R squared :", regressor_OLS.rsquared_adj)

Num Variables: 4 
R squared : 0.9507522991055133 
Adj R squared : 0.94637472569267


In [20]:
import statsmodels.formula.api as sm

X_b = np.append(arr = np.ones((len(X),1)), values = X, axis=1)
#create a new regressor object
X_opt = X_b[:,[0,3,4,5]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
print("Num Variables: 3", "\nR squared :", regressor_OLS.rsquared,"\nAdj R squared :", regressor_OLS.rsquared_adj)

Num Variables: 3 
R squared : 0.9507459940683246 
Adj R squared : 0.9475337762901719


In [21]:
import statsmodels.formula.api as sm

X_b = np.append(arr = np.ones((len(X),1)), values = X, axis=1)
#create a new regressor object
X_opt = X_b[:,[0,3,5]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
print("Num Variables: 2", "\nR squared :", regressor_OLS.rsquared,"\nAdj R squared :", regressor_OLS.rsquared_adj)

Num Variables: 2 
R squared : 0.9504503015559763 
Adj R squared : 0.9483418037498477


In [22]:
import statsmodels.formula.api as sm

X_b = np.append(arr = np.ones((len(X),1)), values = X, axis=1)
#create a new regressor object
X_opt = X_b[:,[0,3]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
print("Num Variables: 1", "\nR squared :", regressor_OLS.rsquared,"\nAdj R squared :", regressor_OLS.rsquared_adj)

Num Variables: 1 
R squared : 0.9465353160804393 
Adj R squared : 0.9454214684987817


**Analysis:** 

It can be seen that as the variables are eliminated, R squared either decreases or does not change. Hence R squared has highest value when all variables are included. But we can see that Adj R square has highest value with 2 variables. But following backward elimination with significant level of 0.05 would have resulted in 1 variable.

### Take away:

Use **p-value** to find out which variable to eliminate.

Use **Adjusted R** squared value to find out if your model is improving.