# EVALUATION OF MODEL PERFORMANCE
## METHOD 3: COEFFICIENT OF DETERMINATION, $R^{2}$<br/>
In short this measures the error from no error (perfect) to the standard deviation (horrible)<p/>
## $R^2=1-\frac{SSE}{SST}$
<p/>
Where:<br/>
<B>SSE:</B> SUM OF SQUARE ERRORS $\sum_{i=1}^n (y_ireal-y_ipredicted)^2$<BR/>
<B>SST:</B> TOTAL SUM OF SQUARES $\sum_{i=1}^n (y_ireal-mean_y)^2$ which is basically $S^2$(variance)*n(# of samples)<p/>
Just like the other performance methods, the Coefficient of Determination, shows how well the model predicts the dependent variable comparing it against the real values. It is said that the value of $R^2$ is the percentage of how well the model explains the dependant variable with 1.0= 100% (perfect explanation) and 0=0% (Not being able to explain at all).<br>
-The max value for $R^2$ is 1.0 which occures when there is no error <b>(perfect prediction)</b>.<br/>
-The min value for $R^2$ is undefined, can be a large negative number<br/>
-Values between 0 and 1 indicates how well the model performs with respect to the variance of the data itself. when the mean of square erros is equal to the variance (SSE/n)=(SST/n), $R^2=1$ which means that the mean error is as wide as the standard deviation which is horrible.

## Load dataset into pandas dataframe

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression as LR
from sklearn.model_selection import train_test_split

In [2]:
cNames=['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT','MEDV']
df=pd.read_csv('housing.data',sep='\s+',names=cNames)

In [3]:
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


## Split the data into training data and testing data
we are going to use the rooms per dwelling parameter in this case

In [4]:
x=df['RM'].values
y=df['MEDV'].values
X=x.reshape(-1,1)

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

## Create and fit the model

In [6]:
lr_model=LR()

In [7]:
lr_model.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [8]:
y_train_pred=lr_model.predict(X_train)
y_test_pred=lr_model.predict(X_test)

## Coefficient of Determination using Scikit

In [9]:
from sklearn.metrics import r2_score

In [17]:
r2_train=r2_score(y_train,y_train_pred)
r2_test=r2_score(y_test,y_test_pred)
print('Coefficient of Determination with train data=%f\n\rCoefficient of determination with test data=%f'%(r2_train,r2_test))

Coefficient of Determination with train data=0.497080
Coefficient of determination with test data=0.423944


## Mean Square Error manually using above formula

In [11]:
d_mean_train=np.sum(y_train)/y_train.size
d_mean_test=np.sum(y_test)/y_test.size
SSE_train=np.sum((y_train-y_train_pred)*(y_train-y_train_pred))
SSE_test=np.sum((y_test-y_test_pred)*(y_test-y_test_pred))
SST_train=np.sum((y_train-d_mean_train)*(y_train-d_mean_train))
SST_test=np.sum((y_test-d_mean_test)*(y_test-d_mean_test))

In [12]:
r2_train=1-(SSE_train/SST_train)
r2_test=1-(SSE_test/SST_test)
print('Coefficient of Determination with train data=%f\n\rCoefficient of determination with test data=%f'%(r2_train,r2_test))

Coefficient of Determination with train data=0.497080
Coefficient of determination with test data=0.423944
