Linear regression assumes that the input variables have a Gaussian distribution. It is also
assumed that input variables are relevant to the output variable and that they are not highly
correlated with each other (a problem called collinearity).

3 of the most common metrics for evaluating predictions on regression
machine learning problems: <br/> <br/>
Mean Absolute Error <br/>
Mean Squared Error <br/>
$R^2$

#  Mean Absolute Error

The Mean Absolute Error (or MAE) is the sum of the absolute differences between predictions
and actual values. The measure gives an
idea of the magnitude of the error, but no idea of the direction (e.g. over or under predicting).

## Calculating mean absolute error on the Boston house price dataset

In [7]:
# Cross Validation Regression MAE
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
filename = 'housing.csv'
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
dataframe = read_csv(filename, delim_whitespace=True, names=names)
array = dataframe.values

In [8]:
X = array[:,0:13]
Y = array[:,13]

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html <br/>
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html <br/>
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html <br/>
https://scikit-learn.org/stable/modules/model_evaluation.html

In [9]:
kfold = KFold(n_splits=10, random_state=7)
model = LinearRegression()
scoring = 'neg_mean_absolute_error'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("MAE: %.3f (%.3f)" % (results.mean(), results.std()))

MAE: -4.005 (2.084)




When summarizing performance measures, it is a good practice to summarize the
distribution of the measures, in this case assuming a Gaussian distribution of performance (a
very reasonable assumption) and recording the mean and standard deviation.

In [10]:
results

array([-2.20686845, -2.89680909, -2.78673044, -4.59847835, -4.10986504,
       -3.56469238, -2.66966723, -9.65637767, -5.02272517, -2.53725254])

# Mean Squared Error

The Mean Squared Error (or MSE) is much like the mean absolute error in that it provides a
gross idea of the magnitude of error. Taking the square root of the mean squared error converts
the units back to the original units of the output variable and can be meaningful for description
and presentation. This is called the Root Mean Squared Error (or RMSE). 

In [12]:
# Cross Validation Regression MSE
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
filename = 'housing.csv'
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
dataframe = read_csv(filename, delim_whitespace=True, names=names)
array = dataframe.values
X = array[:,0:13]
Y = array[:,13]
kfold = KFold(n_splits=10, random_state=7)
model = LinearRegression()
scoring = 'neg_mean_squared_error'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("MSE: %.3f (%.3f)" % (results.mean(), results.std()))

MSE: -34.705 (45.574)




# $R^2$ Metric

The $R^2$
(or R Squared) metric provides an indication of the goodness of fit of a set of predictions
to the actual values. In statistical literature this measure is called the coefficient of determination.
This is a value between 0 and 1 for no-fit and perfect fit respectively.

In [13]:
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
filename = 'housing.csv'
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
dataframe = read_csv(filename, delim_whitespace=True, names=names)
array = dataframe.values
X = array[:,0:13]
Y = array[:,13]
kfold = KFold(n_splits=10, random_state=7)
model = LinearRegression()
scoring = 'r2'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("R^2: %.3f (%.3f)" % (results.mean(), results.std()))

R^2: 0.203 (0.595)




You can see the predictions have a poor fit to the actual values with a value closer to zero
and less than 0.5.

R-squared or R2 explains the degree to which your input variables explain the variation of your output / predicted variable. So, __if R-square is 0.8, it means 80% of the variation in the output variable is explained by the input variables__. So, in simple terms, higher the R squared, the more variation is explained by your input variables and hence better is your model. <br/>

However, the problem with R-squared is that it will either stay the same or increase with addition of more variables, even if they do not have any relationship with the output variables. This is where “Adjusted R square” comes to help. __Adjusted R-square penalizes you for adding variables which do not improve your existing model.__ <br/>

Hence, if you are building Linear regression on multiple variable, it is always suggested that you use Adjusted R-squared to judge goodness of model. In case you only have one input variable, R-square and Adjusted R squared would be exactly same. <br/>

Typically, the more non-significant variables you add into the model, the gap in R-squared and Adjusted R-squared increases.