In [1]:
import sklearn as sl
import warnings

warnings.filterwarnings("ignore")

### Business Problem Definition

We are going to create a predictive model that is able to predict house prices based on a series of characteristics about different houses in a neighborhood in Boston, USA. Dataset: https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html

### Performance Evaluation

https://scikit-learn.org/stable/modules/model_evaluation.html

The metrics you choose to evaluate the model performance will influence at how the performance is measured and compared to models created with other algorithms.

### *Metrics for Regression Algorithms*

- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- Mean Absolute Error (MAE)
- R Squared (R²)
- Adjusted R Squared (R²)
- Mean Square Percentage Error (MSPE)
- Mean Absolute Percentage Error (MAPE)
- Root Mean Squared Logarithmic Error (RMSLE)

$$MSE = \frac {1}{n} \sum_{i=1}^n (y_{i} - \hat{y}_{i})^2$$

$$RMSE = \sqrt {\frac {1}{n} \sum_{i=1}^n (y_{i} - \hat{y}_{i})^2} = \sqrt {MSE}$$

$$MAE = \frac {1}{n} \sum_{i=1}^n |y_{i} - \hat{y}_{i}|$$

$$R^2 = 1 - \frac {MSE (model)}{MSE (baseline)}$$

**Mean Squared Error (MSE)**

It is perhaps the simplest and most common metric for regression assessment, but also probably the least useful. The MSE basically measures the average square error of our predictions. For each point, it calculates the square difference between the prediction and the real value of the target variable, and then averages those values.

The higher the value, the worse the model. This value will never be negative, since we are squaring individual prediction errors, but it would be zero for a perfect model.

In [2]:
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression

file = 'aux/datasets/boston-houses.csv'
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = columns)

x = data.values[:, 0:13]
y = data.values[:, 13]

x_training, x_test, y_training, y_test = train_test_split(x, y, test_size = 0.33, random_state = 5)

model = LinearRegression()
model.fit(x_training, y_training)

y_predictions = model.predict(x_test)

mse = mean_squared_error(y_test, y_predictions)

print(f"MSE: {round(mse, 3)}")

MSE: 28.53


**Mean Absolute Error (MAE)**

It is the sum of the absolute difference between prediction and actual values, which provides an idea of how wrong our predictions are. A value of 0 indicates that there is no error, and the prediction is perfect.

In [3]:
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import LinearRegression

file = 'aux/datasets/boston-houses.csv'
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = columns)

x = data.values[:, 0:13]
y = data.values[:, 13]

x_training, x_test, y_training, y_test = train_test_split(x, y, test_size = 0.33, random_state = 5)

model = LinearRegression()
model.fit(x_training, y_training)

y_predictions = model.predict(x_test)

mae = mean_absolute_error(y_test, y_predictions)

print(f"MAE: {round(mae, 3)}")

MAE: 3.455


**R Squared (R²)**

This metric provides an indication of how accurate the predictions are in relation to the observed values. Also called as coefficient of determination, it results in values between 0 and 1, with 0 being the ideal.

In [4]:
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression

file = 'aux/datasets/boston-houses.csv'
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = columns)

x = data.values[:, 0:13]
y = data.values[:, 13]

x_training, x_test, y_training, y_test = train_test_split(x, y, test_size = 0.33, random_state = 5)

model = LinearRegression()
model.fit(x_training, y_training)

y_predictions = model.predict(x_test)

r2 = r2_score(y_test, y_predictions)

print(f"R2: {round(r2, 3)}")

R2: 0.696


### Regression Algorithms

### *Linear Regression*

It assumes that the data are in a normal distribution, that the variables are relevant to the model construction and that they are not collinear, that is, variables with high correlation.

In [5]:
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression

file = 'aux/datasets/boston-houses.csv'
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = columns)

x = data.values[:, 0:13]
y = data.values[:, 13]

x_training, x_test, y_training, y_test = train_test_split(x, y, test_size = 0.33, random_state = 5)

model = LinearRegression()
model.fit(x_training, y_training)

y_predictions = model.predict(x_test)

mse = mean_squared_error(y_test, y_predictions)

print(f"MSE with LinearRegression: {round(mse, 3)}")

MSE with LinearRegression: 28.53


### *Ridge Regression*

It's an extension to the linear regression where the loss function is modified to minimize the model complexity.

In [6]:
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import Ridge

file = 'aux/datasets/boston-houses.csv'
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = columns)

x = data.values[:, 0:13]
y = data.values[:, 13]

x_training, x_test, y_training, y_test = train_test_split(x, y, test_size = 0.33, random_state = 5)

model = Ridge()
model.fit(x_training, y_training)

y_predictions = model.predict(x_test)

mse = mean_squared_error(y_test, y_predictions)

print(f"MSE with Ridge: {round(mse, 3)}")

MSE with Ridge: 29.294


### *Lasso Regression*

Lasso (Least Absolute Shrinkage and Selection Operator) Regression is a modification of linear regression, and like the Ridge Regression, the loss function is modified to minimize the model complexity.

In [7]:
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import Lasso

file = 'aux/datasets/boston-houses.csv'
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = columns)

x = data.values[:, 0:13]
y = data.values[:, 13]

x_training, x_test, y_training, y_test = train_test_split(x, y, test_size = 0.33, random_state = 5)

model = Lasso()
model.fit(x_training, y_training)

y_predictions = model.predict(x_test)

mse = mean_squared_error(y_test, y_predictions)

print(f"MSE with Lasso: {round(mse, 3)}")

MSE with Lasso: 33.395


### *ElasticNet Regression*

It is a form of regression regularization that combines the properties of Ridge and LASSO regression. The objective is to minimize the model complexity, penalizing it using the sum of the coefficients squares.

In [8]:
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import ElasticNet

file = 'aux/datasets/boston-houses.csv'
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = columns)

x = data.values[:, 0:13]
y = data.values[:, 13]

x_training, x_test, y_training, y_test = train_test_split(x, y, test_size = 0.33, random_state = 5)

model = ElasticNet()
model.fit(x_training, y_training)

y_predictions = model.predict(x_test)

mse = mean_squared_error(y_test, y_predictions)

print(f"MSE with ElasticNet: {round(mse, 3)}")

MSE with ElasticNet: 33.273


### *KNN*

In [9]:
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.neighbors import KNeighborsRegressor

file = 'aux/datasets/boston-houses.csv'
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = columns)

x = data.values[:, 0:13]
y = data.values[:, 13]

x_training, x_test, y_training, y_test = train_test_split(x, y, test_size = 0.33, random_state = 5)

model = KNeighborsRegressor()
model.fit(x_training, y_training)

y_predictions = model.predict(x_test)

mse = mean_squared_error(y_test, y_predictions)

print(f"MSE with KNeighborsRegressor: {round(mse, 3)}")

MSE with KNeighborsRegressor: 47.706


### *CART*

In [10]:
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor

file = 'aux/datasets/boston-houses.csv'
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = columns)

x = data.values[:, 0:13]
y = data.values[:, 13]

x_training, x_test, y_training, y_test = train_test_split(x, y, test_size = 0.33, random_state = 5)

model = DecisionTreeRegressor()
model.fit(x_training, y_training)

y_predictions = model.predict(x_test)

mse = mean_squared_error(y_test, y_predictions)

print(f"MSE with DecisionTreeRegressor: {round(mse, 3)}")

MSE with DecisionTreeRegressor: 30.957


### *SVM*

In [11]:
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.svm import SVR

file = 'aux/datasets/boston-houses.csv'
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = columns)

x = data.values[:, 0:13]
y = data.values[:, 13]

x_training, x_test, y_training, y_test = train_test_split(x, y, test_size = 0.33, random_state = 5)

model = SVR()
model.fit(x_training, y_training)

y_predictions = model.predict(x_test)

mse = mean_squared_error(y_test, y_predictions)

print(f"MSE with SVR: {round(mse, 3)}")

MSE with SVR: 79.235
