# Introduction to Supervised Machine Learning (SML) - Regression

Welcome to this introduction to machine learning (ML). In this session we cover the following topics
1. Generalizating and valididating from ML models.
2. The Bias-Variance Trade-Off
3. Out-of-sample testing and cross-validation workflows
4. Implementing Ml workflows in the Python (Sklearn) ecosystem.

In [None]:
# loading essential libraries

import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm

sns.set(style="darkgrid", color_codes=True)

## Data Description

We will load a standard dataset, the BostonHousing dataset. It comes as a dataframe with 506 observations on 14 features, the last one `medv` being the outcome:

* `crim`	per capita crime rate by town
* `zn`	proportion of residential land zoned for lots over 25,000 sq.ft
* `indus`	proportion of non-retail business acres per town
* `chas`	Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) (deselected in this case)
* `nox`	nitric oxides concentration (parts per 110 million)
* `rm`	average number of rooms per dwelling
* `age`	proportion of owner-occupied units built prior to 1940
* `dis`	weighted distances to five Boston employment centres
* `rad`	index of accessibility to radial highways
* `tax`	full-value property-tax rate per USD 10,000
* `ptratio`	pupil-teacher ratio by town
* `b`	1000(B - 0.63)^2 where B is the proportion of blacks by town
* `lstat`	lower status of the population
* `medv`	median value of owner-occupied homes in USD 1000's (our outcome to predict)

Source: Harrison, D. and Rubinfeld, D.L. "Hedonic prices and the demand for clean air", J. Environ. Economics & Management, vol.5, 81-102, 1978.

These data have been taken from the [UCI Repository Of Machine Learning Databases](ftp://ftp.ics.uci.edu/pub/machine-learning-databases)

In [None]:
data = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv')

In [None]:
data.rename(columns={"medv": "y"})

In [None]:
#X, y = load_boston(return_X_y=True)
#load_boston()['feature_names']
#data = pd.DataFrame(load_boston()['data'], columns=load_boston()['feature_names'])
#data['y'] = load_boston()['target']

## EDA

In [None]:
data.head()

In [None]:
data.describe().T


In this exercise, we will predict `medv` (median value of owner-occupied homes in USD). Such a model would in the real world be used to predict developments in housing prices, eg. to inform policy makers  or potential investors. In case I have only one target outcome, I prefer to name it as `y`. This simple naming convention helps to re-use code across datasets.

Let's take a look at the correlation matrix and the distributions of the variables.

In [None]:
corr = data.corr()

In [None]:
cmap = sns.diverging_palette(230, 20, as_cmap=True)
mask = np.triu(np.ones_like(corr, dtype=bool))

sns.heatmap(corr,mask=mask, cmap=cmap)

In [None]:
sns.pairplot(data, corner=True, diag_kind='kde')

## Preprocessing

We preprocess the data by standardising it and then we split into a train and test set using standard settings.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [None]:
X = data.drop('medv', axis=1)
y = data['medv']

In [None]:
X_scaled = StandardScaler().fit_transform(X)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

## Fitting SML models

Here, we are going to try out 3 different models:

1. OLS model (Baseline)
2. Elastic net (still parametric, but maybe advantage in feature selection)
3. Random forest (tree-based ensemble model)

There is no particular reason other than to demonstrate different models with increasing complexity and hyperparameter tuning options.

In [None]:
from sklearn.linear_model import LinearRegression, ElasticNet
from sklearn.ensemble import RandomForestRegressor

In [None]:
model_ols = LinearRegression()
model_el = ElasticNet()
model_rf = RandomForestRegressor(n_estimators=25)

In [None]:
model_ols.fit(X_train, y_train)
model_el.fit(X_train, y_train)
model_rf.fit(X_train, y_train)

In [None]:
print('Model OLS' + ' ' + str(model_ols.score(X_test, y_test)))
print('Model EL' + ' ' + str(model_el.score(X_test, y_test)))
print('Model RF' + ' ' + str(model_rf.score(X_test, y_test)))

## Hyperparameter tuning

Hyperparameter tuning is performed using 5-fold crossvalidation with ```GridSearchCV```

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

# scorer = make_scorer(r2_score)
scorer = make_scorer(mean_squared_error)

### OLS

Since it has no hyperparameters, no training necessary.

### Elastic Net

In [None]:
parameters_el = {'alpha':[0.1, 0.5, 1.0],
                 'l1_ratio':[0.1, 0.5, 0.75]}

In [None]:
# Perform grid search on the classifier using 'scorer' as the scoring method.
grid_obj = GridSearchCV(model_el, parameters_el, scoring=scorer)

In [None]:
grid_obj

In [None]:
grid_fit = grid_obj.fit(X, y)

In [None]:
# Get the estimator.
best_reg = grid_fit.best_estimator_

# Fit the new model.
best_reg.fit(X_train, y_train)

In [None]:
best_reg.score(X_test, y_test)

### Random Forest

In [None]:
model_rf = RandomForestRegressor()

In [None]:
parameters_rf = {'bootstrap': [True, False],
 'max_depth': [10, 20, None],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [25, 50]}

In [None]:
# erform grid search on the classifier using 'scorer' as the scoring method.
grid_obj = GridSearchCV(model_rf, parameters_rf, scoring=scorer)

In [None]:
grid_fit = grid_obj.fit(X, y)

In [None]:
# Get the estimator.
best_reg = grid_fit.best_estimator_

# Fit the new model.
best_reg.fit(X_train, y_train)

In [None]:
# Model performance on TRAIN data
best_reg.score(X_train, y_train)

In [None]:
# Model performance on TEST data
best_reg.score(X_test, y_test)