![car](http://www.auctionsamerica.com/images/lots/AF15/AF15_r0339_01.jpg)

# Introduction Data Science
# Assignment 2: Building a machine learning pipeline with scikit-learn

### In this assignment you will use some widely used sklearn classes to make implementing data science much easier.

### Learning goals:
- Understand the difference between transformer and estimator classes in sklearn
- Learn how to build a pipeline
- The importance of preproccesing: train-test split, scaling, one-hot-encoding
- Understand how to implement regularization
- Learn how to use cross validation and tune hyperparameters

# Import packages

In [None]:
import numpy as np
import pandas as pd

from IPython.display import Image
import matplotlib.pyplot as plt
%matplotlib inline

# Load data
The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes.

Attribute Information:
    1. mpg:           continuous
    2. cylinders:     multi-valued discrete
    3. displacement:  continuous
    4. horsepower:    continuous
    5. weight:        continuous
    6. acceleration:  continuous
    7. model year:    multi-valued discrete
    8. origin:        multi-valued discrete
    9. car name:      string (unique for each instance)

In [None]:
cars_df = pd.read_csv('../data/mtcars.csv', sep=';')

cars_df.head()

In [None]:
# Simple check for existence of linear relations with a correlation plot
cars_df.corr()

In [None]:
# Get first intuition for the data
cars_df.describe()

Note that:
- there are no missing values
- cylinders, model year and origin are discrete variables, so we may want to use **one hot encoding** for these features
- especially weight is on a very different scale than the other features, so we want to use **scaling**

In [None]:
# Set y
y = cars_df['mpg'].values

# drop the dependent variable and the unique car name to get the features
X = cars_df.drop(['mpg', 'car name'], axis=1)

# set which features we would like to use for one hot encoding and which for scaling
features_ohe = ['cylinders', 'origin', 'model year']
features_scaling = ['acceleration', 'weight', 'horsepower', 'displacement']

## Understanding the scikit-learn estimator API

In scikit-learn there are two kinds of classes: the **transformers** and the **estimators**.

**Transformers**

A transformer class has two essential methods: ```fit``` and ```transform```. The ```fit``` method is used to learn the parameters from the training data, and the ```transform``` method uses those paramters to transform the data.

In [None]:
Image("../images/sklearnapi1.png", width=500)

**Estimators**

The estimators, such as the one we build in the first Hands-on, are very similar to the transformers. As you may recall, we also used the ```fit``` method to learn parameters of a model when we trained the estimator for regression. The other essential method of an estimator is ```predict```, which can provide us with the predictions of instances.

In [None]:
Image("../images/sklearnapi2.png", width=500)

## Data preprocessing:  Split train and test set
The first step is to split the data in a train and a test set. Sklearn makes this easy for us using the ```train_test_split``` class. When using sklearn, it is very useful to check the [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).

Note that the train_test_split is not a transformer of an estimator, so we don't have to think about the ```fit```, ```transform``` or ```predict``` methods.

In [None]:
from sklearn.model_selection import train_test_split

# enter parameters of train_test_split (see example in documentation)
X_train, X_test, y_train, y_test = train_test_split(<CODE HERE>)

print(X_train.shape)
print(X_test.shape)

## Data preprocessing: Scaling

In [None]:
def plot_boxplot(array, feature_names, title=None):
    """function to visualize the range of values of features"""
    fig = plt.figure(figsize=(12,6))
    ax = fig.add_subplot(111)
    plt.boxplot(array, vert=False);
    plt.grid(True)
    plt.xticks(size=16);
    plt.yticks(size=16);
    plt.xlabel('feature values', fontsize=18)
    ax.set_yticklabels(feature_names)
    plt.title(title, fontsize=22)
    plt.show()

# Visualize range of values of features
plot_boxplot(X_train[features_scaling].values, feature_names=features_scaling, title='Value ranges of raw features')

It is very easy to see that the features are on quite different scales. Let's use sklearns transformers [StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) and [MinMaxScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) to see how this will change the feature values.

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Start with standardization. Create an instance of the standard scaler class
stsc = StandardScaler()
# use fit to let the instane calculate the parameters to transform the features 
stsc.fit(X_train[features_scaling])
# use transform to return the transformation
X_train_stsc = stsc.transform(<CODE HERE>)

In [None]:
# Visualize the standardized 
plot_boxplot(X_train_stsc, features_scaling, title='Value ranges of standardized features')

In [None]:
# Now use MinMaxScaling to check the difference. First create an instance.
mmsc = <CODE HERE>
# use the fit_transform method to return the transformation
# fit_transform is a method of sklearn transformer classes to directly fit the parameters and return the transformation
X_train_mmsc = <CODE HERE>

plot_boxplot(X_train_mmsc, features_scaling, title='Value ranges of min-max scaled features')

## One Hot Encoding
Now we have taken care of the continuous features, it is time to focus on the categorical features. [Here](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) is the documentation for the OneHotEncoder.

In [None]:
from sklearn.preprocessing import OneHotEncoder

# Create instancce of OneHotEncoder class
ohe = <CODE HERE>
# Return the one hote ncoded features
X_train_ohe = <CODE HERE>

# print shape to check how many new features will be used
print(X_train_ohe.shape)

X_train_ohe

In [None]:
# Get total preprocessed X_train by concatenating the scaled features with the one hot encoded features.
# You can choose to use the the standardized version or the minmax version
X_train_pp = <CODE HERE>

# print features of the first observation
X_train_pp[0]

## Train model
The data is preprocessed! We are going to train a linear regression model. Here is the [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html).

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# create an instance of LinearRegression
lr = <CODE HERE>
# fit a linear model
lr.<CODE HERE>

# get predictions on training set
y_train_pred = lr.<CODE HERE>

# calclate mean squared error of training set
mse_train = mean_squared_error(<CODE HERE>)
print('MSE train set: {:.3f}'.format(mse_train))

# print the weights (coef_ in documentation) of the linear model
<CODE HERE>

## Predict on test set
With sklearn it is now is very easy to repeat the same steps for the test set. We already fitted our scaler, one hot encoder and our model with our training set. Now we just have to use these fits to get a prediction for unseen data.

In [None]:
# get scaled features for test set
X_test_stsc = stsc.<CODE HERE>
# get ohe features for test set
X_test_ohe = ohe.<CODE HERE>
# concatenate features
X_test_pp = <CODE HERE>

# get predictions on test set with train model
y_test_pred = lr.<CODE HERE>

# calculate mean squared error
mse_test = mean_squared_error(<CODE HERE>)
print('MSE test set: {:.3f}'.format(mse_test))

# Fitting hyperparameters

So it seems there is some overfitting involved, but very little. One way to reduce this problem is with regularization. When using linear regression, there are two different estimators in sklearn for L1 and L2 regression: [Lasso](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html) and [Ridge](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html), respectively. We will use the L2 regression (Ridge) here.

### Cross validation
When tuning hyperparameter, such as the regularization parameter, we want to evaluate using cross validation. Creating the folds in the traing set can easily be done using [KFold](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html).

In [None]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import KFold

# create instance of kfold
k_fold = KFold(n_splits=3, shuffle=True)
# create instance of linear regression with L2 regularization (Ridge)
# choose appropriate alpha parameter(regularization parameter) like 1 or 10
lr_l2 = <CODE HERE>
# create list to track cross validation mean squared errors
cv_mse = []

# kfold.split() returns indices of train and cross validation set
for train_indices, cv_indices in k_fold.split(X_train_pp):
    # fit model on train
    lr_l2.fit(X_train_pp[train_indices], y_train[train_indices])
    # get predictions for fold
    y_pred_cv = lr_l2.predict(X_train_pp[cv_indices])
    # keep track of mean squared error on cross validation set
    cv_mse.append(mean_squared_error(y_pred_cv, y_train[cv_indices]))
    
print('MSE of each cross validation fold:\n{}'.format(cv_mse))

# calucalte mean of cross validation mean squared errors
mse_cv_l2 = sum(cv_mse) / len(cv_mse)
print('MSE CV: {:.3f}'.format(mse_cv_l2))

Now we can tune the hyperparameter without 'seeing' the test set, we can use the tuned hyperparameter to check how well it does on the test set.

In [None]:
# fit linear regression with best regularization parameter on entire training set
lr_l2.fit(X_train_pp, y_train)
# get predictions
y_train_pred_l2 = lr_l2.predict(X_train_pp)
y_test_pred_l2 = lr_l2.predict(X_test_pp)
# get mean squared errors of data sets
mse_train_l2 = mean_squared_error(y_train, y_train_pred_l2)
mse_test_l2 = mean_squared_error(y_test, y_test_pred_l2)
print('MSE train: {:}\nMSE CV: {:}\nMSE test: {:}'.format(mse_train_l2, mse_cv_l2, mse_test_l2))

**Play around with the regularization parameter a little**

** What happens with the train, cross validation and test mean squared errors? **

** What can you conclude from this? **