#Advanced Regression with Azure Databricks

###Initial configuration

In this section we perform some imports and initial configurations to make sure everything is properly prepared for the next steps.

We are also using one of the popular Machine Learning modules in the data science world, scikit-learn.


![](http://scikit-learn.org/stable/_static/scikit-learn-logo-small.png)
**Scikit-learn is a widely used library for Machine Learning in Python**
- Contains simple and efficient tools for data mining and data analysis
- Accessible to everybody, and reusable in various contexts
- Built on NumPy, SciPy, and matplotlib (the "big three")
- Open source, commercially usable - BSD license

Run the next cell to import and configure the required modules.

In [4]:
# Do the most standard imports for DS:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns

# We all have high-resolution displays now. Make sure we exploit that.
%config InlineBackend.figure_format = 'retina' 

# Adjust some colors and fonts to make our plots easier to navitate and understand:
plt.style.use('seaborn-colorblind')
plt.rcParams['axes.axisbelow'] = True
mpl.rcParams['axes.titlesize'] = 20
mpl.rcParams['axes.labelsize'] = 16
mpl.rcParams['xtick.labelsize'] = 14
mpl.rcParams['ytick.labelsize'] = 14
mpl.rcParams['font.size'] = 16   # 10
mpl.rcParams['legend.fontsize'] = 14
# Tell Pandas to only show us two decimals
pd.set_option('precision',2)

# Some necessary sklearn imports that we will need later
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn import metrics

### Optional code, but quite useful for this lab context! ###
# Ignore warnings from scikit-learn?
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
from sklearn.exceptions import DataConversionWarning
warnings.filterwarnings("ignore", category=DataConversionWarning)

**IMPORTANT**'

If this is the first notebook you run from this lab, make sure you run the steps to import the data as indicated in the <a href="$./01 Model Training Selection Evaluation">introductory notebook</a> of this lab.

Next, let's load the dataset for this lab.
Be sure to update the table name  "usedcars\_clean\_#####" (replace ##### to make the name unique within your environment).

In [6]:
df_clean = spark.sql("SELECT * FROM usedcars_clean_#####")
df = df_clean.toPandas()

### Polynomial Regression

**First, a quick overview over polynomial regression**. Feel free to skip this if you want - the important part is that we can use **feature engineering** to add polynomials of features we already have in order to let the model lean non-linear behaviour in the features. This way, we **make a more advanced model out of a simple one**. This is an often-used "trick" in ML.

**Quick recap of linear and polynomial regression**:


Given \\( \mathbf{x} \\), a column vector, and \\( \mathbf{y} \\), the target vector, you can perform polynomial regression by appending polynomials of \\( \mathbf{x} \\). For example, consider if 

$$ \mathbf{x} = \begin{bmatrix} 2 \\\\[0.3em] -1 \\\\[0.1em] \frac{1}{3} \end{bmatrix} $$

Using just this vector in linear regression implies the model:

$$ y = \alpha_1 x $$ 

We can add columns that are powers of the vector above, which represent adding polynomials to the regression. Below we show this for polynomials up to power 3:

$$ \mathbf{x} = \begin{bmatrix} 2 & 4 & 8 \\\\[0.3em] -1 & 1 & -1 \\\\[0.1em] \frac{1}{3} & \frac{1}{3^2} & \frac{1}{3^3} \end{bmatrix} $$

This is our new data matrix that we use in sklearn's linear regression, and it represents the model:

$$ y = \alpha_1 x + \alpha_2x^2 + \alpha_3x^3$$

**Add "Age squared" as feature**

Let's add the squared of the `Age` to our dataset:

In [10]:
# Make a copy of our "original" data, so that we avoid later confusion
df_poly = df.copy(deep=True)
# When we do a mathematical operation on a Pandas Series object, like the "Age"-column, Pandas will automatically do the operation on each element in that Series (since it is build on NumPy, and normally acts as NumPy would)
df_poly['Age**2'] = df['Age']**2

Our new dataframe `df_poly` should now contain a column with Age squared. Let's inspect and see if that is true for 3 randomly sampled rows from the dataframe:

In [12]:
df_poly.sample(3)

Good! We can now use the "`Age**2`"-column in training/fitting our linear regression model just like we would any other feature:

In [14]:
# Import the cross-validation function:
from sklearn.model_selection import cross_validate, cross_val_predict

# Shuffle the rows of the dataframe, since the train_test_split function will not do it for us:
df_poly = df_poly.sample(frac=1)
# Select the columns/features from the Pandas dataframe that we want to use in the model:
X_poly = np.array(df_poly[['Age', 'Age**2']])
# X_poly = np.array(df_poly[['Age']])
y_poly = np.array(df_poly['Price'])

# Create a linear regression model that we can train:
model = LinearRegression()
# Print some information about the linear model and its parameters:
print(model)

# Train the model using CV and multiple scoring on the data we have prepared:
cv_results = cross_validate(model, # Provide our model to the CV-function
                            X_poly, # Provide all the features (in real life only the training-data)
                            y_poly, # Provide all the "correct answers" (in real life only the training-data)
                            scoring=('r2', 'neg_mean_squared_error', 'neg_mean_absolute_error'), 
                            cv=5 # Cross-validate using 5-fold (K-Fold method) cross-validation splits
                           )

MAE  = -cv_results['test_neg_mean_absolute_error']
RMSE = np.sqrt(-cv_results['test_neg_mean_squared_error'])
R2   = cv_results['test_r2']

print('MAE:', MAE)
print('RMSE:', RMSE)
print('R2:', R2)

print('Average R^2:\t {:.2f} (+/- {:.2f})'.format(R2.mean(), R2.std()))
print('Average MAE:\t {:.2f} (+/- {:.2f})'.format(MAE.mean(), MAE.std()))
print('Average RMSE:\t {:.2f} (+/- {:.2f})'.format(RMSE.mean(), RMSE.std()))

**Questions:**
- How are the results we got now relative to what we got earlier, when we only used `Age`? 
- Do you agree that it is easier to compare these performance scores to the ones we got previously, now that we use cross-validation to get more "stable" performance scores?

This time we also want to make predictions using our model, so that we can inspect our results visually. We then use the function `cross_cal_predict`, which first trains the model just like the `cross_validate` function does, but then also stores what price (`y`) the 5 different models (trained on 5 different splits of the data, 5-Fold) predicted for the test data that was not seen by the models when they were trained.

The end result is that we get predictions for all of our datapoints in the dataframe, from 5 different examples of our model. We then plot the "price vs. age" for the original data in blue, and the predicted data from the polynomial model in red:

In [16]:
y_pred = cross_val_predict(model, 
                            X_poly,
                            y_poly,
                            cv=5
                           )

fig, ax = plt.subplots(figsize= (15, 6))

ax.scatter(X_poly[:,0], y_poly)
ax.plot(X_poly[:,0], y_pred, 'ro')
plt.title('Price of used cars as a function of age')
plt.ylabel('Price [$]')
plt.xlabel('Age [Months]')
plt.grid()
display(fig)

Did this look like you expected? Can you see that the model, even if it is a linear model, learned to predict values that partially follow a second-degree polynomial? How and why did this work, and how is this possible? Feel free to ask us, or move on if you are short om time.

If you have time later: Try to *remove* the original feature `Age` from the input to the model. How does it score now? What does this tell us?

**Adding an engineered feature like we just did is very powerful in ML, and is often absolutely necessary in order to achieve good results.**

**Summary so far...**

So far...
- We have looked at linear regression, and how it works
- We have tried linear regression with only one feature (so-called *simple* linear regression) and with several features
- We have investigated how an SGD regressor was trained through iterations, and we learned that this concept is at the heart of most ML algorithms
- We have tried **engineering a new polynomial feature** in our dataset, and have used this in the linear regression model to do *polynomial regression*

Not bad. Training or building a model that can represent the data we have, like we have done already, is at the heart of Machine Learning. The number of techniques that are available to us can, however, be quite daunting - there are a huge number of models and techniques available to us in libraries, but we need to learn how to use them in order for them to be useful. 

What we have seen so far have been some examples of linear regression models. These are part of a larger family called **Generalized Linear Models (GLM)**, where they all have in common that they try to find the (multi-dimensional) "line" that fits our data the best in some way.
[This page (click here)](http://scikit-learn.org/stable/modules/linear_model.html) gives a good overview over the GLMs in scikit-learn.

We will now move on to some "more exciting" models.

### Regression - Decision Trees

Decision trees are enormously poplular in ML, for many reasons. We will look at them a bit more closely when we start looking at classification, but for now, see how easy it is to run our now familiar code but using a decision tree instead. 

For later, if you are interested:
[Read more about Decision Trees in the excellent overview given by scikit-learn](http://scikit-learn.org/stable/modules/tree.html)

In [21]:
from sklearn import tree
df.sample()

# Shuffle the rows of the dataframe, since the train_test_split function will not do it for us:
df = df.sample(frac=1)
# Select the columns/features from the Pandas dataframe that we want to use in the model:
X = np.array(df[['Age', 'KM', 'Weight']])
# X = np.array(df[['Age']])
y = np.array(df['Price'])

# Create a linear regression model that we can train:
model = tree.DecisionTreeRegressor(criterion='mse', max_depth=1)
# Print some information about the linear model and its parameters:
print(model)

# Train the model using CV and multiple scoring on the data we have prepared:
cv_results = cross_validate(model, # Provide our model to the CV-function
                            X, # Provide all the features (in real life only the training-data)
                            y, # Provide all the "correct answers" (in real life only the training-data)
                            scoring=('r2', 'neg_mean_squared_error', 'neg_mean_absolute_error'), 
                            cv=5 # Cross-validate using 5-fold (K-Fold method) cross-validation splits
                           )

MAE  = -cv_results['test_neg_mean_absolute_error']
RMSE = np.sqrt(-cv_results['test_neg_mean_squared_error'])
R2   = cv_results['test_r2']

print('\n-------------- Scores ---------------')
print('Average R^2:\t {:.2f} (+/- {:.2f})'.format(R2.mean(), R2.std()))
print('Average MAE:\t {:.2f} (+/- {:.2f})'.format(MAE.mean(), MAE.std()))
print('Average RMSE:\t {:.2f} (+/- {:.2f})'.format(RMSE.mean(), RMSE.std()))

Not a good score, right? That is not surprising, **since the tree is currently only allowed to make a single decision!** Try changing the argument `max_depth`, and see if you can find the optimal number below 20.

The decision tree is fast and simple to use, and it works with no scaling of the input. 
However, we will later see that it must be used with care -- the results above are not the whole story.

### Regression (and pipelines) - Support Vector Machines (SVM)**

We continue with the same code as previously, but now with a Linear Support Vector Machine: a very popular go-to model for both regression and classification. Let's see what we get with the following code:

In [25]:
from sklearn import svm
# Shuffle the rows of the dataframe, since the train_test_split function will not do it for us:
df = df.sample(frac=1)
# Select the columns/features from the Pandas dataframe that we want to use in the model:
X = np.array(df[['Age', 'KM', 'Weight']])
# X = np.array(df[['Age']])
y = np.array(df['Price'])

# Create a linear regression model that we can train:
model = svm.LinearSVR(C=1000)

# Print some information about the linear model and its parameters:
print(model)

# Train the model using CV and multiple scoring on the data we have prepared:
cv_results = cross_validate(model, # Provide our model to the CV-function
                            X, # Provide all the features (in real life only the training-data)
                            y, # Provide all the "correct answers" (in real life only the training-data)
                            scoring=('r2', 'neg_mean_squared_error', 'neg_mean_absolute_error'), 
                            cv=5 # Cross-validate using 5-fold (K-Fold method) cross-validation splits
                           )

MAE  = -cv_results['test_neg_mean_absolute_error']
RMSE = np.sqrt(-cv_results['test_neg_mean_squared_error'])
R2   = cv_results['test_r2']

print('\n-------------- Scores ---------------')
print('Average R^2:\t {:.2f} (+/- {:.2f})'.format(R2.mean(), R2.std()))
print('Average MAE:\t {:.2f} (+/- {:.2f})'.format(MAE.mean(), MAE.std()))
print('Average RMSE:\t {:.2f} (+/- {:.2f})'.format(RMSE.mean(), RMSE.std()))

Not good, not good at all.. What happened?

It turns out that the default inner workings of this type of model works better with normalized/scaled input. There are several ways to scale the input data, but the easiest and most basic way to do this is to use sklearn's StandardScaler, which we have done a couple of times already.

A "scaler" in sklearn is build to behave the same way as any other model:

We create it
We use the .fit() function to have it look at the input data we want to scale. The scaler then decides how it will have to scale the different features in order to have them all scaled the way we asked for.
We then call the .transform() function on the scaler in order for it to actually scale the data.
However, if we transform/normalize the data before we train the model, we also have to inverse transform the results to get them in the units we are used to, and we also have to separately transform the test data before we use it to score our model.

sklearn has a great system for doing these things, and more, automatically. It's called a pipeline. Have a look at the code below, and see if you can understand how it is set up. Then run the code and see if the performance of our model is more in the range we expected from a decent model:

In [27]:
# Shuffle the rows of the dataframe, since the train_test_split function will not do it for us:
df = df.sample(frac=1)
# Select the columns/features from the Pandas dataframe that we want to use in the model:
X = np.array(df[['Age', 'KM', 'Weight']])
# X = np.array(df[['Age']])
y = np.array(df['Price'])

# Create a linear regression model that we can train:
model = svm.LinearSVR(C=1000)

# Print some information about the linear model and its parameters:
print(model)

from sklearn.pipeline import make_pipeline
reg_pipeline = make_pipeline(StandardScaler(), model)

# Train the model using CV and multiple scoring on the data we have prepared:
cv_results = cross_validate(reg_pipeline, # Provide our model to the CV-function
                            X, # Provide all the features (in real life only the training-data)
                            y, # Provide all the "correct answers" (in real life only the training-data)
                            scoring=('r2', 'neg_mean_squared_error', 'neg_mean_absolute_error'), 
                            cv=5 # Cross-validate using 5-fold (K-Fold method) cross-validation splits
                           )

MAE  = -cv_results['test_neg_mean_absolute_error']
RMSE = np.sqrt(-cv_results['test_neg_mean_squared_error'])
R2   = cv_results['test_r2']

print('\n-------------- Scores ---------------')
print('Average R^2:\t {:.2f} (+/- {:.2f})'.format(R2.mean(), R2.std()))
print('Average MAE:\t {:.2f} (+/- {:.2f})'.format(MAE.mean(), MAE.std()))
print('Average RMSE:\t {:.2f} (+/- {:.2f})'.format(RMSE.mean(), RMSE.std()))

In short, we:
1. Use the `make_pipeline` function to create a "pipeline" which contains a scaler and a model, in that order
- We use the pipeline *just like we would use a pure model*

Pipelines are powerful and great tools, usable for many purposes. When we give the pipeline to the `cross_validate` function, it automatically takes care of the separate scaling of each of the 5 "experiments" it runs internally, and also makes sure that the scores it returns are in normal units.

### Comparing models

See if you can run and understand parts of the code below, if you have time. See how easy it is to try different models in `sklearn`?

In [31]:
import time
from sklearn import svm
from sklearn import tree
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_validate, cross_val_predict
from sklearn.kernel_ridge import KernelRidge
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
# Shuffle the rows of the dataframe, since the train_test_split function will not do it for us:
df = df.sample(frac=1)
# Select the columns/features from the Pandas dataframe that we want to use in the model:
X = np.array(df[['Age', 'KM', 'Weight']])
# X = np.array(df[['Age']])
y = np.array(df['Price'])

# Choose and initialize the model:
model = []
model.append(('NuSVR_rbf', svm.NuSVR(kernel='rbf', C=5, nu=0.5)))
model.append(('SVR_rbf\t', svm.SVR(kernel='rbf', C=10000, epsilon=0.1, gamma='auto')))
model.append(('SVR_poly', svm.SVR(kernel='poly', degree=3, gamma='auto', C=100, epsilon=0.1)))
model.append(('SVR_linear', svm.LinearSVR(C=1000)))
model.append(('KRR\t', KernelRidge(kernel='rbf', alpha=0.001, gamma=1)))
model.append(('tree\t', tree.DecisionTreeRegressor(criterion='mse')))
model.append(('GBR\t' , GradientBoostingRegressor(n_estimators=70, max_depth=5, verbose=False)))
model.append(('RF\t', RandomForestRegressor(n_estimators=20)))
model.append(('linreg\t', LinearRegression()))


print('\n-------------- Scores ---------------')
print('Model:\t\ttime:\tR^2:\tsqrt(MSE)\tMAE')
for i in range(len(model)):
    print('%s\t' % model[i][0], end='')
    # Initialize regression pipeline with scaling and model
    reg_pipeline = make_pipeline(StandardScaler(), model[i][1])
    
    t0 = time.time()
    # Train the model using CV and multiple scoring on the data we have prepared:
    cv_results = cross_validate(reg_pipeline, # Provide our model to the CV-function
                                X, # Provide all the features (in real life only the training-data)
                                y, # Provide all the "correct answers" (in real life only the training-data)
                                scoring=('r2', 'neg_mean_squared_error', 'neg_mean_absolute_error'), 
                                cv=5 # Cross-validate using 5-fold (K-Fold method) cross-validation splits
                               )
    print('%.3fs\t' % (time.time() - t0), end='')

    MAE  = -cv_results['test_neg_mean_absolute_error']
    RMSE = np.sqrt(-cv_results['test_neg_mean_squared_error'])
    R2   = cv_results['test_r2']
    
    # Print score:
    print('{0:.4f}'.format(R2.mean()), end='')
    print('\t{0:.2f}'.format(RMSE.mean()), end='')
    print('\t\t{0:.2f}'.format(MAE.mean()))
print('\nPerformance is given wrt. the cross-validation hold-out dataset (the validation dataset).')

If you want to return to this lab after the workshop is over, feel free to:
- Try to adjust the various models in the code cell above, and see if you can find the "optimal model" by hand-tuning some parameters to the models.
- If you are adept at `sklearn` already: Implement a `GridSearch` for one or several models above to automatically scan for the best model parameters.
- Use a more advanced model than the `SGDRegressor` in the "Learning Curves" you made earlier, or change the parameters of the current model. How do the learning curves, per iteration, change?

###Evaluation and Validation - Overfitting

Let's have a look at the Decision tree that we used earlier. Please run the code below, where the decision tree is used with default parameters, look at the results, and then continue below.

In [35]:
# Shuffle the rows of the dataframe, since the train_test_split function will not do it for us:
df = df.sample(frac=1)
# Select the columns/features from the Pandas dataframe that we want to use in the model:
X = np.array(df[['Age', 'KM', 'Weight']])
# X = np.array(df[['Age']])
y = np.array(df['Price'])

# Create a linear regression model that we can train:
model = tree.DecisionTreeRegressor(criterion='mse')
# Print some information about the linear model and its parameters:
print(model)

# Train the model using CV and multiple scoring on the data we have prepared:
cv_results = cross_validate(model, # Provide our model to the CV-function
                            X, # Provide all the features (in real life only the training-data)
                            y, # Provide all the "correct answers" (in real life only the training-data)
                            scoring=('r2', 'neg_mean_squared_error', 'neg_mean_absolute_error'), 
                            cv=5 # Cross-validate using 5-fold (K-Fold method) cross-validation splits
                           )

MAE  = -cv_results['test_neg_mean_absolute_error']
RMSE = np.sqrt(-cv_results['test_neg_mean_squared_error'])
R2   = cv_results['test_r2']

print('\n-------------- Scores on test-dataset ---------------')
print('Average R^2:\t {:.2f} (+/- {:.2f})'.format(R2.mean(), R2.std()))
print('Average MAE:\t {:.2f} (+/- {:.2f})'.format(MAE.mean(), MAE.std()))
print('Average RMSE:\t {:.2f} (+/- {:.2f})'.format(RMSE.mean(), RMSE.std()))

Looks alright? Again, the decision tree is fast and simple to use, and it works with no scaling of the input. Let us however make a **very important point**: Until now, with the exception of the "learning curves" we looked at for the `SGDRegressor`, we have only looked at the **test-scores**, based on the test-dataset. Let's change this in the cell below, by calling the `train_`-version of the scores from the cross-validator:

In [37]:
MAE  = -cv_results['train_neg_mean_absolute_error']
RMSE = np.sqrt(-cv_results['train_neg_mean_squared_error'])
R2   = cv_results['train_r2']

print('\n-------------- Scores on TRAIN-dataset ---------------')
print('Average R^2:\t {:.2f} (+/- {:.2f})'.format(R2.mean(), R2.std()))
print('Average MAE:\t {:.2f} (+/- {:.2f})'.format(MAE.mean(), MAE.std()))
print('Average RMSE:\t {:.2f} (+/- {:.2f})'.format(RMSE.mean(), RMSE.std()))

What do you observe above? **This is some serious overfitting.** The decision tree has been given no limit on its depth, so it makes branches until **it has a leaf (end-node on the tree) for almost every datapoint in the train-dataset that we are using to train it**.

We therefore get extremely good scores on the train-dataset when we use the model. This shows that the model is unable to improve any more based on the data it has been given: **the model is too complex considering the amount of data we have available**.

Exercise: Try limiting the depth of the tree, and re-check the test- and train-scores. Did it help? Do you think the depth-limited model is a better model overall?

###Evaluation and Validation - Learning curves

Earlier we looked at learning curves where we assessed some scores as the model went through training iterations. Another very interesting thing to look at is how our model scores as we **introduce it to more and more data**. Run the cell below to get some code into the system that will help us plot this:

In [41]:
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit


def plot_learning_curve(ax, estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 20)):
    """
    Generate a simple plot of the test and training learning curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : int, cross-validation generator or an iterable, optional
        Determines the cross-validation splitting strategy.
        Possible inputs for cv are:
          - None, to use the default 3-fold cross-validation,
          - integer, to specify the number of folds.
          - An object to be used as a cross-validation generator.
          - An iterable yielding train/test splits.

        For integer/None inputs, if ``y`` is binary or multiclass,
        :class:`StratifiedKFold` used. If the estimator is not a classifier
        or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.

        Refer :ref:`User Guide <cross_validation>` for the various
        cross-validators that can be used here.

    n_jobs : integer, optional
        Number of jobs to run in parallel (default 1).
    """

    ax.set_title(title)
    if ylim is not None:
        ax.set_ylim(*ylim)
    ax.set_xlabel("Training examples")
    ax.set_ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    ax.grid()

    ax.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    ax.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    ax.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    ax.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    ax.legend(loc="best")
    return

We now add two engineered features to our dataset, so that we have more to play with:

In [43]:
df['Age2'] = df['Age'].values**2
df['KMlog'] = np.log(df['KM'].values)

We now load a variety of features into different X's, and to get you started we've done the code already. We also do test-train splits, even if we later will do cross-validation: this time we do cross-validation properly, more or less.

In [45]:
# Select the columns/features from the Pandas dataframe that we want to use in the model:
x2D = df[['Age', 'KM']].as_matrix()
x5D = np.array(df[['Age', 'KM', 'HP', 'CC', 'Weight']])
x7D = df[['Age', 'KM', 'HP', 'CC', 'Weight', 'Age2', 'KMlog']].as_matrix()

y = np.array(df['Price'])

# Do a test-train-split like we did previously:
X0_train, X0_test, y0_train, y0_test = train_test_split(x2D, y, train_size=0.8)
X1_train, X1_test, y1_train, y1_test = train_test_split(x5D, y, train_size=0.8)
X2_train, X2_test, y2_train, y2_test = train_test_split(x7D, y, train_size=0.8)

We then choose a model. The LinearRegression is not really a model that trains at all, so please replace it with something more 'trainable' after you've had a look at the curves the first time.

In [47]:
# Create a linear regression model that we can train:
linreg = LinearRegression()

Here we go! Run the code below, and see if you can play with the plotting range (max/min) and the model. Can you get some insight into model complexity and data quantity?

In [49]:
# Cross validation with 100 iterations to get smoother mean test and train
# score curves, each time with 20% data randomly selected as a validation set.
cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=0)
fig, ax = plt.subplots(3, figsize=(20, 25))

plot_learning_curve(ax[0], linreg, "Learning Curves 2D (Linear Regression)", X0_train, y0_train, ylim=(0.77, 0.97), cv=cv, n_jobs=4)

plot_learning_curve(ax[1], linreg, "Learning Curves 5D (Linear Regression)", X1_train, y1_train, ylim=(0.8, 0.9), cv=cv, n_jobs=4)

plot_learning_curve(ax[2], linreg, "Learning Curves 7D (Linear Regression)", X2_train, y2_train, ylim=(0.8, 0.9), cv=cv, n_jobs=4)

display(fig)

###Higher order polynomial regression

In [51]:
# Make a copy of our "original" data, so that we avoid later confusion
df_poly = df.copy(deep=True)
# When we do a mathematical operation on a Pandas Series object, like the "Age"-column, Pandas will automatically do the operation on each element in that Series (since it is build on NumPy, and normally acts as NumPy would)
df_poly['Age**2'] = df['Age']**2

# Shuffle the rows of the dataframe, since the train_test_split function will not do it for us:
df_poly = df_poly.sample(frac=1)
# Select the columns/features from the Pandas dataframe that we want to use in the model:
X_poly = np.array(df_poly[['Age', 'Age**2']])
# X_poly = np.array(df_poly[['Age']])
y_poly = np.array(df_poly['Price'])

# Create a linear regression model that we can train:
linreg_poly = LinearRegression()
# Print some information about the linear model and its parameters:
print(linreg_poly)

# Train the model using CV and multiple scoring on the data we have prepared:
cv_results = cross_validate(linreg_poly, # Provide our model to the CV-function
                            X_poly, # Provide all the features (in real life only the training-data)
                            y_poly, # Provide all the "correct answers" (in real life only the training-data)
                            scoring=('r2', 'neg_mean_squared_error', 'neg_mean_absolute_error'), 
                            cv=5 # Cross-validate using 5-fold (K-Fold method) cross-validation splits
                           )

MAE  = -cv_results['test_neg_mean_absolute_error']
RMSE = np.sqrt(-cv_results['test_neg_mean_squared_error'])
R2   = cv_results['test_r2']


print('Average R^2:\t {:.2f} (+/- {:.2f})'.format(R2.mean(), R2.std()))
print('Average MAE:\t {:.2f} (+/- {:.2f})'.format(MAE.mean(), MAE.std()))
print('Average RMSE:\t {:.2f} (+/- {:.2f})'.format(RMSE.mean(), RMSE.std()))

y_pred = cross_val_predict(linreg_poly, 
                            X_poly,
                            y_poly,
                            cv=5
                           )

fig, ax = plt.subplots(1, 1, figsize= (20, 15))
ax.scatter(df_poly['Age'], y_poly)
ax.plot(df_poly['Age'], y_pred, 'ro')
ax.set_title('Price of used cars as a function of age')
ax.set_ylabel('Price [$]')
ax.set_xlabel('Age [Months]')
display(fig)

This concludes the model training, selection, and evaluation lab.

In this lab you investigated techniques for training, selecting, and evaluating machine learning models,m with a focus on two "classical" categories of models: regression and classification.