#Basic Regression with Azure Databricks

###Initial configuration

In this section we perform some imports and initial configurations to make sure everything is properly prepared for the next steps.

We are also using one of the popular Machine Learning modules in the data science world, scikit-learn.


![](http://scikit-learn.org/stable/_static/scikit-learn-logo-small.png)
**Scikit-learn is a widely used library for Machine Learning in Python**
- Contains simple and efficient tools for data mining and data analysis
- Accessible to everybody, and reusable in various contexts
- Built on NumPy, SciPy, and matplotlib (the "big three")
- Open source, commercially usable - BSD license

In [4]:
# Do the most standard imports for DS:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

# We all have high-resolution displays now. Make sure we exploit that.
%config InlineBackend.figure_format = 'retina' 

# Adjust some colors and fonts to make our plots easier to navitate and understand:
plt.style.use('seaborn-colorblind')
plt.rcParams['axes.axisbelow'] = True
mpl.rcParams['axes.titlesize'] = 20
mpl.rcParams['axes.labelsize'] = 16
mpl.rcParams['xtick.labelsize'] = 14
mpl.rcParams['ytick.labelsize'] = 14
mpl.rcParams['font.size'] = 16   # 10
mpl.rcParams['legend.fontsize'] = 14
# Tell Pandas to only show us two decimals
pd.set_option('precision',2)

# Some necessary sklearn imports that we will need later
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn import metrics

### Optional code, but quite useful for this lab context! ###
# Ignore warnings from scikit-learn?
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
from sklearn.exceptions import DataConversionWarning
warnings.filterwarnings("ignore", category=DataConversionWarning)

**IMPORTANT**'

If this is the first notebook you run from this lab, make sure you run the steps to import the data as indicated in the <a href="$./01 Model Training Selection Evaluation">introductory notebook</a> of this lab.

Next, let's load the dataset for this lab.
Be sure to update the table name  "usedcars\_clean\_#####" (replace ##### to make the name unique within your environment).

In [6]:
df_clean = spark.sql("SELECT * FROM usedcars_clean_#####")
df = df_clean.toPandas()

### A gentle start: Linear Regression

Let's investigate how the age of a used car influences its sales price. We start by associating the price information to `x` and the age in months to `y`:

In [9]:
X = df['Age']
y = df['Price']

It is then easy to make a scatterplot showing how the datapoints available to us relate to each other. We will now use `matplotlib` "directly" through its easy-interface called `pyplot`, here called upon using the shortcut `plt`. Using `plt` allows you to build a figure through simple lines of code, one line at the time.

PS: If the figure is not fitting your screen, try changing the `figsize` parameters below. The first number is the width, the second number is the height.

In [11]:
fig, ax = plt.subplots()

# Populate the figure
plt.scatter(X, y)

# Set various labels
plt.title('Price of used cars as function of age')
plt.ylabel('Price [$]')
plt.xlabel('Age [Months]')

# Extras?
plt.grid() # Turn plot-grid on

# Show figure
display(fig)

Remember that Datbricks notebooks have similar capabilities, all you need is to provide a Spark dataframe and the Plot Options (which are already configured for the next cell).

In [13]:
display(df_clean)

In general, "doing Machine Learning" on some data is not well defined: We have to agree on what the *goal* with the Machine Learning or analysis is.

For now, let's say that "doing Machine Learning" on the used card dataset means to build a model **using historic car price data**, that can **predict/guess a price for a used car, given some information about a used car**.

Linear regression is familiar to most, either from high school, MS Excel or similar. Many models assume that data is linear in some fashion, which often is true. Linear regression is the most straight-forward way to make a linear model, in the shape of the familiar equation \\(y = ax + b\\).

While this topic can seem a bit trivial to some, it is a good introduction to some parts of the general method for doing data science. 
For us, linear regression will also serve as an introduction to `scikit-learn` and much more...

**1D Linear Regression: price vs. age**

We will soon try some linear regression on our data, but first we will introduce an important concept in ML: splitting our data in training-data and testing-data. If you know this from before, feel free to fast-forward a bit.

**Splitting our dataset in training-data and testing-data**


Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a sufficiently advanced model that would just repeat the labels of the samples that it has just seen would have a *perfect score* but would fail to predict anything useful on yet-unseen data. This situation is called *overfitting* (or simply *cheating*!).
To avoid it when working with (supervised) machine learning we **always** want to split the data available to us into at least two categories:
- **Training** data: The data we will use to "teach" the algorithm. This is the data that the machine/algorithm will learn from.
- **Test** data: We keep this data "secret", and will not share it with the algorithm during the learning phase. After the system has been trained, we use this data to *test* the performance of the trained system.

We choose to keep 20% of the data available to us for testing, meaning that we have to remove this amount of data from our dataset *before* we start training/teaching our system. `sklearn` has a convenient function for this: the `train_test_split` function:

In [17]:
X = df['Age']
y = df['Price']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

We will have to reshape these vectors a little to make `sklearn` happy. The `reshape` method transforms them into one-dimensional lists of data.

In [19]:
X_train

In [20]:
X_train = X_train.values.reshape(-1,1)
X_test = X_test.values.reshape(-1,1)

print('Number of datapoints (cars) contained in the X_train vector:', len(X_train))
print('Number of datapoints (cars) contained in the X_test vector:', len(X_test))

**Doing the linear regression (finally)**

We are now ready to use `sklearn` for some linear regression, using **stochastic gradient descent**:

In [22]:
# Create the model using sklearn (don't worry about the parameters for now):
model = SGDRegressor(loss='squared_loss', verbose=0, eta0=0.0003, max_iter=3000)

# Train/fit the model to the train-part of the dataset:
model.fit(X_train, y_train)

Thats it, we've created and trained our model. Working with `sklearn` is quite nice.

We can now use our `sklearn` model to predict a used car price if we give it an age in months. We then use the `predict`-function, which is a function belonging to the model we have called `model`:

In [24]:
# Enter an age [months]
car_age  = 42

# Have the model, which is trained on 80% of our dataset, make a "prediction" for a car price:
car_cost = model.predict(car_age)

# Print the result:
print('\nI, the model, predict that a used car which is', car_age, 'months old, will cost ', car_cost[0],'$.\n')

Great. Feel free to try a couple of different ages.

- Does the prediction change if you run the code-cell with the prediction (the cell above) twice?
- Does it change if you re-create and re-train the model?
    - (This is because it is a *stochastic* gradient decent model. If we used the exact linear regression model we used yesterday, the result would have to be the same, as long as the training data stayed the same.)
    
The current model is very simple. For our one-dimensional `Age` input it is just:   \\( f(x) = ax + b\\)

We continue by extracting the \\(a\\) and the \\(b\\) from our trained model:

In [26]:
# Get the linear regression coefficients:
a = model.coef_[0]
b = model.intercept_[0]

print('a = ', a)
print('b = ', b)
print('\nModel equation:   y = {:.2f}x + {:.2f}'.format(a,b))

This model is *very* transparent, and very has good explainability - that's for sure.

We can now use these two numbers together with the function for a straight line from above \\(y = f(x)\\) in order to look at the results:

In [28]:
fig, ax = plt.subplots()

# Plot the training datapoints:
plt.scatter(X_train, y_train)

# Plot the straight line:
x_plot = np.linspace(X_test.min(), X_test.max()) # Create a lot of x-axis-values
y_plot = a*x_plot + b # Calculate the value for y
plt.plot(x_plot, y_plot , color='red', linewidth=2.5) # Plot the linear model

# Set various labels
plt.title('Price of used cars as function of age')
plt.ylabel('Price [$]')
plt.xlabel('Age [Months]')
# Extras?
plt.grid() # Turn plot-grid on
# Show figure
display(fig)

Looks good, right? While this model might be too simple, it doesn't look all too bad considering the plot above. The reason for this might be that the age of the car actually is very important to the price of the car, in a fairly linear fashion.

** Put the model to work: Predict and evaluate its performance**

When we earlier asked the model to "predict" a single price for a car, we had no way of knowing whether the answer was "correct", since we didn't have anything to compare our model's claims to. This is what the **test**-dataset is for.

We can now try to predict the price of every car in the **test**-part of our dataset, the ages of which we called `X_test`. Remember that the model **has not seen these datapoints before**, so it could not possibly have used these datapoints to optimize its model.

In [31]:
# X_test is a list of car ages in months
# Make "predictions" for car prices for each of these ages:
y_pred = model.predict(X_test)

Let's compare these price predictions to the actual prices:

In [33]:
# Make a dataframe that shows both y_test and y_predicted, and show it
df_prediction = pd.DataFrame([y_pred,y_test, y_test-y_pred], 
                     index=['Actual cost of used car:','Predicted cost of used car:', 'Error'])
df_prediction

**Questions:**
- What do you think of the performance?
- Given the table above, how would you assess the performance of the model? Or in other words: Is it "good enough" to assess the performance of a model through a few sample predictions?

Comparing the actual price with the predicted price for the test-dataset manually only gives us insight into single predictions. Let's rather plot the actual prices from the test-dataset in the same plot as the train-dataset and the model:

In [35]:
fig, ax = plt.subplots()

### Populate the figure
# Plot the training datapoints:
plt.scatter(X_train, y_train, label='train-data')
# Plot the straight line:
x_plot = np.linspace(X_test.min(), X_test.max()) 
plt.plot(x_plot, (a*x_plot + b) , color='red', linewidth=2.5, label='model')
# Plot the test-data:
plt.scatter(X_test, y_test, color='orange', edgecolors='black', label='test-data')

# Set various labels
plt.title('Price of used cars as function of age')
plt.ylabel('Price [$]')
plt.xlabel('Age [Months]')
# Extras?
plt.grid() # Turn plot-grid on
plt.legend()
# Show figure
display(fig)

The **error for each price-prediction** for the cars in our test-dataset can now be extracted from the plot above as the vertical distance between an orange point and the red model-line. Let's plot this error directly:

In [37]:
fig, ax = plt.subplots()

# Make a list of all the errors in the test-dataset:
errors = y_pred - y_test

### Populate the figure
# Plot the test-data:
plt.scatter(X_test, errors, color='red', edgecolors='black', label='test-data prediction error')

# Set various labels
plt.ylabel('Model Prediction Error [$]')
plt.xlabel('Age [Months]')
# Extras?
plt.grid() # Turn plot-grid on
plt.legend()
plt.ylim((-12500, 12500))
# Show figure
display(fig)

We now see the error in dollars between the predicted price and the actual price for our test-dataset. **These errors, in different forms, are actually what the model uses to improve itself when it trains.** If we took the absolute value of each of the errors in the plot above, and then took the average of all those error-values, we would be left with what it called the **Mean Absolute Error (MAE)**.

Optional: Have a quick look at [the `sklearn` website for the MAE](http://scikit-learn.org/stable/modules/model_evaluation.html#mean-squared-error) and also check out the section for the [**Mean Squared Error (MSE)**](http://scikit-learn.org/stable/modules/model_evaluation.html#mean-absolute-error). 



The following code extracts these metrics, and also \\(R^2\\) (quickly explained below in the blue box), from our linear regression (gradient decent) model, doing exactly what we did above for the test-dataset and the equations given on the webpages you had a look at. What do you think of the performance? Did you see something in the error-plot above that might explain why the RMSE is larger than the MAE?

In [39]:
MAE  =          metrics.mean_absolute_error( y_test, y_pred)
RMSE = np.sqrt( metrics.mean_squared_error(  y_test, y_pred) )
R2   =          metrics.r2_score(            y_test, y_pred)

print('MAE:\t {:.2f}$'.format(MAE))
print('RMSE:\t {:.2f}$'.format(RMSE))
print('R^2:\t {:.2f}'.format(R2))

**Assessing the performance and error of our models is a critical part of Machine Learning.**

**Optional read**

**\\(R^2\\)**: The coefficient \\(R^2\\) is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) ^ 2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ^ 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \\(R^2\\) score of 0.0.

### Learning Curves - Model tuning

We have now seen...
- How we use `train_test_split`
- How we use a linear model in `sklearn`
- What the model has learned, and what it means to assess the **error** of the test-dataset in prediction

In short, we have seen some **results** of doing linear regression. But how did the `SGDRegressor` model learn in the first place? In short, and as you might know, it learned through *iterations*, using a method called **Stochastic Gradient Decent (SGD)**.
In simplified terms this means that the model went through the following process:

First start out with a guess for both \\(a\\) and \\(b\\) in \\(y = ax + b\\). Then:
1. Use this temporary model to do some predictions on the train-dataset.
- Use the *error* from those predictions to adjust both \\(a\\) and \\(b\\).
- Go back to step one.

Each such data-driven improvement to the model is called an *iteration* or *epoch* -- the details of exactly what defines these terms for different models is outside the scope of this lab. For now, this means that it is time to see how our SGD model learns, so that we can make it perform better and/or faster!

In [43]:
# Choose features for training, and the target feature for predictions:
X = df['Age']
y = df['Price']

# Shuffle the order of the rows in the dataframe, and split into train and test:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

# Do the following if we only have one feature in our X
X_train = X_train.values.reshape(-1,1)
X_test = X_test.values.reshape(-1,1)

Good. We are now ready to again train the model, like before, but *in small chunks at the time*. The code below creates an `SGDRegressor` like before, but tells it to continue training where it left off each time we call the `.fit()` function. It's all explained in the code cell itself (but don't spend time understanding the code if you are new to all of this - the main point comes later):

In [45]:
# Define some lists that can store our performance scores as we train our model:
MAE_train_list = []
RMSE_train_list = []
R2_train_list = []
MAE_test_list = []
RMSE_test_list = []
R2_test_list = []
a_list = []
b_list = []
# Also make a list to store how many iterations were made at the point of scoring:
iterations = []

# How many iterations do we want to work through between each time we store new model scores?
iterations_per_loop = 100

# Create a model! Use the keyword "warm_start=True" to make the model continue where it left off each time we call the .fit() function again.
model = SGDRegressor(loss='squared_loss', verbose=0, eta0=0.0003, max_iter=iterations_per_loop, warm_start=True)

# The argument to range() decides how many iterations we train our model in total, through
#     iterations_per_loop * 'argument to range'. By default we then have 100*30=3000 iterations
for i in range(30):
    # Train the model with "iterations_per_loop" iterations
    model.fit(X_train, y_train)
    
    # Calculate predictions for the test dataset
    y_pred = model.predict(X_train)
    MAE_train_list.append(           metrics.mean_absolute_error( y_train, y_pred) )
    RMSE_train_list.append( np.sqrt( metrics.mean_squared_error(  y_train, y_pred)))
    R2_train_list.append(            metrics.r2_score(            y_train, y_pred) )
    
    # Calculate predictions for the test dataset
    y_pred = model.predict(X_test)
    MAE_test_list.append(           metrics.mean_absolute_error( y_test, y_pred) )
    RMSE_test_list.append( np.sqrt( metrics.mean_squared_error(  y_test, y_pred)))
    R2_test_list.append(            metrics.r2_score(            y_test, y_pred) )
    
    # Get the linear regression coefficients, to use later:
    a_list.append(model.coef_[0])
    b_list.append(model.intercept_[0])
    
    # Store the number of iterations done so far, to use as the x-axis in plotting
    iterations.append(iterations_per_loop*i)

If you understand the code above that's great, but if not don't worry - the important part is that we can now plot how the model improved over time, as we let it iterate more and more times to find a better solution (linear fit). The code below shows this, for both the train-dataset and the test-dataset. Please run this code, have a look at the plots and see if you can answer the questions below.

In [47]:
fig, ax = plt.subplots()

# Plot the first figure
plt.subplot(211)
plt.plot(iterations, MAE_train_list, marker='o', label='MAE train')
plt.plot(iterations, MAE_test_list, marker='o', label='MAE test')
plt.plot(iterations, RMSE_train_list, marker='o', label='RMSE train')
plt.plot(iterations, RMSE_test_list, marker='o', label='RMSE test')
plt.ylabel('Mean Error [$]\n(lower is better)')
plt.grid() # Turn plot-grid on
plt.legend()

# Plot the second figure
plt.subplot(212)
plt.plot(iterations, R2_train_list, marker='o', label='$R^2$ train')
plt.plot(iterations, R2_test_list, marker='o', label='$R^2$ test')
plt.ylabel('$R^2$ value\n(higher is better, 1.0 is maximum)')
plt.xlabel('Iterations')
plt.grid() # Turn plot-grid on
plt.legend()

# Show figure
plt.tight_layout()
display(fig)

**Questions:**
- Do you feel like the above plot makes sense, considering the fact that the model is trying to improve itself with every iteration?
- Both plots show that the model performs better when it is doing predictions on the train-dataset than on the test-dataset (agree?). Did you expect this?
- Considering the plots above, do you think it would help the error in our model if we let it train longer than we did?

What we observe in the plot is that the model is moving towards the minimum for the training error in the error-landscape, using gradient decent. We will visualize this in the next section.

### Cost functions - Navigating the error landscape

The following code defines a little function that calculates the "sum of squared error" cost function at several points in space. 
Run it to show the "error landscape" we saw in the introductory presentation, and try to answer the questions below. Don't worry about the details of the code shown.

In [51]:
def compute_cost(X, y, theta):
    ''' Calculate the sum of squared errors divided by 10^6, given a and b from y = ax + b '''
    return np.sum(np.square(np.matmul(X, theta) - y)) / (2 * len(y)) / 1000000

# Prepare X and y
X_train_mod = np.column_stack((np.ones(len(X_train)), (X_train)))
y_train_mod = (y_train.values.reshape(-1, 1))[:,0]

# Compute the error everywhere in the plot window, so that we can show contours
Xs, Ys = np.meshgrid(np.linspace(-0, 30000, 51), np.linspace(-500, 300, 51))
Zs = np.array([compute_cost(X_train_mod, y_train_mod, [t0, t1]) for t0, t1 in zip(np.ravel(Xs), np.ravel(Ys))])
Zs = np.reshape(Zs, Xs.shape)

fig, ax = plt.subplots()
m, c = np.linalg.lstsq(X_train_mod, y_train_mod)[0]
plt.plot(m, c, 'g.', markersize=35)
plt.title('Error landscape for used car Price afo. Age, linear regression SGD\n[Error shown has been divided by $10^6$]')
plt.xlabel('Coefficient: b [$]')
plt.ylabel('Coefficient: a [$/month]')
CS = plt.contour(Xs, Ys, Zs, levels=[2,5,10,20,30,50,70,90, 120, 200, 300], cmap=mpl.cm.rainbow)
plt.plot(b_list, a_list, marker='o', color='red')

plt.clabel(CS, inline=1, fontsize=16)
display(fig)

Quite beautiful, right? The plot shows
- The "error landscape" through contourlines for constant errors
- The "perfect answer", meaning the best linear fit (analytically solvable, since the error landscape is convex for a linear model and squared errors) marked as a green dot
- The path that our `SGDRegressor` took with every batch of iterations earlier.

**Questions:**
- Can you see that the red path taken by our model is moving towards the perfect solution?
- Can you see that the length in the error landscape between two consecutive batch iterations becomes smaller and smaller as we get closer to the final solution? This is because our model is configured to make smaller and smaller steps as it trains.
- Can you see that the red path is "jumping around" the green dot at the end of the training? What part of the learning curve we saw earlier does this correspond to?

d
 ### Linear Regression with 5 features

Linear regression can be used for more than the regular "straight line in x-y-plot" that we are used to. It can actually be used in any number of dimensions, even if it is difficult to visualize.

Let's try to use more than just car age to make our linear model for the car price. We will, for now, use the "perfect solution" to the linear problem, the analytical solution given by `sklearn's` `LinearRegression()`. This model does not "train" in the traditional sense, but we use it now in order to keep the pace up in this lab.

In [55]:
# Select the columns/features from the Pandas dataframe that we want to use in the model:
features_to_use = ['Age', 'KM', 'HP', 'CC', 'Weight']
x5D = np.array(df[ features_to_use ])
y5D = np.array(df['Price'])

# Do a test-train-split like we did previously:
X_train, X_test, y_train, y_test = train_test_split(x5D, y5D, train_size=0.8)

# Create a linear regression model that we can train:
model = LinearRegression()
# Train the model on the data we have prepared:
model.fit(X_train, y_train)

We have now selected our data, split it into train and test parts, created a linear model in `sklearn` and trained/fitted the model. Let's see how well it worked:

In [57]:
# Create a vector (list of data) that contains our "predicted" car prices based on the 5 features in the test dataset:
y_pred = model.predict(X_test)

MAE  = metrics.mean_absolute_error(y_test, y_pred)
RMSE = np.sqrt( metrics.mean_squared_error(y_test, y_pred) )
R2   = metrics.r2_score(y_test, y_pred)

print('MAE:\t {:.2f}$'.format(MAE))
print('RMSE:\t {:.2f}$'.format(RMSE))
print('R^2:\t {:.2f}'.format(R2))

Works quite well, right?

**Questions:**
- What did we assume (hope for) for ALL the features we now included in our *linear* model? (Hint: Linearity)
- Is that assumption justifiable?

To help answer the above two questions we can see what our model looks like in each of the 5 dimensions (the 5 features). 
(To do this we have to do a little trick, namely to make the model predict the car price in each of the feature dimensions, given the mean value of the features in the dimensions we are not plotting for. Don't worry about this if you didn't understand it, or ask us!)

While we admit that the python code below could be done in a more elegant manner, it gets the job done. Don't worry about the code details, but inspect the 5 figures that should appear when you run the cell. 
**Can you see how the model "works" in each dimension? Was the assumption of linearity equally good for each feature? This insight can also be used for more complex models.**

In [59]:

fig, ax = plt.subplots(len(features_to_use), figsize=(16, 16))

for i, feature in enumerate( features_to_use ):
    # Generate an array that holds the mean values for each feature in the train-dataset:
    train_means_array = (np.ones((len(X_train),1))*X_train.mean(axis=0))

    # Get the current feature, chosen by the loop at the top of this cell, from our training and test datasets
    X_plot_train = X_train[:,i]
    train_means_array[:,i] = (X_plot_train)
    y_pred_train = model.predict(train_means_array)
    
    ### Populate the figure
    # Plot the training datapoints:
    ax[i].scatter(X_plot_train, y_train, label='train-data')
    ax[i].plot(X_plot_train, y_pred_train , color='red', linewidth=2.5, label='trained model')

    # Set various labels
    ax[i].set_ylabel('Price [$]')
    ax[i].set_xlabel('Feature: {}'.format(feature))
    # Extras?
    plt.grid() # Turn plot-grid on
    plt.legend()
    # Show figure
    
display(fig)


What result did you get? While we do not always expect this model to benefit from adding the categorical features (why?), there is something we have to emphasize at this point:

**Every time you split the data into test/train and re-fit (train) the model, the score (MAE/RMSE) changes. With the SGDRegressor, the score also changes with every new re-fit (training) of the model, even with the same data.**

Why?
1. Because the train-test split is splitting the data into test- and train-datasets **randomly**. This makes the results quite random, since the model might or might not be "lucky" with the test-dataset.
- Because the training of the model is not deterministic for most models, meaning that the model does not always end up looking the same after training on the same data.

*This fact makes it difficult to compare different models in their performance.* We will therefore have a look at a technique that helps with this as well as another important problem in ML: **cross validation**

### Finghting randomly varying results - Cross validation

There are two major issues in ML that are interlinked:
- If we adjust our ML models' parameters (we have not done this in this lab so far) based on our test-dataset scores, we are in fact 'leaking' information about the test-dataset(s) into the model.
- If we base our performance measure on the model's performance on a single test-dataset, we might just be "lucky" or "unlucky" with the model performance we report.

**Cross validation can help us make better decisions on the performance of our models, by letting us train the model several times on different data and give us statistics on the total performance**.

In the basic approach for cross validation, called k-fold CV, the training set is split into k smaller sets (other approaches are described below, but generally follow the same principles). The following procedure is followed for each of the k “folds”:

- A model is trained using k-1 of the folds as training data;
- the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).

K-fold cross-validation will calculate the performance measure as an average of the values computed in the process. The obvious implication is a higher need for computational resources. In the same time we are not wasting too much data (this does happen when a specific test set is defined) which helps a lot with datasets that have reduced numbers of entities (*learn more from [scikit-learn's excellent documentation](http://scikit-learn.org/stable/modules/cross_validation.html)*).


PS: We will in most of this lab use cross validation as a way to *train our model several times and each time test it against unseen data*. While this is correct use of cross validation **one would normally also test the performance against a test-dataset**. We will mainly skip this part, to keep this lab simple and since our dataset is limited in size.

**Do a test-run of cross-validation using linear regression**

Please run the following code after reading the comments in the code-cell and the comments below it.

In [64]:
# Import the cross-validation function:
from sklearn.model_selection import cross_validate, cross_val_predict

# Shuffle the rows of the dataframe, since we do not use the train_test_split function (and it normally does it for us):
df = df.sample(frac=1)
# Select the columns/features from the Pandas dataframe that we want to use in the model:
features_to_use = ['Age', 'KM', 'HP', 'CC', 'Weight']
X = np.array(df[ features_to_use ])
y = np.array(df['Price'])

# Create a linear regression model that we can train:
model = LinearRegression()
# Print some information about the linear model and its parameters:
print(model)

### NEW CODE: ###
# Train the model using CV and multiple scoring on the data we have prepared:
cv_results = cross_validate(model, # Provide our model to the CV-function
                            X, # Provide all the features (in real life only the training-data)
                            y, # Provide all the "correct answers" (in real life only the training-data)
                            scoring=('r2', 'neg_mean_squared_error', 'neg_mean_absolute_error'), 
                            cv=5 # Cross-validate using 5-fold (K-Fold method) cross-validation splits
                           )

**What did we just do?**
- We did not split the data prior to calling the cross-validation function, but rather gave it all the data we have (this is NOT normally the best way to do things, but we do it now for simplicity).
- The CV-function took the model and the data, and split into 5 separate "experiments":
    1. Each of the 5 experiments are just like we did previously: they all call the `.fit(X,y)` function on the model
    - Each experiment has 80% of the data as training data, and 20% of the data as test data
    - This way, all the data is test data once in one of the experiments
    - After the training is done in all experiments, the test data available in each experiment is tested using the `.predict(X_test)` functionality that we used before
    - The scores (RMSE, MAE and R^2) are calculated for each of the experiments
- All results are stored in the variable `cv_results`

We can now get the different scores/results, and see how they changed between each of the 5 experiments:

In [66]:
MAE  = -cv_results['test_neg_mean_absolute_error']
RMSE = np.sqrt(-cv_results['test_neg_mean_squared_error'])
R2   = cv_results['test_r2']

print('MAE:', MAE)
print('RMSE:', RMSE)
print('R2:', R2)

See that there are 5 sets of results for each error measure?

We can then get the mean and the standard deviation of the different scores/results:

In [68]:
print('Average R^2:\t {:.2f} (+/- {:.2f})'.format(R2.mean(), R2.std()))
print('Average MAE:\t {:.2f} (+/- {:.2f})'.format(MAE.mean(), MAE.std()))
print('Average RMSE:\t {:.2f} (+/- {:.2f})'.format(RMSE.mean(), RMSE.std()))

Great. We now have scores that are more reliable, and that includes some rough measure of their uncertainty.

**Questions:**
- Are the overall code in the cells above more complicated for us to write than before?
- Do you agree that a split into 5-fold CV makes sense?

You can now proceed to the next notebook in this lab - <a href="$./03 Classification with Azure Databricks">Create a classification model with Azure Databricks</a>.