#Classification with Azure Databricks

###Initial configuration

Run the next cell to import and configure the required modules.

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns

%config InlineBackend.figure_format = 'retina' 

plt.style.use('seaborn-colorblind')
plt.rcParams['axes.axisbelow'] = True
mpl.rcParams['axes.titlesize'] = 20
mpl.rcParams['axes.labelsize'] = 16
mpl.rcParams['xtick.labelsize'] = 14
mpl.rcParams['ytick.labelsize'] = 14
mpl.rcParams['font.size'] = 16   # 10

from sklearn import metrics

# Ignore warnings from scikit-learn?
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
from sklearn.exceptions import DataConversionWarning
warnings.filterwarnings("ignore", category=DataConversionWarning)

**IMPORTANT**'

If this is the first notebook you run from this lab, make sure you run the steps to import the data as indicated in the <a href="$./01 Model Training Selection Evaluation">introductory notebook</a> of this lab.

Next, let's load the dataset for this lab.
Be sure to update the table name  "usedcars\_clean\_#####" (replace ##### to make the name unique within your environment).

In [5]:
df_clean = spark.sql("SELECT * FROM usedcars_clean_#####")
df = df_clean.toPandas()

# Shuffle the datarows randomly, to be sure that the ordering of rows is somewhat random:
df = df.sample(frac=1)

Even if we got familiar with the dataset in the previous exercise, it is always a good idea to quickly inspect the data we have loaded. This acts as a fail-safe in case we by accident loaded the wrong dataset, and is also a good reference for the further development.

In [7]:
df.info()

If the previous output indicated that the Pandas dataframe now contains 1436 entries, we should be good to go. But let's also check a quick sample from the dataframe:

In [9]:
df.sample(3)

###Supervised Learning - Classification introduction

In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. 

**To get started we will define a classification problem by saying that 
*"I only have $12'000. Given features describing a used car, can I afford it?***

We start by splitting our cars into cars that we can/cannot afford based on the age and mileage of the cars in our dataset. We take a look at the data, and then plot it as a scatterplot, where the red color represents cars that we cannot afford.

In [12]:
limit = 12000
print('Price limit set for classification: ${}'.format(limit))

# X contains ALL rows, but only the "Age" and "KM" columns:
X = np.array(df[['Age', 'KM']])  # we only take the first two features.

# We make `y` a vector (a list of numbers) that is 1 if we can afford a car, and 0 if not
y = 1*np.array(df['Price']<limit)

#Let's print the first rows of X and y, just so that we're sure of what we're dealing with:
print('X:\n', X[0:10,:])
print('y:\n', y[0:10])

Looking at the above data, you'll see that `y` is `1` if we can afford a car (it typically is quite old, and has a high mileage). Let's make a scatterplot:

In [14]:
# Plot
fig, ax = plt.subplots(figsize=(16, 8))

plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k', cmap=plt.cm.bwr_r, alpha=0.5)
plt.xlabel('Age [months]')
plt.ylabel('Driven distance [km]')

display(fig)


That's it - is the plot understandable?

Some of the models we will look at in the following require the input data to be scaled, so let's create a scaled version of the `X` we have already chosen. We do not have to do anything to `y` this time, since it is already just 0 or 1.

In [16]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

###Classification - Logistic Regression

One of the workhorses for classification has for a long time been the method of **Logistic Regression**, also called the **logit or MaxEnt** classifier. It has confusing naming because it is a logistic function overlaid a linear regressor, where we normally use the so-called 50% probability line as a division between the classes we are interested in.

Let's get started by creating the model, and train it on the scaled data we just prepared earlier.

In [19]:
from sklearn import linear_model
# Create a linear model for Logistic Regression
clf = linear_model.LogisticRegression(C=1)

# we create an instance of Neighbours Classifier and fit the data.
clf.fit(X_scaled, y)

With the classifier trained on all our data we are ready to visualize the results. Please run the following code cell without worrying about the (messy) contents, and proceed below.

In [21]:
# This function DOES NOT HAVE TO BE UNDERSTOOD at this point :)
def plot_classification(clf, X, X_org, xlabel, y, ylabel, probplot):
    '''
    Take a classifier, scaled and original data as input, and show a very specialized plot
    indicating the data, its class and the classifiers decision boundary.
    '''
    from matplotlib.colors import ListedColormap
    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, x_max]x[y_min, y_max].
    h = 1000  # step size in the mesh
    x_min, x_max = X[:, 0].min() - .0, X[:, 0].max() + .0
    y_min, y_max = X[:, 1].min() - .0, X[:, 1].max() + .0
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, h), np.linspace(y_min, y_max, h))
    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, x_max]x[y_min, y_max].
    predict_proba = False
    if probplot:
        if hasattr(clf, "predict_proba"):
            Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]*100
            predict_proba = True
        elif hasattr(clf, "decision_function"):
            Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
        else:
            Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    else:
        Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    # Put the result into a color plot
    cm_bright = ListedColormap(['#FF0000', '#0000FF'])
    Z = Z.reshape(xx.shape)
    # Create the figure
    fig, ax = plt.subplots(1, figsize=(16, 8))
    # Re-do the mesh in original coordinates, as a hack to show the original axes
    x_min, x_max = X_org[:, 0].min() - .0, X_org[:, 0].max() + .0
    y_min, y_max = X_org[:, 1].min() - .0, X_org[:, 1].max() + .0
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, h), np.linspace(y_min, y_max, h))
    # Plot the decision boundary or the decision probability function as a background
    plt.contourf(xx, yy, Z, cmap=plt.cm.RdBu, alpha=0.8)
    if predict_proba:
        plt.colorbar()
    # Plot also the training points
    ax.scatter(X_org[:, 0], X_org[:, 1], c=y, edgecolors='k', cmap=cm_bright, alpha=0.6)
    # Set the labels, view-limits and show the plot
    ax.set_xlabel(xlabel)
    ax.set_ylabel(ylabel)
    ax.set_xlim(xx.min(), xx.max())
    ax.set_ylim(yy.min(), yy.max())
    display(fig)

Let's use the plotting function defined above to show us **where in the two-dimensional age-mileage-space the classifier has decided to draw its decision boundary**. We'll explain a bit more after the plot.

In [23]:
plot_classification(clf, X_scaled, X, 'Age [Months]', y, 'Mileage [KM]', False)

The background color of this plot is now showing what category the classifier will give as output, if the input is provided as a used car's age and mileage (red means can't afford, blue means can afford).

**Questions:**
- If we provide the now trained classifier with the following car information: [Age=40, KM=100000]
    - Will the model predict that we can afford that car?
- Can we take the classifier model we now have trained and assess its performance? Why not? (Hint: Train/Test datasets...)

Most models that try to classify something also report their confidence in their classifications. These confidence numbers can be a bit difficult to interpret correctly, but they still provide valuable information. Run the following code cell to show in colors how certain the classifier is in predicting whether we can afford a used car.

In [25]:
plot_classification(clf, X_scaled, X, 'Age [Months]', y, 'Mileage [KM]', True)

Can you now see the way this method works? It has created a third dimension, in this plot the z-axis coming out of your screen, and here it has fitted a function that looks like the "Niagra falls" - a hill with a non-linear slope, but with a straight edge.

###Classification: Support Vector Machines

While we are looking at simplified visuals over how a classifier might make its decisions, when given all the data we have, let's also look at a not-so-linear example. The following cell shows a **Support Vector Machine Classifier** in action, based on the Gaussian-like *Radial Basis Function (RBF)* kernel. These models have a lot of variations and tunable parameters, but we will not go into these details here.


<div class="alert alert-block alert-info">

  If you have time, or for later:<br>
Here is a link to the scikit-learn overview article on Support Vector Machines: [(click here)](http://scikit-learn.org/stable/modules/svm.html)
</div>

In [29]:
# Re-prepare the inputs, in case we have run the cells in this notebook out-of-order:
X = np.array(df[['Age', 'KM']])  # we only take the first two features.
X_scaled = StandardScaler().fit_transform(X) # Use scaled/normalized X-values
y = 1*np.array(df['Price']<limit)

In [30]:
### Running this cell might take some 20 seconds ###
from sklearn import svm

# Create a Support Vector Machine Classifier with a Radial Basis Function (RBF) kernel:
clf = svm.SVC(kernel='rbf', C=1, gamma=2)

# Fit the data.
clf.fit(X_scaled, y)

# Plot the datapoints, colored by category, with a background showing a not-explained measure for the certainty
plot_classification(clf, X_scaled, X, 'Age [Months]', y, 'Mileage [KM]', True)

Even in two dimensions it is clear that methods like this SVC with an RBF kernel can be powerful tools with less limitations than the standard logistic regression we saw earlier. (However: There are many ways to make models more complicated and able to fit non-linear data behaviour, feature engineering is one of them. There is a jungle of models out there, and it can be very challenging choosing the right ones.)


PS: In terms of the "Niagra falls" third dimension explanation we gave above, this method kindof makes three-dimensional Gaussian "hats" and put them everywhere in the plot above in an overlapping fashion.

###Classification - Decision Tree Classifier

The familiar decision tree is very popular for working with classification problems, both two-class and multi-class types. It is fast and typically performs decently, but it is the other models that are *based on multiple decision trees*, **Random Forest, boosted trees, XDGBoost etc...** that are common in ML today.

Single decision trees are however popular still, partly because we easily can plot them in full and inspect them. Please run the code below, have a look at the plot and see if you can answer the questions below it.

**Overfitting**

In [35]:
# Re-prepare the inputs, in case we have run the cells in this notebook out-of-order:
X = np.array(df[['Age', 'KM']])  # we only take the first two features.
y = 1*np.array(df['Price']<limit)

from sklearn import tree

# Create a Decicion Tree Classifier with default settings:
clf = tree.DecisionTreeClassifier()


# Fit the data. NB: Now we do NOT use the scaled data. We could, but it is not necessary with decision trees!
clf.fit(X, y)

# Plot the datapoints, colored by category, with a background showing how it would classify a datapoint in that coordinate
plot_classification(clf, X, X, 'Age [Months]', y, 'Mileage [KM]', False)

A decision tree classifier with default settings has no limits on the depth of the tree. Take a close look at the figure above, and try to answer the following questions.

**Questions:**
- Does the classifier seem to do a good job in predicting the right class for each datapoint?
- This classifier was trained on all the data available. Is there a way to judge its performance?
- Do you think the decision boundaries shown in the figure are representative in answering what age-mileage we have to accept in order to afford a car?
- The above result is a good example of *overfitting*. Just to re-cap a little from the previous lab: What does this mean? Is this a good thing?
- What would you do with the classifier in order to reduce this overfitting?

**Exercise:**
Adjust the maximum depth of the tree (hint: the parameter name is max_depth).


Did this make the concept of overfitting more clear to you? We'd love to hear your feedback. And of course, ask us if you have any questions.

###Classification performance evaluation

We will now have a quick look at performance evaluation for classifiers. Until now we have not done a `train-test-split`, making it impossible to assess the performance of our classifier. We will therefore do the split from now on.

Performance metrics like **Precision, Recall and F1-score** can be a bit tricky getting used to, but see if you can run the code below and get some information from the "classification report".

**Without cross-validation**

In [40]:
# X contains ALL rows, but only the "Age" and "KM" columns:
X = np.array(df[['Age', 'KM']])  # we only take the first two features.
X_scaled = StandardScaler().fit_transform(X)

# We make `y` a vector (a list of numbers) that is 1 if we can afford a car, and 0 if not
y = 1*np.array(df['Price']<limit)

from sklearn.model_selection import train_test_split

# Use the regular train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

# Create a classifier model:
clf = tree.DecisionTreeClassifier(max_depth=3)

# we create an instance of Neighbours Classifier and fit the data.
clf.fit(X_train, y_train)

from sklearn.metrics import classification_report

# Get predictions from the model, from the test-dataset:
y_pred = clf.predict(X_test)

# Print a classification report, using sklearn:
print("Classification report:\n%s\n"
      % (classification_report(y_test, y_pred, target_names=['Not Affordable', 'Affordable'])))

What do you think about the scores? They are overall very good, but notice that the category "Affordable" is quite underrepresented. It is often smart to let the model compensate for this inbalance, but we will not look at this in this lab.

Let's have a look at the visual side of the classification, showing the "confidence" of the classifier on a percentage scale:

In [42]:
plot_classification(clf, X, X, 'Age [Months]', y, 'Mileage [KM]', True)

We will now use the test-dataset with the model to get some scoring data that can be used for some insight into the classification evaluation. First we make the model give us uncertainty estimates for what category it predicts for all the data in the test-dataset:

In [44]:
# y_score = clf.decision_function(df_test_scaled[features])
y_score = clf.predict_proba(X_test)

# Do some necessary imports for this section:
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import average_precision_score

average_precision = average_precision_score(y_test, y_pred)
print('Average precision-recall score: {0:0.2f}'.format(average_precision))

We can now plot the **Precision-Recall curve**, another insight into our classification performance. It shows how precision in the classifications develops as a function of the recall score, and can for some datasets tell us about the data quality, model parameter consequences and much more. For now we'll have a look, and then move on.

In [46]:
# Get the precision and recall scores that are necessary:
precision, recall, _ = precision_recall_curve(y_test, y_score[:,1])

# Make the plot:
fig, ax = plt.subplots(figsize=(12,6))
plt.step(recall, precision, alpha=0.2, where='post')
plt.fill_between(recall, precision, step='post', alpha=0.2)
ax.set_xlabel('Recall')
ax.set_ylabel('Precision')
ax.set_title('2-class Precision-Recall curve: AP={0:0.2f}'.format(average_precision))

display(fig)

Receiver-Operative-Characteristic-curves (ROC-curves), and the area under them (ROC AUC), are also famous instruments we can use in getting insight into classification data, models and algorithms.
[Some info on ROC on sklearn - click here if you have time.](http://scikit-learn.org/stable/modules/model_evaluation.html#receiver-operating-characteristic-roc)

In [48]:
# Get the ROC data:
fpr, tpr, _ = roc_curve(y_test, y_score[:,1])
roc_auc = auc(fpr, tpr)

# Plot a standard ROC plot:
fig, ax = plt.subplots(figsize=(6,6))
ax.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)'%roc_auc)
ax.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title('Receiver Operative Characteristic')
plt.legend(loc='lower right')
display(fig)

After looking at some (not fully explained) performance scores, it's time to look at the actual tree and its decisions itself. The following code will install and import a multi-purpose library called **ELI5**, and then plot the decision tree with some options.

In [50]:
import eli5

x = eli5.show_weights(clf, feature_names=['Age', 'KM'], target_names=['Affordable', 'NOT Affordable'],
                     filled=True)

displayHTML("<html><body>" + x.data + "</body></html>")

Quite nice! Can you understand the essentials of the tree? It always puts "True" results to the left of a split. The top of each box shows the evaluation it runs to make the decision, and the color and bottommost text in each box tells us whether the model will predict that we can afford a car, given information about age and mileage of a car.

On the top left we see a nice feature of the ELI5-library (search for the library online if you have time!), namely a summary of what *weight* the tree puts on a given feature. In this case it tells us that age is MUCH more important than mileage in deciding whether we can afford a car!

In [52]:
chosen_datapoint = X_test[3]
print('The chosen datapoint from X_test is:', chosen_datapoint)
print('\nOutput from ELI5:')
x = eli5.show_prediction(clf, X_test[3], show=eli5.formatters.fields.WEIGHTS, show_feature_values=True)
displayHTML("<html><body>" + x.data + "</body></html>")

Finally, we use ELI5 to try to "explain" to us WHY a given datapoint (a car: its age and mileage) was put in a certain category. Can you interpret its output? What does "``<BIAS>``" mean do you think?

**Using more features (no cross-validation)**

We will now try to 
1. Use more features from the dataset in our model
- Use a "random forest" model
- Extract the feature importance from the model, an often valuable piece of information

See if you can follow the code below. Are you happy with the score? We do this in a simplified manner, without cross-validation. Can you trust the results? It is an open question - we don't know the answer. Maybe we are overfitting our model? What plots would you plot to get some insight into this, and other "traps" the model might have fallen into?

In [56]:
df_ohe = df.copy(deep=True)
df_ohe['FuelType'] = df_ohe['FuelType'].astype('category')
df_ohe['MetColor'] = df_ohe['MetColor'].astype('category')
df_ohe['Automatic'] = df_ohe['Automatic'].astype('category')
df_ohe['Doors'] = df_ohe['Doors'].astype('category')
df_ohe = pd.get_dummies(df_ohe)

df_ohe.sample()

limit = 12000

ml_features_in_use = ['Age', 'KM', 'Weight', 'HP', 'CC', 'FuelType_cng', 'FuelType_diesel', 'FuelType_petrol']
X = np.array(df_ohe[ml_features_in_use])
y = 1*np.array(df['Price']<limit)

print('Price limit set for classification: ${}'.format(limit))
print('Using the following features: ', ml_features_in_use)

from sklearn.ensemble import RandomForestClassifier

# Use the regular train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

# Create a model:
clf = RandomForestClassifier(n_estimators=20, max_depth=3)

# we create an instance of Neighbours Classifier and fit the data.
clf.fit(X_train, y_train)

# Get predictions from the model, from the test-dataset:
y_pred = clf.predict(X_test)

# Print a classification report, using sklearn:
print("Classification report:\n%s\n"
      % (classification_report(y_test, y_pred, target_names=['Not Affordable', 'Affordable'])))

**Investigate feature importances**

We are at the end of the modelling and evaluation lab. We end with a great little feature of some models, and especially decision trees and their cousins (random forest etc.): Checking the "variable importance". We will not go into detail, but check out the code below, and see if you agree with what the model tells us are the "most important features" for deciding if we can afford a car with a set of properties.

Feel free to go back and change the model, or add/remove features. `Number of estimators` and `depth` can have a big impact on the feature imporances, since they strongly affect the model. If you have time, add age^2 and other engineered features, and see if the model thinks these additions are "important"!

In [59]:
clf.feature_importances_

In [60]:
def plot_model_var_imp( model, X):
    imp = pd.DataFrame(
        model.feature_importances_ ,
        columns = ['Importance'] ,
        index = X.columns
    )
    imp = imp.sort_values( [ 'Importance' ] , ascending = True )
    
    fig, ax = plt.subplots(figsize=(12,6))
    imp['Importance'].plot(kind='barh')
    display(fig)

In [61]:
plot_model_var_imp(clf, df_ohe[ml_features_in_use])

**Classification using part-cross-validation**

For completeness, we also do the classification using cross-validation (with all the data, as we have done some times before). Can you see that the scores can get a bit confusing in their naming? This is quite often the case.. The code below works well, but some parts of the scoring is missing when we compare the two outputs. See if you can match up the results from the cross-validated 5 runs and the "report", if you have time.

In [63]:
# Import the cross-validation function:
from sklearn.model_selection import cross_validate, cross_val_predict

# Shuffle the rows of the dataframe, since the train_test_split function will not do it for us:
df = df.sample(frac=1)

limit = 12000
print('Price limit set for classification: ${}'.format(limit))

# Select the columns/features from the Pandas dataframe that we want to use in the model:
X = np.array(df[['Age', 'KM']])  # we only take the first two features.
y = 1*np.array(df['Price']>limit)

# Create a linear regression model that we can train:
clf = tree.DecisionTreeClassifier(max_depth=3)

# Print some information about the linear model and its parameters:
print(clf)

# Train the model using CV and multiple scoring on the data we have prepared:
cv_results = cross_validate(clf, # Provide our model to the CV-function
                            X, # Provide all the features (in real life only the training-data)
                            y, # Provide all the "correct answers" (in real life only the training-data)
                            scoring=('f1', 'precision', 'recall', 'accuracy'), 
                            cv=5 # Cross-validate using 5-fold (K-Fold method) cross-validation splits
                           )

F1_pos   = cv_results['test_f1']
P_pos   = cv_results['test_precision']
R_pos   = cv_results['test_recall']
A   = cv_results['test_accuracy']

print('\n-------------- Scores ---------------')
print('Average F1:\t {:.2f} (+/- {:.2f})'.format(F1_pos.mean(), F1_pos.std()))
print('Average Precision (y positive):\t {:.2f} (+/- {:.2f})'.format(P_pos.mean(), P_pos.std()))
print('Average Recall (y positive):\t {:.2f} (+/- {:.2f})'.format(R_pos.mean(), R_pos.std()))
print('Average Accuracy:\t {:.2f} (+/- {:.2f})'.format(A.mean(), A.std()))

# Get price-predictions for all data as test-data using cross_val_predict:
y_pred = cross_val_predict(clf, 
                            X,
                            y,
                            cv=5
                           )

from sklearn.metrics import classification_report
print("Classification report:\n%s\n"
      % (classification_report(y, y_pred, target_names=['Not Affordable', 'Affordable'])))

### Comparison of several scikit-learn classifiers

If you have time: The following code is from sklearn's website. If you are interested and have time, it is some fun code to play around with in order to build some intuition for how the most popular models work and react to different parameters and data. Enjoy!

In [66]:
# Code source: Gaël Varoquaux
#              Andreas Müller
# Modified for documentation by Jaques Grobler
# License: BSD 3 clause
mpl.rcParams.update(mpl.rcParamsDefault)

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

h = .02  # step size in the mesh

names = ["Nearest Neighbors", "Linear SVM", "RBF SVM", "Gaussian Process",
         "Decision Tree", "Random Forest", "Neural Net", "AdaBoost",
         "Naive Bayes", "QDA"]

classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1),
    GaussianProcessClassifier(1.0 * RBF(1.0)),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    MLPClassifier(alpha=1),
    AdaBoostClassifier(),
    GaussianNB(),
    QuadraticDiscriminantAnalysis()]

X, y = make_classification(n_features=2, n_redundant=0, n_informative=2,
                           random_state=1, n_clusters_per_class=1)
rng = np.random.RandomState(2)
X += 2 * rng.uniform(size=X.shape)
linearly_separable = (X, y)

datasets = [make_moons(noise=0.3, random_state=0),
            make_circles(noise=0.2, factor=0.5, random_state=1),
            linearly_separable
            ]


figure, ax = plt.subplots(len(datasets), len(classifiers) + 1, figsize=(27, 9))

# iterate over datasets
for ds_cnt, ds in enumerate(datasets):
    print('ds_cnt: {0}'.format(ds_cnt))
    # preprocess dataset, split into training and test part
    X, y = ds
    X = StandardScaler().fit_transform(X)
    X_train, X_test, y_train, y_test = \
        train_test_split(X, y, test_size=.4, random_state=42)

    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))

    # just plot the dataset first
    cm = plt.cm.RdBu
    cm_bright = ListedColormap(['#FF0000', '#0000FF'])

    if ds_cnt == 0:
        ax[ds_cnt,0].set_title("Input data")
    # Plot the training points
    ax[ds_cnt,0].scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright,
               edgecolors='k')
    # and testing points
    ax[ds_cnt,0].scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright, alpha=0.6,
               edgecolors='k')
    ax[ds_cnt,0].set_xlim(xx.min(), xx.max())
    ax[ds_cnt,0].set_ylim(yy.min(), yy.max())
    ax[ds_cnt,0].set_xticks(())
    ax[ds_cnt,0].set_yticks(())
    
    i = 1
    # iterate over classifiers
    for name, clf in zip(names, classifiers):
        
        clf.fit(X_train, y_train)
        score = clf.score(X_test, y_test)

        # Plot the decision boundary. For that, we will assign a color to each
        # point in the mesh [x_min, x_max]x[y_min, y_max].
        if hasattr(clf, "decision_function"):
            Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
        else:
            Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]

        # Put the result into a color plot
        Z = Z.reshape(xx.shape)
        
        ax[ds_cnt,i].contourf(xx, yy, Z, cmap=cm, alpha=.8)

        # Plot also the training points
        ax[ds_cnt,i].scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright,
                   edgecolors='k')
        # and testing points
        ax[ds_cnt,i].scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright,
                   edgecolors='k', alpha=0.6)

        ax[ds_cnt,i].set_xlim(xx.min(), xx.max())
        ax[ds_cnt,i].set_ylim(yy.min(), yy.max())
        ax[ds_cnt,i].set_xticks(())
        ax[ds_cnt,i].set_yticks(())
        if ds_cnt == 0:
            ax[ds_cnt,i].set_title(name)
        ax[ds_cnt,i].text(xx.max() - .3, yy.min() + .3, ('%.2f' % score).lstrip('0'),
                size=15, horizontalalignment='right')
        
        i += 1

plt.tight_layout()
display(figure)

You can now move to the next notebook in the lab - <a href="$./04 Advanced Regression with Azure Databricks">Advanced Regression with Azure Databricks</a>.