# Lecture 6

## Classification
In this lecture we will investigate another common supervised learning problem, i.e., the classification problem.

Unlike the previous lecture, where we had to predicto a real-valued function, this time we are asked to predict
some categorical values.



# Setup

First, let's make sure this notebook works well in both python 2 and 3, import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures:

In [None]:
# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals

# Common imports
import numpy as np
import os
import pickle

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures
PROJECT_ROOT_DIR = "."
DATA_DIR = os.path.join(PROJECT_ROOT_DIR,'data')
IMG_DIR = os.path.join(PROJECT_ROOT_DIR, 'img')
CHAPTER_ID = "classification"


def save_fig(fig_id, tight_layout=True):
    path = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID, fig_id + ".png")
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format='png', dpi=300)

# MNIST

## MNIST
The MNIST dataset is probably the most famous dataset in ML. It is a set of 70.000 small images of 
handwritten digits collected from high school studends and employees of the US Census Bureau.

Each image is labeled with the digit it represents.

Since it is a very famous dataset, sklearn provides some helper function for
download it and store it in memory.

**Warning**: `fetch_mldata()` is deprecated since Scikit-Learn 0.20. You should use `fetch_openml()` instead. However, it returns the unsorted MNIST dataset, whereas `fetch_mldata()` returned the dataset sorted by target (the training set and the test test were sorted separately). In general, this is fine, but if you want to get the exact same results as before, you need to sort the dataset using the following function:

In [None]:
def sort_by_target(mnist):
    reorder_train = np.array(sorted([(target, i) for i, target in enumerate(mnist.target[:60000])]))[:, 1]
    reorder_test = np.array(sorted([(target, i) for i, target in enumerate(mnist.target[60000:])]))[:, 1]
    mnist.data[:60000] = mnist.data[reorder_train]
    mnist.target[:60000] = mnist.target[reorder_train]
    mnist.data[60000:] = mnist.data[reorder_test + 60000]
    mnist.target[60000:] = mnist.target[reorder_test + 60000]

In [None]:
def maybe_download_mnist():
    if not os.path.isdir(DATA_DIR):
        os.mkdir(DATA_DIR)
    download = False
    if not os.listdir(DATA_DIR):
        download = True
        # empty directory download the file
        try:
            from sklearn.datasets import fetch_openml
            mnist = fetch_openml('mnist_784', version=1, cache=True)
            mnist.target = mnist.target.astype(np.int8) # fetch_openml() returns targets as strings
            sort_by_target(mnist) # fetch_openml() returns an unsorted dataset
        except ImportError:
            from sklearn.datasets import fetch_mldata
            mnist = fetch_mldata('MNIST original')
    mode = 'wb' if download else 'rb'
    with open(os.path.join(DATA_DIR,'mnist.pickle'), mode) as f:
        if download:
            pickle.dump(mnist, f)
        else:
            mnist = pickle.load(f)
    return mnist
mnist = maybe_download_mnist()

Take a look at the object. 

It has a dict like structure. You can access to the data, stored
as a numpy 2-dimensional array with ``mnist.data``.



In [None]:
print([k for k in mnist])

Take a look at the shape of your dataset. 


In [None]:
mnist.data.shape

In [None]:
X, y = mnist["data"], mnist["target"]
X.shape

In [None]:
y.shape

In [None]:
28*28

There are 70.000 images with 784 features. Each feature represent the indensity of
a pixel, it ranges from 0 to 255. 
Each row-vector of the dataframe actually corresponds to the rollout of a 28x28 image.

You can plot some data by reshaping the vector into the original matrix and then you
can use ``matplotlib.imshow``.

In [None]:
some_digit = X[36000]

def draw(sample, save_path = None):
    some_digit_image = some_digit.reshape(28, 28)
    plt.imshow(some_digit_image, cmap = mpl.cm.binary,
           interpolation="nearest")
    plt.axis("off")
    if save_path is not None:
        save_fig(os.path.join(IMG_DIR, save_path))

draw(some_digit)

In [None]:
# EXTRA
def plot_digits(instances, images_per_row=10, **options):
    size = 28
    images_per_row = min(len(instances), images_per_row)
    images = [instance.reshape(size,size) for instance in instances]
    n_rows = (len(instances) - 1) // images_per_row + 1
    row_images = []
    n_empty = n_rows * images_per_row - len(instances)
    images.append(np.zeros((size, size * n_empty)))
    for row in range(n_rows):
        rimages = images[row * images_per_row : (row + 1) * images_per_row]
        row_images.append(np.concatenate(rimages, axis=1))
    image = np.concatenate(row_images, axis=0)
    plt.imshow(image, cmap = mpl.cm.binary, **options)
    plt.axis("off")

In [None]:
plt.figure(figsize=(9,9))
example_images = np.r_[X[:12000:600], X[13000:30600:600], X[30600:60000:590]]
plot_digits(example_images, images_per_row=10)
plt.show()

**Wait**. Remember the last lecture? What should you do with the dataset at this stage?

In [None]:
# split

In [None]:
# shuffle
import numpy as np

# Training a Binary Classifier
Before addressing the entire problem, let's focus on a tiny version of it.

Imagine you are asked to train a *binary classifier*, which should be only capable of 
detecting 5s.

Let's create the new target vectors for this problem.

In [None]:
y_train_5 = (y_train == 5)
y_test_5 = (y_test == 5)

some_digit = X[36000]


Okay, now let’s pick a classifier and train it. A good place to start is with a **Stochastic
Gradient Descent (SGD)** classifier, using Scikit-Learn’s SGDClassifier class. This clas‐
sifier has the advantage of being capable of handling very large datasets efficiently.

**Note**: a few hyperparameters will have a different default value in future versions of Scikit-Learn, so a warning is issued if you do not set them explicitly. This is why we set `max_iter=5` and `tol=-np.infty`, to get the same results as in the book, while avoiding the warnings.

In [None]:
from sklearn.linear_model import SGDClassifier
# train it yourself

In [None]:
draw(some_digit)
sgd_clf.predict([some_digit])

In [None]:
from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")

## Performance Measures for Classification Problem.

In lecture 5 we have seen two common way of measuring the performance of our problem,
i.e., the RMSE and MAE.
 
**Question**: How do they apply in this context? Do they appear to you as a reasonable choice?

*Answer*: 

----

## Measuring Accuracy with Cross-Validation
First of all the accuracy is defined as the ration between the number of correct predictions 
and the number of samples taken into account. 

More formally:

$$
ACCURACY = \frac{\#\: right predictions}{\#\: predictions}
$$

### Implementing Cross-Validation
Occasionally you will need more control over the cross-validation process than what
cross_val_score() and similar functions provide. In these cases, you can implement
cross-validation yourself; it is actually fairly straightforward.

The StratifiedKFold class performs stratified sampling 
to produce folds that contain a representative ratio of each class. 

At each iteration the code creates a clone of the classifier, trains that clone on the training folds, 
and makes predictions on the test fold. Then it counts the number of correct predictions and
outputs the ratio of correct predictions.

In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone

# stratified splitting of the training samples
skfolds = StratifiedKFold(n_splits=3, random_state=42)

# your code here
# Hint1 : call split and iterate over the 3 folds created by skfold
# Hint2 : clone the classifier
# Hint3 : call fit and then predict
# Hint4 : compute and pring the accuracy of the current classifier
#for train_index, test_index in skfolds.split(X_train, y_train_5):


Of course, for the lazier persons sklean provides an utility function

In [None]:
from sklearn.model_selection import cross_val_score
cross_val_scores = cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")
cross_val_scores.mean()

95%! Great, right?

**Question**: Can we be staisfied by this result?

Let's implement another classifier. 
A dummy classifier that always predict
``False``.

In [None]:
from sklearn.base import BaseEstimator
class Never5Classifier(BaseEstimator):
    def fit(self, X, y=None):
        pass
    def predict(self, X):
        return np.zeros((len(X), 1), dtype=bool)

In [None]:
never_5_clf = Never5Classifier()
cross_val_score(never_5_clf, X_train, y_train_5, cv=3, scoring="accuracy")

*Answer*: draw your conclusions. What did just happen?

Accuracy is not the preferred performance measure for classifiers, 
in particular when you have skewed data, i.e., some classes are much more frequent than others.

## Confusion Matrix

A much better way to evaluate the performance of a classifier is to look at the confu‐
sion matrix.

The general idea is to count the number of times instances of class A are
classified as class B.

For example, to know the number of times the classifier confused
images of 5s with 3s, you would look in the 5 th row and 3 rd column of the confusion
matrix.

To compute the confusion matrix, of course sklearn helps you with some helper function.
In this case we are going to use the ``confusion_matrix`` function form ``sklearn.metrics``.

In order to use it you first need to have  set of predicitons.

In [None]:
from sklearn.model_selection import cross_val_predict

y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)

In [None]:
def display_cofusion_matrix(matrix, classes):
    s = "\t"+"\t".join(classes)
    for i, row in enumerate(matrix):
        s += "\n{}\t{}".format(classes[i], "\t".join([str(x) for x in row]))
    return s

from sklearn.metrics import confusion_matrix
print(display_cofusion_matrix(confusion_matrix(y_train_5, y_train_pred), ['not-5', '5']))

Each row represents the actual class while the column represents the predicted class.

Since we are dealing with binary classification, we can devise 4 categories:
* True Positive - samples predicted as positive that are actually positive (right classification)
* True Negative - samples predicted as negative that are actually negative (right classification)
* False Positive - samples predicted as positive that are actually negative (wrong classification)
* False Negative - samples predicted as negative that are actually positibe( wrong classification)

A perfect binary classifier whould have only true negative and false negative.

In [None]:
y_train_perfect_predictions = y_train_5

In [None]:
confusion_matrix(y_train_5, y_train_perfect_predictions)

---
![img/confmatrix.png](img/confmatrix.png)

## Precision and Recall
The confusion matrix is the building block for two other measures:

* Precision - it is given by: 
    $$PRECISION = \frac{TP}{TP+FP}$$
    
    It measures how many times
    prediction of a positive sample were actually positive.
* Recall - it is given by: $$ RECALL = \frac{TP}{TP+FN}$$

    It measures how many positve samples are captured from the set of all the positive samples.
    
Let's use these measure over your trained classifier.

In [None]:
from sklearn.metrics import precision_score, recall_score

precision_score(y_train_5, y_train_pred)

In [None]:
recall_score(y_train_5, y_train_pred)

See? The model is far from being perfect as the >90% precision suggested before.

Precision and recall, combined togheter, denote the so called $F_1$ score denoted as:

$$
F_1 = \frac{2}{\frac{1}{precision} + \frac{1}{recall}} = 2 \times \frac{precision \times recall}{precision + recall} = \frac{TP}{TP + \frac{FN+TP}{2}}
$$

The $F_1$ score favors classifiers that have similar precision and recall. 

**Question**: Giving the same importance to both precision and recall, is it always the most reasonable choice?

*Answer*....

**Question**: How are precision and recall correlated with each other?

*Answer*....


In [None]:
from sklearn.metrics import f1_score
f1_score(y_train_5, y_train_pred)


## Precision/Recall Tradeoff

``SGDClassifier`` for each point makes a decision based on a decision function. This function
assigne a score to each sample and the final decision is taken on the basis of wheter or not
this score exceeds a certain threshold.

**Question** To what extents this threshold is correlated with recall and precision ?

*Answer*

---
Let's see what happen when you increase or decrease the threshold.

Wioth the method ``decision_function`` you can access to the score assigned to 
the sample given as input by the your classifier.

In [None]:
y_scores = sgd_clf.decision_function([some_digit])
y_scores

In [None]:
threshold = 0
y_some_digit_pred = (y_scores > threshold)

In [None]:
y_some_digit_pred

Now you can play around with thresholds

In [None]:
threshold = 200000
y_some_digit_pred = (y_scores > threshold)
y_some_digit_pred

If you want to obtain the scores instead of the classes of each data point you can call
``cross_val_predict`` with parameter ``decision_function``

In [None]:
y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3,
                             method="decision_function")
y_scores.shape

Instead of manually setting the threshold you can use the ``precision_recall_curve`` function of sklearn.


In [None]:
from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)
print(precisions.shape, recalls.shape, thresholds.shape)


In [None]:
def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision", linewidth=2)
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall", linewidth=2)
    plt.xlabel("Threshold", fontsize=16)
    plt.legend(loc="upper left", fontsize=16)
    plt.ylim([0, 1])

plt.figure(figsize=(8, 4))
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.xlim([-700000, 700000])

plt.show()

Now you can simply select the threshold value that gives you the best precision/recall
tradeoff for your task. Another way to select a good precision/recall tradeoff is to plot
precision directly against recall.

In [None]:
y_train_pred_90 = (y_scores > 70000)

In [None]:
precision_score(y_train_5, y_train_pred_90)

In [None]:
recall_score(y_train_5, y_train_pred_90)

Another way to select a good tradeoff between precision and recall is to plotprecision directly against recall.

In [None]:
def plot_precision_vs_recall(precisions, recalls):
    plt.plot(recalls, precisions, "b-", linewidth=2)
    plt.xlabel("Recall", fontsize=16)
    plt.ylabel("Precision", fontsize=16)
    plt.axis([0, 1, 0, 1])

plt.figure(figsize=(8, 6))
plot_precision_vs_recall(precisions, recalls)

plt.show()

Now if someone presents to you a 90 percent precision classifier you should also ask 
for its recall. In fact, having high-precision classifier is  not very useful if the recall is too low!

## ROC curves
The receiver operating characteristic curve is another common tool used
with binary classifier.

Thge ROC curve plots the true positive rate (another name for Recall) against the false positive rate.
This latter measure, FPR, is teh ratio of negative samples that are wrongly classified as positive.

FPR is eaaul to 1 minus the true negative rate,which is the ratio of negative instances that are correcly classified as negative.

The ROC curve plots the recall vs the 1-specificity.

As a recap:
* True positive rate (recall / sensitivity)
    $$TPR = \frac{TP}{TP+FP} = 1 - FNR$$

* False positive rate 
    $$
        FPR = \frac{FP}{FP+TN} = 1 - TNR
    $$
* True negative rate (specificity)
    $$
        TNR = \frac{TN}{TN+FP} = 1 - FPR
    $$

To plot the ROC curve, you first need to compute the TPR and FPR for various threshold values, using the  ``roc_curve`` function.

In [None]:
from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)

In [None]:
def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--')
    plt.axis([0, 1, 0, 1])
    plt.xlabel('False Positive Rate', fontsize=16)
    plt.ylabel('True Positive Rate', fontsize=16)

plt.figure(figsize=(8, 6))
plot_roc_curve(fpr, tpr)
plt.show()

Once again, the ROC curve suggests that there is a thadeoff. The higher the recall the more false positives are predicted. 

The dotted line represents the ROC curve associated to a purely random classifier. 
One may intuitively say that the performance of a model improves as the distance between the curves increases.

Another way for comparing two calssifiers is to look at the area under the ROC curve. This metrics is called
AUC. An ideal classifier will always have an area equal to 1. 

sklearn provides a function for computing this area.


In [None]:
from sklearn.metrics import roc_auc_score

roc_auc_score(y_train_5, y_scores)

### Review

You have seen a number of different way for measuring the performance of a classifier. 

You are probably wondering how to choose the correct one.
As a general rule of thumb Precision Recall curve is preferred whenever the positive class is rare or when yo care omre about false positives than the false negatives. Otherwise you should prefer
using the ROC curve or the AUC score as well.

If you look to the AUC score, you see a value really close to 1. 
This result suggests that your model works pretty good. However, if you look at the PR curve you may see that
there is still room for improvement, and the good performance are mostly due to fact that the 
dataset is strongly unbalanced towards the negative class.

**Exercise**

Train a ``RandomForestClassifier`` (Hint: pass ``predict_proba`` as the ``method`` argument).

Display the ROC curve, precision and recall.

In [None]:
from sklearn.ensemble import RandomForestClassifier
#train it
y_probas_forest = # get prediction probabilities

In [None]:
y_scores_forest = y_probas_forest[:, 1] # score = proba of positive class
fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_5,y_scores_forest)

In [None]:
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, "b:", linewidth=2, label="SGD")
plot_roc_curve(fpr_forest, tpr_forest, "Random Forest")
plt.legend(loc="lower right", fontsize=16)

plt.show()

In [None]:
roc_auc_score(y_train_5, y_scores_forest)

In [None]:
y_train_pred_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3)
precision_score(y_train_5, y_train_pred_forest)

In [None]:
recall_score(y_train_5, y_train_pred_forest)

# Multiclass classification

So far we tackled just binary classification problem. There are otehr kind of
classification problem, though. One of this is the *multiclass* calssification problem.

Instead of having to choose between only a pair of labels, you need to assign a label amongs a number
of them.

Some algorithms, such the Random Forest Classifier, or the naive Bayes classifiers, are capable to handling
multiple class directly.

However, there are several other strategies, that allows you to extend the binary classification framework to
this context.

For instance, regarding the MNIST dataset, one may think of using several binary classifiers. 
This strategy can be implemented according to two main approaches:

1. One vs All (OVA) - if you are asked to predict a class amongst N different ones, you can train N different binary classifiers. Each one of them is trained in order to detect a single class against all the others.
2. One vs One (OVO) - given N different class, you can  N x (N-1)/2 different binary classifier. You will need to train a classifier for every possible pair of classes.

For most binary classification algorithms the OVA approach is the one to be preferred.

sklearn authomatically detects when you need to solve a multiclass classification problem.
In the following cell we will train the same classifier defined above; this time we are going to pass to the 
``fit`` function the entire training targets.

In [None]:
sgd_clf.fit(X_train, y_train)
sgd_clf.predict([some_digit])

Under the hood,
Scikit-Learn actually trained 10 binary classifiers, got their decision scores for the
image, and selected the class with the highest score.

To convince yourself about the actual strategy adopted by sklearn you can 
compute the scores associated with the ``some_digit``.

In [None]:
some_digit_scores = sgd_clf.decision_function([some_digit])
some_digit_scores

Each score ~~represents~~ **reflects** the likelihood of the given sample to belong to the corresponding class.

The class is assigned computing the argmax on this vector of scores

In [None]:
np.argmax(some_digit_scores)

**Note**: Be careful, this 5 actually is the index of the maximum element, it is not its class.

The classess upon which the classifier is trained can be accessed via the ``classes_`` field of the estimator

In [None]:
sgd_clf.classes_

In [None]:
sgd_clf.classes_[5]

If you want to force sklearn to use the opposite strategy, OVO, you can use the
``OneVsOneClassifier``.
You need to pass a regular binary classifier and then call the usual fit method.

In [None]:
from sklearn.multiclass import OneVsOneClassifier
ovo_clf = OneVsOneClassifier(SGDClassifier(max_iter=5, tol=-np.infty, random_state=42))
ovo_clf.fit(X_train, y_train)
ovo_clf.predict([some_digit])

In [None]:
len(ovo_clf.estimators_)

Now let's try with the **RandomFrestClassifier**.

In [None]:
forest_clf.fit(X_train, y_train)
forest_clf.predict([some_digit])

In [None]:
forest_clf.predict_proba([some_digit])

In [None]:
cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring="accuracy")

It gets over 84% on all test folds. If you used a random classifier, you would get 10%
accuracy, so this is not such a bad score, but you can still do much better.

We have seen that scaling the data usually helps the training process.
Now, I ask you to scale the data and then train the same RandomForestClassifier.


In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.pipeline import Pipeline
#your code here
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))
cross_val_score(sgd_clf, X_train_scaled, y_train, cv=3, scoring="accuracy")

## Error Analysis

Here, we will assume that you have found a promising model and
you want to find ways to improve it. One way to do this is to analyze the types of
errors it makes.

You need to make predictions using the ``cross_val_predict`` fnction and then
call the ``confusion_matrix()`` function, just like
you did earlier:

In [None]:
y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)
conf_mx = confusion_matrix(y_train, y_train_pred)
conf_mx

In [None]:
import seaborn as sns
def plot_confusion_matrix(matrix):
    fig = plt.figure(figsize=(8,8))
    ax = fig.add_subplot(111)
    sns.heatmap(matrix, 
                annot=False,  
                cmap="YlGnBu")
            
plot_confusion_matrix(conf_mx)

Let’s focus the plot on the errors. First, you need to divide each value in the confusion
matrix by the number of images in the corresponding class, so you can compare error
rates instead of absolute number of errors (which would make abundant classes look
unfairly bad

In [None]:
row_sums = conf_mx.sum(axis=1, keepdims=True)
norm_conf_mx = conf_mx / row_sums

Now let’s fill the diagonal with zeros to keep only the errors, and let’s plot the result:

In [None]:
np.fill_diagonal(norm_conf_mx, 0)
plot_confusion_matrix(norm_conf_mx)

The columns for classes 8 and 9 are quite bright, which tells you that many images get misclassified as
8s or 9s. 

Similarly, the rows for classes 8 and 9 are also quite bright, telling you that 8s
and 9s are often confused with other digits. 

Conversely, some rows are pretty dark,
such as ow 1: this means that most 1s are classified correctly.

# Multilabel classification

Until now each instance has always been assigned to just one class. In some cases you
may want your classifier to output multiple classes for each instance. For example,
consider a face-recognition classifier: what should it do if it recognizes several people
on the same picture?

Such a classification system that outputs multiple
binary labels is called a multilabel classification system.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

y_train_large = (y_train >= 7)
y_train_odd = (y_train % 2 == 1)
y_multilabel = np.c_[y_train_large, y_train_odd]

knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_multilabel)

This code creates a y_multilabel array containing two target labels for each digit
image: the first indicates whether or not
the digit is large (7, 8, or 9) and the second
indicates whether or not it is odd. The next lines create a KNeighborsClassifier,
Now you can make a prediction, and notice
that it outputs two labels:

In [None]:
knn_clf.predict([some_digit])

And it gets it right! The digit 5 is indeed not large ( False ) and odd ( True ).

There are many ways to evaluate a multilabel classifier, and selecting the right metric
really depends on your project. For example, one approach is to measure the F 1 score
for each individual label (or any other binary classifier metric discussed earlier), then
simply compute the average score. This code computes the average F 1 score across all
labels:

In [None]:
y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_multilabel, cv=3, n_jobs=-1)
f1_score(y_multilabel, y_train_knn_pred, average="macro")

This assumes that all labels are equally important, which may not be the case.

One simple option is to give each label a weight equal to its support (i.e., the number of instances with that
target label). To do this, simply set average="weighted" in the preceding code.

## KNN classifier

## SVM

# Decision Tree 

## Logistic Regression

## 1. An MNIST Classifier With Over 97% Accuracy

KNN classifier appears to be one with the best performance.
Try to push its performance with automatica parameter tuning.
It has only one parameter to tune: k, it represents the number of neighbors to take into account.

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = [{'weights': ["uniform", "distance"], 'n_neighbors': [3, 4, 5]}]

knn_clf = KNeighborsClassifier()
grid_search = GridSearchCV(knn_clf, param_grid, cv=5, verbose=3, n_jobs=-1)
grid_search.fit(X_train, y_train)

## 2. Testing and Selecting the best Model
Now, you are required to build an entire pipeline. 
You need to select a number of models, tuning upon each of them and then you need to report in a table the following
information:

1. Name of The estimator
2. Best Configuration
3. accuracy
3. precision
4. recall
5. AUC score

Of course, these results must be obtained wrt the test set.