# <font color='red'>Project MNIST - Phase 1: understanding metrics</font>

# <font color='blue'>A. Set-up, import the data, explore the data


Import modules.

In [0]:
#numpy
import numpy as np

#sklearn
import sklearn

# matplotlib
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

Import the data.

In [0]:
#NOTE: this might take a while..

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1)
mnist.keys()

Prepare feature matrix `X` and label vector `y`. 

In [0]:
X, y = mnist["data"], mnist["target"]

Explore them.

In [0]:
X.shape

In [0]:
y.shape

In [0]:
28*28

In [0]:
X[0]

In [0]:
# this cell is just to display some digits, in a way.
# Do not pay too much attention to how this is done.

#plot the first image in the dataset
some_digit = X[0]
some_digit_image = some_digit.reshape(28, 28)
plt.imshow(some_digit_image)

# try changing i in X[i] to display other numbers

In [0]:
# this cell is just to display some digits, in a different way.
# Do not pay too much attention to how this is done.

some_digit = X[0]
some_digit_image = some_digit.reshape(28, 28)
plt.imshow(some_digit_image, cmap = mpl.cm.binary, interpolation="nearest")
plt.axis("off")
plt.show()

# try changing i in X[i] to display other numbers

As expected, the label for this is:

In [0]:
y[0]

Which is a string:

In [0]:
type(y[0])

For sure I want to cast all these into integers:

In [0]:
y = y.astype(np.uint8)
y[0]

In [0]:
type(y[0])

# <font color='blue'>B. Train-test splitting

Apply a static TT splitting.

In [0]:
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

In [0]:
X_train.shape

In [0]:
X_test.shape

In [0]:
y_train.shape

In [0]:
y_test.shape

( *Note: you may want to do much better that this - see later..* )

# <font color='blue'>C. Build a binary classifier: a "5-detector"

Create the label vectors (for both the train and test set):

In [0]:
y_train_5 = (y_train == 5)
y_test_5 = (y_test == 5)

In [0]:
y_train_5.shape

In [0]:
y_train

In [0]:
y_train_5   #this has become a label vector for a binary classifier (a 5-detector)

Pick a Stochastic Gradient Descent classifier, a decent choice if you need to handle very large datasets efficiently.

In [0]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(max_iter=1000, tol=1e-3, random_state=42)
sgd_clf.fit(X_train, y_train_5)

# hyperparameters set to future default values in future versions on sklearn, so to avoid warnings. 
# No need to pay too much attention to this.

I know that X[0] is a 5, X[1] is a 0, X[2] is a 4:

In [0]:
print "y[0] =", y[0]
print "y[1] =", y[1]
print "y[2] =", y[2]

Let's check if the classifier we built above works for these 3 examples:

In [0]:
sgd_clf.predict([X[0]]) # X[0] is a 5 --> TRUE

In [0]:
sgd_clf.predict([X[1]]) # X[1] is a 0, so NOT a 5 --> FALSE

In [0]:
sgd_clf.predict([X[2]]) # X[2] is a 4, so NOT a 5 --> FALSE

OK, it works!

So, how accurate is this model?

# <font color='blue'>D. Compute the accuracy of the model

We opted for a train-test split, so most rapid is to ask for the score in this way:

In [0]:
sgd_clf.score(X_test, y_test_5)

( *Actually, k-fold cross-validation, as discussed at last lecture, would be similar or better - and would also allow you to get insight on its variance. Let's try it out.* )

Use `cross_val_score()` function in sklearn to evaluate your SGDClassifier model using k-fold cross-validation, with k=3, i.e. make k trainings: split the training set into k folds, train and make predictions and evaluate them on each fold using a model trained on the remaining folds.

In [0]:
y_5 = (y == 5)
y_5.shape

In [0]:
%%time
from sklearn.model_selection import cross_val_score
print cross_val_score(sgd_clf, X, y_5, cv=3, scoring="accuracy")

# this trigger the training of cv=3 models, so it might take a while to complete!

If you arrived here, hold on a second, think about what you did, and think if it is a solidly high accuracy qhat you got here.. before moving on.

# <font color='blue'>E. Find a better performance metric for the model

Use `cross_val_predict()` and `confusion_matrix()`:
* `cross_val_predict()` is similar to `cross_val_score()`: it performs K-fold CV but returns not the evaluation score, but the predictions made on each fold
* then, give the target classes (`y_5`) and the predicted classes (`y_pred`) to `confusion_matrix()`

In [0]:
from sklearn.model_selection import cross_val_predict

y_pred = cross_val_predict(sgd_clf, X, y_5, cv=3)

# this trigger the training of cv=3 models, so it might take a while to complete!

In [0]:
type(y_pred)

In [0]:
y_pred.shape

In [0]:
y_pred

In [0]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_5, y_pred)

If you arrived here, hold on a second, and wait for an explanation in class!

In [0]:
!wget http://bonacor.web.cern.ch/bonacor/SC_AA1920/images/ConfusionMatrix.png
from IPython.display import Image
Image(filename='/content/ConfusionMatrix.png')

Other possible metrics are **precision** and **recall**, and often also the **F1** score:

In [0]:
from sklearn.metrics import precision_score
precision_score(y_5, y_pred)

In [0]:
from sklearn.metrics import recall_score
recall_score(y_5, y_pred)

In [0]:
from sklearn.metrics import f1_score
f1_score(y_5, y_pred)

In [0]:
#create an array with, de-facto, the class probabilities
y_scores = cross_val_predict(sgd_clf, X, y_5, cv=3,
                             method="decision_function")

In [0]:
y_scores

In [0]:
y_5

In [0]:
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_5, y_scores)

In [0]:
def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--') # dashed diagonal
    plt.axis([0, 1, 0, 1])                                    
    plt.xlabel('False Positive Rate (Fall-Out)', fontsize=16) 
    plt.ylabel('True Positive Rate (Recall)', fontsize=16)    
    plt.grid(True)                                            
    
plt.figure(figsize=(8, 6))                         
plot_roc_curve(fpr, tpr)
plt.show()

In [0]:
from sklearn.metrics import roc_auc_score
roc_auc_score(y_5, y_scores)

You can state your AUC score is roughly 96%, when comparing your work with others'. It is a good result!

# Want to do more?

You could achieve even better results via an ensemble method. Try e.g. `RandomForestClassifier`.
* documentation: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html