# <font color='red'>Classification (part 1)</font>

The task is to teach a machine how to learn to recognize hand-written digits (MNIST).

# **Practice C1**: Set-up

In [1]:
#numpy
import numpy as np
#sklearn
import sklearn
# matplotlib
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

# **Practice C2**: Import the data and briefly inspect it

In [None]:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1)
mnist.keys()

In [None]:
X, y = mnist["data"], mnist["target"]

Familiarize a bit with the content of X and y.

In [None]:
X.shape

In [None]:
y.shape

In [None]:
28*28

In [None]:
X[0]

In [None]:
some_digit = X[0]
some_digit_image = some_digit.reshape(28, 28)
plt.imshow(some_digit_image)

In [None]:
some_digit = X[0]
some_digit_image = some_digit.reshape(28, 28)
plt.imshow(some_digit_image, cmap = mpl.cm.binary, interpolation="nearest")
plt.axis("off")
plt.show()


As expected, the label for this is:

In [None]:
y[0]

Which is a string:

In [None]:
type(y[0])

For sure I want to cast all these into integers:

In [None]:
y = y.astype(np.uint8)
y[0]

In [None]:
type(y[0])

# **Practice C3**: Train-test splitting

In [None]:
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

In [None]:
X_train.shape

In [None]:
X_test.shape

In [None]:
y_train.shape

In [None]:
y_test.shape

# **Practice C4**: Build a binary classifier: a "5-detector"

Create the label vectors (for both the train and test set):

In [None]:
y_train_5 = (y_train == 5)
y_test_5 = (y_test == 5)

In [None]:
y_train_5.shape

In [None]:
y_train

In [None]:
y_train_5  

In [None]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(max_iter=1000, tol=1e-3, random_state=42)
sgd_clf.fit(X_train, y_train_5)

# note: hyperparameters set to future default values in future versions on sklearn. 

I know that X[0] is a 5, X[1] is a 0, X[2] is a 4:

In [None]:
print "y[0] =", y[0]
print "y[1] =", y[1]
print "y[2] =", y[2]

Let's check if the classifier we built above works for these 3 examples:

In [None]:
sgd_clf.predict([X[0]]) # X[0] is a 5

In [None]:
sgd_clf.predict([X[1]]) # X[1] is a 0, so NOT a 5

In [None]:
sgd_clf.predict([X[2]]) # X[2] is a 4, so NOT a 5

OK, it works!

So, how accurate is this model?

# **Practice C5**: Compute the accuracy

We opted for a train-test split, so most rapid is to ask for the score in this way:

In [None]:
sgd_clf.score(X_test, y_test_5)

Actually, k-fold cross-validation would be similar or better - and would also allow you to get insight on its variance. Let's try it out.

Use `cross_val_score()` function in sklearn to evaluate your SGDClassifier model using k-fold cross-validation, with k=3, i.e. make k trainings: split the training set into k folds, train and make predictions and evaluate them on each fold using a model trained on the remaining folds.

In [None]:
y_5 = (y == 5)
y_5.shape

In [None]:
from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X, y_5, cv=3, scoring="accuracy")

# this trigger the training of cv=3 models, so it might take a while to complete!

*Really?! Such a high accuracy at first attempt?! Are you sure this is reliable?*

# **Practice C6**: Confusion matrix: looking for a better performance metrics

In [None]:
from sklearn.model_selection import cross_val_predict

y_pred = cross_val_predict(sgd_clf, X, y_5, cv=3)

# this trigger the training of cv=3 models, so it might take a while to complete!

In [None]:
type(y_pred)

In [None]:
y_pred.shape

In [None]:
y_pred

In [None]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_5, y_pred)

Other possible metrics are **precision** and **recall**, and often also the **F1** score:

In [None]:
from sklearn.metrics import precision_score
precision_score(y_5, y_pred)

In [None]:
from sklearn.metrics import recall_score
recall_score(y_5, y_pred)

In [None]:
from sklearn.metrics import f1_score
f1_score(y_5, y_pred)

In [None]:
#create an array with, de-facto, the class probabilities
y_scores = cross_val_predict(sgd_clf, X, y_5, cv=3,
                             method="decision_function")

In [None]:
y_scores

In [None]:
y_5

In [None]:
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_5, y_scores)

In [None]:
def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--') # dashed diagonal
    plt.axis([0, 1, 0, 1])                                    
    plt.xlabel('False Positive Rate (Fall-Out)', fontsize=16) 
    plt.ylabel('True Positive Rate (Recall)', fontsize=16)    
    plt.grid(True)                                            
    
plt.figure(figsize=(8, 6))                         
plot_roc_curve(fpr, tpr)
plt.show()

In [None]:
from sklearn.metrics import roc_auc_score
roc_auc_score(y_5, y_scores)

The AUC score should result as roughly 96-97%, when comparing your work with others'. It is a good result!