**Hello World Example**

_This notebook is heavily based on Chapter 3 of Aurelien Geron: Hand-on Machine Learning with Scikit-learn & Tensorflow, 1st ed._
_You will need to install jupyter (otherwise you would not see this), matplotlib and scikit-learn for this notebook._ 

# Setup

First, let's import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures:

In [1]:
# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

# Where to save the figures
PROJECT_ROOT_DIR = "."
NB_ID = "sevens"

# create the directory if it does not exist
os.makedirs(os.path.join(PROJECT_ROOT_DIR, "images", NB_ID), exist_ok = True)
        
def save_fig(fig_id, tight_layout=True):
    path = os.path.join(PROJECT_ROOT_DIR, "images", NB_ID, fig_id + ".png")
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format='png', dpi=300)

# MNIST

MNIST is the hello world example of machine learning. scikit-learn can load the data directly from https://www.openml.org

In [2]:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784',version=1,cache=True,return_X_y=False)

Find out the format the data.  

In [3]:
print(type(mnist))
print(mnist)

<class 'sklearn.utils.Bunch'>
{'data':        pixel1  pixel2  pixel3  pixel4  pixel5  pixel6  pixel7  pixel8  pixel9  \
0         0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
1         0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
2         0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
3         0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
4         0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
...       ...     ...     ...     ...     ...     ...     ...     ...     ...   
69995     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
69996     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
69997     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
69998     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
69999     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0 

Seems to be sklearns dictionary-like object with a data array and a target or label vector. Let's see the data's description.

In [4]:
print(mnist.DESCR)

**Author**: Yann LeCun, Corinna Cortes, Christopher J.C. Burges  
**Source**: [MNIST Website](http://yann.lecun.com/exdb/mnist/) - Date unknown  
**Please cite**:  

The MNIST database of handwritten digits with 784 features, raw data available at: http://yann.lecun.com/exdb/mnist/. It can be split in a training set of the first 60,000 examples, and a test set of 10,000 examples  

It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image b

In [5]:
print(type(mnist.data))
print(type(mnist.target))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


These are numpy arrays. Let's ses their respective sizes.

In [6]:
# Use the notion of y=f(X)
X, y = mnist["data"], mnist["target"]
print('Data size: {0} x {1} and label size {2}'.format(X.shape[0],X.shape[1],y.shape[0]))

Data size: 70000 x 784 and label size 70000


70,000 examples of images with 784 pixels. These are square:

In [7]:
import math
print('The images are of size: {0} x {0}'.format(math.sqrt(X.shape[1])))

The images are of size: 28.0 x 28.0


In [8]:
# convert the y values to integers from strings
y = y.astype(np.int32);

We can use the matplotlib library to plot the image as a grayscale image with pyplot. Let's find images labeled 7.   

In [12]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

index7 = np.where(y==7)[0]
print('There are {0} images that are labeled 7'.format(index7.shape[0]))

# randomly pick an image
pick = index7[np.random.choice(index7.shape[0])]
print(pick)
print(X.shape)
print(index7.shape)
digit_7 = X[pick,:]
# reshape into a matrix
digit_7_image = digit_7.reshape(28, 28)
plt.imshow(digit_7_image, cmap = matplotlib.cm.binary,
           interpolation="nearest")
# plt.axis("on")
# print to file
save_fig("digit_7_plot")
plt.show()

There are 7293 images that are labeled 7
50488
(70000, 784)
(7293,)


TypeError: '(50488, slice(None, None, None))' is an invalid key

In [14]:
X[1,:]


(70000, 784)

Function by Aurelien Geron to plot sample images in a matrix. Note that the indicies are hard-coded meaning that he is relying on the ordering of the dataset.

In [None]:
def plot_digit(data):
    image = data.reshape(28, 28)
    plt.imshow(image, cmap = matplotlib.cm.binary,
               interpolation="nearest")
    plt.axis("off")

In [None]:
# EXTRA
def plot_digits(instances, images_per_row=10, **options):
    size = 28
    images_per_row = min(len(instances), images_per_row)
    images = [instance.reshape(size,size) for instance in instances]
    n_rows = (len(instances) - 1) // images_per_row + 1
    row_images = []
    n_empty = n_rows * images_per_row - len(instances)
    images.append(np.zeros((size, size * n_empty)))
    for row in range(n_rows):
        rimages = images[row * images_per_row : (row + 1) * images_per_row]
        row_images.append(np.concatenate(rimages, axis=1))
    image = np.concatenate(row_images, axis=0)
    plt.imshow(image, cmap = matplotlib.cm.binary, **options)
    plt.axis("off")

In [None]:
# randomly pick 10 images of each number
num_per_row = 10
examples = np.zeros((10*num_per_row,X.shape[1]))

for i in range(10):
    index = np.where(y==i)[0]
    # print('There are {0} images that are labeled {1}'.format(index.shape[0],i))
    pick = index[np.random.choice(index.shape[0],num_per_row)]
    examples[i*num_per_row:(i+1)*num_per_row,:] = X[pick,:]

plt.figure(figsize=(9,9))
plot_digits(examples, images_per_row = num_per_row)
save_fig("more_digits_plot")
plt.show()

# Test and Training Data

Split data into test and training data after re-shuffling the data. We can simply take the first 60000 as training and the last 10000 as test. (Note the data seems to be already shuffled but let's make sure).

In [None]:
shuffle_index = np.random.permutation(70000)
# reshuffle the data and use 60000 samples as training and 10000 as test
X_train, X_test, y_train, y_test = X[shuffle_index[:60000],:], X[shuffle_index[60000:],:], y[shuffle_index[:60000]], y[shuffle_index[60000:]]


# Binary classifier

Set up a binary classifier for 7 and not 7. We create new labels which are true meaning we have a 7 and false meaning we do not have a 7. 

In [None]:
y_train_7 = (y_train == 7)
y_test_7 = (y_test == 7)
print('We have a training set with {0} images of 7, and a test set with {1}'.format(sum(y_train_7),sum(y_test_7)))

In [None]:
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression

bin_clf = LogisticRegression(max_iter=10, tol=0.1, solver='liblinear')
# sgd_clf = SGDClassifier(max_iter=5, random_state=42)
bin_clf.fit(X_train, y_train_7)

Let's pick a 7 out of the training set and see what our classifier predicts.

In [None]:
index7 = np.where(y_train==7)[0]
# randomly pick an image again
pick = index7[np.random.choice(index7.shape[0])]
digit_7 = X_train[pick,:]
# reshape into a matrix
digit_7_image = digit_7.reshape(28, 28)
plt.imshow(digit_7_image, cmap = matplotlib.cm.binary,
           interpolation="nearest")
# plt.axis("on")
plt.show()

print('Classifier prediction is {0} that the image is a 7'.format(bin_clf.predict([digit_7])))

Let's calculate accuracy

In [None]:
from sklearn.metrics import accuracy_score

y_pred = bin_clf.predict(X_train)
print('Classifier accuracy on the training set is {0} '.format(accuracy_score(y_train_7, y_pred)))

In [None]:
X_train.shape, y_train_7.shape, y_pred.shape, sum(y_pred==y_train_7)/60000


In [None]:
from sklearn.base import BaseEstimator
class Never7Classifier(BaseEstimator):
    def fit(self, X, y=None):
        pass
    def predict(self, X):
        return np.zeros((len(X), 1), dtype=bool)

Our data set is not balanced, i.e., we can create a simple classifier which never predicts a 7 and it wouldn't be so bad.

In [None]:
never_7_clf = Never7Classifier()
y_pred_never = never_7_clf.predict(X_train)
print('Classifier accuracy on the training set is {0} '.format(accuracy_score(y_train_7, y_pred_never)))

Let's calculate precision, recall and the f1 score; easy to crunch numbers ourselves but its built in.

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix

print('Classifier scores on training set: ')
print('Precision is {0} '.format(precision_score(y_train_7, y_pred)))
print('Recall is {0} '.format(recall_score(y_train_7, y_pred)))
print('F1 is {0} '.format(f1_score(y_train_7, y_pred)))

print( 'And the confusion matrix: \n {0}'.format(confusion_matrix(y_train_7, y_pred) ))

Just to be clear let's look at the confusion matrix for the never 7 classifier.

In [None]:
(53386+5617)/60000, 5617/(5617+360), 5617/(5617+637), 2*(5617/(5617+360)*5617/(5617+637))/ (5617/(5617+360) + 5617/(5617+637)) 

In [None]:
print(confusion_matrix(y_train_7, y_pred_never))

In [None]:
from sklearn.model_selection import cross_val_score
cross_val_score(bin_clf, X_train, y_train_7, cv=5, scoring="accuracy")

This is how Aurelien Geron suggest we could implement cross-validation with stratified sampling ourselves. Note it produces the same as above.

In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone

skfolds = StratifiedKFold(n_splits=5)

# We could random shuffle but than the reulst will not be exactly the same
# skfolds = StratifiedKFold(n_splits=5), shuffle=True, random_state=42)

for train_index, test_index in skfolds.split(X_train, y_train_7):
    clone_clf = clone(bin_clf)
    X_train_folds = X_train[train_index]
    y_train_folds = (y_train_7[train_index])
    X_test_fold = X_train[test_index]
    y_test_fold = (y_train_7[test_index])

    clone_clf.fit(X_train_folds, y_train_folds)
    y_pred = clone_clf.predict(X_test_fold)
    n_correct = sum(y_pred == y_test_fold)
    print(n_correct / len(y_pred))

In [None]:
from sklearn.model_selection import cross_val_predict

y_scores = cross_val_predict(bin_clf, X_train, y_train_7, cv=5,
                             method="decision_function")
n, bins, patches = plt.hist(y_scores, 100)

plt.xlabel('Decision treshold for each training sample')
plt.ylabel('Count')


In [None]:
from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_train_7, y_scores)

In [None]:
def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision", linewidth=2)
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall", linewidth=2)
    plt.xlabel("Threshold", fontsize=16)
    plt.legend(loc="upper left", fontsize=16)
    plt.ylim([0, 1])

plt.figure(figsize=(8, 4))
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
save_fig("precision_recall_vs_threshold_plot")
plt.show()

In [None]:
def plot_precision_vs_recall(precisions, recalls):
    plt.plot(recalls, precisions, "b-", linewidth=2)
    plt.xlabel("Recall", fontsize=16)
    plt.ylabel("Precision", fontsize=16)
    plt.axis([0, 1, 0, 1])

plt.figure(figsize=(8, 6))
plot_precision_vs_recall(precisions, recalls)
save_fig("precision_vs_recall_plot")
plt.show()

# ROC curves

In [None]:
from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_train_7, y_scores)

In [None]:
def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--')
    plt.axis([0, 1, 0, 1])
    plt.xlabel('False Positive Rate', fontsize=16)
    plt.ylabel('True Positive Rate', fontsize=16)

plt.figure(figsize=(8, 6))
plot_roc_curve(fpr, tpr)
save_fig("roc_curve_plot")
plt.show()

# Multiclass classification

Scikit-learn tutorial states: "The classes SGDClassifier and SGDRegressor provide functionality to fit linear models for classification and regression using different (convex) loss functions and different penalties. E.g., with loss="log", SGDClassifier fits a logistic regression model, while with loss="hinge" it fits a linear support vector machine (SVM)."

In [None]:
from sklearn.linear_model import SGDClassifier

# We could stop by setting the maximum iteration if this takes too long
# sgd_clf = SGDClassifier(max_iter=10, random_state=42)
sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train)


In [None]:
print('Classifier prediction is that the image belongs to class {0}'.format(sgd_clf.predict([digit_7])))

In [None]:
# Warning: If you don't set maximum iterations, this will take a long time. 
# Setting maximum iteration to 10 will cause convergence warnings.
from sklearn.multiclass import OneVsOneClassifier
ovo_clf = OneVsOneClassifier(SGDClassifier(max_iter=10, random_state=42))
ovo_clf.fit(X_train, y_train)

In [None]:
print('Classifier prediction is that the image belongs to class {0}'.format(ovo_clf.predict([digit_7])))

In [None]:
print('Number of classifiers needed for one-vs-one 10*9/2={0}'.format(len(ovo_clf.estimators_)))

In [None]:
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train, cv=5)


In [None]:
conf_mx = confusion_matrix(y_train, y_train_pred)
print('Confusion matrix: \n{0}'.format(conf_mx))

In [None]:
def plot_confusion_matrix(matrix):
    """If you prefer color and a colorbar"""
    fig = plt.figure(figsize=(8,8))
    ax = fig.add_subplot(111)
    cax = ax.matshow(matrix)
    fig.colorbar(cax)

In [None]:
# plt.matshow(conf_mx, cmap=plt.cm.gray)
plot_confusion_matrix(conf_mx)
save_fig("confusion_matrix_plot", tight_layout=False)
plt.show()

Make errors more visible by printing the ratio and zeroing out the diagonal.

In [None]:
row_sums = conf_mx.sum(axis=1, keepdims=True)
norm_conf_mx = conf_mx / row_sums
np.fill_diagonal(norm_conf_mx, 0)
# plt.matshow(norm_conf_mx, cmap=plt.cm.gray)
plot_confusion_matrix(norm_conf_mx)
save_fig("confusion_matrix_errors_plot", tight_layout=False)
plt.show()

# Confusion matrix samples
Print examples for the confusion between 3 and 5

In [None]:
cl_a, cl_b = 5, 8
X_aa = X_train[(y_train == cl_a) & (y_train_pred == cl_a)]
X_ab = X_train[(y_train == cl_a) & (y_train_pred == cl_b)]
X_ba = X_train[(y_train == cl_b) & (y_train_pred == cl_a)]
X_bb = X_train[(y_train == cl_b) & (y_train_pred == cl_b)]

plt.figure(figsize=(8,8))
plt.subplot(221); plot_digits(X_aa[:25], images_per_row=5)
plt.subplot(222); plot_digits(X_ab[:25], images_per_row=5)
plt.subplot(223); plot_digits(X_ba[:25], images_per_row=5)
plt.subplot(224); plot_digits(X_bb[:25], images_per_row=5)
save_fig("error_analysis_digits_plot")
plt.show()