## Chapter 3 – Classification

_This notebook contains sample code adapted from chapter 3._

### Setup

First, let's make sure this notebook works well in both python 2 and 3, import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures:

In [1]:
# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals

from time import time

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "classification"

def save_fig(fig_id, tight_layout=True):
    path = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID, fig_id + ".png")
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format='png', dpi=300)

### Load MNIST Dataset

In [2]:
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')

In [3]:
X, y = mnist["data"], mnist["target"]
X.shape

(70000, 784)

In [11]:
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

In [12]:
import numpy as np

shuffle_index = np.random.permutation(60000)
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]

Scale the data

In [13]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
#X_train = scaler.fit_transform(X_train)

### A Binary classifier: 5 or not 5

Use several models to try out a simplier problem: binary classification. 

First set up the training label and test label for the 5_or_not_5 classifier. The input features (X_train and X_test) remain the same. 

In [14]:
y_train_5 = (y_train == 5)
y_test_5 = (y_test == 5)

### 1. Logistic Regression Classifier 

Logistic regression is a linear classification model

In [8]:
from sklearn.linear_model import LogisticRegression
logit_clf = LogisticRegression(solver = 'lbfgs') # the default solver='liblinear' is very slow

In [9]:
start_time = time()
logit_clf.fit(X_train, y_train_5)
print('Time elapsed: %.2fs' % (time()-start_time))

Time elapsed: 6.00s


Use 3-fold cross-validation to evaluate the model

In [19]:
from sklearn.model_selection import cross_val_score
cross_val_score(logit_clf, X_train, y_train_5, cv=3, scoring="accuracy")

array([0.9742 , 0.97315, 0.97415])

Evaluate the model using cross_val_predict (not on the test data)

In [20]:
from sklearn.model_selection import cross_val_predict
y_train_pred_logit_clf = cross_val_predict(logit_clf, X_train, y_train_5, cv=3)

Print out the confusion matrix: the count of true negatives is C{0,0}, false negatives is C{1,0}, true positives is C{1,1} and false positives is C{0,1}.

In [21]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_train_5, y_train_pred_logit_clf)

array([[54039,   540],
       [ 1030,  4391]])

Get the precision score 

In [22]:
from sklearn.metrics import precision_score, recall_score

precision_score(y_train_5, y_train_pred_logit_clf)

0.8904887446765362

And the recall score

In [23]:
recall_score(y_train_5, y_train_pred_logit_clf)

0.8099981553218963

Now test the model on test data set

In [24]:
y_pred_logit_clf = logit_clf.predict(X_test)

In [25]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test_5, y_pred_logit_clf)

array([[9034,   74],
       [ 147,  745]])

Check the precision and recall score of our model on the test data

In [26]:
from sklearn.metrics import precision_score, recall_score

precision_score(y_test_5, y_pred_logit_clf)

0.9096459096459096

And the recall score

In [27]:
recall_score(y_test_5, y_pred_logit_clf)

0.8352017937219731

### 2. Stochastic Gradient Descent Classifier 

Stochastic Gradient Descent Classifier is also a linear classification model

In [30]:
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(max_iter=10, random_state=42, verbose=2)

In [31]:
start_time = time()
sgd_clf.fit(X_train, y_train_5)
print('Time elapsed: %.2fs' % (time()-start_time))

-- Epoch 1
Norm: 8032.56, NNZs: 628, Bias: 66.772136, T: 60000, Avg. loss: 95455.342160
Total training time: 0.06 seconds.
-- Epoch 2
Norm: 4807.31, NNZs: 635, Bias: 74.527302, T: 120000, Avg. loss: 13014.075360
Total training time: 0.13 seconds.
-- Epoch 3
Norm: 3687.66, NNZs: 640, Bias: 79.844842, T: 180000, Avg. loss: 7957.697351
Total training time: 0.19 seconds.
-- Epoch 4
Norm: 2991.65, NNZs: 643, Bias: 82.888418, T: 240000, Avg. loss: 5508.076104
Total training time: 0.25 seconds.
-- Epoch 5
Norm: 2529.77, NNZs: 644, Bias: 86.117183, T: 300000, Avg. loss: 4319.862264
Total training time: 0.31 seconds.
-- Epoch 6
Norm: 2269.00, NNZs: 644, Bias: 88.405820, T: 360000, Avg. loss: 3412.976325
Total training time: 0.37 seconds.
-- Epoch 7
Norm: 2019.09, NNZs: 646, Bias: 90.057966, T: 420000, Avg. loss: 2882.621686
Total training time: 0.43 seconds.
-- Epoch 8
Norm: 1815.92, NNZs: 647, Bias: 91.549346, T: 480000, Avg. loss: 2472.263598
Total training time: 0.49 seconds.
-- Epoch 9
Norm

Cross validate the model

In [34]:
cross_val_score(sgd_clf, X_train, y_train_5, cv=5, scoring="accuracy")

-- Epoch 1
Norm: 9410.59, NNZs: 629, Bias: 18.138697, T: 47999, Avg. loss: 110146.413705
Total training time: 0.05 seconds.
-- Epoch 2
Norm: 5622.21, NNZs: 636, Bias: 26.757462, T: 95998, Avg. loss: 16898.938446
Total training time: 0.10 seconds.
-- Epoch 3
Norm: 4367.50, NNZs: 644, Bias: 30.721698, T: 143997, Avg. loss: 9554.779476
Total training time: 0.15 seconds.
-- Epoch 4
Norm: 3575.42, NNZs: 648, Bias: 34.023480, T: 191996, Avg. loss: 6663.896820
Total training time: 0.20 seconds.
-- Epoch 5
Norm: 3105.98, NNZs: 649, Bias: 36.371776, T: 239995, Avg. loss: 5115.181526
Total training time: 0.25 seconds.
-- Epoch 6
Norm: 2747.58, NNZs: 654, Bias: 38.543547, T: 287994, Avg. loss: 4300.794214
Total training time: 0.30 seconds.
-- Epoch 7
Norm: 2497.68, NNZs: 655, Bias: 39.983873, T: 335993, Avg. loss: 3582.435063
Total training time: 0.36 seconds.
-- Epoch 8
Norm: 2280.11, NNZs: 660, Bias: 41.271082, T: 383992, Avg. loss: 3052.730456
Total training time: 0.41 seconds.
-- Epoch 9
Norm

array([0.9592534 , 0.96625   , 0.963     , 0.97058333, 0.96791399])

In [35]:
y_train_pred_sgd_clf = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)

-- Epoch 1
Norm: 10440.23, NNZs: 608, Bias: -4.608426, T: 40000, Avg. loss: 128674.988413
Total training time: 0.04 seconds.
-- Epoch 2
Norm: 6527.51, NNZs: 626, Bias: 2.083937, T: 80000, Avg. loss: 20590.947113
Total training time: 0.08 seconds.
-- Epoch 3
Norm: 5026.96, NNZs: 633, Bias: 4.523462, T: 120000, Avg. loss: 11249.075564
Total training time: 0.12 seconds.
-- Epoch 4
Norm: 4149.25, NNZs: 641, Bias: 7.821485, T: 160000, Avg. loss: 7689.361463
Total training time: 0.17 seconds.
-- Epoch 5
Norm: 3570.46, NNZs: 643, Bias: 9.721195, T: 200000, Avg. loss: 6264.955825
Total training time: 0.21 seconds.
-- Epoch 6
Norm: 3139.24, NNZs: 646, Bias: 11.070565, T: 240000, Avg. loss: 4955.525992
Total training time: 0.25 seconds.
-- Epoch 7
Norm: 2839.98, NNZs: 646, Bias: 12.228031, T: 280000, Avg. loss: 4304.237245
Total training time: 0.29 seconds.
-- Epoch 8
Norm: 2610.63, NNZs: 651, Bias: 13.632419, T: 320000, Avg. loss: 3600.135134
Total training time: 0.33 seconds.
-- Epoch 9
Norm: 

In [36]:
confusion_matrix(y_train_5, y_train_pred_sgd_clf)

array([[52467,  2112],
       [ 1104,  4317]])

Compared with logistic regression model, the number of false positive in the SGD model is increased. 

In [37]:
confusion_matrix(y_train_5, y_train_pred_logit_clf)

array([[54039,   540],
       [ 1030,  4391]])

In [38]:
precision_score(y_train_5, y_train_pred_sgd_clf)

0.6714885674288381

In [39]:
recall_score(y_train_5, y_train_pred_sgd_clf)

0.7963475373547316

Try the trained model on our test data

In [40]:
y_pred_sgd_clf = sgd_clf.predict(X_test)

In [41]:
confusion_matrix(y_test_5, y_pred_sgd_clf)

array([[8965,  143],
       [ 196,  696]])

In [42]:
precision_score(y_test_5, y_pred_sgd_clf)

0.8295589988081049

In [43]:
recall_score(y_test_5, y_pred_sgd_clf)

0.7802690582959642

### Decision Tree Classifier

In [44]:
from sklearn.tree import DecisionTreeClassifier
dt_clf = DecisionTreeClassifier(random_state=42)

In [45]:
start_time = time()
dt_clf.fit(X_train, y_train_5)
print('Time elapsed: %.2fs' % (time()-start_time))

Time elapsed: 21.61s


In [46]:
cross_val_score(dt_clf, X_train, y_train_5, cv=3, scoring="accuracy")

array([0.9696 , 0.9718 , 0.97475])

In [47]:
y_train_pred_dt_clf = cross_val_predict(dt_clf, X_train, y_train_5, cv=3)

In [48]:
confusion_matrix(y_train_5, y_train_pred_dt_clf)

array([[53730,   849],
       [  828,  4593]])

In [49]:
precision_score(y_train_5, y_train_pred_dt_clf)

0.8439911797133407

In [50]:
recall_score(y_train_5, y_train_pred_dt_clf)

0.8472606530160487

Try on the test data

In [51]:
y_pred_dt_clf = dt_clf.predict(X_test)

In [52]:
confusion_matrix(y_test_5, y_pred_dt_clf)

array([[8980,  128],
       [ 126,  766]])

In [53]:
precision_score(y_test_5, y_pred_dt_clf)

0.8568232662192393

In [54]:
recall_score(y_test_5, y_pred_dt_clf)

0.8587443946188341

### Random Forest Classifier

In [55]:
from sklearn.ensemble import RandomForestClassifier
forest_clf = RandomForestClassifier(random_state=42)

In [56]:
start_time = time()
forest_clf.fit(X_train, y_train_5)
print('Time elapsed: %.2fs' % (time()-start_time))

Time elapsed: 4.65s


In [57]:
cross_val_score(forest_clf, X_train, y_train_5, cv=3, scoring="accuracy")

array([0.9832, 0.9839, 0.9836])

In [58]:
y_train_pred_forest_clf = cross_val_predict(forest_clf, X_train, y_train_5, cv=3)

In [59]:
confusion_matrix(y_train_5, y_train_pred_forest_clf)

array([[54501,    78],
       [  908,  4513]])

In [60]:
precision_score(y_train_5, y_train_pred_forest_clf)

0.9830102374210412

In [61]:
recall_score(y_train_5, y_train_pred_forest_clf)

0.8325032281866814

Try on the test data

In [62]:
y_pred_forest_clf = forest_clf.predict(X_test)

In [63]:
confusion_matrix(y_test_5, y_pred_forest_clf)

array([[9094,   14],
       [ 154,  738]])

In [64]:
precision_score(y_test_5, y_pred_forest_clf)

0.9813829787234043

In [65]:
recall_score(y_test_5, y_pred_forest_clf)

0.827354260089686

### KNN (K Nearest Neighor) Classifier

In [66]:
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier(n_jobs=-1, n_neighbors=3)

In [67]:
start_time = time()
knn_clf.fit(X_train, y_train_5)
print('Time elapsed: %.2fs' % (time()-start_time))

Time elapsed: 34.92s


In [None]:
cross_val_score(knn_clf, X_train, y_train_5, cv=3, scoring="accuracy")

In [None]:
y_train_pred_knn_clf = cross_val_predict(knn_clf, X_train, y_train_5, cv=3)

In [None]:
confusion_matrix(y_train_5, y_train_pred_knn_clf)

In [None]:
precision_score(y_train_5, y_train_pred_knn_clf)

In [None]:
recall_score(y_train_5, y_train_pred_knn_clf)

Try on the test data

In [None]:
y_pred_knn_clf = knn_clf.predict(X_test)

In [None]:
confusion_matrix(y_test_5, y_pred_knn_clf)

In [None]:
from sklearn.metrics import precision_score, recall_score
precision_score(y_test_5, y_pred_knn_clf)

In [None]:
recall_score(y_test_5, y_pred_knn_clf)