### 9.
Load the MNIST dataset (introduced in Chapter 3) and split it into a training set and a test set (take the first 60,000 instances for training, and the remaining 10,000 for testing). Train a Random Forest classifier on the dataset and time how long it takes, then evaluate the resulting model on the test set. Next, use PCA to reduce the dataset’s dimensionality, with an explained variance ratio of 95%. Train a new Random Forest classifier on the reduced dataset and see how long it takes. Was training much faster? Next, evaluate the classifier on the test set. How does it compare to the previous classifier?

In [1]:
from sklearn.datasets import fetch_openml
import numpy as np

mnist = fetch_openml('mnist_784', version=1)
X, y = mnist["data"], mnist["target"].astype(np.int64)

# separate into training and test sets
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

In [5]:
import time # this will help us on logging the execution time
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier()
t1 = time.time()
rnd_clf.fit(X_train, y_train)
t2 = time.time()

print("{}: {:.1f} seconds".format(rdn_clf.__class__.__name__, t2 - t1))

RandomForestClassifier: 41.0 seconds


In [6]:
from sklearn.metrics import accuracy_score

y_pred = rnd_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.9701

In [10]:
# Let's try again, but first reducing the dimensionality with PCA.
from sklearn.decomposition import PCA

pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X_train)

In [11]:
# timing the training on the reduced dataset

rnd_clf_reduced = RandomForestClassifier()
t1 = time.time()
rnd_clf_reduced.fit(X_reduced, y_train)
t2 = time.time()

print("Reduced Dataset - {}: {:.1f} seconds".format(rdn_clf.__class__.__name__, t2 - t1))

Reduced Dataset - RandomForestClassifier: 101.7 seconds


Yikes, it is taking more than the double of the time (41 seconds vs 101 seconds). PCA was supposed to improve the time, but somehow it is taking longer. As saw in the book, performing PCA does not always traslate in a training time improvement, it will depend on the dataset, model and training algorithm... all combined.

In [13]:
X_test_reduced = pca.transform(X_test)

y_pred = rnd_clf_reduced.predict(X_test_reduced)
accuracy_score(y_test, y_pred)

0.948

At the end, PCA made it worse for us... our model took longer to train and it gave us a worse accuracuy score.

Note: is common to have a drop on accuracy when using PCA, as we will lose some useful signals during the reduction process.

Let's try with another model, instead of Random Forest, and see how it performs:

In [20]:
from sklearn.linear_model import LogisticRegression

log_clf = LogisticRegression(multi_class="multinomial", solver="lbfgs")
t1 = time.time()
log_clf.fit(X_train, y_train)
t2 = time.time()

print("{}: {:.1f} seconds".format(log_clf.__class__.__name__, t2 - t1))

LogisticRegression: 18.3 seconds


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [21]:
y_pred = log_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.9255

Softmax took longer and performed worse than our previous Random Forest. Let's run the same model, but with the PCA reduced dataset and see how it performs:

In [25]:
log_clf_reduced = LogisticRegression(multi_class="multinomial", solver="lbfgs")
t1 = time.time()
log_clf_reduced.fit(X_reduced, y_train)
t2 = time.time()

print("{}: {:.1f} seconds".format(log_clf_reduced.__class__.__name__, t2 - t1))

LogisticRegression: 4.7 seconds


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [26]:
y_pred = log_clf_reduced.predict(X_test_reduced)
accuracy_score(y_test, y_pred)

0.9201

That was almost 4 times faster! The accuracy dropped a little bit, but that was expected. Depending on the application, this tradeoff is a good thing.