In [53]:
import numpy as np
import pandas as pd

In [54]:
from sklearn.datasets import fetch_openml

In [55]:
mnist = fetch_openml('mnist_784',version = 1)

In [56]:
mnist.keys()

dict_keys(['data', 'target', 'frame', 'categories', 'feature_names', 'target_names', 'DESCR', 'details', 'url'])

In [57]:
x,y = mnist['data'], mnist['target']

In [58]:
x.shape

(70000, 784)

In [59]:
y.shape

(70000,)

In [60]:
X_train = mnist['data'][:60000]
y_train = mnist['target'][:60000]

In [61]:
X_test = mnist['data'][60000:]
y_test = mnist['target'][60000:]

In [62]:
from sklearn.ensemble import RandomForestClassifier

In [63]:
rnd_clf = RandomForestClassifier(n_estimators = 100, random_state=42)

In [64]:
import time

t0 = time.time()
rnd_clf.fit(X_train,y_train)
t1 = time.time()

In [65]:
print('Training took {:.2f}s'.format(t1-t0))

Training took 66.33s


In [66]:
from sklearn.metrics import accuracy_score

y_pred = rnd_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.9705

Next, use PCA to reduce the dataset's dimensionality, with an explained variance ratio of 95%.

In [67]:
from sklearn.decomposition import PCA

pca = PCA(n_components=0.95)
X_train_reduced = pca.fit_transform(X_train)

Train a new Random Forest classifier on the reduced dataset and see how long it takes. Was training much faster?

In [68]:
rnd_clf2 = RandomForestClassifier(n_estimators=100, random_state=42)
t0 = time.time()
rnd_clf2.fit(X_train_reduced, y_train)
t1 = time.time()

In [69]:
print('Training took {:.2f}s'.format(t1-t0))

Training took 160.96s


Next evaluate the classifier on the test set: how does it compare to the previous classifier?

In [70]:
X_test_reduced = pca.transform(X_test)
y_pred = rnd_clf2.predict(X_test_reduced)
accuracy_score(y_test,y_pred)

0.9481

It is common for performance to drop slightly when reducing dimensionality, because we do lose some useful signal in the process. However, the performance drop is rather severe in this case. So PCA really did not help: it slowed down training and reduced performance.

Let's see if it helps when using softmax regression

In [71]:
from sklearn.linear_model import LogisticRegression

log_clf = LogisticRegression(multi_class="multinomial", solver="lbfgs", random_state=42)
t0 = time.time()
log_clf.fit(X_train, y_train)
t1 = time.time()

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [72]:
print("Training took {:.2f}s".format(t1 - t0))

Training took 27.60s


In [73]:
y_pred = log_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.9255

Okay, so softmax regression takes much longer to train on this dataset than the random forest classifier, plus it performs worse on the test set. But that's not what we are interested in right now, we want to see how much PCA can help softmax regression. Let's train the softmax regression model using the reduced dataset:

In [74]:
log_clf2 = LogisticRegression(multi_class="multinomial", solver="lbfgs", random_state=42)
t0 = time.time()
log_clf2.fit(X_train_reduced, y_train)
t1 = time.time()

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [75]:
print("Training took {:.2f}s".format(t1 - t0))

Training took 9.05s


Nice! Reducing dimensionality led to a 4× speedup.  Let's check the model's accuracy:

In [77]:
y_pred = log_clf2.predict(X_test_reduced)
accuracy_score(y_test, y_pred)

0.9201

A very slight drop in performance, which might be a reasonable price to pay for a 4× speedup, depending on the application.

So there you have it: PCA can give you a formidable speedup... but not always!