## Aim:

Load the MNIST dataset and split it into a training set
and a test set. Train a Random Forest classifier on the dataset and time how
long it takes, then evaluate the resulting model on the test set. 


Next, use different dimensionality reduction techniques to
reduce the dataset’s dimensionality.
Train a new Random Forest classifier on the reduced dataset and see how long it
takes. Was training much faster? Next, evaluate the classifier on the test set. How
does it compare to the previous classifier?

In [1]:
import pandas as pd
import numpy as np

#Importing MNIST dataset
from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784', version=1)
mnist.target = mnist.target.astype(np.uint8)

In [2]:
type(mnist)

sklearn.utils.Bunch

In [3]:
from sklearn.model_selection import train_test_split

X = mnist["data"]
y = mnist["target"]

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=12)

In [4]:
print("Shape of training features:",X_train.shape)
print("Shape of training labels:",y_train.shape)
print("Shape of testing features:",X_test.shape)
print("Shape of testing labels:",y_test.shape)

Shape of training features: (56000, 784)
Shape of training labels: (56000,)
Shape of testing features: (14000, 784)
Shape of testing labels: (14000,)


## RF without dimensionality reduction

In [5]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import time


RF=RandomForestClassifier(random_state=123)
t0 = time.time()
RF.fit(X_train,y_train)
t1 = time.time()

RF_score=cross_val_score(RF,X_train,y_train,scoring='f1_macro',cv=5)
print("F1 score for RF without dimensionality reduction is:",RF_score.mean())
print("Training took {:.2f}s".format(t1 - t0))

F1 score for RF without dimensionality reduction is: 0.9668587063036626
Training took 43.13s


## Dimensionality Reduction

### 1. PCA

In [6]:
from sklearn.decomposition import PCA

#PCA with an explained variance ratio of 95%
pca = PCA(n_components=0.95,random_state=123)
X_train_reduced = pca.fit_transform(X_train)
RF=RandomForestClassifier(random_state=123)
t0 = time.time()
RF.fit(X_train_reduced,y_train)
t1 = time.time()
print("Reduction with PCA took {:.2f}s".format(t1 - t0))

RF_score=cross_val_score(RF,X_train_reduced,y_train,scoring='f1_macro',cv=5)
print("F1 score for RF with PCA is:",RF_score.mean())
print("Training took {:.2f}s".format(t1 - t0))

Reduction with PCA took 93.56s
F1 score for RF with PCA is: 0.944492061782839
Training took 93.56s


In [7]:
pca.n_components_

154

Features reduced from 784 to 154 but training is actually more than twice slower now! How can that be? Well, dimensionality reduction does not always lead to faster training time: it depends on the dataset, the model and the training algorithm. If you try a softmax classifier (multi class logisitic regression) instead of a random forest classifier, you will find that training time is reduced by a factor of 3 when using PCA. 

### 2. MDS

In [22]:
"""
The measure of goodness of fit in multidimensional scaling is called S(caled)-Stress!
Stress varies between 0 and 1, with values near 0 indicating better fit.
PCA is equivalent to classical MDS using the Euclidean distance metric. 
So, if this is the type of MDS you want to perform, you can use PCA instead. 
"""

'\nThe measure of goodness of fit in multidimensional scaling is called S(caled)-Stress!\nStress varies between 0 and 1, with values near 0 indicating better fit.\n'

### 3. Isomap

In [11]:
"""
We will not be implementing IsoMap as it is computationally expensive(source:https://www.cs.ubc.ca/~tmm/courses/533-07/slides/hidim.donovan-4x4.pdf)
from sklearn import manifold 
t0 = time.time()
X_train_reduced = manifold.Isomap().fit_transform(X_train)
t1 = time.time()
print("Reduction with Isomap took {:.2f}s".format(t1 - t0))
"""

'\nWe will not be implementing IsoMap as it is computationally expensive(source:https://www.cs.ubc.ca/~tmm/courses/533-07/slides/hidim.donovan-4x4.pdf)\nfrom sklearn import manifold \nt0 = time.time()\nX_train_reduced = manifold.Isomap().fit_transform(X_train)\nt1 = time.time()\nprint("Reduction with Isomap took {:.2f}s".format(t1 - t0))\n'

### 4. t-SNE

In [9]:
"""
t-SNE is used mostly for visualization
ValueError: 'n_components' should be inferior to 4 for the barnes_hut algorithm as it relies on quad-tree or oct-tree.
"""
from sklearn.manifold import TSNE
t0 = time.time()
X_train_reduced_SNE = TSNE().fit_transform(X_train)
t1 = time.time()
print("Reduction with t-SNE took {:.2f}s".format(t1 - t0))


Reduction with t-SNE took 8783.98s


In [10]:
RF_SNE=RandomForestClassifier(random_state=123)
t0 = time.time()
RF_SNE.fit(X_train_reduced_SNE,y_train)
t1 = time.time()


RF_score=cross_val_score(RF_SNE,X_train_reduced_SNE,y_train,scoring='f1_macro',cv=5)
print("F1 score for RF with t-SNE is:",RF_score.mean())
print("Training took {:.2f}s".format(t1 - t0))

F1 score for RF with t-SNE is: 0.9712626232902455
Training took 19.05s


### 5. UMAP

In [17]:
import umap
t0 = time.time()
um=umap.UMAP(random_state=123)
X_train_reduced = um.fit_transform(X_train)
t1 = time.time()
print("Reduction with UMAP took {:.2f}s".format(t1 - t0))

Reduction with UMAP took 142.10s


In [18]:
RF=RandomForestClassifier(random_state=123)
t0 = time.time()
RF.fit(X_train_reduced,y_train)
t1 = time.time()


RF_score=cross_val_score(RF,X_train_reduced,y_train,scoring='f1_macro',cv=5)
print("F1 score for RF with UMAP is:",RF_score.mean())
print("Training took {:.2f}s".format(t1 - t0))

F1 score for RF with UMAP is: 0.9641087867528005
Training took 19.95s


Even though t-SNE took the longest to train, we got a high F1 score. We will use the model trained by UMAP it took the least to train and still gave a high result.

### Test Set

In [19]:
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

#tranforming testing features
X_test_reduced_umap = um.transform(X_test)

yhat=RF.predict(X_test_reduced_umap)
print("F1 score for test set is:",f1_score(y_test,yhat,average='macro'))
print("Accuracy score for test set is:",accuracy_score(y_test,yhat))
print(classification_report(y_test,yhat))

F1 score for test set is: 0.9580496682877884
Accuracy score for test set is: 0.9582857142857143
              precision    recall  f1-score   support

           0       0.97      0.99      0.98      1358
           1       0.95      0.99      0.97      1596
           2       0.98      0.93      0.95      1403
           3       0.95      0.96      0.95      1450
           4       0.98      0.94      0.96      1327
           5       0.94      0.94      0.94      1249
           6       0.97      0.98      0.98      1367
           7       0.96      0.96      0.96      1459
           8       0.96      0.92      0.94      1413
           9       0.93      0.96      0.94      1378

    accuracy                           0.96     14000
   macro avg       0.96      0.96      0.96     14000
weighted avg       0.96      0.96      0.96     14000

