In this project I will be analyzing the MNIST data set, fitting a Random Forest on complete 784 variables, doing PCA on the dataset to find the Principal Components and then fitting another Random Forest using those Principal Components. Once this is complete, the two approaches will be compared in terms of time taken and test performances. Finally, I will be identifying the design flaw in this experiment and re-running it in the regular train and test regimen.

I would suggest the Management to use PCA + ML Model for Image Analysis because it has more Precision and Recall than pure ML. Pure ML model might run faster but I think for an Image Classification task Precision and Recall matter and we only want to classify according to Principal components that have a strong presence in the data.

In [1]:
# Importing pandas and test and train datasets
# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals

import pandas as pd
import numpy as np
import os
import time

# Setting the seed value for reproducible results
RANDOM_SEED = 42

In [2]:
# Importing MNIST

import scipy.io
mnist = scipy.io.loadmat('mnist-original.mat')
X, y = mnist['data'].T.astype(int), mnist['label'].T.astype(int)

In [3]:
# Creating Train and Test sets

y = np.ravel(y,order='C').astype(int)

X_train = X[:60000]
y_train = y[:60000]

X_test = X[60000:]
y_test = y[60000:]

### Random Forest

#### Building the Model

In [4]:
# Scaling data with Pipeline

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline

# Fitting a Random Forest Classifier
# Measuring data

from sklearn.ensemble import RandomForestRegressor

start_time = time.time()

scaler = StandardScaler()

rf = RandomForestRegressor(max_features = 'sqrt', bootstrap = True,
                           random_state=RANDOM_SEED,
                           n_estimators=10)

pipeline = make_pipeline(scaler, rf)

pipeline.fit(X_train.astype(float), y_train)

print("--- Total Time: %s seconds ---" % round((time.time() - start_time), 2))

--- Total Time: 5.65 seconds ---


#### Predicting on Test Data

In [5]:
y_test_predict = pipeline.predict(X_test.astype(float)).astype(int)

In [6]:
y_test_predict

array([0, 0, 0, ..., 8, 8, 8])

In [7]:
y_test

array([0, 0, 0, ..., 9, 9, 9])

#### Evaluating the model

In [8]:
# Importing Precision, Recall and F1 scores

from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import f1_score

print("Precision Score: {:.3f}".format(precision_score(y_test,
                                                       y_test_predict,
                                                       average='macro')))

print("Recall Score: {:.3f}".format(recall_score(y_test,
                                                 y_test_predict,
                                                 average='macro')))

print("F1 Score: {:.3f}".format(f1_score(y_test,
                                         y_test_predict,
                                         average='macro')))

Precision Score: 0.624
Recall Score: 0.582
F1 Score: 0.577


The Precision and Recall both are above 0.5 for this model and the F1 score is 0.577. This indicates that the model is performing more than average on test data.

### Principal Component Analysis

In [9]:
# Importing PCA

from sklearn.decomposition import PCA

# Keeping variance 95%

pca = PCA(n_components = 0.95)

print(pca)

# Start Time

start_time = time.time()

# Fitting the entire dataset

X2D = pca.fit_transform(X.astype(float))

# End Time

print("\n--- Total Time: %s seconds ---" % round((time.time() - start_time),2))

print("\nNumber of Principal components: %s" % pca.n_components_)

PCA(copy=True, iterated_power='auto', n_components=0.95, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

--- Total Time: 12.74 seconds ---

Number of Principal components: 154


### Random Forest Classifier using identified Principal Components

#### Creating Test Train datasets using reduced dimensions

In [10]:
print(X2D.shape)

(70000, 154)


In [11]:
# Creating Train and Test sets

X2D_train = X2D[:60000]

X2D_test = X2D[60000:]

#### Building the Model

In [12]:
# Fitting a Random Forest with 154 components

start_time = time.time()

rf2 = RandomForestRegressor(max_features = 154,
                           random_state=RANDOM_SEED,
                           n_estimators=10,
                           bootstrap = True)

pipeline2 = make_pipeline(scaler, rf2)

pipeline2.fit(X2D_train.astype(float), y_train)

print("--- Total Time: %s seconds ---" % round((time.time() - start_time), 2))

--- Total Time: 124.81 seconds ---


#### Predicting on Test Data

In [13]:
y_test_predict2 = pipeline2.predict(X2D_test.astype(float)).astype(int)

#### Evaluating the model

In [14]:
# Importing Precision, Recall and F1 scores

from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import f1_score

print("Precision Score: {:.3f}".format(precision_score(y_test,
                                                       y_test_predict2,
                                                       average='macro')))

print("Recall Score: {:.3f}".format(recall_score(y_test,
                                                 y_test_predict2,
                                                 average='macro')))

print("F1 Score: {:.3f}".format(f1_score(y_test,
                                         y_test_predict2,
                                         average='macro')))

Precision Score: 0.648
Recall Score: 0.603
F1 Score: 0.600


The Precision and Recall both are above 0.6 for this model and the F1 score is 0.600. This indicates that the model is performing more than average over the test data.

### Comparison of Performances between Model 1 (Random Forest with all components) and Model 2 (Random Forest with PCA)

F1 Score of Model 2 is 0.600 which is better than Model 1 0.577, and the Precision and Recall scores are also better for Model 2. Therefore, Model 2 is performing better than Model 1.

Model 1 took 5.65 seconds whereas Model 2 took (12.74 + 124.81) = 137.55 seconds, which is roughly 24 times more than Model 1. Therefore, Model 2 is more time consuming than Model 1.

### Identification of Design Flaw and Rerun of the Experiment again

The Design Flaw with the above experiment is that the PCA analysis was done on the full dataset instead of the Train set whereas the Random Forest Classifier was trained on the Train set. Ideally, the PCA analysis should be done on Train set and then the Test set should be transformed.

#### PCA

In [15]:
# Start Time

start_time = time.time()

# Fitting the Train & Test dataset

X2D_train = pca.fit_transform(X_train.astype(float))

X2D_test = pca.transform(X_test.astype(float))

# End Time

print("\n--- Total Time: %s seconds ---" % round((time.time() - start_time),2))

print("\nNumber of Principal components: %s" % pca.n_components_)


--- Total Time: 11.03 seconds ---

Number of Principal components: 154


#### Random Forest

In [16]:
# Fitting a Random Forest with 154 components

start_time = time.time()

rf3 = RandomForestRegressor(max_features = 154,
                           random_state=RANDOM_SEED,
                           n_estimators=10,
                           bootstrap = True)

pipeline2 = make_pipeline(scaler, rf3)

pipeline2.fit(X2D_train.astype(float), y_train)

print("--- Total Time: %s seconds ---" % round((time.time() - start_time), 2))

--- Total Time: 122.37 seconds ---


#### Predicting on Test Data

In [17]:
y_test_predict3 = pipeline2.predict(X2D_test.astype(float)).astype(int)

#### Evaluating the model

In [18]:
# Importing Precision, Recall and F1 scores

from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import f1_score

print("Precision Score: {:.3f}".format(precision_score(y_test,
                                                       y_test_predict3,
                                                       average='macro')))

print("Recall Score: {:.3f}".format(recall_score(y_test,
                                                 y_test_predict3,
                                                 average='macro')))

print("F1 Score: {:.3f}".format(f1_score(y_test,
                                         y_test_predict3,
                                         average='macro')))

Precision Score: 0.654
Recall Score: 0.611
F1 Score: 0.609


The Precision and Recall both are above 0.6 for this model and the F1 score is 0.609 which is slightly better than the two Models in previous experiment. The time taken for this model is also slightly lesser than Model 2 - (11.03 + 122.37) = 133.4 seconds.

### Management Problem

Based on the F1 Score, Precision and Recall performance measures, I would recommend using PCA before ML methods even though the time required to do PCA + ML is more than doing just ML. This is because PCA has better performance for image compression and for our analysis we used a very small number of estimators in Random Forest. Increasing the number of estimators (Decision Trees) in Random Forest might increase the time taken and it is possible that the pure ML Model might take more time than PCA + ML Model.