<a href="https://colab.research.google.com/github/dvisionst/PCA_Exercise/blob/main/PCA_Exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PCA Exercise
- Jose Flores
- 23 August 2022

In [1]:
# Imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.decomposition import PCA
from sklearn.datasets import fetch_openml
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score, recall_score, precision_score, \
f1_score, classification_report, ConfusionMatrixDisplay

In [2]:
# load the dataset
mnist = fetch_openml('mnist_784')
# view the shape of the dataset
mnist.data.shape


(70000, 784)

## 2. Prepare the Data

Prepare the data for modeling.  Scale and apply PCA to your data, while retaining 95% of the variance.  Be sure not to leak information.

In [3]:
# Storing the features matrix and target vector in X & y variables
X = mnist.data
y = mnist.target

# Train test split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [4]:
# Instantiating a standard scaler
scaler = StandardScaler()

# Creating a PCA model with 95% data retention.
pca = PCA(n_components=.95)


## 3. Create 2 KNN models

a. One that that uses the PCA transformed data to predict which number each image shows.

In [12]:
# Instantiating the KNN model
knn = KNeighborsClassifier()

In [22]:
%%time
# Creating a KNN model 

knn_pca_pipe = make_pipeline(scaler, pca, knn)
# Fitting the model on the training data set 
knn_pca_pipe.fit(X_train, y_train)

CPU times: user 13 s, sys: 1.3 s, total: 14.3 s
Wall time: 8.08 s


Pipeline(steps=[('standardscaler', StandardScaler()),
                ('pca', PCA(n_components=0.95)),
                ('kneighborsclassifier', KNeighborsClassifier())])

b. One that uses the original data, without the PCA transformation (but, remember you still need to scale the data!)

In [15]:
%%time
# Creating a knn model without pca and then fitting it on the training data set
knn_pipe = make_pipeline(scaler, knn)
knn_pipe.fit(X_train, y_train)

CPU times: user 348 ms, sys: 3.09 ms, total: 351 ms
Wall time: 352 ms


Pipeline(steps=[('standardscaler', StandardScaler()),
                ('kneighborsclassifier', KNeighborsClassifier())])

## 4. Evaluate and compare the models

Use separate cells to make predictions using each model.  Include the cell magic command: `%%time` at the top of your cells when making predictions to see which model can create predictions faster, the one trained on PCA data or the one trained on non-PCA data.  Evaluate both models using multiple appropriate metrics.



'%%time' will output under the cell a count of how long it takes the code in that cell to run.

In [24]:
# Making predictions with the PCA KNN model and displaying processing time
%%time
knn_pca_pred = knn_pca_pipe.predict(X_test)

CPU times: user 43.2 s, sys: 1.11 s, total: 44.3 s
Wall time: 29.4 s


In [17]:
# timing the KNN only model making predictions on the test data
%%time
knn_pred = knn_pipe.predict(X_test)

CPU times: user 1min 18s, sys: 1.3 s, total: 1min 19s
Wall time: 49.3 s


In [16]:
# Checking accuracy of model without PCA

knn_acc = round(knn_pipe.score(X_test, y_test), 3)
knn_acc

0.944

In [26]:
# Checking the accuracy of the KNN with PCA
knn_pca_acc = round(knn_pca_pipe.score(X_test, y_test), 3)
knn_pca_acc

0.948

In [28]:
# Creating a classification report for KNN with PCA to see 
#  how it performs in other metrics

test_report_pca = classification_report(y_test, knn_pca_pred)
print(f"Classification Report for Testing set With PCA\n", test_report_pca)

Classification Report for Testing set With PCA
               precision    recall  f1-score   support

           0       0.96      0.98      0.97      1714
           1       0.96      0.99      0.97      1977
           2       0.95      0.94      0.94      1761
           3       0.94      0.94      0.94      1806
           4       0.94      0.94      0.94      1587
           5       0.95      0.93      0.94      1607
           6       0.96      0.98      0.97      1761
           7       0.94      0.93      0.94      1878
           8       0.97      0.90      0.93      1657
           9       0.91      0.93      0.92      1752

    accuracy                           0.95     17500
   macro avg       0.95      0.95      0.95     17500
weighted avg       0.95      0.95      0.95     17500



In [19]:
# Creating a classification report for the model without PCA to see 
#  how it performs in other metrics
test_report = classification_report(y_test, knn_pred)
print('Classification Report for Testing Set\n', test_report)

Classification Report for Testing Set
               precision    recall  f1-score   support

           0       0.96      0.98      0.97      1714
           1       0.95      0.99      0.97      1977
           2       0.95      0.93      0.94      1761
           3       0.93      0.94      0.94      1806
           4       0.94      0.93      0.94      1587
           5       0.94      0.93      0.94      1607
           6       0.96      0.97      0.97      1761
           7       0.94      0.93      0.93      1878
           8       0.97      0.89      0.93      1657
           9       0.90      0.92      0.91      1752

    accuracy                           0.94     17500
   macro avg       0.94      0.94      0.94     17500
weighted avg       0.94      0.94      0.94     17500



## 5. Answer the following questions in text:

a. Which model performed the best on the test set?

b. Which model was the fastest at making predictions?

- a:

The KNN model with PCA performed slightly better than the model without it. 
It has an accuracy of 94.8% as opposed to 94.4%. It’s a very thin margin and it can be argued that both models perform the same. However, the model with PCA also has a better macro average in precision, recall, and f1 score. The model with PCA has 0.95 as opposed to the model without it only has 0.94 in the same macro average metrics. 


- b:

The model without PCA made prediction in 49.3 seconds which lagged behind the other model. The model that does include PCA made predictions in 29.4 seconds. The model with PCA once again performed better than the model without it. 

Overall the model that used PCA was able to make predictions significantly faster than the model that did not include PCA. 
It did so without losing much in performance. In fact the model with PCA slightly outpoerformed the non-PCA model in accuracy and all metrics in the classification report. 
This shows that for this data set at least including PCA really inhances the processing speed of the model without losing predictive performance. 