<a href="https://colab.research.google.com/github/andreacohen7/education/blob/main/Digit_Classification_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Digit Classification
- Andrea Cohen
- 01.30.23

## Task:
  - to perform PCA to speed up a classification algorithm on a high-dimensional dataset
  - to answer the following questions:
    - Which model performed the best on the test set?
    - Which model was the fastest at making predictions?

## Data Source:
  - https://en.wikipedia.org/wiki/MNIST_database

## Load the Data

In [None]:
#import libraries
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report
from sklearn import set_config
set_config(display='diagram')

In [None]:
#load the data
mnist = fetch_openml('mnist_784')
# view the shape of the dataset
mnist.data.shape

(70000, 784)

  - There are 70000 rows (images) and 784 columns (dimensions).

### Access the X data and the target

In [None]:
X = pd.DataFrame(mnist.data)
y = np.array(mnist.target)

### Inspect the data

In [None]:
# check for duplicated rows
X.duplicated().sum()

0

  - There are 0 duplicates.

In [None]:
# check for missing values
X.isna().sum().sum()

0

  - There are 0 missing values.

## Prepare the Data

### Split the data into training and testing sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

### Scale the data and apply PCA to the data, while retaining 95% of the variance, using a pipeline

In [None]:
scaler = StandardScaler()
pca = PCA(n_components = .95, random_state = 42)
transformer = make_pipeline(scaler, pca)

## KNN Model using the PCA-transformed data to predict which number each image shows

In [None]:
%%time
#put the PCA pipeline in another pipeline with a KNN Classifier
#create a modeling pipeline
knn_pca_pipe = make_pipeline(transformer, KNeighborsClassifier())
knn_pca_pipe.fit(X_train, y_train)
#make predictions
knn_pca_pipe_predictions = knn_pca_pipe.predict(X_test)

CPU times: user 1min 14s, sys: 8.05 s, total: 1min 22s
Wall time: 1min 4s


## KNN Model using the original data, without the PCA transformation

In [None]:
%%time
#create a modeling pipeline
knn_orig_pipe = make_pipeline(scaler, KNeighborsClassifier())
knn_orig_pipe.fit(X_train, y_train)
#make predictions
knn_orig_pipe_predictions = knn_orig_pipe.predict(X_test)

CPU times: user 1min 43s, sys: 1.86 s, total: 1min 45s
Wall time: 1min 8s


## Evaluate and compare the models

In [None]:
#print classification reports for the testing data
print('KNN Classification Report for Testing Data Using PCA Transformation')
print(classification_report(y_test, knn_pca_pipe_predictions))
print('KNN Classification Report for Testing Data Without PCA Transformation')
print(classification_report(y_test, knn_orig_pipe_predictions))

KNN Classification Report for Testing Data Using PCA Transformation
              precision    recall  f1-score   support

           0       0.96      0.98      0.97      1714
           1       0.96      0.99      0.97      1977
           2       0.95      0.94      0.94      1761
           3       0.94      0.94      0.94      1806
           4       0.94      0.94      0.94      1587
           5       0.95      0.93      0.94      1607
           6       0.96      0.98      0.97      1761
           7       0.94      0.93      0.94      1878
           8       0.97      0.90      0.93      1657
           9       0.91      0.93      0.92      1752

    accuracy                           0.95     17500
   macro avg       0.95      0.95      0.95     17500
weighted avg       0.95      0.95      0.95     17500

KNN Classification Report for Testing Data Without PCA Transformation
              precision    recall  f1-score   support

           0       0.96      0.98      0.97     

## Which model performed the best on the test set?

  - The KNN model with PCA-transformed data performed better on the test set.  It was 95% accurate for making correct predictions.  The macro average precision (specificity) was 95%, the macro average recall (sensitivity) was 95%, and the macro average f1-score was 95%.

## Which model was the fastest at making predictions?

  - The KNN model with PCA-transformed data was faster at fitting and making predictions.  The total CPU time was 1 minute, 22 seconds, which was 23 seconds faster than the KNN model with the original data.

In [None]:
%%time
knn_pca_pipe_predictions = knn_pca_pipe.predict(X_test)

CPU times: user 56.4 s, sys: 1.88 s, total: 58.3 s
Wall time: 42 s


In [None]:
%%time
knn_orig_pipe_predictions = knn_orig_pipe.predict(X_test)

CPU times: user 1min 43s, sys: 2.57 s, total: 1min 46s
Wall time: 1min 10s


  - The KNN model with PCA-transformed data made its predictions in 58.3 seconds, which was 47.7 seconds faster than the KNN model with the original data.