<a href="https://colab.research.google.com/github/andreacohen7/digit-classification/blob/main/Digit_Classification_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Digit Classification
- Andrea Cohen
- 01.30.23

## Task:
  - to perform PCA to speed up a classification algorithm on a high-dimensional dataset
  - to answer the following questions:
    - Which model performed the best on the test set?
    - Which model was the fastest at making predictions?

## Data Source:
  - https://en.wikipedia.org/wiki/MNIST_database

## Load the Data

In [1]:
#import libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report
from sklearn import set_config
set_config(display='diagram')

In [2]:
#load the data
mnist = fetch_openml('mnist_784')
# view the shape of the dataset
mnist.data.shape

(70000, 784)

  - There are 70000 rows (images) and 784 columns (dimensions).

### Access the X data and the target

In [3]:
X = pd.DataFrame(mnist.data)
y = np.array(mnist.target)

### Inspect the data

In [4]:
# check for duplicated rows
X.duplicated().sum()

0

  - There are 0 duplicates.

In [5]:
# check for missing values
X.isna().sum().sum()

0

  - There are 0 missing values.

## Prepare the Data

### Split the data into training and testing sets

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

### Scale the data and apply PCA to the data, while retaining 95% of the variance, using a pipeline

In [7]:
scaler = StandardScaler()
pca = PCA(n_components = .95, random_state = 42)
transformer = make_pipeline(scaler, pca)

## KNN Model using the PCA-transformed data to predict which number each image shows

### KNN with default parameters

In [8]:
%%time
#put the PCA pipeline in another pipeline with a KNN Classifier
#create a modeling pipeline
knn_pca_pipe = make_pipeline(transformer, KNeighborsClassifier())
knn_pca_pipe.fit(X_train, y_train)
#make predictions
knn_pca_pipe_predictions = knn_pca_pipe.predict(X_test)

CPU times: user 1min 6s, sys: 6.42 s, total: 1min 13s
Wall time: 51 s


### Tuning hyperparameters using GridSearchCV and pipelines

#### Choose parameters for tuning

In [9]:
knn_pca_pipe.get_params()

{'memory': None,
 'steps': [('pipeline', Pipeline(steps=[('standardscaler', StandardScaler()),
                   ('pca', PCA(n_components=0.95, random_state=42))])),
  ('kneighborsclassifier', KNeighborsClassifier())],
 'verbose': False,
 'pipeline': Pipeline(steps=[('standardscaler', StandardScaler()),
                 ('pca', PCA(n_components=0.95, random_state=42))]),
 'kneighborsclassifier': KNeighborsClassifier(),
 'pipeline__memory': None,
 'pipeline__steps': [('standardscaler', StandardScaler()),
  ('pca', PCA(n_components=0.95, random_state=42))],
 'pipeline__verbose': False,
 'pipeline__standardscaler': StandardScaler(),
 'pipeline__pca': PCA(n_components=0.95, random_state=42),
 'pipeline__standardscaler__copy': True,
 'pipeline__standardscaler__with_mean': True,
 'pipeline__standardscaler__with_std': True,
 'pipeline__pca__copy': True,
 'pipeline__pca__iterated_power': 'auto',
 'pipeline__pca__n_components': 0.95,
 'pipeline__pca__random_state': 42,
 'pipeline__pca__svd_sol

#### Create a parameter grid dictionary

In [10]:
pipe_param_grid1 = {'kneighborsclassifier__n_neighbors': [1, 3, 5, 7],
              'kneighborsclassifier__weights': ['distance', 'uniform']}

#### Instantiate the GridSearchCV Class.

In [11]:
knn_pca_pipe_gs = GridSearchCV(knn_pca_pipe, pipe_param_grid1)

#### Fit the GridSearchCV on the Training Data.

In [12]:
%%time
knn_pca_pipe_gs.fit(X_train, y_train)

CPU times: user 23min 55s, sys: 2min 22s, total: 26min 18s
Wall time: 17min 15s


#### Find the best parameters

In [13]:
print('Best KNN-PCA Parameters:')
print(knn_pca_pipe_gs.best_params_)
best_knn_pca_pipe = knn_pca_pipe_gs.best_estimator_
print(f'Accuracy of best KNN-PCa model is: {best_knn_pca_pipe.score(X_test, y_test)}')

Best KNN-PCA Parameters:
{'kneighborsclassifier__n_neighbors': 3, 'kneighborsclassifier__weights': 'distance'}
Accuracy of best KNN-PCa model is: 0.9502285714285714


### Final tuned KNN Model using PCA-transformed data

In [14]:
%%time
knn_pca_tuned = KNeighborsClassifier(n_neighbors = 3, weights = 'distance')
knn_pca_tuned_pipe = make_pipeline(scaler, knn_pca_tuned)
knn_pca_tuned_pipe.fit(X_train, y_train)
test_preds_knn_pca_tuned = knn_pca_tuned_pipe.predict(X_test)

CPU times: user 1min 26s, sys: 5.93 s, total: 1min 32s
Wall time: 58.6 s


## KNN Model using the original data, without the PCA transformation

In [15]:
%%time
#create a modeling pipeline
knn_orig_pipe = make_pipeline(scaler, KNeighborsClassifier())
knn_orig_pipe.fit(X_train, y_train)
#make predictions
knn_orig_pipe_predictions = knn_orig_pipe.predict(X_test)

CPU times: user 1min 33s, sys: 5.5 s, total: 1min 38s
Wall time: 1min 4s


### Tuning hyperparameters using GridSearchCV and pipelines

#### Choose parameters for tuning

In [16]:
knn_orig_pipe.get_params()

{'memory': None,
 'steps': [('standardscaler', StandardScaler()),
  ('kneighborsclassifier', KNeighborsClassifier())],
 'verbose': False,
 'standardscaler': StandardScaler(),
 'kneighborsclassifier': KNeighborsClassifier(),
 'standardscaler__copy': True,
 'standardscaler__with_mean': True,
 'standardscaler__with_std': True,
 'kneighborsclassifier__algorithm': 'auto',
 'kneighborsclassifier__leaf_size': 30,
 'kneighborsclassifier__metric': 'minkowski',
 'kneighborsclassifier__metric_params': None,
 'kneighborsclassifier__n_jobs': None,
 'kneighborsclassifier__n_neighbors': 5,
 'kneighborsclassifier__p': 2,
 'kneighborsclassifier__weights': 'uniform'}

#### Create a parameter grid dictionary

In [17]:
pipe_param_grid2 = {'kneighborsclassifier__n_neighbors': [1, 3, 5, 7],
              'kneighborsclassifier__weights': ['distance','uniform']}

#### Instantiate the GridSearchCV Class.

In [18]:
knn_orig_pipe_gs = GridSearchCV(knn_orig_pipe, pipe_param_grid2)

#### Fit the GridSearchCV on the Training Data.

In [19]:
%%time
knn_orig_pipe_gs.fit(X_train, y_train)

CPU times: user 28min 41s, sys: 1min 54s, total: 30min 35s
Wall time: 19min 35s


#### Find the best parameters

In [20]:
print('Best KNN Parameters:')
print(knn_orig_pipe_gs.best_params_)
best_knn_orig_pipe = knn_orig_pipe_gs.best_estimator_
print(f'Accuracy of best KNN model is: {best_knn_orig_pipe.score(X_test, y_test)}')

Best KNN Parameters:
{'kneighborsclassifier__n_neighbors': 3, 'kneighborsclassifier__weights': 'distance'}
Accuracy of best KNN model is: 0.9466857142857142


### Final tuned KNN Model using original data

In [21]:
%%time
knn_orig_tuned = KNeighborsClassifier(n_neighbors = 3, weights = 'distance')
knn_orig_tuned_pipe = make_pipeline(scaler, knn_orig_tuned)
knn_orig_tuned_pipe.fit(X_train, y_train)
test_preds_knn_orig_tuned = knn_orig_tuned_pipe.predict(X_test)

CPU times: user 1min 25s, sys: 5.7 s, total: 1min 31s
Wall time: 58.8 s


## Evaluate and compare the models

In [22]:
#print classification reports for the testing data
print('KNN Classification Report for Testing Data Using PCA Transformation')
print(classification_report(y_test, test_preds_knn_pca_tuned))
print('KNN Classification Report for Testing Data Without PCA Transformation')
print(classification_report(y_test, test_preds_knn_orig_tuned))

KNN Classification Report for Testing Data Using PCA Transformation
              precision    recall  f1-score   support

           0       0.97      0.98      0.97      1714
           1       0.96      0.99      0.97      1977
           2       0.96      0.93      0.94      1761
           3       0.94      0.94      0.94      1806
           4       0.95      0.94      0.94      1587
           5       0.94      0.93      0.94      1607
           6       0.96      0.98      0.97      1761
           7       0.93      0.93      0.93      1878
           8       0.96      0.91      0.93      1657
           9       0.90      0.92      0.91      1752

    accuracy                           0.95     17500
   macro avg       0.95      0.95      0.95     17500
weighted avg       0.95      0.95      0.95     17500

KNN Classification Report for Testing Data Without PCA Transformation
              precision    recall  f1-score   support

           0       0.97      0.98      0.97     

## Which model performed the best on the test set?

  - Both KNN models (with the PCA-transformed data performed the same on the test set.  They were 95% accurate for making correct predictions.  The macro average precision (specificity) was 95%, the macro average recall (sensitivity) was 95%, and the macro average f1-score was 95%.


## Which model was the fastest at making predictions?

  - The KNN model with the original data was faster at fitting and making predictions. The total CPU time was 1 minute, 31 seconds, which was 1 second faster than the KNN model with the PCA-transformed data.

In [25]:
%%time
test_preds_knn_pca_tuned = knn_pca_tuned_pipe.predict(X_test)

CPU times: user 1min 26s, sys: 6.39 s, total: 1min 32s
Wall time: 1min 9s


In [26]:
%%time
test_preds_knn_orig_tuned = knn_orig_tuned_pipe.predict(X_test)

CPU times: user 1min 24s, sys: 5.85 s, total: 1min 30s
Wall time: 57.8 s


  - The KNN model with the original data was 2 seconds faster at making predictions than the KNN model with the PCA-transformed data.