# Principal Component Analysis on MNIST Data

Building on from notebook 15, we attempt PCA on MNIST data without image processing/deep learning techniques.

## Data

MNIST ("Modified National Institute of Standards and Technology") is the de facto “hello world” dataset of computer vision. Since its release in 1999, this classic dataset of handwritten images has served as the basis for benchmarking classification algorithms. As new machine learning techniques emerge, MNIST remains a reliable resource for researchers and learners alike.

The [MNIST database of handwritten digits](http://yann.lecun.com/exdb/mnist/index.html), available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.

It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings(action='ignore')

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.model_selection import train_test_split

### Load the data

In [4]:
data = pd.read_csv('Data\\16-mnist-train.csv')

In [6]:
data.shape

(42000, 785)

In [7]:
data.head()

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### X / y Split

In [9]:
# Split for training
y = data['label']
X = data.drop(['label'], axis=1)

### Train / Test Split

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=100)

### Standardizing the Data

In [11]:
scaler = StandardScaler()

# Fit on training set only
scaler.fit(X_train)

# Apply transform to both the training set and the test set
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

### Create PCA with 95% PVE

In [12]:
# Make instance of the pca model
pca = PCA(.95)

### Fit PCA on training data

In [13]:
# Fit PCA on the training set
pca.fit(X_train)

PCA(copy=True, iterated_power='auto', n_components=0.95, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

### Get principal components used

In [15]:
# Get n components
pca.n_components_

312

### Fit on train and test set to model

In [16]:
# Transform train and test set
X_train = pca.transform(X_train)
X_test = pca.transform(X_test)

### Models
- Random Forest
- Logistic Regression
    - Solver: lbfgs
        - Limited-memory BFGS (L-BFGS or LM-BFGS) is an optimization algorithm in the family of quasi-Newton methods that approximates the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm using a limited amount of computer memory. It is a popular algorithm for parameter estimation in machine learning.

In [17]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LogisticRegression

In [18]:
lr = LogisticRegression(solver = 'lbfgs')
rf = RandomForestRegressor(random_state=10)

### Fit the Models

In [19]:
# Logistic Regression
lr.fit(X_train, y_train)

# Random Forest
rf.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
           oob_score=False, random_state=10, verbose=0, warm_start=False)

### Predict

In [23]:
lr.predict(X_train[0:10])

array([5, 1, 9, 5, 8, 1, 4, 3, 2, 7], dtype=int64)

In [25]:
rf.predict(X_train[0:10])

array([4.6, 1.6, 9. , 5. , 7.7, 5.9, 4. , 3.2, 2.4, 7. ])

### Evaluate Performance

In [27]:
lr_score = lr.score(X_test, y_test)
print(lr_score)

0.9113333333333333


In [28]:
rf_score = rf.score(X_test, y_test)
print(rf_score)

0.8146554638127181
