<h1 align="center"> PCA + Logistic Regression (MNIST) </h1>
* https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60
* https://github.com/mGalarnyk/Python_Tutorials/blob/master/Sklearn/PCA/PCA_to_Speed-up_Machine_Learning_Algorithms.ipynb
* https://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/

The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.
<br>
It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. 

Parameters | Number
--- | ---
Classes | 10
Samples per class | ~7000 samples per class
Samples total | 70000
Dimensionality | 784
Features | integers values from 0 to 255

The MNIST database of handwritten digits is available on the following website: [MNIST Dataset](http://yann.lecun.com/exdb/mnist/)

In [1]:
from sklearn.datasets import fetch_mldata
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.model_selection import train_test_split
import pandas as pd

## Download and Load the Data

In [3]:
# You can add the parameter data_home to wherever to where you want to download your data
mnist = fetch_mldata('MNIST original')

ConnectionResetError: [Errno 54] Connection reset by peer

In [None]:
mnist

In [None]:
# These are the images
mnist.data.shape

In [None]:
# These are the labels
mnist.target.shape

## Splitting Data into Training and Test Sets

In [None]:
# test_size: what proportion of original data is used for test set
train_img, test_img, train_lbl, test_lbl = train_test_split(
    mnist.data, mnist.target, test_size=1/7.0, random_state=0)

In [None]:
print(train_img.shape)

In [None]:
print(train_lbl.shape)

In [None]:
print(test_img.shape)

In [None]:
print(test_lbl.shape)

## Standardizing the Data

Since PCA yields a feature subspace that maximizes the variance along the axes, it makes sense to standardize the data, especially, if it was measured on different scales.

Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual feature do not more or less look like standard normally distributed data

Notebook going over the importance of feature Scaling: http://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html#sphx-glr-auto-examples-preprocessing-plot-scaling-importance-py

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# Fit on training set only.
scaler.fit(train_img)

# Apply transform to both the training set and the test set.
train_img = scaler.transform(train_img)
test_img = scaler.transform(test_img)

## PCA to Speed up Machine Learning Algorithms (Logistic Regression)

<b>Step 0:</b> Import and use PCA. After PCA you will apply a machine learning algorithm of your choice to the transformed data

In [None]:
from sklearn.decomposition import PCA

Make an instance of the Model

In [None]:
pca = PCA(.95)

Fit PCA on training set. <b>Note: you are fitting PCA on the training set only</b>

In [None]:
pca.fit(train_img)

In [None]:
pca.n_components_

Apply the mapping (transform) to <b>both</b> the training set and the test set. 

In [None]:
train_img = pca.transform(train_img)
test_img = pca.transform(test_img)

<b>Step 1: </b> Import the model you want to use

In sklearn, all machine learning models are implemented as Python classes

In [None]:
from sklearn.linear_model import LogisticRegression

<b>Step 2:</b> Make an instance of the Model

In [None]:
# all parameters not specified are set to their defaults
# default solver is incredibly slow thats why we change it
# solver = 'lbfgs'
logisticRegr = LogisticRegression(solver = 'lbfgs')

<b>Step 3:</b> Training the model on the data, storing the information learned from the data

Model is learning the relationship between x (digits) and y (labels)

In [None]:
logisticRegr.fit(train_img, train_lbl)

<b>Step 4:</b> Predict the labels of new data (new images)

Uses the information the model learned during the model training process

In [None]:
# Returns a NumPy Array
# Predict for One Observation (image)
logisticRegr.predict(test_img[0].reshape(1,-1))

In [None]:
# Predict for Multiple Observations (images) at Once
logisticRegr.predict(test_img[0:10])

## Measuring Model Performance

accuracy (fraction of correct predictions): correct predictions / total number of data points

Basically, how the model performs on new data (test set)

In [None]:
score = logisticRegr.score(test_img, test_lbl)
print(score)