# Principal Component Analysis (PCA)
__________________________________________________________________________________________________________
### Principal Component Analysis (PCA) is a linear dimensionality reduction technique that can be utilized for extracting information from a high-dimensional space by projecting it into a lower-dimensional sub-space. 

### It tries to preserve the essential parts that have more variation of the data and remove the non-essential parts with fewer variation.

### Dimensions are nothing but features that represent the data. 

### For example, A 28 X 28 image has 784 picture elements (pixels) that are the dimensions or features which together represent that image.
__________________________________________________________________________________________________________

### One important thing to note about PCA is that it is an Unsupervised dimensionality reduction technique.

### You can cluster the similar data points based on the feature correlation between them without any supervision (or labels)

__________________________________________________________________________________________________________

### Wikipedia Definition:

#### PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components.

* Note 
* ___Features, Dimensions, and Variables___ are all referring to the same thing. You will find them being used interchangeably.


__________________________________________________________________________________________________________

<img src='iris.JPG'>

__________________________________________________________________________________________________________

# Where can PCA be applied?
* ___Data Visualization:___

    ___When working on any data related problem, the challenge in today's world is the sheer volume of data, and the variables/features that define that data.___
    
    ___To solve a problem where data is the key, you need extensive data exploration like finding out how the variables are correlated or understanding the distribution of a few variables.___ 
    
    ___Considering that there are a large number of variables or dimensions along which the data is distributed, visualization can be a challenge and almost impossible.___

    ___Hence, PCA can do that for you since it projects the data into a lower dimension, thereby allowing you to visualize the data in a 2D or 3D space with a naked eye.___



* ___Speeding Machine Learning (ML) Algorithm:___ 

    ___Since PCA's main idea is dimensionality reduction, you can leverage that to speed up your machine learning algorithm's training and testing time considering your data has a lot of features, and the ML algorithm's learning is too slow.___
    
    
### At an abstract level, you take a dataset having many features, and you simplify that dataset by selecting a few Principal Components from original features.

__________________________________________________________________________________________________________


# What is a Principal Component?

* Principal components are the key to PCA; they represent what's underneath the hood of your data. 
<br>
<br>
* In a layman term, when the data is projected into a lower dimension (assume three dimensions) from a higher space, the three dimensions are nothing but the three Principal Components that captures (or holds) most of the variance (information) of your data.
<br>
<br>

* Principal components have both direction and magnitude. 
<br>
<br>

    - Direction represents across which principal axes the data is mostly spread out or has most variance.
    - Magnitude signifies the amount of variance that Principal Component captures of the data when projected onto that axis.
    <br>
<br>

* The principal components are a straight line, and the first principal component holds the most variance in the data. 
<br>
<br>

* Each subsequent principal component is orthogonal to the last and has a lesser variance.


# Implementation: PCA + Logistic Regression (MNIST)

The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. 
It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting.

#### The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.


#### It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting.

<img src='mnist.JPG'>

In [1]:
from sklearn.datasets import fetch_mldata
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.model_selection import train_test_split
import pandas as pd

## Download and Load the Data

### Check the folder structure

Take the following folder structure as an example. Assuming there is a neural network model in "tutorial/notebook.py":


  working_directory   
       │   
       ├── dataset   
       │        └── mldata
       │                  └──  mnist-original.mat  # MNIST handwritten digits dataset   
       │   
       └── tutorial   
                 └── notebook.py    # the jupyter notebook in which we load MNIST

## Download and store the dataset in local
* Please download the file named "mnist-original.mat" from the following page.
    https://github.com/amplab/datascience-sp14/blob/master/lab7/mldata/mnist-original.mat

* To manually download the file, click the "download" button.

In [8]:
data_path = "dataset"


mnist = fetch_mldata('MNIST original', data_home=data_path)



In [9]:
mnist

{'DESCR': 'mldata.org dataset: mnist-original',
 'COL_NAMES': ['label', 'data'],
 'target': array([0., 0., 0., ..., 9., 9., 9.]),
 'data': array([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=uint8)}

In [10]:
# Images
mnist.data.shape

(70000, 784)

In [12]:
# labels
mnist.target.shape

(70000,)

## Splitting Data into Training and Test Sets

In [13]:
# test_size: what proportion of original data is used for test set
train_img, test_img, train_lbl, test_lbl = train_test_split(
    mnist.data, mnist.target, test_size=1/7.0, random_state=0)

In [15]:
print(train_img.shape)

(60000, 784)


In [16]:
print(train_lbl.shape)

(60000,)


In [17]:
print(test_img.shape)

(10000, 784)


In [18]:
print(test_lbl.shape)

(10000,)


# Standardizing the Data

Since PCA yields a feature subspace that maximizes the variance along the axes, it makes sense to standardize the data, especially, if it was measured on different scales.

Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual feature do not more or less look like standard normally distributed data

Notebook going over the importance of feature Scaling: http://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html#sphx-glr-auto-examples-preprocessing-plot-scaling-importance-py

In [19]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# Fit on training set only.
scaler.fit(train_img)

# Apply transform to both the training set and the test set.
train_img = scaler.transform(train_img)
test_img = scaler.transform(test_img)

# PCA to Speed up Machine Learning Algorithms (Logistic Regression)

## Step 0: Import and use PCA. After PCA you will apply a machine learning algorithm of your choice to the transformed data

In [20]:
from sklearn.decomposition import PCA

Make an instance of the Model

In [21]:
pca = PCA(.95)

Fit PCA on training set. 
* Note: you are fitting PCA on the training set only

In [22]:
pca.fit(train_img)

PCA(copy=True, iterated_power='auto', n_components=0.95, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

In [23]:
pca.n_components_

330

Apply the mapping (transform) to __both__ the training set and the test set.

In [24]:
train_img = pca.transform(train_img)
test_img = pca.transform(test_img)

* Step 1: Import the model you want to use

    In sklearn, all machine learning models are implemented as Python classes

In [25]:
from sklearn.linear_model import LogisticRegression

* Step 2: Make an instance of the Model

In [26]:
# all parameters not specified are set to their defaults
# default solver is incredibly slow thats why we change it
# solver = 'lbfgs'
logisticRegr = LogisticRegression(solver = 'lbfgs')

* Step 3: Training the model on the data, storing the information learned from the data

    Model is learning the relationship between x (digits) and y (labels)

In [27]:
logisticRegr.fit(train_img, train_lbl)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

* Step 4: Predict the labels of new data (new images)

    Uses the information the model learned during the model training process

In [28]:
# Returns a NumPy Array
# Predict for One Observation (image)
logisticRegr.predict(test_img[0].reshape(1,-1))

array([1.])

In [29]:
# Predict for Multiple Observations (images) at Once
logisticRegr.predict(test_img[0:10])

array([1., 9., 2., 2., 7., 1., 8., 3., 3., 7.])

# Measuring Model Performance

* __accuracy (fraction of correct predictions): correct predictions / total number of data points__


* __Basically, how the model performs on new data (test set)__

In [31]:
score = logisticRegr.score(test_img, test_lbl)
print(score)

0.9199


# Number of Components, Variance, Time Table

In [32]:
pd.DataFrame(data = [[1.00, 784, 48.94, .9158],
                     [.99, 541, 34.69, .9169],
                     [.95, 330, 13.89, .92],
                     [.90, 236, 10.56, .9168],
                     [.85, 184, 8.85, .9156]], 
             columns = ['Variance Retained',
                      'Number of Components', 
                      'Time (seconds)',
                      'Accuracy'])

Unnamed: 0,Variance Retained,Number of Components,Time (seconds),Accuracy
0,1.0,784,48.94,0.9158
1,0.99,541,34.69,0.9169
2,0.95,330,13.89,0.92
3,0.9,236,10.56,0.9168
4,0.85,184,8.85,0.9156
