[View in Colaboratory](https://colab.research.google.com/github/duakaran96/ML-AcadView/blob/master/PCA_on_digits.ipynb)

# Principal Component Analysis

[Principal component analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. 

In this project we'll take the preloaded *digits* dataset under **scikit-learn** library and will try to reduce the dimensions taking care that our data is not lost.

## Imports

In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### The code below will load digits datset

In [0]:
from sklearn.datasets import load_digits

In [0]:
digits = load_digits()

### Separating the features and target into X and y

In [0]:
X = digits.data
y = digits.target

### Scaling the features on a standard scale using StandardScaler

In [0]:
from sklearn.preprocessing import StandardScaler

In [0]:
scaler = StandardScaler()

In [7]:
scaler.fit(X)

StandardScaler(copy=True, with_mean=True, with_std=True)

In [0]:
X = scaler.transform(X)

## Train Test Split

In [0]:
from sklearn.model_selection import train_test_split

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40, random_state=101)

## Building and Training Our Logistic Regression Model

In [0]:
from sklearn.linear_model import LogisticRegression

In [0]:
log = LogisticRegression()

In [29]:
import time
start = time.time()

log.fit(X_train, y_train)

end = time.time()
print(end - start)

0.14958953857421875


We directly passed our data into Logistic regression Model and the fitting time came out to be as above.

Our data had 64 Dimensions in it.

Let's take a look at **Predictions** and **accuracy score**

In [0]:
predictions = log.predict(X_test)

In [0]:
from sklearn.metrics import accuracy_score

In [32]:
print('Accuracy Score when directly doing logistic regression', accuracy_score(y_test, predictions))

Accuracy Score when directly doing logistic regression 0.9707927677329624


Now we'll try to do **Dimnsionality Reduction** and take out **Principal Components** so as to optimize for the fitting time for our algorithm and to reduce the memory usage.

## Importing PCA from decomposition class under sklearn

In [0]:
from sklearn.decomposition import PCA

We want principal components to retain the 99% of variance described by our data.

We fit the pca onto our training data, which is the general practice to do that, because in real life scenarios we do not have testing data beforehand.

In [0]:
pca = PCA(.99)

In [35]:
pca.fit(X_train)

PCA(copy=True, iterated_power='auto', n_components=0.99, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

## Transforming our training and testing data



In [0]:
X_train = pca.transform(X_train)
X_test = pca.transform(X_test)

In [37]:
print(X_train.shape)
print(X_test.shape)

(1078, 48)
(719, 48)


In [0]:
log1 = LogisticRegression()

In [39]:
start = time.time()

log1.fit(X_train, y_train)

end = time.time()
print(end - start)

0.1451256275177002


We can see here that fitting time reduced somewhat but the change is not that much evident beacuse the dataset that we are usiing is already beatifully cleaned and preprocessed that it does not take much time to it the all 64 dimensions altogether without doing any PCA.


In [0]:
predictions1 = log1.predict(X_test)

In [41]:
print('Accuracy Score when doing logistic regression after applying PCA: ', accuracy_score(y_test, predictions1))

Accuracy Score when doing logistic regression after applying PCA:  0.9680111265646731


We can see here accuracy also doesn't changes much. So it's a better approach to do PCA on our features and then apply our algorithm onto it as it saves us our time and memory consumption.

## CONCLUSION

Principal Component Analysis is a better approach when dealing with Higher Dimensional Data as *Dimensionality reduction* helps us to get rid of *The curse of dimensionality*.

Moreover, we can save our time and memory consumption.!