## Principal Component Analysis - PCA

Principal Component analysis on IRIS dataset.
PCA is predominantly used as a dimensionality reduction technique in domains like facial recognition, computer vision and image compression. It is also used for finding patterns in data of high dimension in the field of finance, data mining, bioinformatics, psychology, etc.

- [x] Speed up the training process
- [x] Data visualization
- [x] Dimensionality reduction

[What are the Pros and cons of the PCA?](https://www.i2tutorials.com/what-are-the-pros-and-cons-of-the-pca/)

In [63]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

In [45]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

In [36]:
df = pd.read_csv("../Data/iris.csv")

In [37]:
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [38]:
df.shape

(150, 6)

In [39]:
df.describe()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,150.0,150.0,150.0,150.0,150.0
mean,75.5,5.843333,3.054,3.758667,1.198667
std,43.445368,0.828066,0.433594,1.76442,0.763161
min,1.0,4.3,2.0,1.0,0.1
25%,38.25,5.1,2.8,1.6,0.3
50%,75.5,5.8,3.0,4.35,1.3
75%,112.75,6.4,3.3,5.1,1.8
max,150.0,7.9,4.4,6.9,2.5


## Standarization

PCA is effected by scale so you need to scale the features in your data before applying PCA. Use StandardScaler to help you standardize the dataset

In [40]:
y = df.Species
X = df.drop(["Species", "Id"], axis = 1)

In [41]:
X.head()

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [42]:
col_names = X.columns

# create the scaler object
scaler = StandardScaler()

scaled_df = scaler.fit_transform(X)
scaled_df = pd.DataFrame(scaled_df, columns=col_names)

In [44]:
scaled_df.head()

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
0,-0.900681,1.032057,-1.341272,-1.312977
1,-1.143017,-0.124958,-1.341272,-1.312977
2,-1.385353,0.337848,-1.398138,-1.312977
3,-1.506521,0.106445,-1.284407,-1.312977
4,-1.021849,1.26346,-1.341272,-1.312977


In [46]:
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(scaled_df)

In [58]:
principalDf = pd.DataFrame(data = principalComponents, columns = ['pca1', 'pca2'])

In [59]:
finalDf = pd.concat([principalDf, y], axis=1)

In [60]:
finalDf.head()

Unnamed: 0,pca1,pca2,Species
0,-2.264542,0.505704,Iris-setosa
1,-2.086426,-0.655405,Iris-setosa
2,-2.36795,-0.318477,Iris-setosa
3,-2.304197,-0.575368,Iris-setosa
4,-2.388777,0.674767,Iris-setosa


In [68]:
"""By using the attribute explained_variance_ratio_, you can see that the first principal component 
contains 72.77% of the variance and the second principal component contains 23.03% of the variance. 
Together, the two components contain 95.80% of the information."""
pca.explained_variance_ratio_

array([0.72770452, 0.23030523])

__author__ = "