# Principal Components Analysis

Principal component analysis (PCA) is a technique for dimensionality reduction, which is the process of reducing the number of predictor variables in a dataset.
More specifically, PCA is an unsupervised type of feature extraction, where original variables are combined and reduced to their most important and descriptive components.

The goal of PCA is to identify patterns in a data set, and then distill the variables down to their most important features so that the data is simplified without losing important traits. 

# Import Libraries

**Import the usual libraries for pandas and plotting. You can import sklearn later on.**

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import plotly.express as px

warnings.filterwarnings("ignore")
%matplotlib inline

## Get the Data

**Use next function to create a pandas from a loader of sklearn (load iris dataset)**

In [5]:
df = px.data.iris()
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,species_id
0,5.1,3.5,1.4,0.2,setosa,1
1,4.9,3.0,1.4,0.2,setosa,1
2,4.7,3.2,1.3,0.2,setosa,1
3,4.6,3.1,1.5,0.2,setosa,1
4,5.0,3.6,1.4,0.2,setosa,1


# Standardize the Data
PCA is effected by scale so you need to scale the features in your data before applying PCA. Use StandardScaler to help you standardize the dataset’s features onto unit scale (mean = 0 and variance = 1) which is a requirement for the optimal performance of many machine learning algorithms. If you want to see the negative effect not scaling your data can have, scikit-learn has a section on the effects of not standardizing your data.


**Standardize the features of `df` and save it into a numpy matrix `X`**

*Hint*
```
from sklearn.preprocessing import StandardScaler
features = ["sepal_length","sepal_width","petal_length","petal_width"]
```



In [0]:
from sklearn.preprocessing import StandardScaler
features = ["sepal_length","sepal_width","petal_length","petal_width"]

scaler = StandardScaler()
X = scaler.fit_transform(df.loc[:,features])

# PCA Projection to 2D
The original data has 4 columns (sepal length, sepal width, petal length, and petal width). In this section, the code projects the original data which is 4 dimensional into 2 dimensions. I should note that after dimensionality reduction, there usually isn’t a particular meaning assigned to each principal component. The new components are just the two main dimensions of variation.


**Initialize the PCA with 2 components, fit and transform the matrix `X`, then create a dataframe with 3 columns, the 2 principal components and the class from `df`**

**Hint**


```
from sklearn.decomposition import PCA

px.scatter(df, x="pc1", y="pc2", color="species",hover_data=['petal_width'])
```



In [0]:
from sklearn.decomposition import PCA


In [0]:
pca = PCA(n_components=2,random_state=102)

In [18]:
df_pca = pd.DataFrame(pca.fit_transform(X),columns = ['pc1','pc2'])
df_pca.head()

Unnamed: 0,pc1,pc2
0,-2.264542,0.505704
1,-2.086426,-0.655405
2,-2.36795,-0.318477
3,-2.304197,-0.575368
4,-2.388777,0.674767


In [0]:
df = pd.concat([df,df_pca],axis=1)

In [25]:
px.scatter(df, x="sepal_length", y="sepal_width", color="species",hover_data=['petal_width'])

In [27]:
px.scatter(df, x="pc1", y="pc2", color="species")

In [28]:
pca.explained_variance_

array([2.93035378, 0.92740362])

In [29]:
pca.explained_variance_ratio_

array([0.72770452, 0.23030523])

In [0]:
ggdata = pd.DataFrame(pca.explained_variance_ratio_,columns = ['evr'])
ggdata['cummulative_evr'] = ggdata.evr.cumsum()

In [34]:
ggdata

Unnamed: 0,evr,cummulative_evr
0,0.727705,0.727705
1,0.230305,0.95801


In [47]:
PCA(4,random_state=101).fit(df.loc[:,features]).explained_variance_ratio_

array([0.92461621, 0.05301557, 0.01718514, 0.00518309])

In [48]:
np.cumsum(PCA(4,random_state=101).fit(df.loc[:,features]).explained_variance_ratio_)

array([0.92461621, 0.97763178, 0.99481691, 1.        ])

In [0]:
components=4
evr = PCA(components,random_state=101).fit(df.loc[:,features]).explained_variance_ratio_

In [0]:
ggdata  = pd.DataFrame(evr,columns = ['evr'])
ggdata['cevr'] = ggdata['evr'].cumsum()

In [0]:
ggdata['number_components'] = list(range(1,components+1))

In [59]:
px.line(ggdata,x='number_components',y='evr')

In [0]:
pca = PCA(0.9)

In [61]:
pca.fit(df.loc[:,features])

PCA(copy=True, iterated_power='auto', n_components=0.9, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

In [64]:
pca.n_components

0.9

In [65]:
pca.n_components_

1