# Principal Components Analysis

Principal component analysis (PCA) is a technique for dimensionality reduction, which is the process of reducing the number of predictor variables in a dataset.
More specifically, PCA is an unsupervised type of feature extraction, where original variables are combined and reduced to their most important and descriptive components.

The goal of PCA is to identify patterns in a data set, and then distill the variables down to their most important features so that the data is simplified without losing important traits. 

# Import Libraries

**Import the usual libraries for pandas and plotting. You can import sklearn later on.**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import plotly.express as px

warnings.filterwarnings("ignore")
%matplotlib inline

## Get the Data

**Use next function to create a pandas from a loader of sklearn (load iris dataset)**

In [None]:
df = px.data.iris()
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,species_id
0,5.1,3.5,1.4,0.2,setosa,1
1,4.9,3.0,1.4,0.2,setosa,1
2,4.7,3.2,1.3,0.2,setosa,1
3,4.6,3.1,1.5,0.2,setosa,1
4,5.0,3.6,1.4,0.2,setosa,1


# Standardize the Data
PCA is effected by scale so you need to scale the features in your data before applying PCA. Use StandardScaler to help you standardize the dataset’s features onto unit scale (mean = 0 and variance = 1) which is a requirement for the optimal performance of many machine learning algorithms. If you want to see the negative effect not scaling your data can have, scikit-learn has a section on the effects of not standardizing your data.


**Standardize the features of `df` and save it into a numpy matrix `X`**

*Hint*
```
from sklearn.preprocessing import StandardScaler
features = ["sepal_length","sepal_width","petal_length","petal_width"]
```



In [None]:
from sklearn.preprocessing import StandardScaler
features = ["sepal_length","sepal_width","petal_length","petal_width"]
# your code here

# PCA Projection to 2D
The original data has 4 columns (sepal length, sepal width, petal length, and petal width). In this section, the code projects the original data which is 4 dimensional into 2 dimensions. I should note that after dimensionality reduction, there usually isn’t a particular meaning assigned to each principal component. The new components are just the two main dimensions of variation.


**Initialize the PCA with 2 components, fit and transform the matrix `X`, then create a dataframe with 3 columns, the 2 principal components and the class from `df`**

**Hint**


```
from sklearn.decomposition import PCA

px.scatter(df, x="pc1", y="pc2", color="species",hover_data=['petal_width'])
```



In [None]:
from sklearn.decomposition import PCA
# your code here

Plot in a scatter plot the `sepal_length` versus `sepal_width` using `species` in color aesthetic

Plot in a scatter plot the `sepal_length` versus `sepal_width` using `species` in color aesthetic

# Question: What is the explain variance per principal component?

In [None]:
# your code here

array([2.93035378, 0.92740362])

In [None]:
# your code here

array([0.72770452, 0.23030523])

In [None]:
# your code here

Unnamed: 0,explained_variance,explained_variance_cumulative
0,0.727705,0.727705
1,0.230305,0.95801


# Question: What is the explain variance and the cumulative explained variace per principal component if we make a PCA of 4 components?

In [None]:
# your code here

# Question: Could you plot the explained variance ratio per component and decide what's the optimal number of components?

In [None]:
# your code here

# Question: how many components does the pca model have that collect at least 90% of the explained variance?

In [None]:
# your code here

0.9
1


# Question: what is the formula of the most important component? and the second most important?