# Principal Component Analysis in Python

https://plot.ly/ipython-notebooks/principal-component-analysis/#shortcut--pca-in-scikitlearn

## Principal Component Analysis in 3 Simple Steps
### PCA is a simple yet popular and useful transformation that is used in numerous applications, such as stock market predictions, the analysis of gene expression data, and many more. In this tutorial, we will see that PCA is not just a "black box", and we are going to unravel its internals in 3 basic steps.

## A Summary of the PCA Approach
### Standardize the data.
### Obtain the Eigenvectors and Eigenvalues from the covariance matrix or correlation matrix, or perform Singular Vector Decomposition.
### Sort eignvalues in descending order and choose the $k$ eignenvectors that correspond to the $k$ largest eigenvalues where $k$ is the number of dimensions of the new feature subspace ($k <= d$).
### Construct the projection matrix $W$ from the selected $k$ eigenvectors.
### Transform the original dataset $X$ via $W$ to obtain a $k$-dimensional feature subspace $Y$

## Preparing the Iris Dataset
The Iris dataset contains measurements for 150 iris flowers from three different species.

The three classes in the Iris dataset are:

1. Irsi-setosa (n = 50)
2. Iris-versicolor (n = 50)
3. Iris-virginica (n = 50)

And the four features of in Iris dataset are:

1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm


## Loading the Dataset

In oder to load the Iris data directly from the UCI repository, we are going to use the superb pandas library.

In [2]:
import pandas as pd

df = pd.read_csv(filepath_or_buffer = 'http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',
               header=None,
              sep = ',')

df.columns = ['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class']
df.dropna(how = 'all', inplace = True) # drops the empty line at file-end

df.tail()

Unnamed: 0,sepal_len,sepal_wid,petal_len,petal_wid,class
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica


In [4]:
# split data table into data X and class labels y

X = df.iloc[:,0:4].values
y = df.iloc[:,4].values

Our iris dataset is now stored in form of a 150 x 4 matrix where the columns are the different features, and every row represents a separate flower sample. Each sample row x can be pictured as a 4-dimensional vector

## Exploratory Visulization
To get a feeling for how the 3 different flower classes are distributes along the 4 different features, let us visulizat them via hisograms.

In [6]:
import plotly.plotly as py
from plotly.graph_objs import *
import plotly.tools as tls

ModuleNotFoundError: No module named 'plotly'

In [None]:
# plotting histograms

traces = []

legend = {0:False, 1:False, 2:False, 3:True}

