# 00 - Principal Component Analysis
[Great PCA explanation](http://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues)

In [2]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D

from codefiles.datagen import random_xy, x_plus_noise, data_3d
from codefiles.dataplot import plot_principal_components, plot_3d, plot_2d
# %matplotlib inline
%matplotlib notebook

## PCA with Random 2D Data
Totally random data. Generate a 2D dataset

In [4]:
data_random = random_xy(num_points=100)
plot_2d(data_random)

<IPython.core.display.Javascript object>

Initialize PCA - recall that we won't need any kind of target column since this is an unsupervised technique.

In [5]:
pca_random = PCA()

Now, let's give it the random data.

In [6]:
pca_random.fit(data_random)

PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

And evaluate the variance of the data. Will we have some axis with significant more variance?

In [7]:
pca_random.explained_variance_

array([ 1.08901852,  0.96524561])

As we can see, there is not a huge difference in variance between the two axis. It was expected. If we increase `num_points` in the `random_xy()`, we'll see them closer together.

## Correlated 2D Data
We'll now assess a correlated dataset. Check [here for a nice gif on PCA](https://i.stack.imgur.com/lNHqt.gif)

In [8]:
# Correlated data
data_correlated = x_plus_noise(slope=1)
plot_2d(data_correlated)

<IPython.core.display.Javascript object>

Initialize and fit the correlated data.

In [None]:
pca_correlated = PCA()
pca_correlated.fit(data_correlated)

In [None]:
pca_correlated.explained_variance_

Now we can see a principal component with a significant higher magnitude than the other one. There's definitely some knowledge we can use about this, e.g., use only one dimensional data if we have/need to, without losing much information.  

**Hint**: check `x_plus_noise()` with `slope=-1`.