# Principal Component Analyis

Often in real world data with large dimensionality, it is possible to reduce the dimensions of the data without losing too much of the variance. This reduction of dimensionality is only useful if it makes the analysis: modeling and predictions simpler. 

The search of an axis such that when the data is projected on it preserves the maxiumum amount of variance possible, is called principal component analysis.


In [1]:
import numpy as np

We generate a 3-D datset using random

In [2]:
np.random.seed(42)
m = 50
w1, w2 = 0.1, 0.2
noise = 0.08

angles = np.random.rand(m) * 3 * np.pi / 2 - 0.5
X = np.empty((m, 3))
X[:, 0] = np.cos(angles) + np.sin(angles)/2 + noise * np.random.randn(m) / 2
X[:, 1] = np.sin(angles) * 0.7 + noise * np.random.randn(m) / 2
X[:, 2] = X[:, 0] * w1 + X[:, 1] * w2 + noise * np.random.randn(m)

In [3]:
X[:5]

array([[ 0.80741216,  0.67140388,  0.23321879],
       [-1.03352726, -0.48182307, -0.09514592],
       [-0.89073528,  0.10559841, -0.19655251],
       [-0.32817664,  0.49892524,  0.08173809],
       [ 1.02985081,  0.14745588,  0.15326688]])

Project this onto 2-D using PCA function from sklearn

In [4]:
from sklearn.decomposition import PCA

In [5]:
pca = PCA(n_components = 2)
X2D=pca.fit_transform(X)

In [6]:
X2D[:5]

array([[-0.67694888, -0.32205002],
       [ 1.39205431,  0.40816528],
       [ 1.14596704, -0.09949586],
       [ 0.47929492, -0.39518628],
       [-0.77095747,  0.24484606]])

To understand the variance lost, we will reproject X2D into 3-D array and compare the variance (mean squared error) with X.

In [7]:
X3D_recon=pca.inverse_transform(X2D)

In [8]:
X3D_recon[:5]

array([[ 0.80721491,  0.67097429,  0.23514824],
       [-1.02247989, -0.4577623 , -0.20321149],
       [-0.90348859,  0.07782218, -0.07179946],
       [-0.32808349,  0.49912811,  0.08082695],
       [ 1.03100039,  0.14995962,  0.14202171]])

In [9]:
np.mean(np.sum(np.square(X3D_recon - X), axis=1))

0.005126201193575271

In this case the variance lost is minimal: 0.5%

Similar information can be extracted from the information saved by the pca function

In [10]:
pca.explained_variance_ratio_

array([0.8469243 , 0.14723277])

In [11]:
 1 - sum(pca.explained_variance_ratio_)

0.005842925909389063