### Principle Compenents Analysis

PCA is a technique that reduces the number of variables in a dataset down to a smaller set of variables in which the majority of information from all variables is retained in the smaller set, in turn taking a set of correlated variables and turning them into a smaller set of variables that in uncorrelated. 

PCA finds variables that have a shared variance, it then creates a new variable that represents that shared variance. When this happens some of the initial variance is lost. This is important to keep in mind. If we are creating an explainable model then we may want to think about what the exact variance is that is being left out. In the same sense, reduction techniques may be useful to the end user by only showing the 3 most important things to focus on. Less features also may lead to less overfitting. The goal is to explain the maximum amount of variance. 

In [107]:
import numpy as np
import pandas as pd
import sklearn.datasets as ds
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

In [145]:
boston = ds.load_boston()
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = pd.DataFrame(boston.target, columns=['MEDV'])

Column Descriptions: </br>
- CRIM    - per capita crime rate by town </br>
- ZN      - proportion of residential land zoned for lots over 25,000 sq.ft.</br>
- INDUS   - proportion of non-retail business acres per town.</br>
- CHAS    - Charles River dummy variable (1 if tract bounds river; 0 otherwise)</br>
- NOX     - nitric oxides concentration (parts per 10 million)</br>
- RM      - average number of rooms per dwelling</br>
- AGE     - proportion of owner-occupied units built prior to 1940</br>
- DIS     - weighted distances to five Boston employment centres</br>
- RAD     - index of accessibility to radial highways</br>
- TAX     - full-value property-tax rate per 10_000</br>
- PTRATIO - pupil-teacher ratio by town </br>
- B       - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town </br>
- LSTAT   - Percent lower status of the population </br>
- MEDV    - Median value of owner-occupied homes in 1_000's</br>

The Boston dataset can be used to predict two prototasks. One is nitrous oxide levels and the other is price. 

PCA is a parametric model, meaning it works best under the statistical assumptions of a normal distribution. 

Below is a preview of PCA using sklearn. We can see that the explained variance drops off fairly quickly for the number of features. 

In [146]:
pca = PCA(n_components=5)
pca.fit(X);
print(pca.explained_variance_ratio_)

[0.80582318 0.16305197 0.02134861 0.00695699 0.00129995]


Here we switch to 2 variables to keep the majority of the variance.

In [147]:
pca = PCA(n_components=2)
pca.fit(X);
print(pca.explained_variance_ratio_)

[0.80582318 0.16305197]


Below is a partial correlation matrix between:
CRIM(per capita crime rate by town) & ZN(proportion of res. land zoned for lots over 25,000 sq.ft)
&
CRIM(per capita crime rate by town) & ZNINDUS(proportion of non-retail business acres per town)

In [149]:
X.corr(method='pearson').iloc[:3,:1]

Unnamed: 0,CRIM
CRIM,1.0
ZN,-0.200469
INDUS,0.406583


A correlation matrix is a covariance matrix where the covariances have been divided by the variances.

$$var(x)=\frac{\sum(x_i-\bar{x})^2}n$$

$$cov(A)=\sum\frac{(x_i-\bar{x})(y_i-\bar{y})}n$$

In [151]:
scaler = StandardScaler()

# StandardScaler = (x - mean) / std == the mean and std of each variable
# fit_transform = standardization by centering and scaling

#  All variables have a mean of 0 and std of 1
x = StandardScaler().fit_transform(X)
xt = x.T # numpty thinks variables are rows
Cx = np.cov(xt) 
print('First two rows of a covariance matrix: \n \n', Cx[:2])

First two rows of a covariance matrix: 
 
 [[ 1.0019802  -0.20086619  0.40738853 -0.05600226  0.42180532 -0.21968085
   0.35343273 -0.38042191  0.62674377  0.5839183   0.29051973 -0.38582644
   0.4565237 ]
 [-0.20086619  1.0019802  -0.53488527 -0.04278127 -0.51762669  0.31260839
  -0.57066514  0.66572388 -0.31256554 -0.31518622 -0.39245415  0.17586788
  -0.41381239]]


The first step in PCA is to center the data of matrix X. In this case our matrix X is the dataframe X with n number of features. Centering is done by subtracting the mean of the whole of the data from each data point. which helps remove bias.

Calculate the covariance matrix. Then apply a linear transformation. Next derive the eigenvalues and eigenvectors.

In [None]:
EigenVectors

In [67]:
This particular implementation uses Singular Value Decomposition (SVD).
The input data is centered but not scaled for each feature before applying the SVD.

It uses the LAPACK implementation of the full SVD or a randomized truncated
SVD by the method of Halko et al. 2009, depending on the shape of the input
data and the number of components to extract.


[1;31mInit signature:[0m
[0mPCA[0m[1;33m([0m[1;33m
[0m    [0mn_components[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mcopy[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mwhiten[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0msvd_solver[0m[1;33m=[0m[1;34m'auto'[0m[1;33m,[0m[1;33m
[0m    [0mtol[0m[1;33m=[0m[1;36m0.0[0m[1;33m,[0m[1;33m
[0m    [0miterated_power[0m[1;33m=[0m[1;34m'auto'[0m[1;33m,[0m[1;33m
[0m    [0mrandom_state[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
Principal component analysis (PCA)

Linear dimensionality reduction using Singular Value Decomposition of the
data to project it to a lower dimensional space. The input data is centered
but not scaled for each feature before applying the SVD.

It uses the LAPACK implementation of the full SVD or a randomized truncated
SVD by the method of Halko et al. 2009, depend