### PCA

#### Theory

PCA is solving the equation 𝐴𝑥⃗ =𝜆𝑥⃗ , for some scalar 𝜆.  Another way of writing this equation to make the linear algebra clearer is (𝐴−𝜆𝐼)𝑥⃗ =0

IOW: "What is the set of vectors which, when I multiply them by my matrix of data, results in that same set of vectors just moved in space by a scalar distance"

The idea is that we want to transform our data into a set of mutually-orthoganal vectors which still capture most of the linear variance in the data.  Why?

- Dimensionally reduction.  As dimensions increase, observations get further apart and distance calculations, which a lot of techinuqes rely on, get less effective.  Picture trying to cluster 100 data points in 2 dimensions.  Now picture clustering 100 pts in 500.  By reducing dimensions, we're able to avoid the complications that come with spreading points further and further out.  

- By making the features in our data orthogonal, we're ensuring that values for observations in one dimension aren't dependent on information in another dimension.  This is a large advantage in constructing models, since information in a given feature is soley about that feature, and not incorporating information from other features; models have an easier time distinguishing what information is coming from which feature.  (Think back to regression and why multicollinearity is a problem.) When we go to predict off of orthogonal features, our techniques are able to make better predictions.  

𝑥⃗ is a matrix of "eigenvectors", which we interpret as "principal components", the vectors along which there is linear variance in the data.  

The first principal component is the vector along which the data varies the most.  The second principal component is the vector along which the data varies the second most and is orthogonal to the first component.  The third princpal component is the vector along which the data varies third most and is orthogonal to the first two, etc. 

![pca](img/pca.png)

Eigenvalues, 𝜆, are stored in a matrix, the diagonal of which contain the eigenvalues: the specific values which apply to the equation in the first line. 

#### Code

In [80]:
import pandas as pd
import numpy as np

from sklearn.datasets import load_breast_cancer

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.decomposition import PCA

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import classification_report

import matplotlib.pyplot as plt
%matplotlib inline

In [104]:
med = load_breast_cancer()

In [112]:
X = med.data
y = med.target
columns = med.feature_names
df = pd.DataFrame(X)
df.columns = columns
df['Target'] = y
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,Target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


As a demonstration of the power of PCA, let's only use a subset of this dataset for prediction

In [113]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=666
)

In [114]:
X_train = pd.DataFrame(X_train)
X_test = pd.DataFrame(X_test)

In [131]:
X_train_reduce = X_train.iloc[:, [x for x in range(0,10)]]
X_test_reduce = X_test.iloc[:, [x for x in range(0,10)]]

### Fitting a PCA Model

In [132]:
ss = StandardScaler()

pca = PCA(n_components=10)

X_train_red_sc = ss.fit_transform(X_train_reduce)
X_test_red_sc = ss.transform(X_test_reduce)

X_train_red_sc_pca = pca.fit_transform(X_train_red_sc)
X_test_red_sc_pca = pca.transform(X_test_red_sc)

X_train_red_sc_pca[0:10]

array([[-1.99053793e+00, -1.70143967e+00, -2.47597397e-01,
         1.47112470e-01,  1.93148609e-01,  1.26430575e-02,
         6.84490107e-02,  1.22296147e-01, -7.67397537e-02,
        -2.66818546e-03],
       [-7.63064730e-01,  1.11867073e-01, -7.83578371e-01,
         1.71429573e-01,  2.92828059e-01,  2.66144861e-01,
         1.34270451e-01,  1.82316074e-01, -9.04842935e-02,
         1.22503707e-02],
       [-1.26847786e+00, -2.25054068e+00,  1.85928194e+00,
         6.29731566e-01,  4.06740285e-01, -1.05887082e-01,
         1.44728864e-02,  6.24524220e-02, -7.83116799e-02,
         3.78208270e-03],
       [-1.79707915e+00,  6.00103394e-01, -1.21427501e+00,
         5.08872029e-02, -1.20241649e-01,  1.60625908e-01,
        -4.25335614e-02, -1.06401046e-01,  6.49602012e-02,
        -4.95611384e-04],
       [-2.10405333e+00, -9.68323155e-01, -6.55193987e-01,
         1.12438700e-01,  1.04281049e-01, -3.82307884e-01,
        -4.72516804e-02,  2.51976937e-01, -6.55108712e-02,
         4.

### Inspecting the Explained Variance of the Principal Components

Remember that PCA decomposes the original dataset into principal components which attempt to encapsulate the maximum amount of information as defined by the maximum variance across observations. With this, it is useful to investigate how much variance in the dataset is accounted for in the first $n$ components. 

While you will have the same number of principal components as you have original features to account for all of the variance in a dataset (assuming you don't have redundant features), the first few principal components will typically account for the vast majority of the variance.

sci-kit learn makes this very easy using the expalined_variance_ratio_ attribute of the instantiated PCA model object.

In [133]:
pca.explained_variance_ratio_

array([5.44962598e-01, 2.47715572e-01, 8.93516414e-02, 5.28058038e-02,
       3.93907594e-02, 1.21594806e-02, 8.61669610e-03, 3.73925749e-03,
       1.23093317e-03, 2.72580647e-05])

In [134]:
pca.explained_variance_ratio_.cumsum()

array([0.5449626 , 0.79267817, 0.88202981, 0.93483562, 0.97422637,
       0.98638586, 0.99500255, 0.99874181, 0.99997274, 1.        ])

Looks like we can grab a lot of linear variance in the data with only the first four components

In [129]:
X_train_red_sc_pca = pd.DataFrame(X_train_red_sc_pca)
X_train_red_sc_pca_4 = X_train_red_sc_pca.iloc[:, 0:5]

X_test_red_sc_pca = pd.DataFrame(X_test_red_sc_pca)
X_test_red_sc_pca_4 = X_test_red_sc_pca.iloc[:, 0:5]

### PCA rubber-to-the-road metric comparison

In [130]:
knn = KNeighborsClassifier()
knn_pca = KNeighborsClassifier()

knn.fit(X_train_red_sc, y_train)
preds = knn.predict(X_test_red_sc)

knn_pca.fit(X_train_red_sc_pca_4, y_train)
preds_pca = knn_pca.predict(X_test_red_sc_pca_4)

print(classification_report(y_test, preds))
print(classification_report(y_test, preds_pca))

              precision    recall  f1-score   support

           0       0.88      0.93      0.90        56
           1       0.95      0.92      0.94        87

    accuracy                           0.92       143
   macro avg       0.92      0.92      0.92       143
weighted avg       0.92      0.92      0.92       143

              precision    recall  f1-score   support

           0       0.88      0.93      0.90        56
           1       0.95      0.92      0.94        87

    accuracy                           0.92       143
   macro avg       0.92      0.92      0.92       143
weighted avg       0.92      0.92      0.92       143



### For the interested, here's how to calculate principal components manually

What follows is indebted to Sebastian Raschka (http://sebastianraschka.com/Articles/2015_pca_in_3_steps.html#pca-vs-lda).

Unnamed: 0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,...,PC21,PC22,PC23,PC24,PC25,PC26,PC27,PC28,PC29,PC30
0,1160.142574,-293.917544,48.578398,-8.711975,32.000486,1.265415,0.931337,0.148167,0.745463,0.589359,...,0.021189,0.000241,0.002528,0.011560,0.005773,0.001377,-0.001982,0.001293,0.001989,0.000704
1,1269.122443,15.630182,-35.394534,17.861283,-4.334874,-0.225872,-0.046037,0.200804,-0.485828,-0.084035,...,0.005237,0.021069,0.001565,0.006968,-0.006978,0.001411,-0.000083,-0.001347,0.000686,-0.001061
2,995.793889,39.156743,-1.709753,4.199340,-0.466529,-2.652811,-0.779745,-0.274026,-0.173874,-0.186994,...,-0.009865,-0.002394,-0.004125,-0.004007,0.000709,-0.003781,0.000178,0.000018,-0.000775,0.000405
3,-407.180803,-67.380320,8.672848,-11.759867,7.115461,1.299436,-1.267304,-0.060555,-0.330639,-0.144155,...,0.011169,0.007063,0.001537,0.007003,-0.010261,-0.002899,0.000016,0.001369,-0.002139,-0.001657
4,930.341180,189.340742,1.374801,8.499183,7.613289,1.021160,-0.335522,0.289109,0.036087,-0.138502,...,-0.009916,0.010269,0.002204,0.002764,0.002455,0.001665,0.003290,0.000273,0.001783,0.000327
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,1414.126684,110.222492,40.065944,6.562240,-5.102856,-0.395424,-0.786751,0.037082,-0.452530,-0.235185,...,-0.017214,0.007864,-0.002317,-0.002384,-0.003637,-0.008211,0.002418,0.001234,-0.000078,-0.000455
565,1045.018854,77.057589,0.036669,-4.753245,-12.417863,-0.059637,0.449831,0.509154,-0.449986,0.493247,...,0.011219,-0.001905,-0.003028,-0.007931,0.002905,-0.002519,0.000212,0.001006,-0.000621,-0.000741
566,314.501756,47.553525,-10.442407,-9.771881,-6.156213,-0.870726,-2.166493,-0.442279,-0.097398,-0.144667,...,-0.003362,-0.002249,-0.001248,-0.003927,-0.000921,0.000573,-0.001325,0.000025,0.000484,-0.000285
567,1124.858115,34.129225,-19.742087,-23.660881,3.565133,4.086390,-1.705401,-0.359964,0.385030,0.615467,...,-0.006130,-0.010804,0.005841,0.001127,-0.002646,0.001862,0.002698,0.001235,-0.000809,0.001217


In [None]:
# We'll start by producing the covariance matrix for the columns of X_tr_sc.

cov_mat = np.cov(X_scaled, rowvar=False)
cov_mat.shape

In [None]:
np.linalg.eig(cov_mat)

In [None]:
# Let's assign the results of eig(cov_mat) to a double of variables.

eigvals, eigvecs = np.linalg.eig(cov_mat)

In [None]:
# The columns of "eigvecs" are the eigenvectors!

eigvecs

In [None]:
# The eigenvectors of the covariance matrix are our principal components.
# Let's look at the first three.

pcabh = np.vstack([row[:3].reshape(1, 3) for row in eigvecs])

Now, to transform our data points into the space defined by the principal components, we simply need to compute the dot-product of X_scaled with those principal components.

Why? Think about what this matrix product looks like:

We take a row of X_scaled and multiply it by a column of pcabh, pairwise. The row of X_scaled represents the values for the columns in the original space. The column of pcabh represents the weights we need on each of the original columns in order to transform a value into principal-component space. And so the product of these two matrices will be each row, transformed into principal-component space!

In [None]:
X_scaled.dot(pcabh)

In [None]:
# Naturally, sklearn has a shortcut for this!

pca = PCA(n_components=3)                       # Check out how `n_components` works

X_new = pca.fit_transform(X_scaled)

In [None]:
# Let's check out the explained variance

pca.explained_variance_

In [None]:
# The ratio is often more informative

pca.explained_variance_ratio_

In [20]:
# We can also check out the Principal Components themselves

pca.components_

array([[ 0.36138659, -0.08452251,  0.85667061,  0.3582892 ],
       [ 0.65658877,  0.73016143, -0.17337266, -0.07548102],
       [-0.58202985,  0.59791083,  0.07623608,  0.54583143],
       [-0.31548719,  0.3197231 ,  0.47983899, -0.75365743]])