<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# PCA: Extra Practice

_Author: Joseph Nelson (DC), Matt Brems (DC) _

---

In this lab, we will practice the PCA Process:

- Load data.
- Center it (Standardize, center at 0).
- Compute covariance matrix of the standardized original data.
- Compute eigenvalues/eigenvectors.
- Decide how much explained variance you want in your final model. Select the number of principal components that are needed to explain said amount of variance.  
- Keep the needed eigenvalues to explain said variance.
- Go back and multiply original data by the eigenvalues of the selected principal components.

PCA works best to find the importance of relationship between various features.  
Having a dataset of entirely uncorrelated features will not show much benefit from a PCA.

Now, to the lab!

# Congressional Voting Data

You're working for a political watchdog that wants to track and analyze the voting behavior of various politicians. Specifically, we want to understand how the political affiliation of a member of the House of Representatives affects their voting record. You're given a dataset with a affiliations as well as voting records for a variety of key bills.

Your task is to perform PCA to determine the principal components of this dataset so that your data science team can perform a clustering analysis to learn how political affiliation is related to voting.

[Congressional Voting Dataset](./datasets/votes.csv)

Bill Index|Bill (vote options)
----------|----
V1.  |handicapped-infants: 2 (y,n)
V2.  |water-project-cost-sharing: 2 (y,n)
V3.  |adoption-of-the-budget-resolution: 2 (y,n)
V4.  |physician-fee-freeze: 2 (y,n)
V5.  |el-salvador-aid: 2 (y,n)
V6.  |religious-groups-in-schools: 2 (y,n)
V7.  |anti-satellite-test-ban: 2 (y,n)
V8.  |aid-to-nicaraguan-contras: 2 (y,n)
V9.  |mx-missile: 2 (y,n)
V10. |immigration: 2 (y,n)
V11. |synfuels-corporation-cutback: 2 (y,n)
V12. |education-spending: 2 (y,n)
V13. |superfund-right-to-sue: 2 (y,n)
V14. |crime: 2 (y,n)
V15. |duty-free-exports: 2 (y,n)
V16. |export-administration-act-south-africa: 2 (y,n)

### 1. Load Packages

In [3]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import math
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA 
from sklearn import metrics


%matplotlib inline

votes_file = '../datasets/votes.csv'

### 2. Preprocessing

After you've downloaded the data from the repository, go ahead and load it with Pandas and handle any preprocessing that is may need.

- Convert all columns to numeric values
- Decide how to deal with NaN values
- Standardize numeric values

In [4]:
votes = pd.read_csv(votes_file, index_col=0)
votes.head()

Unnamed: 0,Class,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16
1,republican,n,y,n,y,y,y,n,n,n,y,,y,y,y,n,y
2,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,
3,democrat,,y,y,,y,y,n,n,n,n,y,n,y,y,n,n
4,democrat,n,y,y,n,,y,n,n,n,n,y,n,y,n,n,y
5,democrat,y,y,y,n,y,y,n,n,n,n,y,,y,y,y,y


In [5]:
# Proportion of all votes that have NaNs

votes.isnull().sum() / len(votes)

Class    0.000000
V1       0.027586
V2       0.110345
V3       0.025287
V4       0.025287
V5       0.034483
V6       0.025287
V7       0.032184
V8       0.034483
V9       0.050575
V10      0.016092
V11      0.048276
V12      0.071264
V13      0.057471
V14      0.039080
V15      0.064368
V16      0.239080
dtype: float64

In [6]:
# Need these to be numerical to run a correlation matrix
# Replace "y" with 1, "n" with -1, and null with 0
# let's make 'democrat' = 1 and 'republican' = 0.

votemap = {'y': 1, 'n': -1, 'democrat': 1, 'republican': 0}

# This loops through each column name (V1 to V16)
for col in votes:
    votes[col] = votes[col].map(votemap)
    
# Handle NaN values that were left and had no conversion value
votes.fillna(0, inplace=True)

In [7]:
# Verify it worked
votes.head()

Unnamed: 0,Class,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16
1,0,-1.0,1.0,-1.0,1.0,1.0,1.0,-1.0,-1.0,-1.0,1.0,0.0,1.0,1.0,1.0,-1.0,1.0
2,0,-1.0,1.0,-1.0,1.0,1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,1.0,1.0,-1.0,0.0
3,1,0.0,1.0,1.0,0.0,1.0,1.0,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,1.0,1.0,-1.0,-1.0
4,1,-1.0,1.0,1.0,-1.0,0.0,1.0,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,1.0,-1.0,-1.0,1.0
5,1,1.0,1.0,1.0,-1.0,1.0,1.0,-1.0,-1.0,-1.0,-1.0,1.0,0.0,1.0,1.0,1.0,1.0


Next, let's define the x and y variables: 

In [8]:
X = votes.drop('Class', axis = 1)
y = votes.Class

The Next step required Standardized values.  Lets just do that now. **There are 2 different methods below.**

**Note, this data has already pretty much been standardized because it exists on a standard scale between -1 and 1**

In [9]:
# Method 1: This method will convert directly to a Numpy Matrix
x_stand = StandardScaler().fit_transform(X)
x_stand

array([[-0.90578391,  1.05291969, -1.22638627, ...,  0.85158008,
        -0.90252246,  0.7169386 ],
       [-0.90578391,  1.05291969, -1.22638627, ...,  0.85158008,
        -0.90252246, -0.65090478],
       [ 0.11498293,  1.05291969,  0.83735851, ...,  0.85158008,
        -0.90252246, -2.01874815],
       ...,
       [-0.90578391, -0.00731194, -1.22638627, ...,  0.85158008,
        -0.90252246,  0.7169386 ],
       [-0.90578391, -1.06754357, -1.22638627, ...,  0.85158008,
        -0.90252246,  0.7169386 ],
       [-0.90578391,  1.05291969, -1.22638627, ...,  0.85158008,
         0.14161922, -2.01874815]])

In [10]:
# Method 2: This will modify the values within the dataframe

X_2 = (X - X.mean()) / X.std()
# isn't it amazing how pandas does this?
X_2.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16
1,-0.904742,1.051709,-1.224976,1.19045,1.007227,0.763019,-1.158688,-1.179384,-1.027468,0.997768,0.278564,1.197357,1.010112,0.850601,-0.901484,0.716114
2,-0.904742,1.051709,-1.224976,1.19045,1.007227,0.763019,-1.158688,-1.179384,-1.027468,-1.016288,-0.784377,1.197357,1.010112,0.850601,-0.901484,-0.650156
3,0.114851,1.051709,0.836395,0.165013,1.007227,0.763019,-1.158688,-1.179384,-1.027468,-1.016288,1.341505,-0.89862,1.010112,0.850601,-0.901484,-2.016426
4,-0.904742,1.051709,0.836395,-0.860424,-0.009348,0.763019,-1.158688,-1.179384,-1.027468,-1.016288,1.341505,-0.89862,1.010112,-1.222292,-0.901484,0.716114
5,1.134444,1.051709,0.836395,-0.860424,1.007227,0.763019,-1.158688,-1.179384,-1.027468,-1.016288,1.341505,0.149369,1.010112,0.850601,1.184397,0.716114


### 3. Covariance matrix

- Compute the covariance matrix
- Compute the eigenvectors and eigenvalues (try using `np.linalg`!)

In [11]:
# .corr() is a pandas function that Compute pairwise correlation of columns, excluding NA/null values
X_2CM = X_2.corr()

# PCA are eigenvectors of the covariance matrix
X_2CM.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16
V1,1.0,0.023232,0.39768,-0.421307,-0.3691,-0.402215,0.362022,0.399039,0.33984,-0.086469,0.105277,-0.41112,-0.34836,-0.370628,0.201782,0.11657
V2,0.023232,1.0,-0.054237,0.076274,0.133882,0.149569,-0.203465,-0.103966,-0.190123,-0.122931,0.188786,-0.019364,0.223338,-0.016535,-0.11094,-0.09144
V3,0.39768,-0.054237,1.0,-0.725232,-0.651244,-0.43196,0.579655,0.69815,0.603294,0.022112,0.218328,-0.645382,-0.526661,-0.585085,0.47833,0.311423
V4,-0.421307,0.076274,-0.725232,1.0,0.753347,0.47629,-0.580509,-0.694025,-0.639042,0.04436,-0.282151,0.690901,0.593952,0.647853,-0.538417,-0.270164
V5,-0.3691,0.133882,-0.651244,0.753347,1.0,0.624175,-0.694744,-0.827431,-0.782799,0.009348,-0.146776,0.634723,0.645797,0.695011,-0.558103,-0.274914


In [12]:
# This is the correlation matrix of our variables without being scaled

X_3 = X.corr()
X_3.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16
V1,1.0,0.023232,0.39768,-0.421307,-0.3691,-0.402215,0.362022,0.399039,0.33984,-0.086469,0.105277,-0.41112,-0.34836,-0.370628,0.201782,0.11657
V2,0.023232,1.0,-0.054237,0.076274,0.133882,0.149569,-0.203465,-0.103966,-0.190123,-0.122931,0.188786,-0.019364,0.223338,-0.016535,-0.11094,-0.09144
V3,0.39768,-0.054237,1.0,-0.725232,-0.651244,-0.43196,0.579655,0.69815,0.603294,0.022112,0.218328,-0.645382,-0.526661,-0.585085,0.47833,0.311423
V4,-0.421307,0.076274,-0.725232,1.0,0.753347,0.47629,-0.580509,-0.694025,-0.639042,0.04436,-0.282151,0.690901,0.593952,0.647853,-0.538417,-0.270164
V5,-0.3691,0.133882,-0.651244,0.753347,1.0,0.624175,-0.694744,-0.827431,-0.782799,0.009348,-0.146776,0.634723,0.645797,0.695011,-0.558103,-0.274914


While the correlation matrices of the standardized and normal data sets appear to be the same, for some reason python does not see them as equals.  this may be due to their decimal place extension.

In [13]:
# Why are these not the same, but look the same?

X_3 == X_2CM

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16
V1,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
V2,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False
V3,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False
V4,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False
V5,False,False,False,False,True,False,True,False,False,False,False,False,False,False,False,False
V6,False,False,False,False,False,True,True,False,False,False,False,False,False,False,False,False
V7,False,False,False,False,True,True,True,False,False,False,False,False,False,False,False,False
V8,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False
V9,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False
V10,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False


#### Now, let's check the eigenvalues: 

In [14]:
# Man isn't Numpy great!
eig_vals, eig_vecs = np.linalg.eig(X_2CM)
eig_vals1, eig_vecs1 = np.linalg.eig(X_3)
print(eig_vals)
print(eig_vals1)

[7.40236313 1.42718114 1.13099577 0.86295103 0.80236765 0.75097129
 0.13309252 0.21540746 0.24037679 0.5749521  0.30529835 0.33103556
 0.52533026 0.47190968 0.39281719 0.43295007]
[7.40236313 1.42718114 1.13099577 0.86295103 0.80236765 0.75097129
 0.13309252 0.21540746 0.24037679 0.5749521  0.30529835 0.33103556
 0.52533026 0.47190968 0.39281719 0.43295007]


#### And the eigenvectors: 

In [15]:
eig_vecs

array([[-1.87809330e-01, -1.81722119e-01, -1.55275769e-01,
         5.53363824e-01,  3.95918006e-01,  4.94792150e-01,
        -6.20394464e-02,  1.06978258e-02, -5.85601301e-02,
         1.72339485e-01,  5.45292676e-02,  1.61018808e-01,
         1.17913902e-01, -3.23472153e-01, -1.21564992e-01,
         2.24123205e-02],
       [ 5.38655080e-02, -6.10752223e-01,  1.37255837e-01,
         4.08772744e-01, -1.11659131e-01, -5.32813707e-01,
        -3.65539275e-04,  1.00665084e-01,  9.38662181e-02,
         7.18097848e-02, -1.10895527e-01,  7.01216422e-02,
        -1.22742773e-01, -7.37275759e-03,  4.54129677e-02,
         2.88598767e-01],
       [-2.93251619e-01, -8.58088375e-02,  1.83349755e-01,
         2.74308742e-02,  3.39469693e-02,  2.87636785e-02,
        -1.84410234e-01,  1.37147698e-01, -1.66809452e-01,
        -2.70406816e-01, -6.13057352e-01, -4.33207839e-01,
         3.69655169e-01,  3.77171990e-02, -1.00605249e-01,
        -6.95114324e-02],
       [ 3.10693742e-01,  1.35055455e

**Great!** We have some eigenvectors and eigenvalues, but what do they mean?
These eigenvalues and eigenvectors do not correspond to our original dataframe matrix but are based on our square correlation of features matrix.

So we have our eigenvalues which are like keys for each eigenvector. The number of eigenpairs right now is equal to our original number of features.

The eigenvalue is essentially the key that is needed to transpose our identity matrix to our eigenvector.

The eigenpairs are used to calculate the explained variance which is useful in determining principal components.

#### To find the principal components, find the eigenpairs, and sort them from highest to lowest.

In [16]:
value_vector_pairs = [[eig_vals[i], eig_vecs[:,i]] for i in range(len(eig_vals))]
value_vector_pairs.sort(reverse=True)

What are principal components?

In [17]:
value_vector_pairs


[[7.40236312952388,
  array([-0.18780933,  0.05386551, -0.29325162,  0.31069374,  0.32982429,
          0.26111754, -0.29052391, -0.32161235, -0.30007697,  0.01129995,
         -0.06852963,  0.28732779,  0.27557143,  0.28569924, -0.24614685,
         -0.13753029])],
 [1.4271811422429181,
  array([-0.18172212, -0.61075222, -0.08580884,  0.13505545, -0.03476445,
         -0.08521068,  0.18229466,  0.04488362,  0.14629295,  0.38175157,
         -0.50627296,  0.15733647, -0.08472336,  0.14338653, -0.02048274,
          0.21972732])],
 [1.130995770243853,
  array([-0.15527577,  0.13725584,  0.18334976, -0.10192309,  0.01447065,
          0.31285733,  0.01316406,  0.06646765, -0.00834646,  0.63236536,
          0.44793651, -0.04989752,  0.11856934,  0.13103759, -0.04185746,
          0.41748102])],
 [0.8629510277123037,
  array([ 0.55336382,  0.40877274,  0.02743087,  0.10210521,  0.07099146,
         -0.11745149,  0.06972849,  0.02339107,  0.02011706,  0.10313823,
         -0.41333262, -0.0

### 4. Understanding the principal components

#### 4.A. Calculate the explained variance. 

> Explained variance is the eigenvalue divided by the sum of all eigenvalues.
  **These should sum to 1!**

In [18]:
EVSum = sum(eig_vals)

# EV = eigenvalue divided by the sum of all eigenvalues times 100
var_exp = [(i / EVSum) for i in sorted(eig_vals, reverse=True)]

# Explained variance of all eigenpairs should add up to 100, as in 100% of explained variance.

In [19]:
var_exp

[0.4626476955952424,
 0.08919882139018237,
 0.07068723564024079,
 0.05393443923201897,
 0.05014797803849963,
 0.04693570568674697,
 0.035934506470415684,
 0.03283314119663071,
 0.029494355201732433,
 0.027059379593370236,
 0.02455107412021299,
 0.020689722479974804,
 0.019081147096158196,
 0.015023549598944306,
 0.013462966051801244,
 0.008318282607828043]

In [20]:
sum(var_exp)

0.9999999999999997

#### 4.B. Calculate the explained variance and the cumulative explained variance (see `np.cumsum`)

In [22]:
# Return the cumulative sum of the elements along a given axis.
# the moving sum 
cum_var_exp = np.cumsum(var_exp)
cum_var_exp

# you can see that the explained variance of all the eigenvalues adds up to 1.
# in other words, this combination of eigenvalues is capable of explaining all variance.

array([0.4626477 , 0.55184652, 0.62253375, 0.67646819, 0.72661617,
       0.77355188, 0.80948638, 0.84231952, 0.87181388, 0.89887326,
       0.92342433, 0.94411405, 0.9631952 , 0.97821875, 0.99168172,
       1.        ])

#### 4.C. Suppose we require 90% explained variance. How many eigenvectors should we keep? 

- Hint: Use the cumulative sum

### 5. Now, repeat the process with sklearn.
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

In [23]:
from sklearn.decomposition import PCA

pca = PCA(n_components=10)
skl_pca = pca.fit_transform(X_2)
skl_pca

array([[ 3.49236025,  0.08607588, -1.37104206, ...,  0.59099576,
         0.21114272, -0.06781472],
       [ 3.730348  ,  0.61701382,  0.94909933, ...,  0.15029384,
         0.05639808,  0.16833472],
       [ 2.05574679,  2.82392803,  0.13849764, ..., -0.84693551,
         0.35413561, -0.41179291],
       ...,
       [ 3.33971434,  0.74628917, -0.42819691, ...,  0.89081185,
         0.08251796, -0.4308882 ],
       [ 2.50383963, -1.74407999, -0.04065769, ...,  0.54598294,
        -0.40164608, -0.20759022],
       [ 3.68429367,  0.16971379,  0.28952655, ..., -0.1711483 ,
        -0.50490591,  0.3397782 ]])

In [24]:
pca.components_

array([[-0.18780933,  0.05386551, -0.29325162,  0.31069374,  0.32982429,
         0.26111754, -0.29052391, -0.32161235, -0.30007697,  0.01129995,
        -0.06852963,  0.28732779,  0.27557143,  0.28569924, -0.24614685,
        -0.13753029],
       [ 0.18172212,  0.61075222,  0.08580884, -0.13505545,  0.03476445,
         0.08521068, -0.18229466, -0.04488362, -0.14629295, -0.38175157,
         0.50627296, -0.15733647,  0.08472336, -0.14338653,  0.02048274,
        -0.21972732],
       [ 0.15527577, -0.13725584, -0.18334976,  0.10192309, -0.01447065,
        -0.31285733, -0.01316406, -0.06646765,  0.00834646, -0.63236536,
        -0.44793651,  0.04989752, -0.11856934, -0.13103759,  0.04185746,
        -0.41748102],
       [-0.55336382, -0.40877274, -0.02743087, -0.10210521, -0.07099146,
         0.11745149, -0.06972849, -0.02339107, -0.02011706, -0.10313823,
         0.41333262,  0.06282039, -0.14603368, -0.00404089,  0.39109413,
        -0.36213577],
       [ 0.39591801, -0.11165913,  0

In [25]:
pca.explained_variance_ratio_

array([0.4626477 , 0.08919882, 0.07068724, 0.05393444, 0.05014798,
       0.04693571, 0.03593451, 0.03283314, 0.02949436, 0.02705938])