<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# PCA: Extra Practice

_Author: Joseph Nelson (DC), Matt Brems (DC) _

---

In this lab, we will practice performing a principal component analysis (PCA) process:

- Loading the data.
- Centering the data (standardizing and centering at zero).
- Computing the covariance matrix of the standardized original data.
- Computing the eigenvalues/eigenvectors.
- Deciding how much explained variance you want in your final model and selecting the number of principal components that are necessary to explain that amount of variance.  
- Keeping the needed eigenvalues to explain the variance.
- Going back and multiplying the original data by the eigenvalues of the selected principal components.

PCA works best for finding the importance of relationships between various features.  

A data set of entirely uncorrelated features will not benefit much from a PCA.

# Congressional Voting Data

You're working for a political watchdog who wants to track and analyze the voting behavior of various politicians. Specifically, they want to understand how the political affiliation of House of Representatives members affects voting records. You're given a data set with affiliations and voting records for a variety of key bills.

Your task is to perform a PCA to determine the principal components of the data set. Then, your data science team can perform a clustering analysis to learn how political affiliation relates to voting.

[Congressional Voting Data Set](./datasets/votes.csv)

Bill Index|Bill (Vote Options)
----------|----
V1.  |handicapped-infants: 2 (y,n)
V2.  |water-project-cost-sharing: 2 (y,n)
V3.  |adoption-of-the-budget-resolution: 2 (y,n)
V4.  |physician-fee-freeze: 2 (y,n)
V5.  |el-salvador-aid: 2 (y,n)
V6.  |religious-groups-in-schools: 2 (y,n)
V7.  |anti-satellite-test-ban: 2 (y,n)
V8.  |aid-to-nicaraguan-contras: 2 (y,n)
V9.  |mx-missile: 2 (y,n)
V10. |immigration: 2 (y,n)
V11. |synfuels-corporation-cutback: 2 (y,n)
V12. |education-spending: 2 (y,n)
V13. |superfund-right-to-sue: 2 (y,n)
V14. |crime: 2 (y,n)
V15. |duty-free-exports: 2 (y,n)
V16. |export-administration-act-south-africa: 2 (y,n)

### 1) Load packages.

In [1]:
import pandas as pd
import numpy as np
import os
from matplotlib import pyplot as plt
import numpy as np
import math
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA 

votes_file = './datasets/votes.csv'

### 2) Preprocess the data.

After you've downloaded the data from the repository, load the set with Pandas and handle any necessary preprocessing.

- Convert all columns to numeric values.
- Decide how to deal with NaN values.
- Standardize numeric values.

### 3) Compute the eigenpairs.

- Compute the covariance matrix.
- Compute the eigenvectors and eigenvalues using `np.linalg`.
- Sort by descending eigenvalues to find the principal components.

### 4) Understand the principal components.

#### 4.A) Calculate the explained variance. 

> Explained variance is the eigenvalue divided by the sum of all eigenvalues.
  **These should sum to one!**

#### 4.B) Calculate the explained variance and the cumulative explained variance (see `np.cumsum`).

#### 4.C) Suppose we require 90-percent explained variance. How many eigenvectors should we keep? 

- Hint: Use the cumulative sum.

### 5) Now, repeat the process with scikit-learn.
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html