<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Principal Component Analysis

_Authors: Justin Pounders, Matt Brems, Noelle Brown_

### LEARNING OBJECTIVES
By the end of the lesson, students should be able to:
1. Differentiate between feature selection and feature extraction.
2. Describe the PCA algorithm.
3. Implement PCA in `scikit-learn`.
4. Calculate and interpret proportion of explained variance.
5. Identify use cases for PCA.

Before covering what's here, let's cover some foundations on `Dimensionality reduction` in this separate [intuition deck](https://docs.google.com/presentation/d/1acHgUvkR-qx1SxM5rA8dikHNqMlZFM3t4DNKWZXhYbA/edit#slide=id.g10cc392b88f_1_0)

In [1]:
# Import our libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Import from sklearn.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures, StandardScaler

# Set a random seed.
np.random.seed(42)

### Introduction of Problem

Today, we're going to be using the [wine quality](http://www3.dsi.uminho.pt/pcortez/wine/) dataset by Cortez, Cerdeira, Almeida, Matos and Reis.

Specifically, we are going to use physicochemical properties of the wine in order to **predict the *quality* of the wine.**

In [2]:
# Read in the wine quality datasets.
df_red = pd.read_csv('../datasets/winequality-red.csv', sep=';')
df_white = pd.read_csv('../datasets/winequality-white.csv', sep=';')

# Stack datasets together. (They have the same column names!)
df = pd.concat([df_red, df_white])

# Check out head of our dataframe.
print(df.shape)
df.head()

(6497, 12)


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


### Fit a multiple linear regression model in `sklearn`.

In [3]:
# Set y to be the quality column.
y = df['quality']

# Set X as all other columns.
X = df.drop(columns=['quality'])

# How much missing data do we have? --> none for this dataset
X.isnull().sum()

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
dtype: int64

[sklearn polynomial features recap](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html)

- if an input sample is two dimensional and of the form [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2]
- accordingly, degree-3 would incorporate cubed features as follows: [1,a,b,a^2,ab,b^2,a^3,a^2b,ab^2,b^3]

In [6]:
# let's examine with a small subset of X before applying on entire 'X' for model training
sample_X = df[['fixed acidity','volatile acidity']].head()

trans = PolynomialFeatures(degree=3)
poly_sample_X = trans.fit_transform(sample_X)
poly_sample_X

array([[1.000000e+00, 7.400000e+00, 7.000000e-01, 5.476000e+01,
        5.180000e+00, 4.900000e-01, 4.052240e+02, 3.833200e+01,
        3.626000e+00, 3.430000e-01],
       [1.000000e+00, 7.800000e+00, 8.800000e-01, 6.084000e+01,
        6.864000e+00, 7.744000e-01, 4.745520e+02, 5.353920e+01,
        6.040320e+00, 6.814720e-01],
       [1.000000e+00, 7.800000e+00, 7.600000e-01, 6.084000e+01,
        5.928000e+00, 5.776000e-01, 4.745520e+02, 4.623840e+01,
        4.505280e+00, 4.389760e-01],
       [1.000000e+00, 1.120000e+01, 2.800000e-01, 1.254400e+02,
        3.136000e+00, 7.840000e-02, 1.404928e+03, 3.512320e+01,
        8.780800e-01, 2.195200e-02],
       [1.000000e+00, 7.400000e+00, 7.000000e-01, 5.476000e+01,
        5.180000e+00, 4.900000e-01, 4.052240e+02, 3.833200e+01,
        3.626000e+00, 3.430000e-01]])

In [4]:
# To show off the strength of PCA, we're going to make many, many more features.
pf = PolynomialFeatures(degree = 3)

# Fit and transform our X data using Polynomial Features.
X_new = pf.fit_transform(X) # Fit to data, then transform it

# How many features do we have now?
print(X_new.shape)

# How many features did we start out with?
print(X.shape)

(6497, 364)
(6497, 11)


In [7]:
# Train/test split our data.
X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size = 0.33, random_state = 42)

In [8]:
# Instantiate and fit a linear regression model.
lm = LinearRegression()
lm.fit(X_train, y_train)

# Score on training set. (We'll use the default R^2 scoring metric)
print(f'Training Score: {round(lm.score(X_train, y_train),4)}.')

# Score on testing set.
print(f'Testing Score: {round(lm.score(X_test, y_test),4)}.')

Training Score: 0.4563.
Testing Score: -0.8865.


<details><summary>Check: What is the problem with this?</summary>
    
- Our model's performance is really bad. It is poorly fit and overfit (the model hasn't generalized)!
- We have a lot of columns relative to our number of rows! (If you have $n$ rows and you're fitting a linear model, it's often advised to keep your number of columns below $\sqrt{n}$.)
    - With this guiding principle, we are clearly way-off as post polynomial features creation, we had 364 features, while $\sqrt{6497}$ would be ~81.
    
</details>

<details><summary>Check: How can we overcome this problem?</summary>

- We can drop features from our model. (However, we need to ensure we're not dropping model critical features! It can also be time-consuming and/or require subject-matter expertise.)
- Maybe we can combine features together so that we can get the benefits of most/all of our features. <b>(This is what PCA will do.)</b>
</details>

### Dimensionality Reduction

[Dimensionality reduction](https://www.analyticsvidhya.com/blog/2015/07/dimension-reduction-methods/) refers to (approximately) reducing the number of features we use in our model.

<details><summary>Dimensionality reduction has a number of advantages:</summary>

- Increases computational efficiency when fitting models.
- Can help with addressing a multicollinearity problem.
- Makes visualization simpler (or feasible).
</details>

<details><summary>Dimensionality reduction can suffer from some drawbacks, though:</summary>

- We've invested our time and money into collecting information... why do we want to get rid of it?
</details>

### Is there a way to get the advantages of dimensionality reduction while minimizing the drawbacks?

Dimensionality reduction can generally be broken down into one of two categories:

<img src="../images/dim_red.png" alt="drawing" width="550"/>

- **Feature Selection**
    - We drop variables from our model. In other words, we *pick and choose*
- **Feature Extraction**
    - In feature extraction, we take our existing features and *combine them together* in a particular way. We can then drop some of these "new" variables, but the variables we keep are still a combination of the old variables!
    - This allows us to still reduce the number of features in our model **but** we can keep all of the most important pieces of the original features!
    - Let's consider this example:

<img src="../images/feast.png" alt="drawing" width="550"/>

### $$
\begin{eqnarray*}
X_1, \ldots, X_p &\Rightarrow& Z_1, \ldots, Z_p \\
\\
\text{most important: }Z_1 &=& w_{1,1}X_1 + w_{1,2}X_2 + \cdots + w_{1,p}X_p \\
\text{slightly less important: }Z_2 &=& w_{2,1}X_1 + w_{2,2}X_2 + \cdots + w_{2,p}X_p \\
&\vdots&\\
\text{least important: }Z_p &=& w_{p,1}X_1 + w_{p,2}X_2 + \cdots + w_{p,p}X_p \\
\end{eqnarray*}
$$

- What is shown here is, we are taking our original features in `X` and combining them in a certain way to create `Z`
- We don't usually care about the values of weights captured in `w` here. They aren't very meaningful and we don't try to interpret them.
- You can think of $Z_1$ as a "<b>high performance</b>" predictor, where $Z_1$ has all of the <b>best pieces</b> of our original predictors $X_1$ through $X_p$.
- As we move down the list toward $Z_p$, the variables will consist of the more "redundant" parts of our $X$ variables and become lesser and lesser meaningful. 
- You can think of $Z_p$ as a "<b>low performance</b>" predictor.

<details><summary>Based on above. If we're going to keep three of our new predictors, which three would that be?</summary>
    
- The first three: $Z_1$, $Z_2$, and $Z_3$.
- This is how we do feature extraction.
    - We take our old features $X_1$, $X_2$, $X_3$, and $X_4$.
    - We turn them into new features $Z_1$, $Z_2$, $Z_3$, and $Z_4$.
    - The new features are combinations of our old features.
    - If we drop some new features, we're doing dimensionality reduction, but we also keep parts of every old feature!
</details>

Dimensionality reduction can be used as an exploratory/unsupervised learning method or as a pre-processing step for supervised learning later.

**Principal component analysis** is one algorithm for doing **feature extraction**.

<details><summary>How would you describe the difference between feature selection and feature extraction?</summary>

- Feature selection is a process of dropping original features from our model.
- Feature extraction is a process of transforming our original features into "new" features, then dropping some of the "new" features from our model.
</details>

## Principal Component Analysis

### Big picture, what is PCA doing?
1. We are going to look at how all of the $X$ variables relate to one another and summarize these relationships.
2. Then, we will take this summary and look at which combinations of our $X$ variables are most important/which relationships are most meaningful.
3. We can also quantify how important each combination is and rank these combinations.

Once we've taken our original $X$ data and transformed it into $Z$, we can then drop the columns of $Z$ that are "least important."

Imagine you are this [Whale shark](https://en.wikipedia.org/wiki/Whale_shark):

<img src="../images/whaleshark.png" alt="drawing" width="500"/>

And you want a snack. Which way would you tilt your head to eat the most krill at once?
- anti-clockwise to maximize outcome

<img src="../images/krill.png" alt="drawing" width="500"/>

Above artwork by [@allison_horst](https://twitter.com/allison_horst)

<img src="../images/pca.gif" alt="drawing" width="600"/>

[Source](https://rpubs.com/jormerod/594859).

**Visually...**

> Think of our data floating out in $p$-dimensional space. Each observation is a dot and you can imagine this massive cloud of dots that exists somewhere. PCA is a way to rotate this cloud of dots (formally, a [coordinate transformation](http://farside.ph.utexas.edu/teaching/336k/Newtonhtml/node153.html)). The old axes are the original $X_1$, $X_2$, $\ldots$ features. **The new axes are the principal components from PCA**.

The principal components are the most concise, informative descriptors of our data as a whole.
- What does this mean?
- If we wanted to take our full data set and condense it into one dimension (think like our $X$ axis), we'd only use $Z_1$.
- If we wanted to take our full data set and condense it into two dimensions (think like our $X$ and $Y$ axes), we'd use $Z_1$ and $Z_2$. *(like pc1 and pc2 in below site)*

You can head to [this site](http://setosa.io/ev/principal-component-analysis/) that allows us to visualize what PCA does to original data. Play around with the 2D data.

- PCA finds a *new coordinate system* in which every point has a new (x,y) value. The axes don't actually mean anything physical; they're combinations of height and weight called "principal components" that are chosen to give one axes lots of variation.

---

### Principal Components

- We are looking for new *directions*.

**These new *directions* are the "principal components."**

> Applying PCA to your data *transforms* your original data columns (variables) onto the new principal component axes.


### Two important notes:

1. Train/test split **before** applying PCA!
2. Standardize our data **before** applying PCA!

In [9]:
# Instantiate our StandardScaler.
ss = StandardScaler()

# Standardize X_train.
X_train = ss.fit_transform(X_train)

# Standardize X_test.
X_test = ss.transform(X_test)

In [10]:
# Import PCA.
from sklearn.decomposition import PCA

#### (BONUS) Why decomposition?
The way PCA works "under the hood" is it takes one matrix and **decomposes** that matrix into multiple matrices.

Written out, we might take some matrix $\mathbf{A}$ and break it down into multiple matrices like this:

$$
\begin{eqnarray*}
\mathbf{A} &=& \mathbf{P}\mathbf{D}\mathbf{P}^{-1}
\end{eqnarray*}
$$

Check out [the Wikipedia article](https://en.wikipedia.org/wiki/Matrix_decomposition) for a list of ways to decompose matrices.
- A specific method of decomposition commonly used for PCA is known as the [eigendecomposition](https://en.wikipedia.org/wiki/Eigendecomposition_of_a_matrix) or spectral decomposition of a matrix. However, eigendecompositon requires [diagonalizable](https://en.wikipedia.org/wiki/Diagonalizable_matrix) matrices. To generalize this to non-square/non-diagonalizable matrices, we more commonly use the [Singular Value Decomposition](https://en.wikipedia.org/wiki/Singular_value_decomposition) (SVD) for PCA. [PCA in Sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) uses SVD.

In [11]:
# Instantiate PCA.
pca = PCA(random_state = 42)

In [13]:
# Fit PCA on the training data.
pca.fit(X_train)

PCA(random_state=42)

In [14]:
# Transform PCA on the scaled training data.
Z_train = pca.transform(X_train)

In [16]:
# Let's check out the resulting data.
pd.DataFrame(Z_train).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,354,355,356,357,358,359,360,361,362,363
0,12.948627,-3.258955,-0.832581,-2.804569,4.013368,2.54378,-7.019037,0.210613,-1.044385,4.588738,...,-1.75303e-07,2.244862e-07,-3.355503e-07,1.695424e-07,-2.723171e-08,-4.47986e-07,5.305204e-07,2.465473e-07,-5.421411e-08,-7.46838e-15
1,2.65672,-5.133174,1.321977,7.870713,7.259017,1.834894,-4.380642,1.323332,-0.326929,0.671756,...,-6.582953e-08,-7.814499e-08,-1.37356e-07,-1.330862e-07,6.570108e-09,7.538702e-10,-6.368494e-08,-7.070962e-08,2.419514e-08,2.008061e-16
2,-0.99022,-3.640704,-4.845163,-1.805145,-4.065424,-7.852356,3.627241,3.246181,5.454453,1.442888,...,2.341384e-07,2.386694e-07,-1.500054e-07,1.673883e-07,-1.234804e-07,1.920235e-07,-1.052504e-07,1.918555e-07,-1.796324e-08,-1.6022840000000003e-17
3,-4.978165,37.883902,-1.540496,-18.391432,-18.798863,39.65114,19.141686,-18.185576,0.84882,-16.846516,...,-2.379348e-07,-3.991949e-07,4.642472e-07,-8.555283e-08,1.090973e-08,-6.307144e-08,8.913501e-08,-5.217746e-08,3.277049e-08,2.3058780000000003e-17
4,-1.351178,-8.745763,-0.088128,-6.09701,2.148219,1.394471,-1.591332,2.14814,1.898963,1.393446,...,7.049872e-08,1.449276e-08,2.896338e-08,2.498999e-08,-7.445622e-09,-3.288498e-08,-8.061787e-08,1.99167e-10,-1.138442e-08,6.281998000000001e-17


In [13]:
# Let's check out the resulting data. - descriptive stats on above output
pd.DataFrame(Z_train).describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,354,355,356,357,358,359,360,361,362,363
count,4352.0,4352.0,4352.0,4352.0,4352.0,4352.0,4352.0,4352.0,4352.0,4352.0,...,4352.0,4352.0,4352.0,4352.0,4352.0,4352.0,4352.0,4352.0,4352.0,4352.0
mean,1.146958e-16,-1.595946e-16,-6.122553e-17,4.693958e-18,8.783313e-17,-6.959302000000001e-17,6.046022000000001e-17,1.299002e-16,1.050528e-16,-8.316468000000001e-17,...,-1.247476e-17,-5.682199e-17,-2.2785330000000002e-17,1.44571e-17,-3.035692e-17,5.544647e-18,-8.241894e-18,1.6497130000000002e-17,5.3115080000000005e-17,4.0666e-26
std,10.59077,8.173139,6.290206,5.899809,4.755777,4.187379,3.663482,3.51957,2.952624,2.382871,...,3.32442e-07,2.824236e-07,2.473689e-07,1.937117e-07,1.745561e-07,1.617731e-07,1.377643e-07,9.470085e-08,2.940241e-08,3.08479e-16
min,-26.52547,-12.74783,-24.69224,-47.87008,-22.55302,-20.58849,-28.71649,-18.18558,-14.23366,-23.76801,...,-3.023989e-06,-2.779914e-06,-1.989357e-06,-1.760353e-06,-1.692596e-06,-1.716397e-06,-1.619285e-06,-8.943172e-07,-2.324601e-07,-7.46838e-15
25%,-7.399924,-5.048388,-4.000986,-3.001194,-2.38384,-2.356163,-2.189302,-2.283286,-1.858204,-1.434927,...,-1.378026e-07,-9.565181e-08,-1.013028e-07,-7.476632e-08,-6.683678e-08,-6.46355e-08,-4.998106e-08,-3.541248e-08,-1.156609e-08,-7.859317000000001e-17
50%,-1.77981,-1.784668,-0.6808596,-0.06803642,-0.0481993,-0.07350413,-0.1303354,-0.1860893,0.1402668,-0.07198744,...,1.628219e-08,1.037889e-08,-9.718861e-09,3.965882e-09,3.880132e-10,-1.324977e-09,6.129282e-10,2.191272e-09,4.248704e-10,1.950242e-17
75%,6.595688,2.77039,3.760017,3.103933,2.411031,2.287913,1.959506,2.180386,1.909862,1.35825,...,1.393233e-07,1.054804e-07,1.014314e-07,7.356252e-08,6.753657e-08,5.911348e-08,4.85796e-08,3.957595e-08,1.240289e-08,1.17273e-16
max,95.24253,114.9057,80.74751,133.8557,76.0941,39.65114,35.59624,56.28123,19.36733,24.58697,...,4.492273e-06,4.635209e-06,3.338466e-06,2.044531e-06,2.028879e-06,1.779281e-06,1.906753e-06,1.092489e-06,4.183708e-07,1.523538e-15


In [17]:
# similarly, Don't forget to transform the test data!
Z_test = pca.transform(X_test)
pd.DataFrame(Z_test).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,354,355,356,357,358,359,360,361,362,363
0,9.103072,3.439415,-11.783759,13.01693,1.184315,-3.159621,3.414925,8.538476,-2.584179,3.018793,...,-9.18292e-08,-1.561775e-08,2.473053e-08,-3.549909e-07,2.055061e-07,1.42839e-07,2.897451e-07,2.545958e-08,-2.549097e-08,-7.503008e-16
1,-2.235547,3.500906,7.817154,-7.445563,-3.417477,-8.154706,1.531397,-2.076225,1.191075,0.573568,...,-1.511403e-07,7.566446e-08,8.39073e-08,1.703526e-07,-1.228489e-07,6.259637e-08,-4.645473e-08,8.151851e-08,-2.441678e-08,-5.379277e-17
2,3.353077,-4.055033,-0.026266,1.417186,-5.566875,-2.302657,0.953539,4.012457,1.944812,-1.971464,...,-1.313394e-07,-2.411871e-07,-4.764512e-08,-7.260178e-08,1.005425e-07,-3.23687e-08,-4.27361e-08,4.756427e-09,4.493932e-09,-3.051075e-17
3,18.626931,1.017445,-1.74643,-0.796755,-1.895322,-0.293004,-1.728781,1.354999,-4.237255,-2.503114,...,-1.367549e-07,-7.396131e-08,-4.645654e-08,-1.358216e-07,-2.208545e-08,2.778198e-08,-1.780352e-09,-1.933329e-08,-9.8213e-09,-9.857787e-17
4,23.180523,2.107191,9.862115,-3.257994,-1.340473,0.833409,-2.528985,-4.532641,4.218397,0.151709,...,-2.094121e-08,8.671944e-08,1.682491e-07,1.688573e-07,-4.166592e-08,5.811428e-08,-1.148048e-07,-9.872134e-09,2.203536e-09,1.878234e-16


### So, big picture, what is PCA doing, technically?
Well, we're transforming our data. Specifically:

The associated mathematical terminologies with PCA are captured in **bold**


1. We are going to look at how all of the $X$ variables relate to one another, then summarize these relationships. (This is done with the **covariance matrix**.)
2. Then, we will take this summary and look at which combinations of our $X$ variables are most important. (We will decompose our covariance matrix into its **eigenvectors**, which is a linear algebra term that allows us to understand the most important "directions" in our data, which are our principal components!)
3. We can also see exactly how important each combination is, then rank these combinations. (With each eigenvector, we get an **eigenvalue**. This eigenvalue is a number that tells us how important each "direction" or principal component is.)
    - Want a better understanding of eigenvectors and eigenvalues? [Check this 3Blue1Brown video out!](https://www.youtube.com/watch?v=PFDu9oVAE-g)

Remember that one of our goals with PCA is to do **dimensionality reduction** (a.k.a. get rid of unnecessary features).

To summarize, We can: 
- measure how important each principal component is using the eigenvalue, 
- rank the columns of `Z_train` by their eigenvalues,
- then drop the columns with small eigenvalues (little importance) but keep the columns with big eigenvalues (very important).
    - In `sklearn`, when transformed by PCA, the columns will <b>already be sorted by their eigenvalues</b> from biggest to smallest! The first column will be the most important, the second column will be the next most important, and so on.

#### But how do we decide on how many features to discard?

A useful measure is the **proportion of explained variance**, which is calculated from the **eigenvalues**. 

The explained variance tells us how much information (variance) is captured by each principal component.

### $$ \text{explained variance of }PC_k = \bigg(\frac{\text{eigenvalue of } PC_k}{\sum_{i=1}^p\text{eigenvalue of } PC_i}\bigg)$$

Rather than write out "$\text{eigenvalue of } PC_k$", we usually just write $\lambda_k$.

If I want to calculate the proportion of explained variance by retaining $PC_1$ and $PC_2$, I would calculate this as:

### $$ \text{explained variance of } PC_1 \text{ and } PC_2 = \bigg(\frac{\lambda_1 + \lambda_2}{\sum_{i=1}^p \lambda_i} \bigg)$$

In [15]:
# Pull the explained variance attribute using ready-to-use method from sklearn's PCA.
var_exp = pca.explained_variance_ratio_
print(f'Explained variance (first 20 components): {np.round(var_exp[:20],3)}') # examine only 1st 20 variances, rounded to 3 decimals

print('')

# Generate the cumulative explained variance.
cum_var_exp = np.cumsum(var_exp)
print(f'Cumulative explained variance (first 20 components): {np.round(cum_var_exp[:20],3)}')

Explained variance (first 20 components): [0.309 0.184 0.109 0.096 0.062 0.048 0.037 0.034 0.024 0.016 0.013 0.011
 0.009 0.006 0.004 0.003 0.003 0.003 0.003 0.002]

Cumulative explained variance (first 20 components): [0.309 0.493 0.602 0.698 0.76  0.808 0.845 0.879 0.903 0.919 0.932 0.943
 0.952 0.958 0.962 0.966 0.969 0.972 0.975 0.977]


<details><summary>Check: If I wanted to explain at least 80% of the variability in my data with principal components, what is the smallest number of principal components that I would need to keep? </summary>

- Only six! 
- I could keep $Z_1, Z_2, \ldots, Z_6$ in my model, and this would explain 80.8% of the variability in my $X$ data. (based on the 6th value of cum_var_exp hitting ~80%)
</details>

## Let's compare our PCA'ed performance to our original performance!

#### Original performance:

<img src="../images/lr_performance.png" alt="drawing" width="800"/>

#### Principal Component Regression performance:

In [18]:
# Note: we do these additional instantiation + fitting to use PCA, outside of our ML model instantiation and fitting
# Instantiate PCA with 10 components.
pca = PCA(n_components = 10, random_state = 42)

# Fit PCA to training data.
pca.fit(X_train)

PCA(n_components=10, random_state=42)

In [19]:
# Instantiate linear regression model.
lm = LinearRegression()

# Transform Z_train and Z_test.
Z_train = pca.transform(X_train)
Z_test = pca.transform(X_test)

# Fit on Z_train.
lm.fit(Z_train, y_train)

# Score on training and testing sets.
print(f'Training Score: {round(lm.score(Z_train, y_train),4)}')
print(f'Testing Score: {round(lm.score(Z_test, y_test),4)}')

Training Score: 0.2902
Testing Score: 0.2639


**Two assumptions that PCA makes:**
1. **Linearity:** PCA detects and controls for linear relationships, so we assume that the data does not hold nonlinear relationships (or that we don't care about these nonlinear relationships).
    - We are using our covariance matrix to determine important "directions," which is a measure of the linear relationship between observations!
    - There are other types of feature extraction like [t-SNE](https://lvdmaaten.github.io/tsne/) and [PPA](https://towardsdatascience.com/interesting-projections-where-pca-fails-fe64ddca73e6), though we won't formally cover those in a global lesson.
    
    
2. **Large variances define importance:** If data is spread in a direction, that direction is important! If there is little spread in a direction, that direction is not very important.
    - That aligns with what we saw [here](http://setosa.io/ev/principal-component-analysis/).

### Potential Use Cases for PCA
- Situations where $p \not\ll n$. (Situations where $p$ is not substantially smaller than $n$.)
- Situations in which there are variables with high multicollinearity. (Can be traditional models or models with highly correlated inputs by design, like images.)
- Situations in which there are many variables, even without explicit multicollinearity.

### Interview Questions

<details><summary>Explain PCA to me.</summary>

- Principal component analysis is a method of dimensionality reduction that **identifies important relationships** in our data, **transforms the existing data** based on these relationships, and then **quantifies the importance** of these relationships so we can keep the most important relationships and drop the others!

<details><summary>How can I remember the above?</summary>

Matt's "Three Signposts:"
- Covariance Matrix
- Eigenvectors
- Eigenvalues
</details>
</details>

<details><summary>In what cases would I not use PCA?</summary>

- Since PCA distorts the interpretability of our features, we should not use PCA if our goal is to interpret the output of our model.
- If we have relatively few features as inputs, PCA is unlikely to have a large positive impact on our model.
</details>