## Principal Component Analysis - What is PCA doing?
1. We are going to look at how all of the $X$ variables relate to one another and summarize these relationships.
2. Then, we will take this summary and look at which combinations of our $X$ variables are most important.
3. We can also quantify how important each combination is and rank these combinations.

Once we've taken our original $X$ data and transformed it into $Z$, we can then drop the columns of $Z$ that are "least important."

The principal components are the most concise, informative descriptors of our data as a whole.
- What does this mean?
- If we wanted to take our full data set and condense it into one dimension (think like our $X$ axis), we'd only use $Z_1$.
- If we wanted to take our full data set and condense it into two dimensions (think like our $X$ and $Y$ axes), we'd use $Z_1$ and $Z_2$.

In [1]:
# Import our libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Import from sklearn.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures, StandardScaler

# Set a random seed.
np.random.seed(42)

## Test Code using multiple linear regression
Use the physiochemical properties of wine to predict quality

In [2]:
# Read in the wine quality datasets.
df_red = pd.read_csv('./datasets/winequality-red.csv', sep=';')
df_white = pd.read_csv('./datasets/winequality-white.csv', sep=';')

# Stack datasets together. (They have the same column names!)
df = pd.concat([df_red, df_white])

# Check out head of our dataframe.
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [3]:
# Set y to be the quality column.
y = df['quality']

# Set X as all other columns.
X = df.drop(columns=['quality'])

# How much missing data do we have?
X.isnull().sum()

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
dtype: int64

In [4]:
# To show off the strength of PCA, we're going to make many, many more features.
pf = PolynomialFeatures(degree = 3)

# Fit and transform our X data using Polynomial Features.
X_new = pf.fit_transform(X)

# How many features do we have now?
print(X_new.shape)

# How many features did we start out with?
print(X.shape)

(6497, 364)
(6497, 11)


In [5]:
# Train/test split our data.
X_train, X_test, y_train, y_test = train_test_split(X_new,
                                                    y,
                                                    test_size = 0.33,
                                                    random_state = 42)

In [6]:
# Instantiate and fit a linear regression model.
lm = LinearRegression()
lm.fit(X_train, y_train)

# Score on training set. (We'll use R^2 for the score today.)
print(f'Training Score: {round(lm.score(X_train, y_train),4)}.')

# Score on testing set.
print(f'Testing Score: {round(lm.score(X_test, y_test),4)}.')

Training Score: 0.4563.
Testing Score: -0.8845.


Evaluation of multiple linear regression in this case
- We've clearly overfit our model to the data (so much so that our model's performance is really bad)!
- We have a lot of columns relative to our number of rows! (If you have $n$ rows and you're fitting a linear model, it's often advised to keep your number of columns below $\sqrt{n}$.)

How to overcome:
- We can drop features from our model. (However, this loses any benefit we'd get from dropping those features! It can also be time-consuming and/or require subject-matter expertise.)
- Maybe we can combine features together so that we can get the benefits of most/all of our features. (This is what PCA will do.)

## Purpose of this is to reduce features/ dimensions
---
Advantages of dimensionality reduction:
- Increases computational efficiency when fitting models.
- Can help with addressing a multicollinearity problem.
- Makes visualization simpler (or feasible).

Disadvantages:
- We've invested our time and money into collecting information... why do we want to get rid of it?

## 2 categories of Dimensionality Reduction:
- **Feature Selection**
    - We drop variables from our model.
- **Feature Extraction**
    - In feature extraction, we take our existing features and combine them together in a particular way. We can then drop some of these "new" variables, but the variables we keep are still a combination of the old variables!
    - This allows us to still reduce the number of features in our model **but** we can keep all of the most important pieces of the original features!


## Feature Selection vs Feature Extraction
- Feature selection is a process of dropping original features from our model.
- Feature extraction is a process of transforming our original features into "new" features, then dropping some of the "new" features from our model.

## If I'm going to keep three of my new predictors, which three would I keep?
- The first three: $Z_1$, $Z_2$, and $Z_3$.
- This is how we do feature extraction.
    - We take our old features $X_1$, $X_2$, $X_3$, and $X_4$.
    - We turn them into new features $Z_1$, $Z_2$, $Z_3$, and $Z_4$.
    - The new features are combinations of our old features.
    - If we drop new features, we're doing dimensionality reduction, but we also keep parts of every old feature!

## PCA CODE!

In [7]:
# Instantiate our StandardScaler.
ss = StandardScaler()

# Standardize X_train.
X_train = ss.fit_transform(X_train)

# Standardize X_test.
X_test = ss.transform(X_test)

In [8]:
# Import PCA.
from sklearn.decomposition import PCA

In [9]:
# Instantiate PCA.
pca = PCA(random_state = 42)

In [10]:
# Fit PCA on the training data.
pca.fit(X_train)

PCA(random_state=42)

In [11]:
# Transform PCA on the training data.
Z_train = pca.transform(X_train)

In [12]:
# Let's check out the resulting data.
pd.DataFrame(Z_train).describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,354,355,356,357,358,359,360,361,362,363
count,4352.0,4352.0,4352.0,4352.0,4352.0,4352.0,4352.0,4352.0,4352.0,4352.0,...,4352.0,4352.0,4352.0,4352.0,4352.0,4352.0,4352.0,4352.0,4352.0,4352.0
mean,1.791867e-16,-3.73935e-16,7.142979000000001e-17,-9.724656e-17,1.0893040000000001e-17,3.8368000000000004e-17,-6.969507e-17,-3.265362e-18,6.265413e-17,-1.712274e-16,...,-2.9880820000000005e-17,-2.098214e-17,1.9866250000000003e-17,-6.994047000000001e-17,-2.0086150000000003e-17,-1.386594e-18,-1.17247e-18,-1.097543e-17,4.6076410000000007e-17,5.1455169999999996e-26
std,10.59077,8.173139,6.290206,5.899809,4.755777,4.187379,3.663482,3.51957,2.952624,2.382871,...,3.32442e-07,2.824236e-07,2.473689e-07,1.937117e-07,1.745561e-07,1.617731e-07,1.377643e-07,9.470085e-08,2.940241e-08,3.08479e-16
min,-26.52547,-12.74783,-24.69224,-47.87008,-22.55302,-20.58849,-28.71649,-18.18558,-14.23366,-23.76801,...,-3.023989e-06,-2.779914e-06,-1.989357e-06,-1.760353e-06,-1.692596e-06,-1.716397e-06,-1.619285e-06,-8.943172e-07,-2.324601e-07,-7.46838e-15
25%,-7.399924,-5.048388,-4.000986,-3.001194,-2.38384,-2.356163,-2.189302,-2.283286,-1.858204,-1.434927,...,-1.378026e-07,-9.56518e-08,-1.013028e-07,-7.476632e-08,-6.683678e-08,-6.46355e-08,-4.998106e-08,-3.541248e-08,-1.156609e-08,-7.859317000000001e-17
50%,-1.77981,-1.784668,-0.6808596,-0.06803642,-0.0481993,-0.07350413,-0.1303354,-0.1860893,0.1402668,-0.07198744,...,1.628219e-08,1.037889e-08,-9.718862e-09,3.96588e-09,3.880116e-10,-1.324978e-09,6.129292e-10,2.191275e-09,4.248726e-10,1.950242e-17
75%,6.595688,2.77039,3.760017,3.103933,2.411031,2.287913,1.959506,2.180386,1.909862,1.35825,...,1.393233e-07,1.054804e-07,1.014314e-07,7.356252e-08,6.753657e-08,5.911348e-08,4.857961e-08,3.957595e-08,1.240289e-08,1.17273e-16
max,95.24253,114.9057,80.74751,133.8557,76.0941,39.65114,35.59624,56.28123,19.36733,24.58697,...,4.492273e-06,4.635209e-06,3.338466e-06,2.044531e-06,2.028879e-06,1.779281e-06,1.906753e-06,1.092489e-06,4.183708e-07,1.523538e-15


In [13]:
# Don't forget to transform the test data!
Z_test = pca.transform(X_test)

In [14]:
# Pull the explained variance attribute.
var_exp = pca.explained_variance_ratio_
print(f'Explained variance (first 20 components): {np.round(var_exp[:20],3)}')

print('')

# Generate the cumulative explained variance.
cum_var_exp = np.cumsum(var_exp)
print(f'Cumulative explained variance (first 20 components): {np.round(cum_var_exp[:20],3)}')

Explained variance (first 20 components): [0.309 0.184 0.109 0.096 0.062 0.048 0.037 0.034 0.024 0.016 0.013 0.011
 0.009 0.006 0.004 0.003 0.003 0.003 0.003 0.002]

Cumulative explained variance (first 20 components): [0.309 0.493 0.602 0.698 0.76  0.808 0.845 0.879 0.903 0.919 0.932 0.943
 0.952 0.958 0.962 0.966 0.969 0.972 0.975 0.977]


### PCA Performance

In [15]:
# Instantiate PCA with 10 components.
pca = PCA(n_components = 10, random_state = 42)

# Fit PCA to training data.
pca.fit(X_train)

PCA(n_components=10, random_state=42)

In [16]:
# Instantiate linear regression model.
lm = LinearRegression()

# Transform Z_train and Z_test.
Z_train = pca.transform(X_train)
Z_test = pca.transform(X_test)

# Fit on Z_train.
lm.fit(Z_train, y_train)

# Score on training and testing sets.
print(f'Training Score: {round(lm.score(Z_train, y_train),4)}')
print(f'Testing Score: {round(lm.score(Z_test, y_test),4)}')

Training Score: 0.2902
Testing Score: 0.2639
