## Principal Component Analysis - What is PCA doing?
1. We are going to look at how all of the $X$ variables relate to one another and summarize these relationships.
2. Then, we will take this summary and look at which combinations of our $X$ variables are most important.
3. We can also quantify how important each combination is and rank these combinations.

Once we've taken our original $X$ data and transformed it into $Z$, we can then drop the columns of $Z$ that are "least important."

The principal components are the most concise, informative descriptors of our data as a whole.
- What does this mean?
- If we wanted to take our full data set and condense it into one dimension (think like our $X$ axis), we'd only use $Z_1$.
- If we wanted to take our full data set and condense it into two dimensions (think like our $X$ and $Y$ axes), we'd use $Z_1$ and $Z_2$.

In [1]:
# Import our libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Import from sklearn.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures, StandardScaler

# Set a random seed.
np.random.seed(42)

## Test Code using multiple linear regression
Use the physiochemical properties of wine to predict quality

In [4]:
# Read in the wine quality datasets.
df_red = pd.read_csv('./datasets/winequality-red.csv', sep=';')
df_white = pd.read_csv('./datasets/winequality-white.csv', sep=';')

# Stack datasets together. (They have the same column names!)
df = pd.concat([df_red, df_white])

# Check out head of our dataframe.
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [5]:
# Set y to be the quality column.
y = df['quality']

# Set X as all other columns.
X = df.drop(columns=['quality'])

# How much missing data do we have?
X.isnull().sum()

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
dtype: int64

In [6]:
# To show off the strength of PCA, we're going to make many, many more features.
pf = PolynomialFeatures(degree = 3)

# Fit and transform our X data using Polynomial Features.
X_new = pf.fit_transform(X)

# How many features do we have now?
print(X_new.shape)

# How many features did we start out with?
print(X.shape)

(6497, 364)
(6497, 11)


In [7]:
# Train/test split our data.
X_train, X_test, y_train, y_test = train_test_split(X_new,
                                                    y,
                                                    test_size = 0.33,
                                                    random_state = 42)

In [8]:
# Instantiate and fit a linear regression model.
lm = LinearRegression()
lm.fit(X_train, y_train)

# Score on training set. (We'll use R^2 for the score today.)
print(f'Training Score: {round(lm.score(X_train, y_train),4)}.')

# Score on testing set.
print(f'Testing Score: {round(lm.score(X_test, y_test),4)}.')

Training Score: 0.4563.
Testing Score: -0.8845.


Evaluation of multiple linear regression in this case
- We've clearly overfit our model to the data (so much so that our model's performance is really bad)!
- We have a lot of columns relative to our number of rows! (If you have $n$ rows and you're fitting a linear model, it's often advised to keep your number of columns below $\sqrt{n}$.)

How to overcome:
- We can drop features from our model. (However, this loses any benefit we'd get from dropping those features! It can also be time-consuming and/or require subject-matter expertise.)
- Maybe we can combine features together so that we can get the benefits of most/all of our features. (This is what PCA will do.)

## Purpose of this is to reduce features/ dimensions
---
Advantages of dimensionality reduction:
- Increases computational efficiency when fitting models.
- Can help with addressing a multicollinearity problem.
- Makes visualization simpler (or feasible).

Disadvantages:
- We've invested our time and money into collecting information... why do we want to get rid of it?

## 2 categories of Dimensionality Reduction:
- **Feature Selection**
    - We drop variables from our model.
- **Feature Extraction**
    - In feature extraction, we take our existing features and combine them together in a particular way. We can then drop some of these "new" variables, but the variables we keep are still a combination of the old variables!
    - This allows us to still reduce the number of features in our model **but** we can keep all of the most important pieces of the original features!


## Feature Slection vs Feature Extraction
- Feature selection is a process of dropping original features from our model.
- Feature extraction is a process of transforming our original features into "new" features, then dropping some of the "new" features from our model.

## If I'm going to keep three of my new predictors, which three would I keep?
- The first three: $Z_1$, $Z_2$, and $Z_3$.
- This is how we do feature extraction.
    - We take our old features $X_1$, $X_2$, $X_3$, and $X_4$.
    - We turn them into new features $Z_1$, $Z_2$, $Z_3$, and $Z_4$.
    - The new features are combinations of our old features.
    - If we drop new features, we're doing dimensionality reduction, but we also keep parts of every old feature!

## PCA CODE!

In [11]:
# Instantiate our StandardScaler.
ss = StandardScaler()

# Standardize X_train.
X_train = ss.fit_transform(X_train)

# Standardize X_test.
X_test = ss.transform(X_test)

In [12]:
# Import PCA.
from sklearn.decomposition import PCA