![](img/563_banner.png)

# Lecture 3: Introduction to Principal Component Analysis (PCA)

UBC Master of Data Science program, 2021-22

Instructor: Varada Kolhatkar

## Lecture plan and learning outcomes

### Lecture plan 

- Introduction (~5 mins)
- Summary of the pre-watch videos (~15 mins)
- In-class activities and Q&A (~10 mins)
- Break (~5 mins)
- PCA applications (~30 mins)
- Final comments, summary, and reflection (~10 mins)


### Imports 

In [None]:
import os
import random
import sys

import numpy as np
import pandas as pd

sys.path.append("code/.")
import ipywidgets as widgets
import matplotlib.pyplot as plt
import mglearn
import seaborn as sns
from IPython.display import display
from ipywidgets import interact, interactive
from plotting_functions import *
from scipy.cluster.hierarchy import dendrogram, fcluster, linkage
from sklearn import cluster, datasets, metrics
from sklearn.cluster import DBSCAN, AgglomerativeClustering, KMeans
from sklearn.datasets import make_blobs
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from support_functions import *
from yellowbrick.cluster import SilhouetteVisualizer

plt.rcParams["font.size"] = 16
# plt.style.use("seaborn")
%matplotlib inline
pd.set_option("display.max_colwidth", 0)

### Learning outcomes <a name="lo"></a>

From this lecture, students are expected to be able to:

- Explain some issues caused by high-dimensional data and the need for dimensionality reduction.
- Explain the intuition behind Principal Component Analysis (PCA). 
- Describe the role and shapes of four matrices $X$, $W$, $Z$, and $\hat{X}$ in the context of dimensionality reduction techniques;
- Explain how to get $Z$ from $X$ and $W$. 
- Explain how to get $X_{hat}$ from $Z$ and $W$. 
- State the loss function for PCA. 
- Explain the difference between PCA and linear regression. 
    - Explain how PCA can be used in data compression, better representation, and visualization.  
- Use `sklearn.decomposition.PCA` to perform Principal Component Analysis. 
- Use sklearn's `inverse_transform` to get reconstructions. 

<br><br><br><br>

## Dimensionality reduction: Motivation and introduction [[video](https://youtu.be/r-DwXpg1YDI)]

- Suppose you're shown the picture below and you are told that this is **Eva**. 
- Do you have to remember every pixel in the image to recognize other pictures of Eva? 

![](img/eva-tree.jpg)

- For example, if you are asked which one is Eva in the following pictures, it'll be fairly easy for you to identify her just based on some high-level features. 

![](img/hello-bmjs.png)

 

- Just remembering important features such as shape of eyes, nose, mouth, shape and colour of hair etc. suffice to tell her apart from other people. 
- Can we learn such high-level features or **the summary** of the given raw features with machine learning models?
- Yes! With dimensionality reduction techniques! 

- As data scientists, given a dataset we either want to understand some phenomenon or build predictive models. 
- Very often the data we work with is clouded, complex, unclear, or even redundant.
- But in reality the underlying phenomenon we are trying to understand or the relationship between variables in the data is much simpler. 

### Toy example: nutritional value of pizzas

- Suppose we want to analyze nutritional value of pizzas of different brands. 
- Here is a [toy dataset](https://www.kaggle.com/shishir349/can-pizza-be-healthy) for this problem. 

In [None]:
pizza_df = pd.read_csv("data/pizza.csv")
pizza_df.head()

In [None]:
X_pizza = pizza_df.drop(columns=["id", "brand"])
y_pizza = pizza_df["brand"]
X_pizza.head()

In [None]:
X_pizza.shape

We have features such as amount of moisture, amount of protein, amount of fat, amount of ash, amount of sodium, and amount of carbohydrates, and amount of calories per 100 grams in the sample.     

Let's examine correlations between different variables. 

In [None]:
corr_heatmat(X_pizza.corr(), w=6, h=3)
plt.show();

- There is redundancy in the data; many features are correlated.  
- Can we **summarize** these features in some meaningful way so that the data is cleaner and less redundant?

### What do we mean by "summarizing" the data? 

- Can we just discard some redundant features? 
- We have seen some (not very satisfactory) feature selection methods to identify least important features in a greedy way and throw away such features.
- This week we are going to look at a class of more sophisticated approaches for this, which are typically referred to as **dimensionality reduction**.  

### What is dimensionality reduction? 

**Dimensionality reduction** is the task of summarizing data or reducing a dataset in high dimension (e.g., 1000) to low dimension (e.g., 10) **while retaining the most "important" characteristics of the data.** 

**Dimensionality reduction** is also used to reduce the dimensionality similar to feature selection. But
- We will not be just dropping columns as we did in feature selection. 
- The idea of (linear) dimensionality reduction is to project high dimensional data to low dimensional space  while retaining the most "important" characteristics of the data. 
- We can also **reconstruct** the original data (with some error) from this transformed data.

### How do we reduce the dimensions?

- The techniques we are going to look at this week summarize the data by creating new features which are **linear combinations of the original features**. 
- Example: 
$$\text{new_feature} = 0.44 \times fat + 0.47 \times ash - 0.42 \times carb \dots$$

### Dimensionality reduction toy example

- Let's apply a popular dimensionality reduction technique called Principal Component Analysis (PCA) using [`sklearn`'s `PCA`](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) on our nutritional value of pizzas toy example. 
- Learning a PCA model and transforming data is similar to applying preprocessing transformations in `sklearn`. 
- You can learn a PCA model and transform the data using `fit` and `transform` methods, respectively. 

In [None]:
n_components = (
    2  # summarize the data with only two features (typically called components)
)
pipe_pca = make_pipeline(
    StandardScaler(), PCA(n_components=n_components)
)  # scaling before PCA is crucial. We'll see the reason later.
Z = pipe_pca.fit_transform(X_pizza)  # transform the data

How does the data look like after dimensionality reduction? 

In [None]:
component_labels = ["PCA" + str(i + 1) for i in range(n_components)]
pd.DataFrame(Z, columns=component_labels, index=X_pizza.index).head()

- We have reduced dimensionality from original 7 features to 2 features. 
- The two new features can be thought of as the **summary** of the original features. 

### What has it learned?

- It has learned the "most informative" linear combinations of the features. 
- Each new feature (principal component) has a coefficient associated with each of the original features and the value of the new feature is a linear combination of the original features.   

In [None]:
W = pipe_pca.named_steps["pca"].components_
pd.DataFrame(W, columns=X_pizza.columns, index=component_labels)

$$\text{PCA1} = 0.064709 \times \text{mois} + 0.378761 \times \text{prot} + \dots + -0.424914 \times \text{carb} +  0.244487 \times \text{cal}$$

$$\text{PCA2} = -0.628276 \times \text{mois} + -0.628276 \times \text{prot} + \dots + 0.320312 \times \text{carb} +  0.567458 \times \text{cal}$$

In [None]:
np.round(Z[0, :], 4)  # transformed values for the 0th example

In [None]:
x0_scaled = pipe_pca.named_steps["standardscaler"].transform(X_pizza)[0, :]
np.round((np.dot(x0_scaled, W[0, :]), np.dot(x0_scaled, W[1, :])), 4)

In [None]:
plot_pca_w_vectors(W, component_labels, X_pizza.columns, width=800, height=800)

### How good is the summary? 

- We can look at how much information from the original dataset these two newly created dimensions have captured. 

In [None]:
pipe_pca.named_steps["pca"].explained_variance_ratio_.sum()

We are capturing 92.31% of the information using only two of these newly created features!!  

Let's look at how much "information" we can capture with different number of components. 

In [None]:
n_components = len(X_pizza.columns)
pipe_pca = make_pipeline(StandardScaler(), PCA(n_components=n_components))
pipe_pca.fit(X_pizza)

In [None]:
df = pd.DataFrame(
    data=np.cumsum(pipe_pca["pca"].explained_variance_ratio_),
    columns=["variance_explained (%)"],
    index=range(1, n_components + 1),
)
df.index.name = "n_components"

In [None]:
simple_bar_plot(
    x=df.index.tolist(),
    y=df["variance_explained (%)"],
    x_title="n_components",
    y_title="variance explained (%)",
)

So the first two components summarize most of the information (92.31%) in the data!! 

### Common use cases for dimensionality reduction 

Overall this idea of summarizing the data in a meaningful way is going to be super useful and there are tons of applications for this. 

- **Data compression**    
- **Feature extraction** in a machine learning pipeline 
    - Last week, we created PCA representation of face images before passing it to K-Means clustering. 
- **Visualization**
    - Last week, we carried out dimensionality reduction using PCA to visualize our high dimensional data.
- **Anomaly detection**
- ...

### Dimensionality reduction techniques 

We'll talk about the following linear dimensionality reduction techniques. 

- [Principal Component Analysis (PCA)](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) (in detail)
- [TuncatedSVD or Latent Semantic Analysis (LSA)](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html) (brief discussion)
- [Non-negative matrix factorization (NMF)](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html) (brief discussion)

All these techniques can be viewed as applying transformations or "change of basis". 

<br><br><br><br>

## Principal Component Analysis (PCA): Intuition and terminology [[video](https://youtu.be/33TRSSuzALw)]

- PCA has been around for more than 100 years and is one of the most widely used dimensionality reduction techniques. 
- Examples:  
    - [The Big Five personality traits](https://en.wikipedia.org/wiki/Big_Five_personality_traits) (extroversion, agreeableness, conscientiousness, neuroticism, openness to experience) were discovered using PCA. 

### Hockey-stick curve of global warming 

The famous hockey-stick curve of global warming was created by applying PCA on various temperature-related time series (tree rings, ice cores, etc.). [Source](https://www.wsj.com/articles/SB110834031507653590).

![](img/climate-change-hockey.gif)

<!-- <img src="img/climate-change-hockey.gif" alt="" height="300",width="300">  -->


### PCA intuition

- PCA summarizes the data by finding linear combinations of features. 
- In fact, PCA finds the **best linear combinations** of the original features so that 
    - the first component has the most information 
    - the second component has the second most information 
    - and so on 
- What do we mean by finding the best linear combinations?    

### Do we need two dimensions to describe these data points? 

- Let's create some synthetic data with two dimensions: $x_1$ and $x_2$   

In [None]:
np.random.seed(42)
x1 = np.random.randn(10)
x2 = 2 * x1
X = np.stack([x1, x2], axis=1)
plt.scatter(X[:, 0], X[:, 1], linewidths=4)
plt.plot(x1, x2, c="k", linewidth=0.5)
plt.xlabel("x1")
plt.ylabel("x2");

### Data with some noise

In [None]:
np.random.seed(42)
x1 = np.random.randn(10)
x2 = x1 + np.random.randn(10) / 3
X = pd.DataFrame(data=np.stack([x1, x2], axis=1), columns=["x1", "x2"])
X_scaled = StandardScaler().fit_transform(X)

In [None]:
plt.scatter(
    X_scaled[:, 0], X_scaled[:, 1], c=X_scaled[:, 0], linewidths=3, cmap="viridis"
);

We are using colour just to keep track of different points. It doesn't particularly mean anything here. 

In [None]:
data = pd.DataFrame(X_scaled, columns=["x1", "x2"])
data

### Feature selection scenario

- What would happen if we drop column x1? 

In [None]:
plot_feature_selection(data, 15, 6, drop=0)

All points are projected on the x2 axis. 

### Feature selection scenario

- What would happen if we drop column x2? 

In [None]:
plot_feature_selection(data, 15, 6, drop=1)

All points are projected on the x1 axis. 

### PCA idea

- How about finding an optimal line going through these points and projecting our data points on this line? 

In [None]:
def f(angle):
    plot_pca_model_search(data, alpha=angle, w=8, h=6)

In [None]:
interactive(
    f,
    angle=widgets.IntSlider(min=0, max=180, step=1, value=0),
)

- Reconstruction error: sum of the squared distances between original points and projected points.  
- PCA picks the direction which gives the smallest **reconstruction error**. 

- Interestingly, this is the same as **maximizes the variance of the projected points**.
- So PCA learns a linear model (e.g., lines, planes, or hyperplanes) which minimizes the "reconstruction error" or maximizes the variance of the projected points.     
- We'll look at the actual loss function later. 
- First, let's understand some terminology. 

In [None]:
pca = PCA(n_components=1).fit(data)
plot_pca_reconstructions(data, pca)

In [None]:
n = 12
d = 3

x1 = np.linspace(0, 5, n) + np.random.randn(n) * 0.05
x2 = -x1 * 0.1 + np.random.randn(n) * 2
x3 = x1 * 0.7 + np.random.randn(n) * 3

X = np.concatenate((x1[:, None], x2[:, None], x3[:, None]), axis=1)
X = X - np.mean(X, axis=0)
plot_interactive_3d(X)

In [None]:
pca = PCA(n_components=2)
pca.fit(X)
plot_2d_1k(X, pca)

### PCA input/output

- Let's bring back our nutritional value of pizza dataset.  

In [None]:
X_pizza

In [None]:
X_pizza_scaled = StandardScaler().fit_transform(X_pizza)  # scale the data

### PCA input

In [None]:
from sklearn.decomposition import PCA

n_components = 2  # k = 2
pca = PCA(n_components=n_components)
pca.fit(X_pizza_scaled);

- A scaled matrix $X$ with $d$ dimensions (features) and $n$ examples
- Number of components $k$
   - We need to specify how many components we want to keep ($k$).
- In our case, $n=300$, $d =7$, and $k=2$   

### PCA output

Two matrices: $Z$ and $W$

- Projected data ($Z$)
- Basis vectors ($W$)

In [None]:
# Projected data Z
Z = pca.transform(X_pizza_scaled)  # transform the data
component_labels = ["PC" + str(i + 1) for i in range(n_components)]
pd.DataFrame(Z, columns=component_labels, index=X_pizza.index)

In [None]:
Z.shape

In [None]:
# Basis vectors W
W = pca.components_
pd.DataFrame(W, columns=X_pizza.columns, index=component_labels)

In [None]:
W.shape

### Output of PCA: transformed data $Z$ 

- Suppose the original data matrix $X$ has $n$ rows and $d$ columns, and we specify the number of components as $k$. 
- $Z$: Each row of $Z$ is a set of "part weights" of "factor loadings" or "features"
    $$Z = \begin{bmatrix}
        z_{11} & \ldots & z_{1k}\\ 
        z_{21} & \ldots & z_{2k}\\ 
        & \vdots &\\
        z_{n1} & \ldots & z_{nk}
        \end{bmatrix}_{n \times k}
    $$

- It has $n$ rows and $k$ columns in contrast to $d$ columns in the original data (usually $k << d$). 

### Output of PCA: Basis vectors $W$ 
- $W$: Each row of $W$ is a set of **factors**, **principal components**, **parts**, or **basis vectors**. 
    $$W = \begin{bmatrix}
            w_{11} & \ldots & w_{1d}\\ 
            w_{21} & \ldots & z_{2d}\\ 
            & \vdots &\\
            w_{k1} & \ldots & z_{kd}
            \end{bmatrix}_{k \times d}
    $$

- We can access $W$ using `components_` attribute of the `PCA` object. 
- $W$ has $k$ rows, one for each component. 
- Each row has a coefficient or weight associated with all $d$ features. 

### Interpretation of the coefficients

- You can interpret these coefficients similar to linear regression. 
    - Higher magnitude of the coefficient means the feature has a strong effect on the corresponding principal component. 
    - Positive coefficient means the feature and the principal component are positively correlated.
    - Negative coefficient means the feature and the principal component are negatively correlated. 

In [None]:
plot_pca_w_vectors(W, component_labels, X_pizza.columns, width=800, height=800)

### Reconstruction 

- In dimensionality reduction, unlike feature selection, we are not exactly throwing away features; we can reconstruct $X$ (with some error) by multiplying $Z$ and $W$ matrices. 

In [None]:
pd.DataFrame(Z, columns=component_labels, index=X_pizza.index)

In [None]:
pd.DataFrame(W, columns=X_pizza.columns, index=component_labels)

### $\hat{X}$ in the example above 
We can reconstruct examples by multiplying $Z$ and $W$. 

In [None]:
X_pizza_hat = Z @ W

In [None]:
pd.DataFrame(X_pizza_hat, columns=X_pizza.columns).round(4)

### Reconstruction using `inverse_transform`
- We can also access the reconstructed data using `inverse_transform` attribute of the PCA object. 

In [None]:
X_pizza_hat = pca.inverse_transform(Z)
pd.DataFrame(X_pizza_hat, columns=X_pizza.columns).round(4)

### More formally
- We can get $\hat{X}_{n \times d}$ (reconstructed $X$) by matrix multiplication of $Z_{n \times k}$ and $W_{k \times d}$. 
$$\hat{X}_{n \times d} = Z_{n \times k}W_{k \times d} = \begin{bmatrix}
        z_{11} & \ldots & z_{1k}\\ 
        z_{21} & \ldots & z_{2k}\\ 
        & \vdots &\\
        z_{n1} & \ldots & z_{nk}
        \end{bmatrix}_{n \times k} \times 
        \begin{bmatrix}
            w_{11} & \ldots & w_{1d}\\ 
            w_{21} & \ldots & w_{2d}\\ 
            & \vdots &\\
            w_{k1} & \ldots & w_{kd}
            \end{bmatrix}_{k \times d}$$
- For instance, you can reconstruct an example $\hat{x_{i}}$ as follows:  

$$\hat{x_{i}} = \begin{bmatrix} z_{i1}w_{11} + z_{i2}w_{21} + \dots + z_{ik}w_{k1} \\ z_{i1}w_{12} + z_{i2}w_{22} + \dots + z_{ik}w_{k2}\\ \vdots\\ z_{i1}w_{1d} + z_{i2}w_{2d} + \dots + z_{ik}w_{kd}\end{bmatrix}_{d \times 1}$$

### Reconstruction error

- How good is the reconstructed data? 
- Are we able to accurately reconstruct the original data? 
- Let's compare our reconstructions to the original scaled data. 

In [None]:
pd.DataFrame(X_pizza_scaled).head(5)  # orginal scaled data

In [None]:
pd.DataFrame(X_pizza_hat).head(5)  # reconstructions

- Let's calculate squared distances between original data and reconstructed data. 

In [None]:
def reconstruction_error(X, X_hat):
    error = np.sum((np.array(X) - np.array(X_hat)) ** 2, axis=1)
    return error

In [None]:
recon_df = pd.DataFrame(reconstruction_error(X_pizza_scaled, X_pizza_hat))

- As we can see, the reconstruction error is different for different examples. 

In [None]:
recon_df.head(10)

- One way to summarize these distances is by looking at the mean or median reconstruction error. 

In [None]:
recon_df.median()

### Interim summary
- Principal Component Analysis (PCA) is one of the widely used dimensionality reduction techniques. 
- The overall idea is to project high dimensional data onto a lower dimensional space to get a new representation. 
- It applies a linear transformation on the data and so it's a **linear dimensionality reduction** technique. 
- As input it takes number of components and scaled data and as output it results in two matrices: the transformed data matrix $Z$ and the weight matrix $W$.  
- It's possible to reconstruct the original data (with some error) by multiplying $Z$ and $W$ matrices. 

<br><br><br><br>

## PCA loss function 

In the previous section we looked at
- the intuition behind PCA
- the input and output of PCA
- how it approximates data matrix $X$ by matrix-matrix product $ZW$ or in other words how it approximates each example $x_i$ by the matrix-vector product $W^Tz_i$  

- You will find a number of views and explanations for PCA.     
- One way to view PCA is that it learns the hyperplane that minimizes the reconstruction error in the least squares sense. 
- Let's get an intuition for PCA loss function. 


- Let's generate some 2D data. 

In [None]:
np.random.seed(42)
feat1 = np.random.randn(10)
feat2 = feat1 + np.random.randn(10) / 3
X = pd.DataFrame(data=np.stack([feat1, feat2], axis=1), columns=["feat1", "feat2"])
X_scaled = StandardScaler().fit_transform(X)
data = pd.DataFrame(X_scaled, columns=["feat1", "feat2"])
data.head()

In [None]:
pca = PCA(n_components=1, random_state=42)
pca.fit(data)
Z = pca.transform(data)  # transformed data
pd.DataFrame(Z, columns=["PC1"])

### PCA reconstructions

In [None]:
X_hat = pca.inverse_transform(Z)
pd.DataFrame(X_hat, columns=["recon_feat1", "recon_feat2"])

Let's compare our reconstructions (`X_hat`) and original scaled data (`X_scaled`). 

In [None]:
plot_pca_model_reconstructions(data, pca)

- In our case $d = 2$ and $k = 1$. The green line corresponds to $W_{k\times d}$ in the new 1D coordinate system.
- `X_hat` are reconstructions. 
- PCA learns an optimal line, plane, or hyperplane so that reconstruction error is minimized. 
- Goal: Minimize the sum of squared distances between blue points (original) and red points (reconstructions).  

### PCA objective function

In PCA we minimize the sum of squared error between elements of $X$ and elements of $ZW$: 

$$f(W,Z) = \sum_{i=1}^{n} \lVert{W^Tz_i - x_i}\rVert^2_2 $$

- $W^Tz_i \rightarrow$ reconstructed example 
- $x_i \rightarrow$ original example 

**Idea**: What are the best two matrices $W$ and $Z$ we can come up with so that when we multiply them we get a matrix that's closest to the original data.   

### PCA vs. linear regression

- Minimizing squared error might remind you of linear regression.
- **BUT THEY ARE NOT THE SAME.**

- In case of linear regression, 
    - We minimize the squared error between true `y` and predicted `y`. 
    - We only care about the vertical distance because our goal is to predict `y` which we represent on the y-axis.
- Unlike in regression we are also learning the features $z_i$ in PCA. 

In [None]:
plot_pca_regression(data, 8, 6, error_type="both")

### PCA algorithm 

- We want to find a transformation of the data such that we do not lose much information. In other words, 
    - we want the projections to be as close as possible to the original data
- We can use optimization algorithms to find the solution. 
- But there is a better and faster way using linear algebra! 

The standard PCA is as follows: 

- Center $X$ (subtract mean). (In practice, also scale to unit variance, i.e., apply `StandardScaler`.) 
- Compute **singular value decomposition (SVD)** of the data matrix to get principal components ($W$) and corresponding singular values which are associated with the **variance of each of the principal components**. 
- Drop principal components with smallest singular values for dimensionality reduction.  

### Singular Value Decomposition (SVD)

- Singular value decomposition decomposes the given real-valued matrix $X_{n \times d}$ into three matrices: 

$$X_{n \times d} = U_{n \times n}S_{n\times d}V^T_{d \times d}$$
- $U_{n \times n}$ is an orthogonal matrix (not necessarily calculated in practice)
- $S_{n\times d}$ is a diagonal matrix containing singular values, which correspond to the variance of each of the principal components.
- $V^T_{d \times d}$ is an orthogonal matrix which contains component vectors. 
    - For dimensionality reduction we drop rows of $V^T$. 


- Another popular view of PCA is that it maximizes the **variance** of the projected (transformed) points. 
- We search for the direction of highest variance.  
    - This direction is called the **first principal component**.     
    - The next direction with highest variance is the **second principal component**.
    - And so on ...

Let's look at an example. 

In [None]:
# source: Introduction to Machine Learning with Python book
mglearn.plots.plot_pca_illustration()
plt.show()


- Find the direction of maximum variance (Component 1). The direction with maximum variance contains most information in the data. 
- Find second direction which contains most information while being orthogonal (at right angle) to the first direction. (Component 2) 
- The head or tail of the arrows does not matter. They can point in any direction. 
- The directions found by this process are called **principal components**, as they are the directions of variance in the data. 
- There are usually as many components as the original features.
- In dimensionality reduction, we consider the first $k<<d$ most important components. 

- The top right plot shows the same data but rotated so that the first component aligns with the x-axis and the second component with the y-axis. 
- The mean is subtracted from the data before rotation so that the data is centered around zero. 
- In the bottom left plot, we are reducing dimensions from two dimensions to one dimensions. 
- We are keeping the most interesting direction with maximum variance (component 1). 
- In the bottom right plot we undo the rotation and add the mean back to the data. This is our reconstruction of the data. 

### General idea 

- We want to find a transformation such that the transformed features are statistically uncorrelated. 
- We find new axes/basis vectors (rows of $W$) which are mutually orthogonal.
- Each axis has eigenvalues which correspond to the variance in that direction and they decrease monotonically. 
- Since the eigenvalues decrease monotonically for our axes, once we are in the new space, we can drop the the axes with smallest eigenvalues for dimensionality reduction.   

- We want to find a transformation such that the transformed features are statistically uncorrelated. 
- We find new axes which are mutually orthogonal. 
- Line up the variance of the data along these axes. 

### (Optional) Uniqueness of PCA solution

- SVD gives us a solution with the following constraints to make it close to unique.     
    - Normalization: we enforce that $\lVert w_c \rVert = 1$
    - Orthogonality: we enforce that $\lVert w_c \rVert = 0$ for all $c \neq c'$
    - Sequential fitting: 
        - We first fit $w_1$ ("first principal component") giving a line.
        - Then fit $w_2$ given $w_1$ ("second principal component") giving a plane.
        - Then we fit $w_3$ given $w_1$ and $w_2$ ("third principal component") giving a hyperplane and so on 

Even with all this, the solution is only unique up to sign changes. 

<br><br><br><br>

## ❓❓ Questions for you
iClicker cloud join link: https://join.iclicker.com/MA16T

### Select all of the following statements which are **True** (iClicker)

- (A) Each principal component of PCA is $d$ dimensional, where $d$ is the dimensionality of the original data.   
- (B) You can think of transformed data $Z$ as the co-ordinates of $X$ in the new basis.
- (C) When $k=d$, $Z$ will be exactly the same as $X$.  
- (D) When $k=d$, $X_{hat}$ will be exactly the same as $X$.
- (E) In PCA, it's best to pick $k$ as the largest value so that we capture most of the variance in the data.
<br><br><br><br>

``{admonition} Exercise 3.1 

- (A) Each principal component of PCA is $d$ dimensional, where $d$ is the dimensionality of the original data.   
- (B) You can think of transformed data $Z$ as the co-ordinates of $X$ in the new basis.
- (C) When $k=d$, $Z$ will be exactly the same as $X$.  
- (D) When $k=d$, $X_{hat}$ will be exactly the same as $X$.
- (E) In PCA, it's best to pick $k$ as the largest value so that we capture most of the variance in the data.

```

```{admonition} Exercise 3.1: V's Solutions!
:class: tip, dropdown

- (A) True
- (B) True
- (C) False
- (D) True
- (E) False
```

### Select all of the following statements which are **True** (iClicker)

- (A) If you are reducing dimensionality from $d$ dimensions to $k$ dimensions using PCA, where $k\leq d$, then your model is a $k$ dimensional hyperplane. 
- (B) In PCA, it's possible to select how many components you want to keep by looking at reconstructions.
- (C) In PCA, it's possible to identify the most dominant features associated with each principal components.
- (D) In PCA, the first principal component is always the one with highest variance in the data.
- (E) Since PCA and linear regression both minimize squared error, PCA can be thought of as an unsupervised alternative for linear regression. 
<br><br><br><br>

``{admonition} Exercise 3.2 

- (A) If you are reducing dimensionality from $d$ dimensions to $k$ dimensions using PCA, where $k\leq d$, then your model is a $k$ dimensional hyperplane. 
- (B) In PCA, it's possible to select how many components you want to keep by looking at reconstructions.
- (C) In PCA, it's possible to identify the most dominant features associated with each principal components.
- (D) In PCA, the first principal component is always the one with highest variance in the data.
- (E) Since PCA and linear regression both minimize squared error, PCA can be thought of as an unsupervised alternative for linear regression. 

```

```{admonition} Exercise 3.2: V's Solutions!
:class: tip, dropdown

- (A) True
- (B) True
- (C) True
- (D) True
- (E) False
```

### Discussion questions
- Why is PCA causing loss of information? Is it possible to use PCA without loss of information?

<br><br><br><br>

## PCA applications 

In [None]:
import matplotlib as mpl

mpl.rcParams.update(mpl.rcParamsDefault)
plt.rcParams["image.cmap"] = "gray"

### Data 

- We'll be working with `sklearn`'s [Labeled Faces in the Wild dataset](https://scikit-learn.org/0.16/datasets/labeled_faces.html). 
- The dataset has images of celebrities from the early 2000s downloaded from the internet. 

> Credit: This example is based on the example from [here](https://learning.oreilly.com/library/view/introduction-to-machine/9781449369880/ch03.html).

In [None]:
from sklearn.datasets import fetch_lfw_people

people = fetch_lfw_people(min_faces_per_person=20, resize=0.7)

fig, axes = plt.subplots(1, 5, figsize=(10, 8), subplot_kw={"xticks": (), "yticks": ()})
for target, image, ax in zip(people.target, people.images, axes.ravel()):
    ax.imshow(image)
    ax.set_title(people.target_names[target])
plt.show()

In [None]:
people.data.shape

In [None]:
image_shape = (87, 65)

- There are 3,023 images stored as arrays of 5655 pixels (87 by 65), belonging to 62 different people. 
- The data is skewed. Let's make the data less skewed by taking only 20 images of each person. 

In [None]:
mask = np.zeros(people.target.shape, dtype=np.bool)
for target in np.unique(people.target):
    mask[np.where(people.target == target)[0][:20]] = 1

X_people = people.data[mask]
y_people = people.target[mask]
# scale the grayscale values to be between 0 and 1
# instead of 0 and 255 for better numeric stability
X_people = X_people / 255.0

### PCA for feature extraction

- An important application of PCA is feature extraction. 
- Sometimes instead of using the raw data, it's useful to come up with a more interpretable and compact representation of the data. 
- For instance, in case of images, instead of looking at individual pixels, it's useful to look at important components. 

- Let's apply PCA on the data with n_components=100.
- We'll look at how to pick `n_components` in the next lecture.  

In [None]:
n_components = 100
pca = PCA(n_components=100, random_state=123)
pca.fit(X_people)

How much variance are we covering with 100 components? 

In [None]:
pca.explained_variance_ratio_.sum()

In [None]:
Z = pca.transform(X_people)  # Transform the data
W = pca.components_  # principal components
X_hat = pca.inverse_transform(Z)  # reconstructions

What will be the shape of $Z$, $W$, and $X_hat$?

In [None]:
X_people.shape 

In [None]:
Z.shape  

In [None]:
W.shape

In [None]:
X_hat.shape

### Data compression 

- One way to think of PCA is that it's a data compression algorithm. 
- If we store only $Z$ and $W$ instead of $X$, are we going to save space? Discuss with your neighbour.  

### Components learned by PCA

In [None]:
W

- We won't quite be able to examine coefficients associated with all features and make sense of them because each feature in the original data just represents a pixel and there are a lot of them. 
- But we can show principal components as images and can look for semantic themes in them. 
- Let's examine the first few components learned by PCA as images.

In [None]:
fig, axes = plt.subplots(2, 5, figsize=(10, 6), subplot_kw={"xticks": (), "yticks": ()})
for i, (component, ax) in enumerate(zip(W, axes.ravel())):
    ax.imshow(component.reshape(image_shape), cmap="viridis")
    ax.set_title("{}. component".format((i)))

- The components encode some semantic aspects of images. 
- It's not always easy to interpret what semantic aspect each component is roughly capturing but we could make some guesses. 
- The first component is probably encoding the contrast between the background and the image. 
- The second component is probably encoding the differences in the lighting between left and right part of the image. 

### Original images vs. Reconstructed images 

- We can reconstruct the images from the transformed data $Z$ and components $W$. 
- Let's compare original images with the reconstructed images. 

In [None]:
plot_orig_reconstructed_faces(X_people[40:], X_hat[40:])

Decent reconstruction given that we are using only 100 components!! 

### How many components? 
- We can decide this based on how much variance is covered by how many components (next lecture)
- Or we can look at reconstruction of faces with varying number of components and pick $k$ based on your application. 
- Below we are reconstructing 3 faces with varying number of components in the range 10 to 500. 
- As we can see, with 100 components, we are already capturing recognizable face (compared to 7500 features in the original image). 

In [None]:
mglearn.plots.plot_pca_faces(X_people, X_people, image_shape)
plt.show();

<br><br>

### Image lookup application

- In social media and security applications a common task is to determine whether a new test face belongs to a face already in the database or not. 
- A reasonable way to solve this problem is by using 1-nearest neighbour. 
- Usually there are only a few examples of each person in the image databases and usual classification models where we consider each person as a separate class may not work very well.
- Let's try $k$-NN with $k = 1$. 

Since we are working with a supervised learning application , let's split the data into train and validation splits. 

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(
    X_people, y_people, stratify=y_people, random_state=0
)
print(X_train.shape)

### $1$-NN with raw image representation

- Let's try 1-NN on the dataset. 

In [None]:
from sklearn.neighbors import KNeighborsClassifier

pipe = make_pipeline(
    StandardScaler(),
    KNeighborsClassifier(n_neighbors=1),
)
pipe.fit(X_train, y_train)

In [None]:
print("Train set accuracy of 1-nn: {:.2f}".format(pipe.score(X_train, y_train)))
print("Valid set accuracy: of 1-nn: {:.2f}".format(pipe.score(X_valid, y_valid)))

- Considering that there are 62 classes, this score is not too bad as the baseline score would be pretty bad.  
- Currently, we are looking at similarity between images by calculating distances between different pixels at particular locations.
- This is not a great way to calculate similarity between images. 
- PCA would extract meaningful parts and probably could probably help here. 

### $1$-NN with PCA representation  

In [None]:
n_components = 90
pipe_pca = make_pipeline(
    StandardScaler(),
    PCA(n_components=n_components, whiten=True, random_state=123),
    KNeighborsClassifier(n_neighbors=1),
)

pipe_pca.fit(X_train, y_train)
print(
    "Variance Explained by %d principal components: %0.4f"
    % (n_components, sum(pipe_pca.named_steps["pca"].explained_variance_ratio_) * 100)
)

### Train/test results with PCA representation 

In [None]:
print(
    "Train set accuracy of PCA + 1-nn: {:.2f}".format(pipe_pca.score(X_train, y_train))
)
print(
    "Valid set accuracy of PCA + 1-nn: {:.2f}".format(pipe_pca.score(X_valid, y_valid))
)

- Seems like we are overfitting in both cases but our validation results improved when we added PCA.  

<br><br>

### PCA for visualization

- One of the most common applications of PCA is visualizing high dimensional data. 
- Suppose we want to visualize 20-dimensional [countries of the world data](https://www.kaggle.com/fernandol/countries-of-the-world). 
- The dataset has country names linked to population, area size, GDP, literacy percentage, birthrate, mortality, net migration etc.

In [None]:
df = pd.read_csv("data/countries_of_the_world.csv")
df.head()

In [None]:
df.info()

In [None]:
X = df.drop(columns=["Country", "Region"])

Let's replace commas with periods in columns with type `object`. 

In [None]:
def convert_values(value):
    value = str(value)
    value = value.replace(",", ".")
    return float(value)


for col in X.columns:
    if X[col].dtype == object:
        X[col] = X[col].apply(convert_values)

In [None]:
X.head()


- We have missing values
- The features are in different scales. 
- Let's create a pipeline with `SimpleImputer` and `StandardScaler`. 

In [None]:
n_components = 2
pipe = make_pipeline(SimpleImputer(), StandardScaler(), PCA(n_components=n_components))
pipe.fit(X)
X_pca = pipe.transform(X)

In [None]:
print(
    "Variance Explained by the first %d principal components: %0.3f percent"
    % (n_components, sum(pipe.named_steps["pca"].explained_variance_ratio_) * 100)
)

- The explained variance by the first two PCA components is $43.58\%$.  
- Good to know! 

For each example, let's get other information from the original data. 

In [None]:
pca_df = pd.DataFrame(X_pca, columns=["pc1", "pc2"], index=X.index)
pca_df["Country"] = df["Country"]
pca_df["Population"] = X["Population"]
pca_df["GDP"] = X["GDP ($ per capita)"]
pca_df["Crops"] = X["Crops (%)"]
pca_df["Infant mortality"] = X["Infant mortality (per 1000 births)"]
pca_df["Birthrate"] = X["Birthrate"]
pca_df["Literacy"] = X["Literacy (%)"]
pca_df["Net migration"] = X["Net migration"]
pca_df.fillna(pca_df["GDP"].mean(), inplace=True)
pca_df.head()

In [None]:
import plotly.express as px

fig = px.scatter(
    pca_df,
    x="pc1",
    y="pc2",
    color="Country",
    size="GDP",
    hover_data=[
        "Population",
        "Infant mortality",
        "Literacy",
        "Birthrate",
        "Net migration",
    ],
)
fig.show()

### Interpreting `pc1` and `pc2`
- Can we interpret the two dimensions?
    - Recall that each row in $W$ matrix is a principal component. 
    - Each principal component has a coefficient associated with each feature in our original dataset. 
    - We can interpret the components by looking at the features with relatively bigger values (in magnitude) for coefficients for each components. 

In [None]:
fig.show()

In [None]:
component_labels = ["PC " + str(i + 1) for i in range(n_components)]
W = pipe.named_steps["pca"].components_
plot_pca_w_vectors(W, component_labels, X.columns)

###  Dimensionality reduction to reduce overfitting in supervised setting

- Often you would see dimensionality reduction being used as a preprocessing step in supervised learning setup. 
- More features means higher possibility of overfitting. 
- If we reduce number of dimensions, it may reduce overfitting and  computational complexity. 

### Dimensionality reduction for anomaly detection

- A common application for dimensionality reduction is anomaly or outliers detection. For example:
    - Detecting fraud transactions.  
    - Detecting irregular activity in video frames.  
    - It's hard to find good anomaly detection datasets. A popular one is [The KDD Cup ‘99 dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_kddcup99.html).     
![](img/pca_anomaly_detection.png)    
<!-- <img src="img/pca_anomaly_detection.png" alt="" height="900" width="900">  -->

<br><br>

## Final comments, summary, and reflection 

### Take-home message

- **Dimensionality reduction** is the task of reducing a dataset in high dimension to low dimension **while retaining the most "important" characteristics of the data.** 
- PCA is one of the most widely used linear dimensionality reduction techniques. 
- Given data matrix $X_{n \times d}$ and number of components $k \leq d$, PCA outputs transformed data $Z_{n \times k}$ and weight matrix $W_{k \times d}$.
- When going from higher dimensional space to lower dimensional space, PCA still tries to capture the topology of the points in high dimensional space, making sure  that we are not losing some of the important properties of the data. 
- So Points which are nearby in high dimensions are still nearby in low dimension. 

- PCA reduces the dimensionality by learning a $k$-dimensional subspace of the original $d$-dimensional space.
- To represent $k$-dimensional subspace we need $k$ basis vectors. 
    - Each basis vector or a principal component is $d$ dimensional. 
    - The basis vectors or principal components are the rows of $W$. 
        - So PCA learns $k$ basis vectors which define the transformations.        
    - The representation in the new basis are the columns of $Z$.    

- These principal components are basis vectors in the direction of maximum variance.
- The basis vectors are orthogonal to each other. 
- Once we have $W$ we can obtain our transformed data points as a weighted sum of these components. 
    - $Z_{(n\times k)} = X_{(n\times d)}W^T_{(d\times k)}$
- We can also apply **inverse transformation** to recover $X$ from $Z$:
    - $X_{(n\times d)} \approx Z_{(n\times k)}W_{(k\times d)}$ 
    - if $k=d$, then $\approx$ becomes $=$ (i.e., you can precisely recover $X$)    

- In PCA, we minimize the squared error of reconstruction, i.e., elements of `X` and elements of `ZW`. 
- The goal is to find the two best matrices such that when we multiply them we get a matrix that's closest to the data. 
- A common way to learn PCA is using singular value decomposition (SVD). 
- When we apply SVD, we get the two best matrices $W=V^T$ and $Z=U\Sigma$. 
- Although PCA and linear regression seem very similar cosmetically, they are two different algorithms. 
- We can access the variance explained by each component using `sklearn`'s `explained_variance_` and `explained_variance_ratio_`. 

### PCA applications

- PCA is useful in a number of applications. Some examples include
    - Visualization 
    - Feature extraction
    - Anomaly detection

### Reflection (~4 mins)

- PCA is a difficult concept to teach and to learn. 
- Go to this [Google doc](https://docs.google.com/document/d/1RXXXn7WOdU2uxKJv5eDAyZJ9DB1WMzucnBSVX0XpYJE/edit#heading=h.ylf9fhkpc6ik) and answer the following questions. 
    - What is your takeaway from this lesson? 
    - What concept from this lesson are you still struggling with? 

<br><br><br><br>

## Resources

- [Introduction to Machine Learning with Python book Chapter 3](https://learning.oreilly.com/library/view/introduction-to-machine/9781449369880/ch03.html)
- [PCA visualization](https://setosa.io/ev/principal-component-analysis/)
- [StatQuest PCA video](https://www.youtube.com/watch?v=FgakZw6K1QQ&feature=youtu.be)

<br><br><br><br>