# **Principal Component Analysis 1**

Principal Component Analysis (PCA) is an **unsupervised learning technique** - our goal is to identify patterns in data, rather than predict outputs 
* We use *unlabeled* data
* Since we have features and no labels, we want to instead identify patterns within those features

Specifically: Given a high-dimensional dataset (many features), can we find a simpler representation to summarize the main patterns of the data?

### PCA performs **Dimensionality Reduction**
* Transform high dimensional data down to low dimensions 
* Re-express the data in terms of fewer features, without losing (much) info 

There are two equivalent ways of framing PCA: 
1) Finding directions od **maximum variability** in the data 
2) Finding the low dimensional (rank) matrix factorization that **best approximates the data**

The first approach limits us to work with attributes individually
* It cannot resolve collinearity, and we cannot combine features as a result 

The second approach uses PCA to construct **principal components** with the most variance in the data by utilizing **linear combinations of features**


#### **Procedure**

To perform PCA on a matrix 
1) **Center** the data matrix by subtracting the mean of each attribute column 
2) Use Singular Value Decomposition (SVD) to find all principal components efficiently!

### **Singular Value Decomposition**

To make a long story short, SVD describes a matrix $X$'s decomposition into three matrices: 

$$X = USV^T$$

Let's break down each of these terms one by one 

#### $U$ 
* $U$ can be thought of as the "rotation" matrix 
* The columns are **orthonormal**

#### $S$ 
* $S$ can be though of as the scaling operation 
* Contains $r$ **non-zero singular values**, where $r$ is the rank of $X$ 
* Diagonal values (singular values $s_1$, $s_2$, ... $s_r$) are **non-negative** ordered from largest to smallest $s_1 \ge s_2 \ge ... \ge s_r > 0$

#### $V^T$
* When transposed, contains the principal components 
* So, the first $k$ columns of $V$ (different from $V^T$) contain the first $k$ principal components

### **Capturing Variance** 

* To preserve as much information as we can about our dataset, we want our new features to capture the variance of the original data

* Say you're only allowed to use **one** linear combination of the features to represent this two-feature data - what line would you draw?


<img src="PCA1.png" alt="Image Alt Text" width="500" height="300">

* You can think of this first line as the **first** Principal Component

Now, where should the second line go?

<img src="PCA1_2.png" alt="Image Alt Text" width="500" height="300">

* You can think of this second line as the **second** Principal Component

Essentially, we've created a **new coordinate system** to describe our data

We most often use $2$ PCs when conducting PCA 
* This is so we can easily visualize this on a scatterplot 
* PC1 goes on the $x$ axis 
* PC2 is on the $y$ axis 




### **Spinning Around**

* The lines we've drawn define a **new coordinate system** to describe our data 
* The direction that contains the most variance in our data is along one axis; the direction that contains thenext most variance is along the other axis 
* We've **rotated** the coordinate system we originally used to represent the data 

$$XV = US$$

* $X$ represents our original data 
* $V$ rotates our data into the new coordinate system 
* $US$ gives the principal components 




### **Dimensionality Reducton**
* Remember that we constructed the first Principal Component to capture the most variace in the data, the second PC to capture the second most, and so on 

* This means that if we were to represent our dataset using only the first few PCs, we'd still be able to capture most of the variance in the data 

$$(US)_{[:,:n]} V^{\text{T}}_{[:n,:]} = \text{Rank n approximation of } X$$

* Taking the first $n$ columns of $US$ is equivalent to taking the first $n$ Principal Components 
* We multiply by the first $n$ rows of our transposed rotation matrix 
* We then get an approximation of the original data byusing only $n$ dimensions 

This is what we call **dimensionality reduction** - we can make pretty good approximations of the original data by only using a few features


<br>

How do we know how many PCs to use? (What value of $n$ should we be using?)
* $S$ is the singular value matrix = it measure how much variance each PC captures / contributes 

<img src="PCA1_3.png" alt="Image Alt Text" width="400" height="250">

We denote the variance captured by a specific PC as: 

$$\text{variance captured by } n\text{th PC} = \frac{(i\text{th singular value})^2}{n}$$

A **scree plot** displays the variance captured by each PC. 
* We use whatever number of PCs which captures the bulk of the variance in the data 
* Describes the **variance ratio** captured by each principal component, with the largest ration first 
* Help us determine visually the number of dimensions needed to describe the data reasonably

<img src="https://ds100.org/course-notes/pca_2/images/scree_plot.png" alt="Image Alt Text" width="400" height="300">


### **Biplots**

* Biplots superimpose the directions onto the plot of PC1 vs. PC2
* The vector $j$ corresponds to the diretion for feature $j$ (e.g. $v_1j$, $v_2j$)
* In DATA 100, we plot the direction itself 

In the below diagram, we are able to interpret ow features correlate with the principal components shown: positively, negatively, or not much 

<img src="https://ds100.org/course-notes/pca_2/images/slide17_2.png" alt="Image Alt Text" width="600" height="450">

* The direction of the arrows show how that feature contributes to PC1 and PC2
* Let's consider feature 3, with the purple arrow labeled $520$

* We want to break up it's components into $v_1$ and $v_2$, assume $v_1$ corresponds to the PCA $1$ axis and $v_2$ the PCA $2$ axis

* Since $v_1$ is positive, then a linear increase in this feature would correspond to a linear increase in PC $1$ (This feature and PC $1$ are positively correlated)

* Since $v_2$ is negative, then a linear decrease in this feature would correspond to a linear decrease in PC $2$ (This feature and PC $1$ are positively correlated) 




