# Introduction

**Feature Selection vs. Feature Extraction**

- Feature **Selection:**
    - Select a **subset** from a original set
    - Filter methods, wrapper methods, embedded methods

- Feature **Extraction:**
    - Applies a **transformation** to project features into a lower-dimensional space
    - PCA (Principal Component Analysis), LDA (Linear Discriminant Analysis), etc

**Feature Extraction**

**Definition:**  
    Transforms data from a high-dimensional space $R^d$ to a lower-dimensional space ($R^{d'}$, where $d' \lt d$) while preserving maximal information.

**Feature Extraction algorithms:**
- **Unsupervised:**
    - **Goal:**
        - Minimize the information loss (reconstruction error)
    - **Methods:**
        - **PCA** (Principal Component Analysis)
        - **ICA** (Independent Component Analysis)
        - **SVD** (SingularValue Decomposition)
        - MDS (Multi Dimensional Scaling)
        - CCA (Canonical Correlation Analysis)

- **Supervised:**
    - **Goal:**
        - Maximize class separability in the projected space.
    - **Methods:**
        - **LDA**   (Linear DiscriminantAnalysis)
        *Also known as Fisher’s Discriminant Analysis (FDA)*

**Key Benefits:**
- **Visualization:**  
    Projection of high-dimensional data onto lower-dimensional space

- **Data Compression:**  
    Efficient storage, communication, or and retrieval.

- **Helps Avoid Overfitting:**
    - Eliminates redundant/noisy features
    - Improves model generalization by reducing features

**Linear Transformation:**

Projects original data $x \in R^d$ to $x' \in R^{d'}$ via:
$$x' = A^Tx$$

```math
\begin{aligned}
\begin{bmatrix} 
x_1' \\ \vdots \\ x_d' 
\end{bmatrix} 
&= 
\begin{bmatrix} 
a_{11} & \cdots & a_{d1} \\ 
\vdots & \ddots & \vdots \\ 
a_{1d'} & \cdots & a_{d'd} 
\end{bmatrix}
\begin{bmatrix} 
x_1 \\ \vdots \\ x_d 
\end{bmatrix} \\
\end{aligned}
```

Where:
- $A^T \in R^{d' \times d}$ Projection Matrix

- $a_j = \begin{bmatrix} a_{11} \\ \vdots \\ a_{d1} \end{bmatrix}$

- $x \in R^d$: Original features vector

- $x' \in R^{d'}$: Reduced features vector ($d' \ll d$)

Each new dimension in the transformed input matrix is constructed via a linear transformation of the original data:
$$x_j' = a_j' x, \quad \forall j = 1, \cdots d'$$

# PCA

## Introduction

**Principal Component Analysis (PCA)**  
*(Also known as Karhonen-Loeve (KL) transform)*

**Goal:**  
    Reduce the dimensionality of the data while preserving as much of its variation as possible.

**Equivalently:**  
    Find the orthogonal projection that minimizes the squared reconstruction error of the original data.

**Assumption:**  
- The data is **mean-centered**:
    $$\mu_x = \frac{1}{N} \sum_{i=1}^N X_i = 0_{d \times 1}$$

**Core Idea:**  
PCA projects the data onto a **lower-dimensional linear subspace** such that:
- **Interpretation 1:**
    - The variance of the projected data is maximized.
- **Interpretation 2:**
    - The sum of squared distances from the data points to the subspace is minimized.

These two interpretations are **equivalent** because:
- Maximizing the variance of projections (red) $\iff$ Minimizing reconstruction error (blue).

    <div style="text-align:center">
    <img src="../assets/pythagoras.png" alt="Pythagoras Example">
    </div>
    
    By the **Pythagoras theorem**:
    $$||\text{red}||^2 + ||\text{blue}||^2 = ||\text{green}||^2$$

    Since the total distance (green) is fixed (due to mean-centering), maximizing the projection variance (red) necessarily minimizes the reconstruction error (blue).

**Principal Components (PCs):**

A set of **orthonormal** vectors ordered by the fraction of total information (variance) they capture:
- The first PC maximizes the variance of the projected data.
- Subsequent PCs (orthogonal to prior PCs) capture the next highest variance.

**Mathematically:**  
PCA performs an orthogonal projection that **maximizes the variance** of the projected data.  
The **PCs** are the **eigenvectors** of the data’s covariance matrix, **ranked by their corresponding eigenvalues**  
*(which indicate explained variance).*

## Algorithm

**PCA: Steps**

**Input:**
- $X \in R^{N \times d}$ (data matrix with N data points and d dimensions)

**Output:**
- $X' \in R^{N \times k}$ (transformed data with reduced dimensions)

> Compute the mean of each feature: $\mu = \frac{1}{N} \sum_{i=1}^N x^{(i)}$  
> Subtract the mean from each data point (center the data): $\tilde{X} \leftarrow X - \mu$  
> Compute the covariance matrix: $C = \frac{1}{N} \tilde{X}^{T} \tilde{X}$  
> Compute the eigenvalues $[\lambda_1, \lambda_2, \cdots,\lambda_d]$ and eigenvectors $[v_1, v_2, \cdots, v_d]$ of $C$  
> Select the top $d'$ eigenvectors corresponding to the largest eigenvalues: $A \leftarrow [v_1, v_2, \cdots, v_k]$  
> Transform the data into the new subspace: $X' \leftarrow \tilde{X}A$  

Where:
- $v_i$ is the $i$-th PC

**Two Key Interpretations of PCA**

- **Maximum Variance Subspace:**
    - PCA identifies orthogonal vectors $a$ that maximize the variance of projected data:
    $$\max_{a} \frac{1}{N} \sum_{n=1}^N (a^T x^{(n)})^2$$

- **Minimum Reconstruction Error:**
    - PCA finds vectors $a$ that minimize the mean squared error (MSE) when reconstructing the original data from its projections:
    $$\max_{a} \frac{1}{N} \sum_{n=1}^N ||x^{(n)} - (a^Tx^{(n)})a||^2$$

## Maximum Variance Subspace

### **The Principal Component as Eigenvector**

**Goal:**  
We demonstrate that the first principal component (PC) corresponds to the eigenvector of the covariance matrix associated with its largest eigenvalue.

**Key Definitions**

**Mean vector:**
$$
\mu_x = 
\begin{bmatrix}
\mu_1 \\
\vdots \\
\mu_d
\end{bmatrix} =
\begin{bmatrix}
E(x_1) \\
\vdots \\
E(x_d)
\end{bmatrix}
$$

**Covariance matrix:**
$$
\Sigma = E[(x - \mu_x)(x - \mu_x)^T]
$$

**Sample Estimates:**

Given data points $\{x^{(i)}\}_{i=1}^N$:

1. Sample mean:
$$
\hat{\mu} = \frac{1}{N} \sum_{i=1}^N x^{(i)}
$$

2. Mean-centered data matrix:
$$
\tilde{X} = 
\begin{bmatrix}
\tilde{x}^{(1)} \\
\vdots \\
\tilde{x}^{(N)}
\end{bmatrix} =
\begin{bmatrix}
x^{(1)} - \hat{\mu} \\
\vdots \\
x^{(N)} - \hat{\mu}
\end{bmatrix}
$$

3. Sample covariance matrix:
$$
\hat{\Sigma} = \frac{1}{N} \sum_{i=1}^N (x^{(i)} - \hat{\mu})(x^{(i)} - \hat{\mu})^T = \frac{1}{N} \tilde{X}^T \tilde{X}
$$

*Note:* All subsequent analysis assumes mean-centered data ($x \equiv \tilde{x}$).

**Projection Geometry**

For a unit vector $a$ ($|a| = 1$), the projection of $x$ onto $a$ is:
$$
\|x\| \cos \theta = \|x\| \frac{a^T x}{\|x\| \|a\|} = a^T x
$$

<div style="text-align:center">
  <img src="../assets/pythagoras_max_variance.png" alt="Visual example">
</div>

**Variance Maximization Derivation**

The variance of the projected data onto the direction $a$ is:
$$\text{var}(x') = \text{var}(a^Tx) = \frac{1}{N} \sum_{n=1}^N (a^Tx^{(n)})^2$$
$$= \frac{1}{N} ||a^T X||^2 = \frac{1}{N} (a^T X X^T a) = a^T (\frac{1}{N} X^T X) a$$

Let $R_x = \frac{1}{N}X^TX$ (sample covariance matrix) and $\|a\| = 1$. We solve:

$$\argmax_{a} a^T R_x a$$
$$\text{s.t.} \quad \|a\| = a^T a = 1 \implies 1 - a^T a = 0$$

Lagrangian Formulation:
$$L(a, \lambda) = a^T R_x a + \lambda (1- a^Ta)$$

Taking derivatives:
$$
\frac{\partial}{\partial a} \left( a^T R_x a + \lambda (1 - a^Ta) \right) = 2 R_x a - 2 \lambda a = 0
$$
This simplifies to:
$$R_x a = \lambda a$$

**Key Result:**
- $a$ is the eigenvector of sample covariance matrix $R_x = \frac{1}{N} X^T X$ with eigenvalue $\lambda$

### **Optimal Principal Components: Eigenvalue Perspective**

For transformed data $x' = A^Tx$, the covariance matrix becomes:
$$R_{x'} = E[x' {x'}^T] = E[A^T x x^T A] = A^T E[x x^T] A = A^T R_x A$$

When $A = [a_1, \cdots, a_d]$ where $a_1, \cdots, a_d$ are orthonormal eigenvectors of $R_x$:
$$R_x A = \Lambda A = A \Lambda \implies R_x = A \Lambda A^T$$
*where $\Lambda$ is the diagonal eigenvalue matrix.*

Substitute into $R_{x'}$:
$$R_{x'} = A^T R_x A = A^T (A \Lambda A^T) A = \Lambda$$
This yields the critical property:
$$E[{x'}_i{x'}_j] = 0, \quad \forall i \neq j \quad (i, j = 1 \cdots d)$$

**Key Insights:**
- The principal component transformation completely decorrelate the features, eliminating all redundancy in the representation.

## Minimum Reconstruction Error

**Mean Squared Error Approximation**

Incorporating top $d'$ eigenvectors (corresponding to the largest eigenvalues) in $A = [a_1, \ldots, a_{d'}]$ ($d' < d$).

**Key Property:**
- Minimizes the Mean Squared Error (MSE) between original data $x$ and its reconstruction $\hat{x} = Ax'$

**Eigenvalues as Variance Explanations**

The $j$-th largest eigenvalue of $R_x$ is the variance on the $j$-th PC:

$$
\text{var}({x'}_j) = E[{x'}_j {x'}_j]  = E[x_j'^2] = E[{a_j}^T x x^T a_j] = {a_j}^T E[x x^T] a_j
$$

Since $R_x a_j = \lambda_j a_j$:
$$
\text{var}({x'}_j) = {a_j}^T R_x a_j = {a_j}^T \lambda_j a_j = \lambda_j
$$

*Note:* Eigenvalues are ordered $\lambda_1 \ge \lambda_2 \ge \cdots \ge \lambda_d$

**Reconstruction Error Derivation**

The expected reconstruction error for a $d'$-dimensional projection:
$$
J(A) = E[\|x - \hat{x}\|^2] = E[\|x - Ax'\|^2]
$$

This can be expanded as:

$$
= E \left[ \left\| \sum_{j=d'+1}^{d} x_j'a_j \right\|^2 \right]
$$

Which simplifies to:

$$
= E \left[ \sum_{j=d'+1}^{d} \sum_{k=d'+1}^{d} x_j'a_j^T a_k x_k' \right] = E \left[ \sum_{j=d'+1}^{d} x_j'^2 \right]
$$

Finally, this equals the sum of the remaining eigenvalues:

$$
= \sum_{j=d'+1}^{d} E \left[ x_j'^2 \right] = \sum_{j=d'+1}^{d} \lambda_j
$$

**Key Result:**  
- To minimize reconstruction error $J(A)$:
    - Retain PCs with largest eigenvalues (maximum variance)
    - Discard PCs with smallest eigenvalues

# PCA on Faces

# SVD

# ICA