# AIO Q4 — Principal Component Analysis (PCA)

## What is PCA?

**Principal Component Analysis (PCA)** is a dimensionality reduction technique that finds the directions of maximum variance in high-dimensional data. Think of it as finding the "most important" axes to represent your data with fewer dimensions while losing as little information as possible.

**Why does this matter?**
- **Visualization:** Reduce 100-dimensional data to 2D for plotting
- **Noise reduction:** Remove low-variance dimensions that are mostly noise
- **Compression:** Store data more efficiently
- **Feature extraction:** Find meaningful combinations of original features

---

## Notation

Throughout this problem, we use the following notation:

| Symbol | Meaning |
|--------|---------|
| $n$ | number of data points |
| $d$ | dimension of each data point |
| $k$ | number of principal components to retain |
| $X \in \mathbb{R}^{n \times d}$ | data matrix (each row is a data point) |
| $\Sigma \in \mathbb{R}^{d \times d}$ | covariance matrix |
| $\lambda_1 \ge \lambda_2 \ge ... \ge \lambda_d$ | eigenvalues in descending order |
| $v_1, v_2, ..., v_d$ | corresponding eigenvectors |
| $I_d$ | $d \times d$ identity matrix |

---

## Key Definitions

**Covariance Matrix:** A matrix $\Sigma$ where entry $\Sigma_{ij}$ measures how much features $i$ and $j$ vary together. For centered data $\tilde{X}$: $\Sigma = \frac{1}{n}\tilde{X}^T\tilde{X}$

**Frobenius Norm:** For a matrix $A$, the Frobenius norm is: $\|A\|_F = \sqrt{\sum_{i,j} A_{ij}^2} = \sqrt{\text{tr}(A^T A)}$

**Trace:** The trace of a square matrix is the sum of its diagonal entries: $\text{tr}(A) = \sum_{i=1}^{d} A_{ii}$

**Orthonormal Vectors:** A set of vectors $\{v_1, ..., v_k\}$ is orthonormal if each vector has unit length ($\|v_i\|_2 = 1$) and all pairs are perpendicular ($v_i^T v_j = 0$ for $i \neq j$). Equivalently, $V^T V = I$ where $V$ has these vectors as columns.

**Principal Components:** The eigenvectors of the covariance matrix $\Sigma$, ordered by their eigenvalues. The first principal component $v_1$ points in the direction of maximum variance in the data.



Setup Code

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris

# Random seed for reproducibility
np.random.seed(42)

# Load Iris dataset
iris = load_iris()
iris_data = iris.data  # Shape: (150, 4)
iris_labels = iris.target  # Shape: (150,)
iris_feature_names = iris.feature_names  # ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
iris_target_names = iris.target_names  # ['setosa', 'versicolor', 'virginica']

## Q1 — Covariance Matrix Fundamentals (6 points)

Consider a dataset $X \in \mathbb{R}^{n \times d}$ where each row $x_i^T$ represents a data point.

**Part (a):** Write the expression for the mean vector $\hat{\mu}$ (the average of all rows of $X$) and the centered data matrix $\tilde{X} \in \mathbb{R}^{n \times d}$ (data with mean subtracted). (2 points)

```
Answer (a):


```

**Part (b):** Prove that the sample covariance matrix can be written as: (2 points)

$$\Sigma = \frac{1}{n}\tilde{X}^T\tilde{X}$$

```
Answer (b):


```

## Q2 — Computing Principal Components by Hand (7 points)

Consider the following dataset, represented as three points in $\mathbb{R}^2$. Assume the data is already centered (zero mean).

$$X = \begin{bmatrix} 1 & 2 \\ 1.5 & 3 \\ 6 & 12 \end{bmatrix}$$

The **principal components** are the eigenvectors of the covariance matrix $\Sigma = \frac{1}{n}X^T X$, ordered by eigenvalue magnitude.

**Part (a):** What is the first principal component vector $v_1$? (2 points)

```
Answer (a):


```

**Part (b):** What is the second principal component $v_2$? (2 points)

```
Answer (b):


```

**Part (c):** Project the data onto the first principal component by computing $X v_1$ (compressing from 2D to 1D). What is the 1D representation of each point? (2 points)

```
Answer (c):


```

**Part (d):** Will this representation be lossy, or will it perfectly preserve the data? Explain why. (2 points)

```
Answer (d):


```

## Q3 — Structured Covariance Matrix (4 points) — *Optional/Bonus*

Consider a covariance matrix $\Sigma$ of the form:

$$\Sigma = \gamma I_p + aa^T$$

where $\gamma > 0$ is a scalar, $I_p$ is the $p \times p$ identity matrix, and $a$ is a vector of dimension $p$.


**Part (a):** Show that $a$ is an eigenvector of $\Sigma$. What is its eigenvalue? (2 points)

```
Answer (a):


```

**Part (b):** Show that if $b$ is any vector such that $a^T b = 0$, then $b$ is also an eigenvector of $\Sigma$. What is the eigenvalue corresponding to $b$? (2 points)

```
Answer (b):


```


## Q4 — SVD Properties and Derivation (15 points) — *Theory*

Consider any matrix $A \in \mathbb{R}^{n \times d}$.

**Part (a):** Let $v$ be a unit norm eigenvector of $A^T A$ with eigenvalue $\lambda$. Prove that $\lambda \geq 0$ and that $\|Av\|_2 = \sqrt{\lambda}$. (2 points)

```
Answer (a):


```

**Part (b):** Let $v_1, v_2$ be two eigenvectors of $A^T A$ that are orthogonal to each other and correspond to non-zero eigenvalues. Prove that $Av_1$ and $Av_2$ are also orthogonal to each other. (2 points)

```
Answer (b):


```

**Part (c):** Let $V \in \mathbb{R}^{d \times d}$ contain the $d$ eigenvectors of $A^T A$ as its columns and let $\Lambda \in \mathbb{R}^{d \times d}$ contain their corresponding eigenvalues. Assume $A^T A$ is full rank. Prove that $U = AV\Lambda^{-1/2}$ has orthonormal columns. (2 points)

*Hint: Apply parts (a) and (b).*

```
Answer (c):


```


## Q5 — Optimal Low-Rank Approximation (18 points) — *Theory*

In this problem, you will prove that PCA provides the **optimal low-rank approximation** by showing that projecting onto the top $k$ principal components minimizes reconstruction error.

Consider any matrix $A \in \mathbb{R}^{n \times d}$.

**Convention:** Throughout this problem, let $Z \in \mathbb{R}^{d \times k}$ be a matrix with orthonormal columns (i.e., $Z^T Z = I_k$). All min/max operations are over such matrices $Z$.

**Part (a):** Prove that: 

$$\arg\min_{Z} \|A - AZZ^T\|_F^2 = \arg\max_{Z} \text{tr}(Z^T A^T A Z)$$

```
Answer (a):


```

**Part (b):** Let $V \in \mathbb{R}^{d \times d}$ have orthonormal columns. Prove that for any $Z \in \mathbb{R}^{d \times k}$ with orthonormal columns, $V^T Z$ also has orthonormal columns. Further prove that any such $Z$ can be written as $Z = VU$ for some $U \in \mathbb{R}^{d \times k}$ with orthonormal columns. (2 points)

```
Answer (b):


```

**Part (c):** Writing $A^T A = V\Lambda V^T$ in its eigendecomposition, use part (b) to prove that: (2 points)

$$\max_{Z} \text{tr}(Z^T A^T A Z) = \max_{Z} \text{tr}(Z^T \Lambda Z)$$

```
Answer (c):


```

**Part (d):** Prove that for $Z \in \mathbb{R}^{d \times k}$ with orthonormal columns: $\text{tr}(ZZ^T) = k$ and $0 \leq (ZZ^T)_{i,i} \leq 1$ for all $i \in [d]$. (2 points)

```
Answer (d):


```

**Part (e):** Use parts (c) and (d) to show that $\max_{Z} \text{tr}(Z^T A^T A Z) = \sum_{i=1}^k \lambda_i(A^T A)$. Conclude that the optimal $Z$ has as its columns the top $k$ eigenvectors of $A^T A$. (2 points)

```
Answer (e):


```

## Q6 — Implementing BasicPCA (15 points)

In this problem, you will implement the class `BasicPCA` using eigendecomposition of the covariance matrix.

Implement the class according to the specifications below. Include shape comments after each operation.

Attributes:

* `n_components`: Number of principal comonents to retain
* `mean_`: Mean vector of training data, shape `(d,)`
* `components_`: Principal component vectors, shape `(n_components, d)` where each row is a principal component
* `explained_variance_`: Variance explained by each component, shape `(n_components,)`
* `explained_variance_ratio_`: Proportion of total variance explained by each component, shape `(n_components,)`

Method `__init__`:
* Inputs:
  * `n_components`: Integer specifying number of components
* Outputs:
  * None
* What to do:
  * Initialise all attributes

Method `fit`:
* Inputs:
  * `X`: Training data with shape `(n,d)`, where n is number of samples and d is number of features
* Outputs:
  * `self`
* What to do:
  * Compute mean vector and centre the data
  * Compute covariance matrix
  * Compute eigendecomposition using `np.linalg.eigh` (for symmetric matrices)
  * Sort eigenvalues and eigenvectors in descending order
  * Store the top `n_components` eigenvectors as principal components (as rows)
  * Store corresponding eigenvalues as explained variance
  * Compute explained variance ratios
  * After each operation, add a comment on the tensor shape
  * DO NOT use any loop for main computations


Method `transform`:
* Inputs:
  * `X`: Data to transform, shape `(n,d)`
* Outputs:
  *`X_transformed`: Projected data, shape `(n,n_components)`
* What to do:
  * Centre the input data using the stored mean
  * Project onto principal components

Method `fit_transform`:
* Inputs:
  * `X`: Data to transform, shape `(n,d)`
* Outputs:
  *`X_transformed`: Projected data, shape `(n,n_components)`
* What to do:
  * Call fit and transform in sequence

Method `inverse_transform`:
* Inputs:
  * `X_transformed`: Data in PC space, shape `(n,n_components)`
* Outputs:
  * `X_reconstructed`: Reconstructed data in original space, shape `(n,d)`
* What to do:
  * Project back from PC space to original space
  * Add back the mean
  * Add shape comments





In [None]:
# code here

## Q7 — Implementing SVDPCA (15 points)

In this problem, you will build your own PCA class `SVDPCA` using Singular Value Decomposition (SVD) computed via eigendecomposition.

Implement the class according to the specifications below. Include shape comments after each operation.

Attributes:

* `n_components`: Number of principal comonents to retain
* `mean_`: Mean vector of training data, shape `(d,)`
* `components_`: Principal component vectors, shape `(n_components, d)` where each row is a principal component
* `singular_values_`: Singular values from SVD, shape `(n_components,)`
* `explained_variance_`: Variance explained by each component, shape `(n_components,)`
* `explained_variance_ratio_`: Proportion of total variance explained by each component, shape `(n_components,)`

Method `__init__`:
* Inputs:
  * `n_components`: Integer specifying number of components
* Outputs:
  * None
* What to do:
  * Initialise all attributes

Method `fit`:
* Inputs:
  * `X`: Training data with shape `(n,d)`, where n is number of samples and d is number of features
* Outputs:
  * `self`
* What to do:
  * Compute mean vector and centre the data
  * Manually compute $\mathbf{SVD}$ using eigendecomposition:
    * For the centred data matrix $X\in\mathbb{R}^{n\times d}$, we want $X = U \Sigma V^T$
    * Use the fact that $V$ contains eigenvectors of $X^TX$ and $\Sigma^2$ contains eigen values of $X^TX$
    * Compute $A=X^TX$ (shape: $d\times d$)
    * Compute the eigendecomposition of $A$ using `np.linalg.eigh`: $A = V\Lambda V^T$
    * Sort eigenvalues and eigenvectors in descending order
    * Extract singular values: $\sigma_i = \sqrt{\lambda}_i$ for $i = 0, ..., d-1$
    * The right singular vectors are $V$ (eigenvectors of $X^TX$)
    * Note: Principal components are rows of $V^T$, which are columns of $V$
  * Store the first `n_components` rows of `V^T` as principal components
  * Store corresponding singular values
  * Compute explained variance from singular values using: variance = $\sigma^2/(n)$
  * Compute explained variance ratios
  * Do NOT use any loop for main computation

Method `transform`:
* Inputs:
  * `X`: Data to transform, shape `(n,d)`
* Outputs:
  *`X_transformed`: Projected data, shape `(n,n_components)`
* What to do:
  * Centre the input data using the stored mean
  * Project onto principal components

Method `fit_transform`:
* Inputs:
  * `X`: Data to transform, shape `(n,d)`
* Outputs:
  * `X_transformed`: Projected data, shape `(n,n_components)`
* What to do:
  * Call fit and transform in sequence

Method `inverse_transform`:
* Inputs:
  * `X_transformed`: Data in PC space, shape `(n,n_components)`
* Outputs:
  * `X_reconstructed`: Reconstructed data in original space, shape `(n,d)`
* What to do:
  * Project back from PC space to original space
  * Add back the mean
  * Add shape comments


In [None]:
# code here

## Q8 — Testing PCA on Iris Dataset (12 points)

In this problem, you will test both PCA implementations on the Iris dataset and verify their correctness.

**Note:** The Iris dataset has already been loaded in the setup code:
- `iris_data`: shape (150, 4) — the feature matrix
- `iris_labels`: shape (150,) — the class labels
- `iris_feature_names`: list of 4 feature names
- `iris_target_names`: list of 3 class names

---

**Part (a): Implement the test function** (4 points)

Define a function `test_pca_on_iris()` that:
- Sets `n_components = 2`
- Fits both `BasicPCA` and `SVDPCA` on the iris data
- Transforms the iris data using both models
- Prints: original data shape, transformed data shapes, explained variance ratios, total variance explained
- Does NOT use any loops for main computations

---

**Part (b): Compare the implementations** (3 points)

In the same function, verify both implementations produce equivalent results by computing:
1. Maximum absolute difference between principal components (accounting for sign flips)
2. Maximum absolute difference between explained variance ratios
3. Maximum absolute difference between transformed data (accounting for sign flips)
4. Print whether methods agree within tolerance 1e-10

*Hint: Components can be flipped in sign. Check both |v1 - v2| and |v1 + v2| and take the minimum.*

---

**Part (c): Reconstruction error** (3 points)

In the same function, compute and print:
1. Reconstruction error for both models: $\|X - X_{reconstructed}\|_F^2$
2. Verify reconstruction errors match between models

After implementing, call: `test_pca_on_iris()`


In [None]:
# code here

# Q9 - Visualising PCA on Iris (12 points)

In this part you will visualise the PCA results on the Iris dataset.



---

Part (a): Create a scatter plot function

Define a function `plot_pca_results(X_transformed, labels, title, explained_variance_ratio)` that:

Inputs:
* `X_transformed`: Transformed data in 2D, shape `(n, 2)`
* `labels`: Class labels, shape `(n,)`
* `title`: String for plot title
* `explained_variance_ratio`: Array of explained variance ratios for the 2 components, shape `(2,)`

Outputs:
* None 

What to do inside this function:
1. Create a figure with size `(10, 8)`
2. Create a scatter plot with `X_transformed[:, 0]` on x-axis and `X_transformed[:, 1]` on y-axis
3. Color points by their class label (use 3 different colors for 3 classes)
4. Add labels: "PC1 (X.X% variance)" for x-axis, "PC2 (X.X% variance)" for y-axis, where X.X is the percentage from `explained_variance_ratio`
5. Add a legend showing "Setosa", "Versicolor", "Virginica"
6. Add grid for better readability
7. Set the title
8. Use plt.show() to display the plot
9. Do not use any loop for plotting (matplotlib can handle arrays directly)


---

Part (b): Visualise explained variance

Define a function `plot_explained_variance(explained_variance_ratio) that`:

Inputs:
* `explained_variance_ratio`: Array of explained variance ratios for all 4 components, shape `(4,)`

Outputs:
* None

What to do:

1. Create a figure with size (10, 6)
2. Compute cumulative explained variance: `cumulative_variance = np.cumsum(explained_variance_ratio)`
3. Create a bar plot showing explained variance ratio for each component (use component numbers 1, 2, 3, 4 on x-axis)
4. On the same plot, add a line plot showing cumulative explained variance with markers
5. Label x-axis as "Principal Component"
6. Label y-axis as "Explained Variance Ratio"
7. Add a legend distinguishing between "Individual Explained Variance" and "Cumulative Explained Variance"
8. Add grid with transparency (alpha=0.3)
9. Set title as "PCA Explained Variance Analysis"
10. Use `plt.tight_layout()` to prevent label overlap
11. Use plt.show() to display the plot (do NOT save to file)
12. Do not use any loop for plotting


---
Part (c): Call the functions

After implementing both functions, create and display the plots.



In [None]:
# code here