# Principal Component Analysis

PCA is a **dimensionality reduction technique** that transforms data into a new coordinate system, where the **most significant variations** are captured in the **principal components**.

Given a dataset **X** of shape **$(n \times d)$**:
- n = number of samples
- d = number of features
- PCA finds a new set of **orthogonal axes** that capture the most variance.

---

## **Steps of PCA**

### **Step 1: Mean Centering**
To remove bias, we **center** the data by subtracting the mean for each feature:

$$
X_{\text{centered}} = X - \mu
$$

where $\mu$ is the mean vector:

$$
\mu_j = \frac{1}{n} \sum_{i=1}^{n} X_{ij}, \quad \forall j \in [1, d]
$$

---

### **Step 2: Compute Covariance Matrix**
The **covariance matrix** captures relationships between features:

$$
C = \frac{1}{n} X_{\text{centered}}^T X_{\text{centered}}
$$

where:

$$
C_{ij} = \frac{1}{n} \sum_{k=1}^{n} (X_{ki} - \mu_i)(X_{kj} - \mu_j)
$$

- **Large $C_{ij}$ values** → Strong correlation.
- **Small $C_{ij}$ values** → Weak correlation.

---

### **Step 3: Compute Eigenvalues & Eigenvectors**
To find the **principal directions**, we solve the eigenvalue equation:

$$
C v = \lambda v
$$

where:
- v is an **eigenvector** (principal component).
- $\lambda$ is the **eigenvalue** (variance explained).

The eigenvectors **define the new axes** and the eigenvalues **indicate their importance**.

---

### **Step 4: Select Top \( k \) Principal Components**
Sort eigenvectors **by descending eigenvalues**:

$$
\lambda_1 \geq \lambda_2 \geq \lambda_3 \geq ... \geq \lambda_d
$$

Select the **top \( k \) eigenvectors**, forming the **projection matrix** \( W \):

$$
W = [v_1, v_2, ..., v_k]
$$

---

### **Step 5: Project Data onto New Axes**
Transform the dataset into the new lower-dimensional space:

$$
X_{\text{reduced}} = X_{\text{centered}} \cdot W
$$

Each **new feature** is a **linear combination** of the original features.

In [1]:
import numpy as np

In [2]:
class PCA:
    def __init__(self, n_components):
        # n_components: Number of principal components to keep.
        self.n_components = n_components
        self.mean = None
        self.components = None

    def fit(self, X):
        # X: Input data of shape (n_samples, n_features).
        # Step 1: Standardize the data (mean centering)
        self.mean = np.mean(X, axis=0)
        X_centered = X - self.mean

        # Step 2: Compute the covariance matrix
        covariance_matrix = np.cov(X_centered, rowvar=False)

        # Step 3: Compute eigenvalues & eigenvectors
        eigenvalues, eigenvectors = np.linalg.eig(covariance_matrix)

        # Step 4: Sort eigenvectors by largest eigenvalues
        sorted_indices = np.argsort(eigenvalues)[::-1]  # Descending order
        eigenvectors = eigenvectors[:, sorted_indices]
        
        # Keep only the top 'n_components' eigenvectors
        self.components = eigenvectors[:, :self.n_components]

    def transform(self, X):
        """
        Project X onto the principal components.
        
        X: Input data of shape (n_samples, n_features).
        
        Returns:
        X_projected: Transformed data of shape (n_samples, n_components).
        """
        X_centered = X - self.mean
        return np.dot(X_centered, self.components)

In [3]:
if __name__ == "__main__":
    X = np.array([[2.5, 2.4, 1.2],
                  [0.5, 0.7, 0.8],
                  [2.2, 2.9, 1.5],
                  [1.9, 2.2, 1.3]])

    pca = PCA(n_components=2)
    pca.fit(X)

    X_reduced = pca.transform(X)

    print("Original Data:\n", X)
    print("\nReduced Data:\n", X_reduced)


Original Data:
 [[2.5 2.4 1.2]
 [0.5 0.7 0.8]
 [2.2 2.9 1.5]
 [1.9 2.2 1.3]]

Reduced Data:
 [[ 0.73247336  0.33377376]
 [-1.89939483 -0.01731078]
 [ 0.95531195 -0.28290054]
 [ 0.21160952 -0.03356244]]
