## Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a technique used to reduce the number of variables in a dataset while retaining most of its important information. It does this by finding new directions, called principal components, along which the data varies the most.

---

## Covariance Matrix

The covariance matrix is a table that shows how pairs of variables in the data change together.

* If two variables increase or decrease together, their covariance is positive.
* If one variable increases while the other decreases, their covariance is negative.
* If the variables do not affect each other, their covariance is close to zero.

The covariance matrix helps PCA identify relationships between variables.

---

## Eigenvectors and Explained Variance

* **Eigenvectors** are special directions in the data that point along the principal components—these are the directions where the data varies the most.

* **Eigenvalues** tell us how much of the data's variance is captured along each eigenvector.

* **Explained variance** is the percentage of the total variance that each principal component accounts for.

In PCA, we compute the covariance matrix, find its eigenvectors and eigenvalues, and then select the top principal components that explain most of the variance in the data. This reduces the data’s dimensions effectively.

---

### Summary Table

| Term               | Definition                                                         |
| ------------------ | ------------------------------------------------------------------ |
| PCA                | Technique to reduce data dimensions while preserving information   |
| Covariance Matrix  | Matrix showing how variables vary together                         |
| Eigenvectors       | Directions of maximum variance in data (principal components)      |
| Eigenvalues        | Amount of variance explained by each eigenvector                   |
| Explained Variance | Percentage of total variance explained by each principal component |


### Dataset (2 features, 3 points):

$$
X = \begin{bmatrix}
2 & 0 \\
0 & 2 \\
3 & 3 \\
\end{bmatrix}
$$

---

### Step 1: Center the data (subtract mean of each feature)

Mean of each feature:

$$
\bar{x}_1 = \frac{2 + 0 + 3}{3} = 1.67, \quad \bar{x}_2 = \frac{0 + 2 + 3}{3} = 1.67
$$

Centered data:

$$
X_c = \begin{bmatrix}
2 - 1.67 & 0 - 1.67 \\
0 - 1.67 & 2 - 1.67 \\
3 - 1.67 & 3 - 1.67 \\
\end{bmatrix} = 
\begin{bmatrix}
0.33 & -1.67 \\
-1.67 & 0.33 \\
1.33 & 1.33 \\
\end{bmatrix}
$$

---

### Step 2: Calculate covariance matrix $C$

$$
C = \frac{1}{n-1} X_c^T X_c =
\frac{1}{2}
\begin{bmatrix}
0.33 & -1.67 & 1.33 \\
-1.67 & 0.33 & 1.33 \\
\end{bmatrix}
\begin{bmatrix}
0.33 & -1.67 \\
-1.67 & 0.33 \\
1.33 & 1.33 \\
\end{bmatrix}
$$

Calculate:

$$
C = \begin{bmatrix}
2.33 & 0.33 \\
0.33 & 2.33 \\
\end{bmatrix}
$$

---

### Step 3: Find eigenvectors and eigenvalues of $C$

Solve:

$$
\det(C - \lambda I) = 0
$$

Eigenvalues:

$$
\lambda_1 = 2.66, \quad \lambda_2 = 2.00
$$

Eigenvectors (normalized):

$$
v_1 = \frac{1}{\sqrt{2}} \begin{bmatrix}1 \\ 1\end{bmatrix}, \quad
v_2 = \frac{1}{\sqrt{2}} \begin{bmatrix}-1 \\ 1\end{bmatrix}
$$

---

### Step 4: Explained variance

Total variance = $\lambda_1 + \lambda_2 = 4.66$

Percentage explained by each component:

$$
\text{PC1} = \frac{2.66}{4.66} \approx 57\%, \quad \text{PC2} = \frac{2.00}{4.66} \approx 43\%
$$

---

### Interpretation:

* First principal component explains 57% of the variance.
* Second principal component explains 43%.
* We can reduce dimensions by projecting data onto $v_1$ alone to keep most information.


In [1]:
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.decomposition import PCA

In [2]:
# Load Iris dataset
iris = datasets.load_iris()
X = iris.data  # Original features (4 features)
feature_names = iris.feature_names

In [3]:
feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [4]:
# Apply PCA to reduce to 2 components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

In [5]:
X_pca

array([[-2.68412563,  0.31939725],
       [-2.71414169, -0.17700123],
       [-2.88899057, -0.14494943],
       [-2.74534286, -0.31829898],
       [-2.72871654,  0.32675451],
       [-2.28085963,  0.74133045],
       [-2.82053775, -0.08946138],
       [-2.62614497,  0.16338496],
       [-2.88638273, -0.57831175],
       [-2.6727558 , -0.11377425],
       [-2.50694709,  0.6450689 ],
       [-2.61275523,  0.01472994],
       [-2.78610927, -0.235112  ],
       [-3.22380374, -0.51139459],
       [-2.64475039,  1.17876464],
       [-2.38603903,  1.33806233],
       [-2.62352788,  0.81067951],
       [-2.64829671,  0.31184914],
       [-2.19982032,  0.87283904],
       [-2.5879864 ,  0.51356031],
       [-2.31025622,  0.39134594],
       [-2.54370523,  0.43299606],
       [-3.21593942,  0.13346807],
       [-2.30273318,  0.09870885],
       [-2.35575405, -0.03728186],
       [-2.50666891, -0.14601688],
       [-2.46882007,  0.13095149],
       [-2.56231991,  0.36771886],
       [-2.63953472,

In [6]:
pd.DataFrame(X, columns=feature_names).head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [7]:
pd.DataFrame(X_pca, columns=['PC1', 'PC2']).head()

Unnamed: 0,PC1,PC2
0,-2.684126,0.319397
1,-2.714142,-0.177001
2,-2.888991,-0.144949
3,-2.745343,-0.318299
4,-2.728717,0.326755
