# Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is an **unsupervised** technique for reducing the dimensionality of data while preserving as much information as possible.
It is based on the covariance matrix and its eigenvectors.

Steps of PCA:
1. Compute the covariance matrix
2. Obtain eigenvalues and eigenvectors
3. Sort eigenvectors by decreasing eigenvalues
4. Transform the data

Given a N dimensions dataset, first we compute the covariance matrix. This can gives us information on the correlations between each dimension. 

Then we compute the eigenvalues and the corresponding eigenvectors. We order from highest to lowest eigenvalue. We can then build a matrix putting the eigenvectors as columns. This matrix is the one that we can use to perform the projection and thus reduce the dimension.

In [2]:
import csv, numpy, matplotlib.pyplot as plt

## Covariance
The covariance between two variables measures how they change together. If two variables increase or decrease together, they have positive covariance. If one increases while the other decreases, they have negative covariance. A covariance of zero means they are uncorrelated.

If we have a dataset with n observations, each one woth two features: X and Y, the sample covariance is:

$$\text{Cov}(X, Y) = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \mu_X) (Y_i - \mu_Y)$$

Where μ are the sample means of X and Y.

Example: we have 5 point in 2 dimension

In [22]:
# Define a simple dataset with 5 points in 2D
v1 = numpy.array([[0, 3], [1, 5], [2, 8], [3, 10], [4, 12]])
v2 = numpy.array([[0, 3], [0, 5], [0, 8], [0, 10], [0, 12]])

# Compute mean
v1_mean = numpy.mean(v1, axis=0)
v2_mean = numpy.mean(v2, axis=0)

sum = 0
for v in v1:
    sum += (v[0] - v1_mean[0]) * (v[1] - v1_mean[1])
sum2=0
for v in v2:
    sum2 += (v[0] - v2_mean[0]) * (v[1] - v2_mean[1])

cov = sum/len(v1)
#cov2 = sum2/len(v2)

cov
    


np.float64(4.6)

## Covariance matrix
PCA uses the covariance matrix because it provides useful information about the relationships between different dimensions. Specifically, it helps identify which dimensions are highly correlated (thus we can keep only one of them) and which dimensions are less correlated (so they capture distinct information).

The covariance matrix captures relationships between variables:
	•	High covariance → Strong linear relationship between two variables.
	•	Low or zero covariance → Weak or no linear relationship.

Given a dataset where each data is: 
$$\bar{X} = (X_1, X_2, …, X_n)$$

This is the formula of covariance matrix:
$$\Sigma =
\begin{bmatrix}
\text{Var}(X_1) & \text{Cov}(X_1, X_2) & \dots & \text{Cov}(X_1, X_n) \\
\text{Cov}(X_2, X_1) & \text{Var}(X_2) & \dots & \text{Cov}(X_2, X_n) \\
\vdots & \vdots & \ddots & \vdots \\
\text{Cov}(X_n, X_1) & \text{Cov}(X_n, X_2) & \dots & \text{Var}(X_n)
\end{bmatrix}$$

