## Principle Component Analysis

- pca is a dimensionality reduction technique that can be used to reduce the number of dimensions (features) in a dataset while retaining as much information as possible.
- This is done by finding a new set of features that are linear combinations of the original features. The new features are called principle components (PCs).
- The first PC is the linear combination of the original features that has the largest variance.
- The second PC is the linear combination of the original features that has the second largest variance and is orthogonal to the first PC.
- The third PC is the linear combination of the original features that has the third largest variance and is orthogonal to the first two PCs.
This process continues until all of the variance in the original features is accounted for.

In [1]:
from sklearn.decomposition import PCA
import numpy as np

2d example:
| x | y |
|---|---|
| 1 | 2 |
| 3 | 4 |
| 5 | 6 |

In [2]:
data = np.array([[1, 2],
                 [3, 4],
                 [5, 6]])

Mean of x and y:
$\bar{X}=\frac{x_1+x_2+x_3+\ldots+x_n}{n}$
$\bar{y}=\frac{y_1+y_2+y_3+\ldots+y_n}{n}$

The formula for calculating the standard deviation $\sigma$ is given by:

$\sigma_x = \sqrt{\frac{\sum_{i=1}^{N}(X_i - \bar{X})^2}{N}}$

Where:
- $N$ is the number of data points,
- $X_i$ represents each individual data point,
- $\bar{X}$ is the mean of the data set.

then standardize $Z$ the data by this formula:

$Z = \frac{X_i - \mu_x}{\sigma_x}$

Where:
- $X_i$ represents each individual data point,
- $ \mu_x $ is the mean of the data set,
- $ \sigma_x $ is the standard deviation of the data set.



In [4]:
mean = np.mean(data, axis=0)
std_dev = np.std(data, axis=0)
standardized_data = (data - mean) / std_dev
print(mean)
print(std_dev)
print(standardized_data)

[1.63299316 1.63299316]
[[-1.22474487 -1.22474487]
 [ 0.          0.        ]
 [ 1.22474487  1.22474487]]


In [5]:
pca = PCA()
pca.fit(standardized_data)

In [6]:
principal_components = pca.components_
explained_variance = pca.explained_variance_ratio_

In [10]:
cumulative_explained_variance = np.cumsum(explained_variance)
threshold = 0.95
n_components = np.argmax(cumulative_explained_variance >= threshold) + 1
n_components

1

In [8]:
reduced_data = pca.transform(standardized_data)[:, :n_components]

In [9]:
print("Original Data:\n", data)
print("\nStandardized Data:\n", standardized_data)
print("\nPrincipal Components:\n", principal_components)
print("\nExplained Variance:\n", explained_variance)
print("\nCumulative Explained Variance:\n", cumulative_explained_variance)
print("\nReduced Data ({} Components):\n".format(n_components), reduced_data)

Original Data:
 [[1 2]
 [3 4]
 [5 6]]

Standardized Data:
 [[-1.22474487 -1.22474487]
 [ 0.          0.        ]
 [ 1.22474487  1.22474487]]

Principal Components:
 [[-0.70710678 -0.70710678]
 [-0.70710678  0.70710678]]

Explained Variance:
 [1. 0.]

Cumulative Explained Variance:
 [1. 1.]

Reduced Data (1 Components):
 [[ 1.73205081]
 [ 0.        ]
 [-1.73205081]]
