## Principle Component Analysis

- pca is a dimensionality reduction technique that can be used to reduce the number of dimensions (features) in a dataset while retaining as much information as possible.
- This is done by finding a new set of features that are linear combinations of the original features. The new features are called principle components (PCs).
- The first PC is the linear combination of the original features that has the largest variance.
- The second PC is the linear combination of the original features that has the second largest variance and is orthogonal to the first PC.
- The third PC is the linear combination of the original features that has the third largest variance and is orthogonal to the first two PCs.
This process continues until all of the variance in the original features is accounted for.

In [6]:
from sklearn.decomposition import PCA
import numpy as np

2d example:
| x | y |
|---|---|
| 1 | 2 |
| 3 | 4 |
| 5 | 6 |

In [27]:
data = np.array([[1, 2],
                 [3, 4],
                 [5, 6]])

Mean of x and y:
$\bar{X}=\frac{x_1+x_2+x_3+\ldots+x_n}{n}$
$\bar{y}=\frac{y_1+y_2+y_3+\ldots+y_n}{n}$

The formula for calculating the standard deviation $\sigma$ is given by:

$\sigma_x = \sqrt{\frac{\sum_{i=1}^{N}(X_i - \bar{X})^2}{N}}$

Where:
- $N$ is the number of data points,
- $X_i$ represents each individual data point,
- $\bar{X}$ is the mean of the data set.

then standardize ($Z$) the data by this formula:

$Z = \frac{X_i - \mu_x}{\sigma_x}$

Where:
- $X_i$ represents each individual data point,
- $ \mu_x $ is the mean of the data set,
- $ \sigma_x $ is the standard deviation of the data set.



In [36]:
mean = data.mean(axis=0)
std_dev = data.std(axis=0)
standardized_data = (data - mean) / std_dev
transposed_data = standardized_data.T
print(mean)
print(std_dev)
print(transposed_data)
print(standardized_data)

[3. 4.]
[1.63299316 1.63299316]
[[-1.22474487  0.          1.22474487]
 [-1.22474487  0.          1.22474487]]
[[-1.22474487 -1.22474487]
 [ 0.          0.        ]
 [ 1.22474487  1.22474487]]


In [41]:
std_mean=standardized_data.mean(axis=0)
std_mean

array([0., 0.])

To calculate the covariance matrix for a set of variables (X, Y), you can use the following steps:

1. Calculate the mean ($ \bar{X}, \bar{Y} $) for each variable.

2. For each pair of variables, compute the covariance ($Cov(X, Y)$) using the formula:

$ Cov(X, Y) = \frac{\sum_{i=1}^{N}(X_i - \bar{X})(Y_i - \bar{Y})}{N-1} $

3. Assemble the covariance values into a matrix:

$ 
\text{Covariance Matrix} = 
\begin{bmatrix}
    Cov(X, X) & Cov(X, Y) \\
    Cov(Y, X) & Cov(Y, Y) \\
\end{bmatrix}
$

This matrix represents the covariance relationships between the variables X and Y.<br>
**In my case i used standardized_data to calculate covariance_matrix**

In [38]:
covariance_matrix = np.cov(standardized_data, rowvar=False)
covariance_matrix

array([[1.5, 1.5],
       [1.5, 1.5]])

To calculate the eigenvalues of a covariance matrix, you can follow these steps:

1. Formulate and calculate the covariance matrix ($C$) based on your data.

2. Use the determinant equation to find the characteristic equation:

$ \text{det}(C - \lambda I) = 0 $

Where:
- $C$ is the covariance matrix,
- $\lambda$ represents the eigenvalue,
- $I$ is the identity matrix.

3. Solve the characteristic equation for $\lambda$ to find the eigenvalues.

The resulting eigenvalues ($\lambda_1, \lambda_2, \ldots, \lambda_n$) represent the variances along the principal components of the data.

To calculate the eigenvectors corresponding to the eigenvalues ($\lambda_1, \lambda_2, \ldots, \lambda_n$) of a covariance matrix, follow these steps:

4. Substitute each eigenvalue back into the equation $(C - \lambda I) \mathbf{v} = \mathbf{0}$ to solve for the corresponding eigenvector ($\mathbf{v}$).

   For each eigenvalue $\lambda_i$:
   
   $ (C - \lambda_i I) \mathbf{v_i} = \mathbf{0} $<br>
   $ C\mathbf{v_i} - \lambda_i\mathbf{v_i} = \mathbf{0}$

   Where:
   - $C$ is the covariance matrix,
   - $\lambda_i$ is the i-th eigenvalue,
   - $\mathbf{v_i}$ is the i-th eigenvector.

5. The solutions to the above equations will give you the eigenvectors corresponding to each eigenvalue.

The resulting eigenvectors ($\mathbf{v_1, v_2, \ldots, v_n}$) represent the directions along which the data varies the most.

6. To normalize the eigenvector $\mathbf{v_i}$, you can use the following equation:

$ \mathbf{v_{i, \text{normalized}}} = \frac{\mathbf{v_i}}{\sqrt{\sum_{j=1}^{n} (v_{i_j})^2}} $

Where:
- $\mathbf{v_{i, \text{normalized}}}$ is the normalized eigenvector,
- $\mathbf{v_i}$ is the original eigenvector,
- $\sqrt{\sum_{j=1}^{n} (v_{i_j})^2}$ is the Euclidean norm (L2 norm) of the eigenvector.

7. The reduced data matrix $Y$ is calculated by multiplying the original data matrix $Z$ with the matrix of eigenvectors $V$:

$ Y = Z \cdot V $

Where:
- $Y$ is the reduced data matrix,(reduced_data)
- $Z$ is the original data matrix,(standardized_data)
- $V$ is the matrix of eigenvectors.(normalized eigenvectors)

This transformation allows you to project the original data onto the principal components represented by the eigenvectors.




In [31]:
pca = PCA()
pca.fit(standardized_data)

In [32]:
principal_components = pca.components_
explained_variance = pca.explained_variance_ratio_

In [33]:
cumulative_explained_variance = np.cumsum(explained_variance)
threshold = 0.95
n_components = np.argmax(cumulative_explained_variance >= threshold) + 1
n_components

1

In [8]:
reduced_data = pca.transform(standardized_data)[:, :n_components]

In [9]:
print("Original Data:\n", data)
print("\nStandardized Data:\n", standardized_data)
print("\nPrincipal Components:\n", principal_components)
print("\nExplained Variance:\n", explained_variance)
print("\nCumulative Explained Variance:\n", cumulative_explained_variance)
print("\nReduced Data ({} Components):\n".format(n_components), reduced_data)

Original Data:
 [[1 2]
 [3 4]
 [5 6]]

Standardized Data:
 [[-1.22474487 -1.22474487]
 [ 0.          0.        ]
 [ 1.22474487  1.22474487]]

Principal Components:
 [[-0.70710678 -0.70710678]
 [-0.70710678  0.70710678]]

Explained Variance:
 [1. 0.]

Cumulative Explained Variance:
 [1. 1.]

Reduced Data (1 Components):
 [[ 1.73205081]
 [ 0.        ]
 [-1.73205081]]


In [4]:
pca=PCA(n_components=1)
pc1=pca.fit_transform(data)
pc1

array([[-2.82842712],
       [ 0.        ],
       [ 2.82842712]])