<a href="https://colab.research.google.com/github/mayur7garg/66DaysOfData/blob/main/Day%2010/PCA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Principal Component Analysis

## Imports

In [1]:
import numpy as np
from sklearn.decomposition import PCA

## Sample Data

In [2]:
A = np.random.randint(low = 0, high = 10, size = (5, 3))
A

array([[4, 8, 1],
       [8, 6, 2],
       [7, 2, 8],
       [4, 3, 4],
       [3, 6, 2]])

## PCA using Linear Algebra

### Calculate the mean per column/feature

In [3]:
M = A.mean(axis = 0)
M

array([5.2, 5. , 3.4])

### Center the data by subtracting the mean columnwise from the data

In [4]:
C = A - M
C

array([[-1.2,  3. , -2.4],
       [ 2.8,  1. , -1.4],
       [ 1.8, -3. ,  4.6],
       [-1.2, -2. ,  0.6],
       [-2.2,  1. , -1.4]])

### Calculate the covariance matrix
**Note:** `n.cov` calculates the sample covariance by default. To calculate the population covariance, pass `bias = True` as an argument.

In [5]:
V = np.cov(C.T)
V

array([[ 4.7 , -1.5 ,  2.4 ],
       [-1.5 ,  6.  , -6.25],
       [ 2.4 , -6.25,  7.8 ]])

### Calculate the eigen values and vectors of the covariance matrix using eigen decomposition
- **Eigenvalues** denote the corresponding magnitude of eigenvectors
- **Eigenvectors** denote the direction of vectors for the reduced subspace which are perpendicular to each other

In [6]:
values, vectors = np.linalg.eig(V)
print(f'Eigen values: \n{values}\n')
print(f'Eigen vectors: \n{vectors}')

Eigen values: 
[14.05177389  3.91663968  0.53158643]

Eigen vectors: 
[[ 0.28685293 -0.95013989 -0.12226853]
 [-0.62018395 -0.28146319  0.73222288]
 [ 0.73012825  0.1342113   0.67000005]]


### Transform the original data by taking a dot product of eigenvectors with the centered data

**Note**: 
- To reduce the dimensionality of the data only a subset of eigenvectors corresponding to the highest eigenvalues are considered for transformation. For instance, if the data has to be transformed to 2 dimensions, only the 2 eigenvectors with the 2 highest eigenvalues should be used.
- The sum of the eigenvalues of the eigenvectors used for transformation divided by the total sum of the eigenvalues gives the percentage of the variance explained by the data post transformation. If all eigenvectors are used, the explained variance is 100% and no information is lost even after transformation.

In [7]:
P = vectors.T.dot(C.T)
P.T

array([[-3.95708317, -0.0263288 ,  0.73539075],
       [-0.8391753 , -3.12975068, -0.54812908],
       [ 5.73547708, -0.24849028,  0.66524824],
       [ 1.33422134,  1.78362101, -0.91572349],
       [-2.27343995,  1.62094875,  0.06321358]])

## PCA using sklearn

### Create an instance of the `PCA` class and fit the data
`PCA` takes a positive integer argument specifying the number of dimensions of the reduced subspace post transformation. This number should be equal to or less than the number of the dimensions of the data being used to fit the model.

In [8]:
pca = PCA(3)
pca.fit(A)
pca

PCA(copy=True, iterated_power='auto', n_components=3, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

### Get the eigenvalues and eigenvectors of the data using the `explained_variance_` and `components_` attributes of the fitted model
The eigenvalues and eigenvectors returned by the `PCA` object are ordered by eigenvalues in a descending order.

In [9]:
print(f'Eigen values: \n{pca.explained_variance_}\n')
print(f'Eigen vectors: \n{pca.components_}')

Eigen values: 
[14.05177389  3.91663968  0.53158643]

Eigen vectors: 
[[ 0.28685293 -0.62018395  0.73012825]
 [ 0.95013989  0.28146319 -0.1342113 ]
 [ 0.12226853 -0.73222288 -0.67000005]]


### Use the `transform` method of the `PCA` objec to transform any data to the same reduced subspace

In [10]:
B = pca.transform(A)
B

array([[-3.95708317,  0.0263288 , -0.73539075],
       [-0.8391753 ,  3.12975068,  0.54812908],
       [ 5.73547708,  0.24849028, -0.66524824],
       [ 1.33422134, -1.78362101,  0.91572349],
       [-2.27343995, -1.62094875, -0.06321358]])