# Week 3 - Classical ML Models - Part II

## Principal Component Analysis

Principal components analysis (or PCA) is a technique used for dimensionality reduction enabling to identify signifficant correlations in the dataset. Consequently, it allows to reduce the number of dimensions within the dataset without lossing important information.

### Why do we need PCA?

As we have seen so far, ML models has a tendency to work better with larger datasets: the good amount of data allows training of more accurate model. On the other hand, as we increase the dataset dimensions, we observe a few negative effects:
- In large dimensional datasets, there are a lot of inconsistencies reducing model's accuracy
- The usage of redundant features increases the computational time.

This is where the dimensional reduction comes in - it helps to extract only the most signifficant correlations and reduce the number of dimensions.

### PCA computation steps

#### 1. Standardization

In short, the data standardization involves taking values and scalling them into a similar range. It ensures less biased model training process as larger values no longer shift whole model.

It is carried out by subtracting each value by the mean and dividing by deviation.

![standardization](https://d1jnx9ba8s6j9r.cloudfront.net/blog/wp-content/uploads/2019/08/Standardization-Principal-Component-Analysis-Edureka-300x77.png)

#### 2. Computing covariance matrix

As it has been mentioned earlier, PCA helps to find the correlations within the dataset. These correlations between different dataset variables can be expressed in a covariance matrix.

Mathematically speaking, covariance matrix can be imagined as a  p x p matrix, where p represent the dimensions.

For example, let's say we have a 2-dimensional dataset consisting of variables a and b. In such case, the covariance matrix can be expressed as:
![covariance matrix](https://d1jnx9ba8s6j9r.cloudfront.net/blog/wp-content/uploads/2019/08/Covariance-Matrix-Principal-Component-Analysis-Edureka-150x61.png)

- $Cov(a, a)$ shows the covariance between the variable and itself
- $Cov(a, b)$ shows the covariance between two variables

#### 3. Eigenvalues and eigenvectors

Let's say we have a 3-dimensional matrix **A** containing dataset variables:
![matrix](https://miro.medium.com/max/139/1*OBDgTXEUlUt3wfKblp47BQ.png)

According to theory, if we multiply the matrix by a vector and the resulting vector differs from the original vector by a scalar value, such vector is called **eigenvector**. The scalar value in such expression becomes **eigenvalue**. In mathematical terms - $A.x = lambda.x$ or  $A.x - lambda.x = 0$.

In such condition, the determinant of the characteristic function has to be equal to 0 or in other words:

$det|A - lambda.I|$, where $I$ is identity function.

To better understand this, let's analyze an example with 2-dimensions:

![matrix](https://miro.medium.com/max/130/1*4htJZnnvPc6CLad3IqPLbw.png)

After multilplying $\lambda$ by identity matrix and subtracting from the 2-dimensional matrix, we get:
![matrix](https://miro.medium.com/max/144/1*eZSqgvsRvB8-sahbKqseGQ.png)

After determining the determinant and solving it, we get the following eigenvalues: $\frac{5}{2} + i\frac{\sqrt{15}}{2}$ and $\frac{5}{2} - i\frac{\sqrt{15}}{2}$.

To get the eigenvectors, we have to simply substitute the eigenvalues into the equation: $det|A - \lambda.I|$

#### 4. Computing principal components

In short, principal components are new set of variables obtained from the initial dataset that are signifficant an independent of each other. Once we computed the eigenvectors and eigenvalues, we have to order them in the descending order (first component is formed from eigenvector with the heighest eigenvalue and so on).


#### 5. Reducing the dimensions of the dataset

The last step is to re-arrange the original data accoring to variables' signifficance. In order to do so, we simply multiply the transpose of the original data by the transpose of the feature vector.

### Python implementation

Such pipeline would take the following code form:

In [5]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

def pca(X):
    #Scaling values
    X = StandardScaler().fit_transform(X)
    
    #Computing covariance matrix
    mean = np.mean(X, axis = 0)
    cov_mat = (X - mean).T.dot((X - mean)) / (X.shape[0]-1)
    
    #Calculating eigenvectors and eigenvalues
    cov_mat = np.cov(X.T)
    eig_vals, eig_vecs = np.linalg.eig(cov_mat)
    
    #Computing feature vectors
    eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]
    
    return eig_pairs

However, as in the previous examples, we Scikit-learn library provides functions for performing PCA computations which reduces the coding time.

In [9]:
#PCA using scikit-learn
def pca(X):
    
    pca = PCA(n_components=2)
    pca.fit_transform(X)