## Algorithm of PCA demonstration

In this demonstration, we'll utilise the algorithm of PCA to perform dimensionality reduction on the roadmap example that you saw earlier. This demo is on a 2-D example, but can be extended to any dimensions. You are advised to follow the same code for higher dimensional dataset as well to check the workings of PCA.

In [1]:
#Let's import the necessary libraries
import numpy as np
import pandas as pd

In [3]:
#Let's write down the coordinates of all the locations
#And store it in a dataframe.
a = [[2,1],[3,1.5],[4,2],[6,3],[7,3.5],[8,4]]
b = ['X','Y']
data = pd.DataFrame(a,columns = b)
data

Unnamed: 0,X,Y
0,2,1.0
1,3,1.5
2,4,2.0
3,6,3.0
4,7,3.5
5,8,4.0


Let's now proceed and use the algorithm of PCA to find the Principal Components

###  Step 1: Covariance Matrix

In [5]:
#Let's find the covariance matrix of the dataset

C = np.cov(data.T)
C

array([[5.6, 2.8],
       [2.8, 1.4]])

As you can see variance along X is 5.6, along Y is 1.4 and the cov(X,Y) = 2.8. Let's go ahead and perform it's eigendecomposition

### Step 2: Eigendecomposition of Covariance Matrix

For finding the eigenvectors, we'll be using numpy.linalg.eig function.You're advised to take a look at its documentation to understand the workings of it.
https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.eig.html

In [9]:
#Let's find the eigenvectors and the eigenvalues of the covariance matrix
eigenvalues, eigenvectors = np.linalg.eig(C)

In [10]:
#Let's check the eigenvectors
eigenvectors

array([[ 0.89442719, -0.4472136 ],
       [ 0.4472136 ,  0.89442719]])

The eigenvectors are given along the column as per the documentation. Therefore the eigenvectors are given as v1 = (0.89442719,0.4472136 ) and v2 = (-0.4472136, 0.89442719)

In [11]:
#Let's check the eigenvalues corresponding to them
eigenvalues

array([ 7.00000000e+00, -2.22044605e-16])

As you can see, the eigenvalues corresponding to v1 and v2 are 7 and 0 respectively.

In [31]:
### Note that the eigenvalues are already sorted in decreasing order
### Therefore you don't have to change the basis vector matrix 
### However if it isn't in the order, you can use the following code to do the sorting
##idx = eigenvalues.argsort()[::-1]   
##eigenvalues= eigenvalues[idx]
##eigenvectors = eigenvectors[:,idx]

### Step 3:  Tranformation of Data 

Now all we need to do is represent our data in the basis given by the eigenvectors. This involves taking the inverse of the eigenvectors matrix and multiply with the original dataset.

In [12]:
#Let's find the inverse of the eigenvectors matrix.
M = np.linalg.inv(eigenvectors)

In [13]:
M

array([[ 0.89442719,  0.4472136 ],
       [-0.4472136 ,  0.89442719]])

In [17]:
#Now we go ahead and obtain the new set of datapoints
#By multiply M with each datapoint
#We'll take a transpose of the orginal dataset to do this directly for all the points

Data2 = M @ data.T

In [21]:
#Now let's do a transpose again to find the datapoints in the new basis representation
NewData = Data2.T
#Rounding off for better readability
NewData.round(2)

array([[ 2.24, -0.  ],
       [ 3.35, -0.  ],
       [ 4.47, -0.  ],
       [ 6.71, -0.  ],
       [ 7.83, -0.  ],
       [ 8.94, -0.  ]])

As you can see the second column is completely redundant now.Because all the values are 0 now. If we drop it we're still retaining all the information

In [35]:
## Check the covariance matrix now
np.cov(NewData.T).round(2)

array([[ 7., -0.],
       [-0.,  0.]])

As we can see the covariance matrix has been diagonalised