## Implementation of Principal Component Analysis from scratch using NumPy

1. Importing and generating the data

In [18]:
import numpy as np

In [19]:
np.random.seed(42)

In [20]:
x = np.random.randn(500)
y = 2*x + np.random.randn(500) * 0.3

In [21]:
data = np.column_stack((x,y))

In [22]:
type(data)

numpy.ndarray

2. PCA assumes data is centered at the origin so we center the data

In [23]:
centered_data = data - np.mean(data, axis=0)

Since covariance matrix is susceptible to sensitivity regarding the units of features, we also standardise it 

In [24]:
standardized_data = (data - np.mean(data, axis=0))/data.std(axis=0)

3. Create a covariance matrix

In [25]:
cov_matrix = np.cov(standardized_data, ddof=1, rowvar=False)

We use ddof = 1 that is divide by ùëõ‚àí1 instead of ùëõ when computing variance from a sample. It corrects for the fact that the sample mean underestimates variability.

In [17]:
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

In [26]:
ordered_pc = np.argsort(eigenvalues)[::-1]

In [32]:
print(eigenvalues[ordered_pc],'\n')
print(eigenvectors[ordered_pc])


[1.99280863 0.01119939] 

[[ 0.70710678 -0.70710678]
 [ 0.70710678  0.70710678]]


explained variance