#  Principal Component Analysis



Performing PCA on the dataset to reduce each sample into a 10-dimensional feature vector.

=========================================================================
- Implementing PCA algorithm.
    - Start
        - Input: n no. of samples as matrix $X$ of $n$ rows and $k$ columns.
        - Calculate the mean for each column. $$mean = \frac {1}{n} \sum \limits _{i=1} ^{n}X_{ij}$$
        - Calculate the centralised matrix $X_C$ and covariance matrix $C$. $$X_C=X-mean$$ $$C = \frac {1}{n}(X_C)^TX_C$$
        - Calculate the eigenvalues and eigenvectors using convariance matrix.
        - Select top x principal components - which are eigen vector corresponding to top x eigen values. Construct matrix $P$.
    - End
    
- Transforming the the data using the principal components (matrix $P$) obtained using the PCA algorithm. $$Transformed \: Data = XP$$
- Calculating the covariance matrix of the transformed data by first centralising it(mean subtracted) and then obtaining the covariance matrix.

In [1]:

import cvxpy as cp # convex optimization
import pandas as pd # data processing
import numpy as np # linear algebra

train = "train.csv"
df = pd.read_csv(train, header=None) # read csv file
X = df[:2000].iloc[:, 1:].to_numpy() # convert to numpy array

In [None]:
# Selecting top 10 Principal components
no_of_components = 10

covariance_matrix_X = 0
covariance_matrix_X_transformed = 0

# ====================== CODE HERE ======================  
#Reference used from wewbpage on PCA Implementation <https://towardsdatascience.com/a-step-by-step-implementation-of-principal-component-analysis-5520cc6cd598>

mean = np.mean(X, axis=0)# Mean for each column

X_C = X - mean # Calculate the centralized matrix X_C 
C = (1 / X.shape[0]) * X_C.T @ X_C # and covariance matrix C

eigenvalues, eigenvectors = np.linalg.eig(C)# Calculate the eigenvalues and eigenvectors of the covariance matrix C

# Select top x principal components
sorted_indices = np.argsort(eigenvalues)[::-1] # Sort eigenvalues in descending order
eigenvectors = eigenvectors[:, sorted_indices] # Sort eigenvectors according to eigenvalues
P = eigenvectors[:, :no_of_components] # Select the top x eigenvectors

# Transform the data using the principal components matrix P
X_transformed = X_C @ P

# Calculate the covariance matrix of the transformed data
X_transformed_C = X_transformed - np.mean(X_transformed, axis=0)
C_transformed = (1 / X_transformed.shape[0]) * X_transformed_C.T @ X_transformed_C 

# Calculate the sum of the covariance matrix of the original dataset and the transformed dataset
covariance_matrix_X = C
covariance_matrix_X_transformed = C_transformed

# ==========================================================================

sum_cov_X = np.sum(covariance_matrix_X)
sum_cov_X_transformed = np.sum(covariance_matrix_X_transformed)

print('Please copy the folowing result to Question 3 "Cov X = )"')
print(np.round(np.sum(sum_cov_X),2))
print('Please copy the folowing result to Question 3 "Cov X_transformed = )"')
print(np.round(sum_cov_X_transformed,2))
