**OBJECTIVE** :
* **UNDERSTANDING PRICIPAL COMPONENT ANALYSIS FROM SCRATCH** 
* **COMPARING RESULT WITH THE LIBRARY**

In [53]:
# IMPORTING THE LIBRARY 
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA

In [54]:
# Creating dataset and converting into datframe 
x1 = np.array([1,2,3,4,5])
x2 = np.array([2,3,5,8,11])
df = pd.DataFrame({'x1':x1, 'x2':x2})
df

Unnamed: 0,x1,x2
0,1,2
1,2,3
2,3,5
3,4,8
4,5,11


In [55]:
df.mean() #calculating the mean of data 

x1    3.0
x2    5.8
dtype: float64

In [56]:
df.var()  #calculating the variance of data 

x1     2.5
x2    13.7
dtype: float64

In [57]:
df.cov()  # calculating the covariance data 

Unnamed: 0,x1,x2
x1,2.5,5.75
x2,5.75,13.7


In [58]:
eval,evecs=np.linalg.eig(df.cov(ddof=1)) # eigen values and vector extracting from cov 

In [59]:
eval  #eigen values 

array([ 0.07363719, 16.12636281])

In [60]:
evecs #eigen vectors 

array([[-0.92133078, -0.38877961],
       [ 0.38877961, -0.92133078]])

In [61]:
e1=evecs[:,0]  #extracting eigen vector by indexing

In [62]:
e2=evecs[:,1]  #extracting eigen vector by indexing

In [63]:
evecs[:,0].dot(evecs[:,1])  #dot product 

0.0

In [64]:
(e1).dot(e2) # other way of dot product these proof they are orthogonal

0.0

In [65]:
np.linalg.inv(evecs) # inverse of eigen vectors

array([[-0.92133078,  0.38877961],
       [-0.38877961, -0.92133078]])

In [66]:
df 

Unnamed: 0,x1,x2
0,1,2
1,2,3
2,3,5
3,4,8
4,5,11


In [67]:
df.mean() #mean value of column x1,X2

x1    3.0
x2    5.8
dtype: float64

In [68]:
df_mnc=df-df.mean() #subtracting xi- x(mean), yi-y(mean)
df_mnc

Unnamed: 0,x1,x2
0,-2.0,-3.8
1,-1.0,-2.8
2,0.0,-0.8
3,1.0,2.2
4,2.0,5.2


In [69]:
data=np.mat(df_mnc).T  #Converting mean data into matrix and transposing it 
data

matrix([[-2. , -1. ,  0. ,  1. ,  2. ],
        [-3.8, -2.8, -0.8,  2.2,  5.2]])

In [70]:
inv_eig_val=np.linalg.inv(evecs) #using linear algebra inversing the eigen vectors
inv_eig_val

array([[-0.92133078,  0.38877961],
       [-0.38877961, -0.92133078]])

In [71]:
proj_mat=np.linalg.inv(evecs)@data  # projecting eigen vectors with data 
proj_mat

matrix([[ 0.36529905, -0.16725212, -0.31102369, -0.06601564,  0.1789924 ],
        [ 4.2786162 ,  2.96850581,  0.73706463, -2.41570734, -5.5684793 ]])

In [72]:
proj_mat.T #transposing the projected pca 

matrix([[ 0.36529905,  4.2786162 ],
        [-0.16725212,  2.96850581],
        [-0.31102369,  0.73706463],
        [-0.06601564, -2.41570734],
        [ 0.1789924 , -5.5684793 ]])

In [73]:
# converting to datafarame
pca=pd.DataFrame(proj_mat.T,columns=["pca1","pca2"])
pca

Unnamed: 0,pca1,pca2
0,0.365299,4.278616
1,-0.167252,2.968506
2,-0.311024,0.737065
3,-0.066016,-2.415707
4,0.178992,-5.568479


In [74]:
# variation in data remain same 
sum(df.var(ddof=1)), sum(pca.var(ddof=1))

(16.2, 16.2)

In [75]:
df.var(ddof=1)   #variance before pca

x1     2.5
x2    13.7
dtype: float64

In [76]:
pca.var(ddof=1)  #variance after pca

pca1     0.073637
pca2    16.126363
dtype: float64

# PCA using library 

In [79]:
# Create the dataframe
x1 = np.array([1, 2, 3, 4, 5])
x2 = np.array([2, 3, 5, 8, 11])
df = pd.DataFrame({'x1': x1, 'x2': x2})

# Perform PCA
pca = PCA(n_components=2)  # Specify the number of components (in this case, 2)
principal_components = pca.fit_transform(df)

# Create a new dataframe with the principal components
principal_df = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2'])

# Explained variance ratio
explained_variance_ratio = pca.explained_variance_ratio_

# Print the results
print("Principal Components:")
print(principal_df)
print("\nExplained Variance Ratio:")
print(explained_variance_ratio)

Principal Components:
        PC1       PC2
0 -4.278616  0.365299
1 -2.968506 -0.167252
2 -0.737065 -0.311024
3  2.415707 -0.066016
4  5.568479  0.178992

Explained Variance Ratio:
[0.99545449 0.00454551]


LEARNINGS:

* Dimensionality Reduction: PCA helps to reduce the number of variables or features in a dataset while retaining most of the important information. By transforming the data into a lower-dimensional space, PCA can simplify the analysis and visualization of high-dimensional data, especially when dealing with datasets with a large number of variables.


* Feature Extraction: PCA can extract new features, called principal components, that are linear combinations of the original variables. These principal components are chosen in such a way that they capture the maximum amount of variation present in the data. These extracted features can be used as input for subsequent analysis or modeling tasks.


* Noise Reduction: PCA can help to remove noise or irrelevant information from the data by identifying and eliminating the components with low variances. By focusing on the principal components that explain the most variation, PCA can enhance the signal-to-noise ratio and improve the overall quality of the data.


* Model Efficiency: By reducing the dimensionality of the data, PCA can improve the efficiency and performance of machine learning models. It can reduce overfitting by removing redundant or noisy features, leading to more accurate and robust models.