# **PCA**: **Principal Component Analysis**

**PCA Characteristics:**

*   Feature Extraction Technique used to reduce the Curse of Dimensionaltiy.
*   Unsupervised Machine Problem: The data only has input and no output.
*   Very Very Complex math.



**Analogy to Understand how PCA works:**

In a soccer match a photographer wants to capture an image of the soccer match. We know that the soccer match is happening in a 3D plane but when the photographer captures the image, it happens in a 2D plane. *This is what exactly PCA does, it reduces the features and the dimensionality of the data*.

**'PCA is a technique which can transform a higher dimensional data to a lower dimensional data and maintaining the essence of the data at the same time as well'**

so even when the ML model works on the lower dimensional data, the results are good.

**Benefits**:

1. Makes the data smaller which leads to faster execution of algorithm.

2. Visulisation: PCA helps to reduce a 10D data to 3d so that we can visualise the data in a better way.





**Geometric Intuition :**

For Feature Selection in PCA when you have no idea about the effect of features/columns on the predictions simply plot it.

i) the feature the has greater spread on the graph and

ii) we see the variance of features, the features that have greater variance

are prioritized and those columns are the important columns and are to be selected.

Now suppose, if we were given some columns and all of them have an higher impact on the prediction, how would you choose features then?

Also if the features have a linear relationship amongst themselves then the spread of features on the graph would be almost same and so would be the variance, then how would you choose between them?

**This is where PCA (Feature Extraction)comes into place.**



To solve this problem of choosing between different correlated features, PCA creates a separate new feature which has properties of older features and gets our work easier.

OR

It creates a new set of features from the existing set of features and chooses a subset from the new features which it thinks are the most important ones.

**How PCA chooses the most important features?**

It rotates the coordinate axis to some degrees and then the linear data which seemed diagonal before (/) now seems horizontal and then again the spread of data and variance is measured and the features are chosen.

**number of Principal Components <= Original number of components in the data**



Variance is directly proportinal to the spread of data.

**Why is Variance important in PCA?**

Since the Variance and spread of data on the axis are correlated, we know it is important to have greater spread for selecting features, thus our end goal for PCA is to  increase the Variance of the data.



'Mean is the detection of the center.'

**Covariance**: It is a measure of the relationship between two random variables and to what extent, they change together. Or we can say, in other words, it defines the changes between the two variables, such that change in one variable is equal to change in another variable.

**eigenvector**: It is a vector that is associated with a set of linear equations. The eigenvector of a matrix is also known as a latent vector, proper vector, or characteristic vector. These are defined in the reference of a square matrix.

**Step by Step Solution:**

1. Mean centering: to shift the data towards the origin so that its mean goes on the origin. This improves the performance of PCA.


2. Find the covariance. (np.cov)

3. Find the Eigen Value/vector (np.linalg.eig)

Then you get your Principal Components.


# Code:

In [1]:
import numpy as np
import pandas as pd

In [2]:
np.random.seed(23)

The numpy random seed is a numerical value that generates a new set or repeats pseudo-random numbers. The value in the numpy random seed saves the state of randomness. If we call the seed function using value 1 multiple times, the computer displays the same random numbers.

 creating a dataframe of 40 rows and 4 columns

In [57]:
mu_vec1 = np.array([0,0,0])
cov_mat1 = np.array([[1,0,0] , [0,1,0] , [0,0,1]])
class1_sample = np.random.multivariate_normal(mu_vec1 , cov_mat1 , 20)

df = pd.DataFrame(class1_sample , columns = ['features1' , 'features2' , 'features3'])
df['target'] = 1

mu_vec2 = np.array([1,1,1])
cov_mat2 = np.array([[1,0,0] , [0,1,0] , [0,0,1]])
class2_sample = np.random.multivariate_normal(mu_vec2 , cov_mat2 , 20)

df1 = pd.DataFrame(class2_sample , columns = ['features1' , 'features2' , 'features3'])
df1['target'] = 0

df = df.append(df1 , ignore_index = True)

df = df.sample(40)

In [58]:
df.head()

Unnamed: 0,features1,features2,features3,target
5,0.272649,-1.051434,0.336153,1
1,2.80022,-0.017483,0.063191,1
19,-1.745,0.854942,-0.548406,1
12,-0.888666,1.254989,-1.816714,1
37,-0.47204,1.570002,1.702376,0


Our target is to convert this 3D data into 2D

In [7]:
import plotly.express as px



In [59]:
fig = px.scatter_3d(df, x=df['features1'], y=df['features2'], z=df['features3'],
              color = df['target'].astype('str'))
fig.update_traces(marker=dict(size=12,
                              line=dict(width=2,
                                        color ='DarkSlateGrey')),
                  selector = dict(mode = 'markers'))
fig.show()
                    


              

The end goal is to represent this data into the 2D system in the best possible way.


Using the steps to reduce the Dimensionality.

step1: mean centering: Apply Standard scaling

In [10]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

df.iloc[:,0:3] = scaler.fit_transform(df.iloc[:,0:3])



step2: find covariance

In [11]:
covariance_matrix = np.cov([df.iloc[:,0] , df.iloc[:,1] , df.iloc[:,2]])
print('Covariance Matrix \n' ,covariance_matrix)

Covariance Matrix 
 [[1.02564103 0.20478114 0.080118  ]
 [0.20478114 1.02564103 0.19838882]
 [0.080118   0.19838882 1.02564103]]


**Inference:** In this matrix all the diagonal items are variances of 3 columns and the non diagonal items are the covariances of the 3 columns.


step3: find the eigen vectors/values

In [12]:
eigen_values , eigen_vectors = np.linalg.eig(covariance_matrix)


In [13]:
eigen_values

array([1.3536065 , 0.94557084, 0.77774573])

In [14]:
eigen_vectors

array([[-0.53875915, -0.69363291,  0.47813384],
       [-0.65608325, -0.01057596, -0.75461442],
       [-0.52848211,  0.72025103,  0.44938304]])

Remember: If you are working on a 3D data, you will get 3 eigen_vectors and 3 eigen_values.

We are choosing the first two vectors as calling them as pc

In [16]:
pc = eigen_vectors[0:2]
pc

array([[-0.53875915, -0.69363291,  0.47813384],
       [-0.65608325, -0.01057596, -0.75461442]])

Now the shape of the data is 40,2

In [60]:
transformed_df = np.dot(df.iloc[:,0:3] , pc.T)

new_df = pd.DataFrame(transformed_df , columns = ['PC1' , 'PC2'])
new_df['target'] = df['target'].values
new_df.head()

Unnamed: 0,PC1,PC2,target
0,0.743143,-0.421427,1
1,-1.466303,-1.884677,1
2,0.084908,1.549658,1
3,-1.260357,1.940685,1
4,-0.020726,-0.991544,0


**Inference:** The data has now tranformed into a 2D data

In [61]:
new_df['target'] = new_df['target'].astype('str')
fig = px.scatter(x=new_df['PC1'] ,
                 y=new_df['PC2'] , 
                 color=new_df['target'],
                 color_discrete_sequence=px.colors.qualitative.G10
                 )

fig.update_traces(marker=dict(size=12,
                              line=dict(width=2,
                                        color='DarkSlateGrey')),
                  selector=dict(mode='markers'))

fig.show()

# Practical Example on MNIST Dataset with Scikit Learn Implementation

In [38]:
from google.colab import files
import matplotlib.pyplot as plt

In [31]:
uploads = files.upload()

Saving ML mnist dataset.csv to ML mnist dataset (1).csv


In [93]:
df = pd.read_csv('ML mnist dataset.csv')


In [94]:
df.shape

(10000, 785)

In [95]:
df.head()

Unnamed: 0,label,1x1,1x2,1x3,1x4,1x5,1x6,1x7,1x8,1x9,...,28x19,28x20,28x21,28x22,28x23,28x24,28x25,28x26,28x27,28x28
0,7,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [96]:
df.sample()

Unnamed: 0,label,1x1,1x2,1x3,1x4,1x5,1x6,1x7,1x8,1x9,...,28x19,28x20,28x21,28x22,28x23,28x24,28x25,28x26,28x27,28x28
6657,8,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [97]:
x = df.iloc[:,1:]
y = df.iloc[:,0]

In [98]:
from sklearn.model_selection import train_test_split

x_train , x_test , y_train , y_test = train_test_split(x,y,test_size=0.2,random_state=42)

In [99]:
x_train.shape

(8000, 784)

In [100]:
from sklearn.neighbors import KNeighborsClassifier

In [101]:
knn = KNeighborsClassifier()

In [102]:
knn.fit(x_train , y_train)

KNeighborsClassifier()

In [103]:
import time

start = time.time()
y_pred = knn.predict(x_test)
print(time.time() - start)

1.3851611614227295


In [104]:
from sklearn.metrics import accuracy_score

In [105]:
accuracy_score(y_test,y_pred)

0.943

In [106]:
from sklearn.preprocessing import StandardScaler

In [107]:
scaler = StandardScaler()

In [108]:
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

Using PCA

In [109]:
from sklearn.decomposition import PCA

In [110]:
pca = PCA(n_components=100)

In [112]:
x_train_trf = pca.fit_transform(x_train)
x_test_trf = pca.transform(x_test)

In [113]:
x_train_trf.shape

(8000, 100)

**Inference:** After applying PCA the shape changed from 784 to 100.

In [115]:
knn = KNeighborsClassifier()

In [116]:
knn.fit(x_train_trf,y_train)

KNeighborsClassifier()

In [117]:
y_pred = knn.predict(x_test_trf)

In [118]:
accuracy_score(y_test,y_pred)

0.9315