<a href="https://colab.research.google.com/github/gargiisc/mlc/blob/main/Experiment_10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**EXPERIMENT 10: Principal Component Analysis**



**Theory:** <br>
As the number of features or dimensions in a dataset increases, the amount of data required to obtain a statistically significant result increases exponentially. This can lead to issues such as overfitting, increased computation time, and reduced accuracy of machine learning models this is known as the curse of dimensionality problems that arise while working with high-dimensional data.<br>

As the number of dimensions increases, the number of possible combinations of features increases exponentially, which makes it computationally difficult to obtain a representative sample of the data. It becomes expensive to perform tasks such as clustering or classification because the algorithms need to process a much larger feature space, which increases computation time and complexity. Additionally, some machine learning algorithms can be sensitive to the number of dimensions, requiring more data to achieve the same level of accuracy as lower-dimensional data.<br>

To address the curse of dimensionality, Feature engineering techniques are used which include feature selection and feature extraction. Dimensionality reduction is a type of feature extraction technique that aims to reduce the number of input features while retaining as much of the original information as possible.<br>

In this article, we will discuss one of the most popular dimensionality reduction techniques i.e. Principal Component Analysis(PCA).
<br><br>




**Step 1: Import the necessary libraries**

In [17]:
import numpy as np
from numpy import linalg as la

Step 2: Give the input dataset.

In [2]:
x = np.array([2.5, 0.5, 2.2, 1.9, 3.1, 2.3, 2, 1, 1.5, 1.1])
y = np.array([2.4, 0.7, 2.9, 2.2, 3, 2.7, 1.6, 1.1, 1.6, 0.9])
data = np.array([x, y])
print(x)
print(y)
print(data)

[2.5 0.5 2.2 1.9 3.1 2.3 2.  1.  1.5 1.1]
[2.4 0.7 2.9 2.2 3.  2.7 1.6 1.1 1.6 0.9]
[[2.5 0.5 2.2 1.9 3.1 2.3 2.  1.  1.5 1.1]
 [2.4 0.7 2.9 2.2 3.  2.7 1.6 1.1 1.6 0.9]]


In [3]:
xMean = np.mean(x)
yMean = np.mean(y)
print(xMean)
print(yMean)

1.81
1.9100000000000001


In [4]:
data.shape

(2, 10)

**Step 3: Compute the mean adjusted values by subtracting each point from its mean.**

In [5]:
meanAdjusted = np.zeros((2, 10))
for i in range(len(data[0])):
    meanAdjusted[0][i] = data[0][i] - xMean
for i in range(len(data[1])):
    meanAdjusted[1][i] = data[1][i] - yMean
print(meanAdjusted)

[[ 0.69 -1.31  0.39  0.09  1.29  0.49  0.19 -0.81 -0.31 -0.71]
 [ 0.49 -1.21  0.99  0.29  1.09  0.79 -0.31 -0.81 -0.31 -1.01]]


**Step 4: Compute the covariance matrix of the mean adjusted data**

In [6]:
cov_mat = np.cov(meanAdjusted)
print(cov_mat)

[[0.61655556 0.61544444]
 [0.61544444 0.71655556]]


**Step 5: Compute the eigen values and eigen vectors.**

In [7]:
eig_vals, eig_vecs = np.linalg.eig(cov_mat)
print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)

Eigenvectors 
[[-0.73517866 -0.6778734 ]
 [ 0.6778734  -0.73517866]]

Eigenvalues 
[0.0490834  1.28402771]


**Step 6: Arrange the eigenvalues in descending order.**

In [8]:
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]
eig_pairs.sort()
eig_pairs.reverse()
print('Eigenvalues in descending order:')
for i in eig_pairs:
    print(i[0])

Eigenvalues in descending order:
1.2840277121727839
0.04908339893832736


In [10]:
print('Eigenvectors in descending order: ')
for i in eig_pairs:
  print(i[1])

Eigenvectors in descending order: 
[-0.6778734  -0.73517866]
[-0.73517866  0.6778734 ]


In [11]:
eig_pairs [0][1]

array([-0.6778734 , -0.73517866])

**Step 7: Retaining only those eigenvalues having maximum values, tranform data and display.**

In [12]:
transformedData1 = np.matmul (meanAdjusted.T, eig_pairs[0][1])
transformedData2 = np.matmul (meanAdjusted.T, eig_pairs[1][1])
print(transformedData1)
print(transformedData2)

[-0.82797019  1.77758033 -0.99219749 -0.27421042 -1.67580142 -0.9129491
  0.09910944  1.14457216  0.43804614  1.22382056]
[-0.17511531  0.14285723  0.38437499  0.13041721 -0.20949846  0.17528244
 -0.3498247   0.04641726  0.01776463 -0.16267529]


In [13]:
transformedData = [transformedData1, transformedData2]
transformedData = np.transpose(transformedData)
print(transformedData)


[[-0.82797019 -0.17511531]
 [ 1.77758033  0.14285723]
 [-0.99219749  0.38437499]
 [-0.27421042  0.13041721]
 [-1.67580142 -0.20949846]
 [-0.9129491   0.17528244]
 [ 0.09910944 -0.3498247 ]
 [ 1.14457216  0.04641726]
 [ 0.43804614  0.01776463]
 [ 1.22382056 -0.16267529]]


In [14]:
matrix_w = np.hstack((eig_pairs[0][1].reshape(2,1), eig_pairs[1][1].reshape(2,1)))
print('Matrix W:\n', matrix_w)

Matrix W:
 [[-0.6778734  -0.73517866]
 [-0.73517866  0.6778734 ]]


**Step 8: Reconstruct and transform the original data.**

In [16]:
originalData = np.matmul(transformedData, matrix_w)
originalData[:][:] = originalData[:][:] + np.array([xMean, yMean])
print(originalData)

[[2.5 2.4]
 [0.5 0.7]
 [2.2 2.9]
 [1.9 2.2]
 [3.1 3. ]
 [2.3 2.7]
 [2.  1.6]
 [1.  1.1]
 [1.5 1.6]
 [1.1 0.9]]
