# Principal Component Analysis (PCA)

It is a method for reducing the dimensionality of data. It can be thought of as a projection method where data**(A)** with m-columns (features) is projected into a subspace**(B)** with m or fewer columns, whilst retaining the essence of the original data.  
  
**B = PCA(A)**

## Preparation of Fake data
Here we created a fake dataset containing 2 features (columns)

In [1]:
from numpy import array
A = array([
    [1,2],
    [3,4],
    [5,6]
])

## Theory

PCA is an operation applied to a dataset, represented by an n × m matrix <b>A</b> that results in a projection of <b>A</b> which we will call **B**. Steps involved for this are mentioned below.  

***

### Step 1:
The first step is to calculate the mean values of **each column**.  
  
**M = mean(A)**

In [2]:
from numpy import mean
M = mean(A, axis=0)
M

array([3., 4.])

### Step 2:
Next, we need to center the values in each column by subtracting the mean column value.  
  
**C = A - M**

In [3]:
C = A - M
C

array([[-2., -2.],
       [ 0.,  0.],
       [ 2.,  2.]])

### Step 3:
The next step is to calculate the covariance matrix of the centered matrix C.  
  
Covariance is a generalized and unnormalized version of correlation across multiple columns.  
  
A covariance matrix is a calculation of covariance of a given matrix with covariance scores for every column with every other column, including itself.

**V = cov(C)**

In [4]:
from numpy import cov
V = cov(C.T)  # We took the transpose to find covariance of columns not rows.
V

array([[4., 4.],
       [4., 4.]])

### Step 4:
Then we calculate the Eigenvalues and Eigenvectors i.e. we do the **Eigendecomposition** of covarience matrix V.  
  
The eigenvectors represent the directions or components for the reduced subspace of B, whereas the eigenvalues represent the magnitudes for the directions.

In [5]:
from numpy.linalg import eig
eig_val, eig_vec = eig(V)
print("Eigen Values:")
print(eig_val)
print("\nEigen Vectors:")
print(eig_vec)

Eigen Values:
[8. 0.]

Eigen Vectors:
[[ 0.70710678 -0.70710678]
 [ 0.70710678  0.70710678]]


### Step 5:
The eigenvectors can be sorted by the eigenvalues in descending order to provide a ranking of the components or axes of the new subspace for A. If all eigenvalues have a similar value, then we know that the existing representation may already be reasonably compressed or dense and that the projection may offer little. If there are eigenvalues close to zero, they represent components or axes of B that may be discarded. A total of m or less components must be selected to comprise the chosen subspace. Ideally, we would select k eigenvectors, called principal components, that have the k largest eigenvalues

**Important:** Each column in **eig_vec**(above cell) represents 1 eigen vector.  

Since the second eigen value is 0, we can discard the second column but lets see what happens if we proceed with it.

### Step 6:
Once chosen, data can be projected into the subspace via matrix multiplication.  
  
**B = eig_vec<sup>T</sup> . C**

In [6]:
from numpy import dot
B = eig_vec.T.dot(C.T)
B.T

array([[-2.82842712,  0.        ],
       [ 0.        ,  0.        ],
       [ 2.82842712,  0.        ]])

**As you can see, the second column is 0. That shows we could have rejected it while choosing B**

In [7]:
# By taking only 1st eigen vector
red_eig_vec = eig_vec[:,0]
b = red_eig_vec.dot(C.T)
b.reshape((-1,1))

array([[-2.82842712],
       [ 0.        ],
       [ 2.82842712]])