# Principal Component Analysis
PCA is implemented to a dataset, represented by n X m matrix A that results in a projection of A which we call B.
$$A = \begin{pmatrix} a_{11} & a_{12}\\
a_{21} & a_{22}\\
a_{31} & a_{32} 
\end{pmatrix}$$
B = PCA(A)

Steps for PCA:-
1. calculate mean of each column in A. M = mean(A).
2. Center each column, C = A - M.
3. Compute covariance matrix of C called as V.
4. Eigen decomposition of covariance matrix V.
5. The eigenvectors represent the directions or components for the reduced subspace of B.

* If all eigenvalues have a similar value, then the existing representation is compressed enough.
* If there are eigenvalues close to 0, they represent components or axes of B that may be discarded.
* Ideally we select K eigenvectors called principal components that have largest eigenvalues.
$$B = select(values, vectors)$$
Instead of eigendecomposition, we can use SVD as well. Once choosen, data can be represented as $$P = B^T A.$$
B = choosen principal components.
P = Projection of A.

In [16]:
import numpy as np
import numpy.linalg as lin
from sklearn.decomposition import PCA

In [12]:
A = np.array([[1,2],
             [3,4],
             [5,6]])
M = np.mean(A.T, axis = 1)
C = A - M
V = np.cov(A.T)

# Eigendecomposition
values, vectors = lin.eig(V)
print(values)
print(vectors)

[8. 0.]
[[ 0.70710678 -0.70710678]
 [ 0.70710678  0.70710678]]


In [15]:
#Projection of the data
P = vectors.T.dot(C.T)
print(P.T)

[[-2.82842712  0.        ]
 [ 0.          0.        ]
 [ 2.82842712  0.        ]]


In [18]:
#PCA using sklearn
pca = PCA(n_components = 2).fit(A)
print(pca.components_)
print(pca.explained_variance_)

[[ 0.70710678  0.70710678]
 [ 0.70710678 -0.70710678]]
[8.00000000e+00 2.25080839e-33]


In [19]:
# Projection
print(pca.transform(A))

[[-2.82842712e+00  2.22044605e-16]
 [ 0.00000000e+00  0.00000000e+00]
 [ 2.82842712e+00 -2.22044605e-16]]
