# Principal Component Analysis

Principal Component Analysis, or PCA for short, is a method for reducing the dimensionality
of data. It can be thought of as a projection method where data with m-columns (features) is
projected into a subspace with m or fewer columns, whilst retaining the essence of the original
data.

## Steps involved 

- The first step is to calculate the mean values of each column.
   - M = mean(A)

- Next, we need to center the values in each column by subtracting the mean column value.
   - C = A - M

- The next step is to calculate the covariance matrix of the centered matrix C. Correlation
is a normalized measure of the amount and direction (positive or negative) that two columns
change together. Covariance is a generalized and unnormalized version of correlation across
multiple columns. A covariance matrix is a calculation of covariance of a given matrix with
covariance scores for every column with every other column, including itself.
   - V = cov(C)

- Finally, we calculate the eigendecomposition of the covariance matrix V . This results in a
list of eigenvalues and a list of eigenvectors.
   - values, vectors = eig(V )

   - The eigenvectors represent the directions or components for the reduced subspace of B,
whereas the eigenvalues represent the magnitudes for the directions. The eigenvectors can be
sorted by the eigenvalues in descending order to provide a ranking of the components or axes of
the new subspace for A. If all eigenvalues have a similar value, then we know that the existing
representation may already be reasonably compressed or dense and that the projection may
o er little. If there are eigenvalues close to zero, they represent components or axes of B that
may be discarded. A total of m or less components must be selected to comprise the chosen
subspace. Ideally, we would select k eigenvectors, called principal components, that have the k
largest eigenvalues.
   - B = select(values, vectors)

In [3]:
import numpy as np

In [5]:
# There is no pca() function in NumPy, but we can easily calculate the Principal Component
# Analysis step-by-step using NumPy functions.

# define matrix
A = np.array([
    [1, 2],
    [3, 4],
    [5, 6]
])
print(A)
# column means
M = np.mean(A.T, axis=1)
# center columns by subtracting column means
C = A - M
# calculate covariance matrix of centered matrix
V = np.cov(C.T)
# factorize covariance matrix
values, vectors = np.linalg.eig(V)
print(vectors)
print(values)
# project data
P = vectors.T.dot(C.T)
print(P.T)



[[1 2]
 [3 4]
 [5 6]]
[[ 0.70710678 -0.70710678]
 [ 0.70710678  0.70710678]]
[8. 0.]
[[-2.82842712  0.        ]
 [ 0.          0.        ]
 [ 2.82842712  0.        ]]
