# Principal Component Analysis(PCA) from scratch
**[Click here for Theory](./1_scratch_fake.ipynb)**

## Loading data

In [1]:
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()

In [2]:
cancer.keys()

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

In [3]:
print(cancer.DESCR)

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, f

In [4]:
A = cancer.data
A.shape

(569, 30)

## Performing PCA

In [5]:
import numpy as np

In [6]:
M = np.mean(A, axis=0)

In [7]:
C = A - M

In [8]:
V = np.cov(C.T)

In [9]:
eig_val, eig_vec = np.linalg.eig(V)

In [10]:
print("Eigen Values:")
print(eig_val)
print("\nNumber of eigen values: %d"%eig_val.shape[0])

Eigen Values:
[4.43782605e+05 7.31010006e+03 7.03833742e+02 5.46487379e+01
 3.98900178e+01 3.00458768e+00 1.81533030e+00 3.71466740e-01
 1.55513547e-01 8.40612196e-02 3.16089533e-02 7.49736514e-03
 3.16165652e-03 2.16150395e-03 1.32653879e-03 6.40269304e-04
 3.74883320e-04 2.35169626e-04 1.84583467e-04 1.64180064e-04
 7.81102011e-05 5.76111660e-05 3.49172775e-05 2.83952689e-05
 1.61463677e-05 1.24902419e-05 7.01997261e-07 3.68048171e-06
 2.84790425e-06 2.00491564e-06]

Number of eigen values: 30


In [11]:
print("Shape of Eigen vectors:", eig_vec.shape)

Shape of Eigen vectors: (30, 30)


### Reducing Eigen vectors to choose only top 5.

In [12]:
sorted_eig_val = np.sort(eig_val)
sorted_eig_val

array([7.01997261e-07, 2.00491564e-06, 2.84790425e-06, 3.68048171e-06,
       1.24902419e-05, 1.61463677e-05, 2.83952689e-05, 3.49172775e-05,
       5.76111660e-05, 7.81102011e-05, 1.64180064e-04, 1.84583467e-04,
       2.35169626e-04, 3.74883320e-04, 6.40269304e-04, 1.32653879e-03,
       2.16150395e-03, 3.16165652e-03, 7.49736514e-03, 3.16089533e-02,
       8.40612196e-02, 1.55513547e-01, 3.71466740e-01, 1.81533030e+00,
       3.00458768e+00, 3.98900178e+01, 5.46487379e+01, 7.03833742e+02,
       7.31010006e+03, 4.43782605e+05])

In [13]:
top5 = sorted_eig_val[-5:]
top5

array([3.98900178e+01, 5.46487379e+01, 7.03833742e+02, 7.31010006e+03,
       4.43782605e+05])

In [14]:
red_eig_vec_T = []
for i in range(eig_val.shape[0]):
    if eig_val[i] in top5:
        red_eig_vec_T.append(list(eig_vec[:,i]))
red_eig_vec_T = np.array(red_eig_vec_T)
print("Shape of reduced Eigen vectors: ",red_eig_vec_T.shape)

Shape of reduced Eigen vectors:  (5, 30)


In [15]:
print(red_eig_vec_T)

[[ 5.08623202e-03  2.19657026e-03  3.50763298e-02  5.16826469e-01
   4.23694535e-06  4.05260047e-05  8.19399539e-05  4.77807775e-05
   7.07804332e-06 -2.62155251e-06  3.13742507e-04 -6.50984008e-05
   2.23634150e-03  5.57271669e-02 -8.05646029e-07  5.51918197e-06
   8.87094462e-06  3.27915009e-06 -1.24101836e-06 -8.54530832e-08
   7.15473257e-03  3.06736622e-03  4.94576447e-02  8.52063392e-01
   6.42005481e-06  1.01275937e-04  1.68928625e-04  7.36658178e-05
   1.78986262e-05  1.61356159e-06]
 [ 9.28705650e-03 -2.88160658e-03  6.27480827e-02  8.51823720e-01
  -1.48194356e-05 -2.68862249e-06  7.51419574e-05  4.63501038e-05
  -2.52430431e-05 -1.61197148e-05 -5.38692831e-05  3.48370414e-04
   8.19640791e-04  7.51112451e-03  1.49438131e-06  1.27357957e-05
   2.86921009e-05  9.36007477e-06  1.22647432e-05  2.89683790e-07
  -5.68673345e-04 -1.32152605e-02 -1.85961117e-04 -5.19742358e-01
  -7.68565692e-05 -2.56104144e-04 -1.75471479e-04 -3.05051743e-05
  -1.57042845e-04 -5.53071662e-05]
 [-1.2

In [17]:
from numpy import dot
B = red_eig_vec_T.dot(C.T)
print(B.T[:5])

[[ 1.16014257e+03 -2.93917544e+02  4.85783976e+01 -8.71197531e+00
  -3.20004861e+01]
 [ 1.26912244e+03  1.56301818e+01 -3.53945342e+01  1.78612832e+01
   4.33487404e+00]
 [ 9.95793889e+02  3.91567432e+01 -1.70975298e+00  4.19934010e+00
   4.66529118e-01]
 [-4.07180803e+02 -6.73803198e+01  8.67284783e+00 -1.17598673e+01
  -7.11546109e+00]
 [ 9.30341180e+02  1.89340742e+02  1.37480074e+00  8.49918256e+00
  -7.61328922e+00]]
