# Principal Component analysis - Part II
Author: [Biswajit Sahoo](https://biswajitsahoo1111.github.io/)

<table class="tfo-notebook-buttons" align="center">
  <td align="center">
    <a href="https://colab.research.google.com/github/biswajitsahoo1111/blog_notebooks/blob/master/Principal_Component_Analysis_Part_II.ipynb">
    <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />
    Run Python code in Google Colab</a>
  </td>
  <td align="center">
    <a href="https://www.dropbox.com/s/zkftnkv31neuxgq/Principal_Component_Analysis_Part_II.ipynb?dl=1"><img src="https://www.tensorflow.org/images/download_logo_32px.png" />Download Python code</a>
  </td>
  <td align="center">
    <a href="https://www.dropbox.com/s/7bzat96tt6r9iks/Principal_component_analysis_part_II.Rmd?dl=1"><img src="https://www.tensorflow.org/images/download_logo_32px.png" />Download R code (R Markdown)</a>
  </td>
  <td align="center">
    <a href="https://www.dropbox.com/s/u9gbbviswkfgmsj/pca_part_2_MATLAB_code.pdf?dl=1"><img src="https://www.tensorflow.org/images/download_logo_32px.png" />Download MATLAB code</a>
  </td>
</table>

This post is Part-II of a three part series post on PCA. Other parts of the series can be found at the links below. 

* [Part-I: Basic Theory of PCA](https://biswajitsahoo1111.github.io/post/principal-component-analysis-part-i/)
* [Part-III: Reproducing results of a published paper on PCA](https://biswajitsahoo1111.github.io/post/principal-component-analysis-part-iii/)

In this post, we will first apply built in commands to obtain results and then show how the same results can be obtained without using built-in commands. Through this post our aim is not to advocate the use of non-built-in functions. Rather, in our opinion, it enhances understanding by knowing what happens under the hood when a built-in function is called. In actual applications, readers should always use built functions as they are robust(almost always) and tested for efficiency. 

This post is written in R. Equivalent [MATLAB code](https://github.com/biswajitsahoo1111/PCA/blob/master/pca_part_II_MATLAB_codes.pdf) for the same can be obtained from this [link](https://github.com/biswajitsahoo1111/PCA/blob/master/pca_part_II_MATLAB_codes.pdf). 

We will use French food data form reference [2]. Refer to the paper to know about the original source of the data. We will apply different methods to this data and compare the result. As the dataset is pretty small, one way to load the data into R is to create a dataframe in R using the values in the paper. Another way is to first create a csv file and then read the file into R/MATLAB. We have used the later approach.

In [1]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA

### Load data

In [2]:
food = pd.DataFrame(data = {"class": ["Blue_collar", "White_collar", "Upper_class", "Blue_collar", "White_collar", "Upper_class",
                                      "Blue_collar", "White_collar", "Upper_class", "Blue_collar", "White_collar", "Upper_class"],
                            "children": np.repeat([2,3,4,5], 3),
                            "bread": [332, 293, 372, 406, 386, 438, 534, 460, 385, 655, 584, 515],
                            "vegetables": [428, 559, 767, 563, 608, 843, 660, 699, 789, 776, 995, 1097],
                            "fruit": [354, 388, 562, 341, 396, 689, 367, 484, 621, 423, 548, 887],
                            "meat": [1437, 1527, 1948, 1507, 1501, 2345, 1620, 1856, 2366, 1848, 2056, 2630],
                            "poultry": [526, 567, 927, 544, 558, 1148, 638, 762, 1149, 759, 893, 1167],
                            "milk": [247, 239, 235, 324, 319, 243, 414, 400, 304, 495, 518, 561],
                            "wine": [427, 258, 433, 407, 363, 341, 407, 416, 282, 486, 319, 284]})
food

Unnamed: 0,class,children,bread,vegetables,fruit,meat,poultry,milk,wine
0,Blue_collar,2,332,428,354,1437,526,247,427
1,White_collar,2,293,559,388,1527,567,239,258
2,Upper_class,2,372,767,562,1948,927,235,433
3,Blue_collar,3,406,563,341,1507,544,324,407
4,White_collar,3,386,608,396,1501,558,319,363
5,Upper_class,3,438,843,689,2345,1148,243,341
6,Blue_collar,4,534,660,367,1620,638,414,407
7,White_collar,4,460,699,484,1856,762,400,416
8,Upper_class,4,385,789,621,2366,1149,304,282
9,Blue_collar,5,655,776,423,1848,759,495,486


### Centered data matrix

In [3]:
cent_food = food.iloc[:, 2:] - food.iloc[:, 2:].mean()

### Scaled data matrix

In [4]:
scale_food = cent_food / cent_food.std()

## Covariance PCA

### Using built-in function

In [5]:
pca_food_cov = PCA().fit(cent_food)

We convert the result to a pandas dataframe to add row names.

In [6]:
pd.DataFrame(data = np.round(pca_food_cov.components_[0:4, :].T, 2),
             index = food.columns[2:],
             columns = ["PC1", "PC2", "PC3", "PC4"])

Unnamed: 0,PC1,PC2,PC3,PC4
bread,0.07,0.58,-0.4,0.11
vegetables,0.33,0.41,0.29,0.61
fruit,0.3,-0.1,0.34,-0.4
meat,0.75,-0.11,-0.07,-0.29
poultry,0.47,-0.24,-0.38,0.33
milk,0.09,0.63,0.23,-0.41
wine,-0.06,0.14,-0.66,-0.31


#### Factor scores

In [7]:
factor_scores_pca_food_cov = pca_food_cov.transform(cent_food)

In [8]:
np.round(factor_scores_pca_food_cov[:, :4], 2)  # We have printed only four PCs out of seven

array([[-635.05, -120.89,  -21.14,  -68.97],
       [-488.56, -142.33,  132.37,   34.91],
       [ 112.03, -139.75,  -61.86,   44.19],
       [-520.01,   12.05,    2.85,  -13.7 ],
       [-485.94,    1.17,   65.75,   11.51],
       [ 588.17, -188.44,  -71.85,   28.56],
       [-333.95,  144.54,  -34.94,   10.07],
       [ -57.51,   42.86,  -26.26,  -46.55],
       [ 571.32, -206.76,  -38.45,    3.69],
       [ -39.38,  264.47, -126.43,  -12.74],
       [ 296.04,  235.92,   58.84,   87.43],
       [ 992.83,   97.15,  121.13,  -78.39]])

#### Variances using built-in function

In [9]:
np.set_printoptions(suppress= True)
np.round(pca_food_cov.explained_variance_, 2)

array([274831.02,  26415.99,   6254.11,   2299.9 ,   2090.2 ,    338.39,
           65.81])

#### Total variance

In [10]:
np.round(np.sum(pca_food_cov.explained_variance_), 2)

312295.43

## Comparison of variance before and after transformation

#### Total variance before transformation

In [11]:
np.round(np.sum(food.iloc[:, 2:].var()), 2)

312295.43

#### Total variance after transformation

In [12]:
np.round(np.sum(np.var(factor_scores_pca_food_cov, axis  = 0, ddof = 1)), 2)

312295.43

Another important observation is to see how variance of each variable before transformation changes into variance of principal components. Note that total variance in this process remains same as seen from above codes.

#### Variance along variables before transformation

In [13]:
food.iloc[:, 2:].var()

bread          11480.606061
vegetables     35789.090909
fruit          27255.454545
meat          156618.386364
poultry        62280.515152
milk           13718.750000
wine            5152.628788
dtype: float64

Note that calculation of variance is unaffected by centering data matrix. So variance of original data matrix as well as centered data matrix is same. Check it for yourself. Now let's see how PCA transforms these variance.

#### Variance along principal components

In [14]:
pd.DataFrame(data= np.round(np.var(factor_scores_pca_food_cov, axis = 0, ddof = 1), 2).reshape(1, -1),
             columns = ["PC1", "PC2", "PC3", "PC4", "PC5", "PC6", "PC7"])

Unnamed: 0,PC1,PC2,PC3,PC4,PC5,PC6,PC7
0,274831.02,26415.99,6254.11,2299.9,2090.2,338.39,65.81


We can obtain the same result using built-in function.

In [15]:
np.round(pca_food_cov.explained_variance_, 2)

array([274831.02,  26415.99,   6254.11,   2299.9 ,   2090.2 ,    338.39,
           65.81])

## Performing covariance PCA manually using SVD

SVD of a matrix $A$ is defined as

$$A = U\Sigma V^T$$

$U$, $\Sigma$ and $V$ matrices are factorization results of SVD. But oddly (and sadly) after computing SVD, `numpy` returns $U$, $\Sigma$ and $V^T$ as result. So to get matrix $V$, we have to again take the transpose of the third result.

In [16]:
U, S, V_transpose = np.linalg.svd(cent_food)

#### Loading scores

In [17]:
np.round(V_transpose[:4, :].T, 2)  

array([[ 0.07, -0.58, -0.4 ,  0.11],
       [ 0.33, -0.41,  0.29,  0.61],
       [ 0.3 ,  0.1 ,  0.34, -0.4 ],
       [ 0.75,  0.11, -0.07, -0.29],
       [ 0.47,  0.24, -0.38,  0.33],
       [ 0.09, -0.63,  0.23, -0.41],
       [-0.06, -0.14, -0.66, -0.31]])

#### Factor scores

In [18]:
np.round(cent_food.values @ V_transpose.T[:, :4], 2)

array([[-635.05,  120.89,  -21.14,  -68.97],
       [-488.56,  142.33,  132.37,   34.91],
       [ 112.03,  139.75,  -61.86,   44.19],
       [-520.01,  -12.05,    2.85,  -13.7 ],
       [-485.94,   -1.17,   65.75,   11.51],
       [ 588.17,  188.44,  -71.85,   28.56],
       [-333.95, -144.54,  -34.94,   10.07],
       [ -57.51,  -42.86,  -26.26,  -46.55],
       [ 571.32,  206.76,  -38.45,    3.69],
       [ -39.38, -264.47, -126.43,  -12.74],
       [ 296.04, -235.92,   58.84,   87.43],
       [ 992.83,  -97.15,  121.13,  -78.39]])

#### Variance of principal components

In [19]:
np.round(S ** 2 / 11, 2)

array([274831.02,  26415.99,   6254.11,   2299.9 ,   2090.2 ,    338.39,
           65.81])

Our data matrix contains 12 data points. So to find variance of principal components we have to divide the square of the diagonal matrix by 11. To know the theory behind it, refer [Part-I](https://biswajitsahoo1111.github.io/post/principal-component-analysis-part-i/).

## Performing covariance PCA using Eigen-decomposition (Not recommended)

This procedure is not recommended because forming a covariance matrix is computationally not efficient for large matrices if data matrix contains smaller entries. So doing eigen analysis on covariance matrix may give erroneous results. However, for our example we can use it to obtain results.

In [20]:
eigenvalues, eigenvectors = np.linalg.eigh(np.cov(cent_food, rowvar = False, ddof = 1)) # Eigenvalues are in ascending order
eigenvalues, eigenvectors = eigenvalues[len(eigenvalues)::-1], eigenvectors[:, len(eigenvalues)::-1]  # Eigenvalues are in descending order

#### Loading scores

In [21]:
np.round(eigenvectors[:, :4], 2)

array([[-0.07,  0.58,  0.4 ,  0.11],
       [-0.33,  0.41, -0.29,  0.61],
       [-0.3 , -0.1 , -0.34, -0.4 ],
       [-0.75, -0.11,  0.07, -0.29],
       [-0.47, -0.24,  0.38,  0.33],
       [-0.09,  0.63, -0.23, -0.41],
       [ 0.06,  0.14,  0.66, -0.31]])

#### Factor scores

In [22]:
np.round(cent_food.values @ eigenvectors[:, :4], 2)    # We have printed only first 4 principal components

array([[ 635.05, -120.89,   21.14,  -68.97],
       [ 488.56, -142.33, -132.37,   34.91],
       [-112.03, -139.75,   61.86,   44.19],
       [ 520.01,   12.05,   -2.85,  -13.7 ],
       [ 485.94,    1.17,  -65.75,   11.51],
       [-588.17, -188.44,   71.85,   28.56],
       [ 333.95,  144.54,   34.94,   10.07],
       [  57.51,   42.86,   26.26,  -46.55],
       [-571.32, -206.76,   38.45,    3.69],
       [  39.38,  264.47,  126.43,  -12.74],
       [-296.04,  235.92,  -58.84,   87.43],
       [-992.83,   97.15, -121.13,  -78.39]])

#### Variance along principal components

In [23]:
np.round(eigenvalues, 2)

array([274831.02,  26415.99,   6254.11,   2299.9 ,   2090.2 ,    338.39,
           65.81])

Instead of using the `np.cov()` command to compute the covariance matrix, compute it manually and perform its eigen analysis.

In [24]:
cov_matrix_manual_food = (1/11) * cent_food.T @ cent_food

In [25]:
eig_values_new, eig_vectors_new = np.linalg.eigh(cov_matrix_manual_food)
eig_values_new, eig_vectors_new = eig_values_new[len(eig_values_new)::-1], eig_vectors_new[:, len(eig_values_new)::-1]

#### Loading scores

In [26]:
np.round(eig_vectors_new[:, :4], 2)

array([[-0.07,  0.58,  0.4 ,  0.11],
       [-0.33,  0.41, -0.29,  0.61],
       [-0.3 , -0.1 , -0.34, -0.4 ],
       [-0.75, -0.11,  0.07, -0.29],
       [-0.47, -0.24,  0.38,  0.33],
       [-0.09,  0.63, -0.23, -0.41],
       [ 0.06,  0.14,  0.66, -0.31]])

#### Variance along principal components

In [27]:
np.round(eig_values_new, 2)

array([274831.02,  26415.99,   6254.11,   2299.9 ,   2090.2 ,    338.39,
           65.81])

## Correlation PCA

When PCA is performed on a scaled data matrix (each variable is centered as well as variance of each variable is one), it is called correlation PCA. Before discussing correlation PCA we will take some time to see different ways in which we can obtain correlation matrix.

### Different ways to obtain correlation matrix

#### Using built-in command

In [28]:
np.round(np.corrcoef(food.iloc[:, 2:], rowvar = False)[:, :4], 2)

array([[ 1.  ,  0.59,  0.2 ,  0.32],
       [ 0.59,  1.  ,  0.86,  0.88],
       [ 0.2 ,  0.86,  1.  ,  0.96],
       [ 0.32,  0.88,  0.96,  1.  ],
       [ 0.25,  0.83,  0.93,  0.98],
       [ 0.86,  0.66,  0.33,  0.37],
       [ 0.3 , -0.36, -0.49, -0.44]])

#### Compute manually

In [29]:
np.round((1/11) * scale_food.T @ scale_food, 2)

Unnamed: 0,bread,vegetables,fruit,meat,poultry,milk,wine
bread,1.0,0.59,0.2,0.32,0.25,0.86,0.3
vegetables,0.59,1.0,0.86,0.88,0.83,0.66,-0.36
fruit,0.2,0.86,1.0,0.96,0.93,0.33,-0.49
meat,0.32,0.88,0.96,1.0,0.98,0.37,-0.44
poultry,0.25,0.83,0.93,0.98,1.0,0.23,-0.4
milk,0.86,0.66,0.33,0.37,0.23,1.0,0.01
wine,0.3,-0.36,-0.49,-0.44,-0.4,0.01,1.0


### Performing correlation PCA using built-in function 

In [30]:
pca_food_cor = PCA().fit(scale_food)

#### Loading scores

In [31]:
np.round(pca_food_cor.components_.T[:, :4], 2)

array([[ 0.24,  0.62, -0.01, -0.54],
       [ 0.47,  0.1 , -0.06, -0.02],
       [ 0.45, -0.21,  0.15,  0.55],
       [ 0.46, -0.14,  0.21, -0.05],
       [ 0.44, -0.2 ,  0.36, -0.32],
       [ 0.28,  0.52, -0.44,  0.45],
       [-0.21,  0.48,  0.78,  0.31]])

#### Factor scores

In [32]:
factor_scores_pca_food_cor = pca_food_cor.transform(scale_food)
np.round(factor_scores_pca_food_cor[:, :4], 2)

array([[-2.86, -0.36,  0.4 ,  0.36],
       [-1.89, -1.79, -1.31, -0.16],
       [-0.12, -0.73,  1.42,  0.2 ],
       [-2.04,  0.32, -0.11,  0.1 ],
       [-1.69, -0.16, -0.51,  0.16],
       [ 1.69, -1.35,  0.99, -0.43],
       [-0.93,  1.37, -0.28, -0.26],
       [-0.25,  0.63,  0.27,  0.29],
       [ 1.6 , -1.74,  0.1 , -0.4 ],
       [ 0.22,  2.78,  0.57, -0.25],
       [ 1.95,  1.13, -0.99, -0.32],
       [ 4.32, -0.1 , -0.57,  0.72]])

#### Variances along principal components

In [33]:
np.round(pca_food_cor.explained_variance_, 2)

array([4.33, 1.83, 0.63, 0.13, 0.06, 0.02, 0.  ])

#### Sum of variances

In [34]:
np.sum(pca_food_cor.explained_variance_)

7.0

### Comparison of variance before and after transformation

#### Total variance before transformation

In [35]:
np.sum(scale_food.var())

7.0

#### Total variance after transformation

In [36]:
np.sum(np.var(factor_scores_pca_food_cor, axis = 0, ddof = 1))

6.999999999999999

Another way to achieve the same result is given below.

In [37]:
np.sum(np.diag(np.cov(factor_scores_pca_food_cor, rowvar = False, ddof = 1)))

7.0

Another important observation is to see how variance of each variable before transformation changes into variance of principal components. Note that total variance in this process remains same as seen from above code.

#### Variance along variables before transformation

In [38]:
scale_food.var()

bread         1.0
vegetables    1.0
fruit         1.0
meat          1.0
poultry       1.0
milk          1.0
wine          1.0
dtype: float64

This is obvious as we have scaled the matrix. Now see how PCA transforms these variance.

#### Variance along principal components

In [39]:
pd.DataFrame(data= np.round(np.var(factor_scores_pca_food_cor, axis = 0, ddof = 1), 2).reshape(1, -1),
             columns = ["PC1", "PC2", "PC3", "PC4", "PC5", "PC6", "PC7"])

Unnamed: 0,PC1,PC2,PC3,PC4,PC5,PC6,PC7
0,4.33,1.83,0.63,0.13,0.06,0.02,0.0


We can obtain the same result using built-in function.

In [40]:
np.round(pca_food_cor.explained_variance_, 2)

array([4.33, 1.83, 0.63, 0.13, 0.06, 0.02, 0.  ])

## Performing correlation PCA manually using SVD

In [41]:
U_cor, S_cor, V_cor_transpose = np.linalg.svd(scale_food)

#### Loading scores

In [42]:
np.round(V_cor_transpose.T[:, :4], 2)

array([[ 0.24, -0.62,  0.01, -0.54],
       [ 0.47, -0.1 ,  0.06, -0.02],
       [ 0.45,  0.21, -0.15,  0.55],
       [ 0.46,  0.14, -0.21, -0.05],
       [ 0.44,  0.2 , -0.36, -0.32],
       [ 0.28, -0.52,  0.44,  0.45],
       [-0.21, -0.48, -0.78,  0.31]])

#### Factor scores

In [43]:
np.round(scale_food.values @ V_cor_transpose.T[:, :4], 2)

array([[-2.86,  0.36, -0.4 ,  0.36],
       [-1.89,  1.79,  1.31, -0.16],
       [-0.12,  0.73, -1.42,  0.2 ],
       [-2.04, -0.32,  0.11,  0.1 ],
       [-1.69,  0.16,  0.51,  0.16],
       [ 1.69,  1.35, -0.99, -0.43],
       [-0.93, -1.37,  0.28, -0.26],
       [-0.25, -0.63, -0.27,  0.29],
       [ 1.6 ,  1.74, -0.1 , -0.4 ],
       [ 0.22, -2.78, -0.57, -0.25],
       [ 1.95, -1.13,  0.99, -0.32],
       [ 4.32,  0.1 ,  0.57,  0.72]])

#### Variance along each principal component

In [44]:
np.round(S_cor ** 2 /11, 2)

array([4.33, 1.83, 0.63, 0.13, 0.06, 0.02, 0.  ])

#### Sum of variances

In [45]:
np.sum(S_cor ** 2 /11)

6.999999999999997

Again we have to divide by 11 to get eigenvalues of correlation matrix. Check the formulation of correlation matrix using scaled data matrix to convince yourself.

## Using eigen-decomposition (Not recommended)

In [46]:
eig_values_cor, eig_vectors_cor = np.linalg.eigh(np.corrcoef(scale_food, rowvar = False))
eig_values_cor, eig_vectors_cor = eig_values_cor[len(eig_values_cor)::-1], eig_vectors_cor[:, len(eig_values_cor)::-1]

#### Loading scores

In [47]:
np.round(eig_vectors_cor[:, :4], 2)

array([[-0.24,  0.62,  0.01, -0.54],
       [-0.47,  0.1 ,  0.06, -0.02],
       [-0.45, -0.21, -0.15,  0.55],
       [-0.46, -0.14, -0.21, -0.05],
       [-0.44, -0.2 , -0.36, -0.32],
       [-0.28,  0.52,  0.44,  0.45],
       [ 0.21,  0.48, -0.78,  0.31]])

#### Factor scores

In [48]:
np.round(scale_food.values @ eig_vectors_cor[:, :4], 2)

array([[ 2.86, -0.36, -0.4 ,  0.36],
       [ 1.89, -1.79,  1.31, -0.16],
       [ 0.12, -0.73, -1.42,  0.2 ],
       [ 2.04,  0.32,  0.11,  0.1 ],
       [ 1.69, -0.16,  0.51,  0.16],
       [-1.69, -1.35, -0.99, -0.43],
       [ 0.93,  1.37,  0.28, -0.26],
       [ 0.25,  0.63, -0.27,  0.29],
       [-1.6 , -1.74, -0.1 , -0.4 ],
       [-0.22,  2.78, -0.57, -0.25],
       [-1.95,  1.13,  0.99, -0.32],
       [-4.32, -0.1 ,  0.57,  0.72]])

#### Variance along each principal component

In [49]:
np.round(eig_values_cor, 2)

array([4.33, 1.83, 0.63, 0.13, 0.06, 0.02, 0.  ])

I hope this post would help clear some of the confusions that a beginner might have while encountering PCA for the first time. Please send me a note if you find any errors.

## References

1. I.T. Jolliffe, Principal component analysis, 2nd ed, Springer, New York,2002.
2. Abdi, H., & Williams, L. J. (2010). Principal component analysis. Wiley interdisciplinary reviews: computational statistics, 2(4), 433-459.