#Computer Assignment (CA) No. 12: Principal Components Analysis

## Problem Statement 
For this assignment, we will use one of the data sets available in MATLAB known as the “Cities data set.” Instructions for how to load this data set can be found here: http://www.mathworks.com/help/stats/feature-transformation.html. We will focus on processing the matrix “ratings”. The vector “categories” tells you what each column of the matrix represents.

1. Compute the covariance matrix for the raw data (contained in the data matrix “ratings”).
2. Do an eigenvalue/eigenvector analysis of the covariance matrix, and display both these matrices. Discuss what you observe.
3. Compute the whitening transform we discussed in class, and compute a matrix of transformed data.
4. Demonstrate that the covariance of the transformed data is an identity matrix by computing the covariance of the transformed data matrix.
5. Plot the percent of the variance accounted for by the first n eigenvalues for n = [1,9]. Make sure the eigenvalues are sorted from largest to smallest.
6. Analyze the eigenvector corresponding to the two largest eigenvalues, and describe what variables in the ratings matrix (e.g., “climate”) are most influential in explaining the variance of the data. Hint: look for eigenvector coefficients that have the largest positive or negative values.
7. Using the original data, find the three most similar cities using the following algorithm:

>Find the two cities that are the closest together using a Euclidean distance between their ratings.

> Find the third city that is the closest to the average of the ratings vectors for (1).

Does your result make sense?
8. Repeat (7) using the decorrelated, or transformed data. Did your ranking of the top three cities change? Does your new ranking make more sense? Explain your findings.
9. Repeat (8) using only the first 3 features from the transformed feature matrix (the first three columns in the transformed feature matrix). Explain any differences you observe from the result in (7) and (8).
Principal Components Analysis is a very useful tool for understanding correlation in data and understanding how various random variables are related.

## Approach and Results
The required python libraries are first loaded. The cities.mat dataset was saved to an acessible location and loaded into python via scipy's io library.

In [6]:
%matplotlib inline 
import matplotlib.pyplot as plt 
import numpy as np 
from scipy.io import loadmat

cities = loadmat('cities.mat')
a = cities['ratings']
print cities['categories']
print a.shape

[u'climate       ' u'housing       ' u'health        ' u'crime         '
 u'transportation' u'education     ' u'arts          ' u'recreation    '
 u'economics     ']
(329, 9)


### Tasks 1 & 2 

$ X = \left[ \begin{array}{ccc} 
            X_1 \\ 
            \vdots \\
            X_n
       \end{array} \right]$ 
       
The covariance matrix is equal to 

$$ \sigma_{i, j} = cov(X_i, X_j) = E[(X_i-\mu_11)(X_j - \mu_j)] = E[(X-E[X])(X-E[X])^T] $$

Compute the eigenvalues and right eigenvectors of a square array.
Parameters
----------
a : (..., M, M) array
>Matrices for which the eigenvalues and right eigenvectors will
be computed

Returns
-------
w : (..., M) array
>The eigenvalues, each repeated according to its multiplicity.
The eigenvalues are not necessarily ordered. The resulting
array will be always be of complex type. When `a` is real
the resulting eigenvalues will be real (0 imaginary part) or
occur in conjugate pairs

v : (..., M, M) array
>The normalized (unit "length") eigenvectors, such that the
>column ``v[:,i]`` is the eigenvector corresponding to the
>eigenvalue ``w[i]``.


In [66]:
M = np.cov(a)
print cov_a
eigen_values, eigen_vectors = np.linalg.eig(cov_a)


[[ 7111754.19444444  4423830.5         6048096.41666667 ...,
   6168293.45833333  4409078.19444444  6033722.375     ]
 [ 4423830.5         6132584.75        4357107.875      ...,  5167630.125
   4615377.25        4610387.        ]
 [ 6048096.41666667  4357107.875       6062468.         ...,  6453982.5
   4529283.79166667  6233986.875     ]
 ..., 
 [ 6168293.45833333  5167630.125       6453982.5        ...,  7107146.
   5125567.58333333  6672879.25      ]
 [ 4409078.19444444  4615377.25        4529283.79166667 ...,
   5125567.58333333  4086310.44444444  4737195.375     ]
 [ 6033722.375       4610387.          6233986.875      ...,  6672879.25
   4737195.375       6521358.75      ]]


## Task 3 
Compute the whitening transformation
..."multiply by $M^{-1/2}$ $M = cov(X)$
$$ M^{-1/2} = (M^{-1})^{1/2} $$

In [81]:
import scipy
M=np.linalg.inv(scipy.linalg.sqrtm(M))
W = a.T.dot(M)


array([[  1.36425589e+04 +4.37590343e-15j,
          1.43108978e+05 +7.55534080e+01j,
          2.86716758e+04 +6.25915220e+00j,
          7.18005555e+03 +4.08128820e+00j,
          1.99440174e+04 +3.55051397e+01j,
          3.50418171e+03 +2.20016918e+01j,
          1.51994701e+05 +2.52773364e+01j,
          2.36308859e+04 +1.26430530e+01j,
         -8.84136577e+03 +4.82257287e+01j],
       [  1.43108978e+05 -7.55534080e+01j,
          6.73475843e+06 -2.77284970e-15j,
          1.29072967e+06 -6.92973168e+01j,
          1.51858821e+05 -7.14728086e+01j,
          1.15397571e+06 -4.00459200e+01j,
          2.17119399e+05 -5.35546507e+01j,
          5.87963652e+06 -5.02894862e+01j,
          9.76705941e+05 -6.29104443e+01j,
          8.80102148e+05 -2.73380349e+01j],
       [  2.86716758e+04 -6.25915220e+00j,
          1.29072967e+06 +6.92973168e+01j,
          1.16944968e+06 +2.14895852e-14j,
          1.26582991e+05 -2.17781581e+00j,
          8.06525906e+05 +2.92472167e+01j,
         