# Principal component analysis

[Remember that we can rotate a dataset into new coordinate space](RotateMahal.ipynb). This is the core idea behind PCA, which rotates a high-dimensional dataset into a space where maximum amount of variance is captured in smallest amount of dimensions. Realistic datasets often have very high dimensionality but classical data mining methods can perform poorly in these conditions. This is known as "curse of dimensionality". Furthermore, high-dimenensional data is difficult for humans to visualize.

In [11]:
data <- iris
data <- data[,-5]

print(head(data))

  Sepal.Length Sepal.Width Petal.Length Petal.Width
1          5.1         3.5          1.4         0.2
2          4.9         3.0          1.4         0.2
3          4.7         3.2          1.3         0.2
4          4.6         3.1          1.5         0.2
5          5.0         3.6          1.4         0.2
6          5.4         3.9          1.7         0.4


We want to reduce this 4-dimensional data set into something more manageable. Classical method to achieve this is eigen decomposition, which requires us to firstly mean-center our data. Arithmetic mean for each dimension will be subtracted from the data which is a common step in data standardization.

In [16]:
means <- colMeans(data)
centered <- t( t(data) - means )

print(means)
print(head(centered))

Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
    5.843333     3.057333     3.758000     1.199333 
     Sepal.Length Sepal.Width Petal.Length Petal.Width
[1,]   -0.7433333  0.44266667       -2.358  -0.9993333
[2,]   -0.9433333 -0.05733333       -2.358  -0.9993333
[3,]   -1.1433333  0.14266667       -2.458  -0.9993333
[4,]   -1.2433333  0.04266667       -2.258  -0.9993333
[5,]   -0.8433333  0.54266667       -2.358  -0.9993333
[6,]   -0.4433333  0.84266667       -2.058  -0.7993333


Resulting set has arithmetic mean of zero (minor floating-point strangeness may happen when testing this). Note that double-transpose was applied as subtracting vector from matrix in R is done row-wise, and we want to return to our original column oriented dataset (yes, each operation will copy the entire dataset in memory). 

Subsequently, we can calculate variance-covariance matrix of newly centered data

In [18]:
c <- cov(centered)
print(c)

             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length    0.6856935  -0.0424340    1.2743154   0.5162707
Sepal.Width    -0.0424340   0.1899794   -0.3296564  -0.1216394
Petal.Length    1.2743154  -0.3296564    3.1162779   1.2956094
Petal.Width     0.5162707  -0.1216394    1.2956094   0.5810063


## Variance and covariance

Variance, or squared standard deviation, is a measure of spread between numbers in data set.

$$ \sigma^2 = \frac{\sum_{i=1}^{n} (X - \mu)^2}{N} $$