# PCA

In [91]:
groceries <- c(227.01, 241.42, 188.08, 238.23, 235.86)
leisure <- c(96.98, 140.44, 85.13, 158.22, 103.06)
income <- c(741.29, 854.07, 812.07, 813.69, 731.42)
spend <- data.frame(groceries, leisure, income)

In [3]:
rm(groceries, leisure, income)
attach(spend)
spend

groceries,leisure,income
227.01,96.98,741.29
241.42,140.44,854.07
188.08,85.13,812.07
238.23,158.22,813.69
235.86,103.06,731.42


In [92]:
spend.var <- var(spend)

# covariance matrix for spend
spend.var

Unnamed: 0,groceries,leisure,income
groceries,480.86085,479.1369,-46.57675
leisure,479.1369,964.7673,891.82637
income,-46.57675,891.8264,2739.06402


In [93]:
spend.cor <- cor(spend)

# correlation matrix for spend
spend.cor

Unnamed: 0,groceries,leisure,income
groceries,1.0,0.7034587,-0.04058433
leisure,0.70345873,1.0,0.54861532
income,-0.04058433,0.5486153,1.0


## Eigen analysis for spend

In [94]:
eig.var <- eigen(spend.var)

eig.var

eigen() decomposition
$values
[1] 3117.62546  997.08606   69.98063

$vectors
           [,1]       [,2]       [,3]
[1,] 0.05511988 -0.6581783  0.7508416
[2,] 0.39257810 -0.6771368 -0.6223891
[3,] 0.91806549  0.3290700  0.2210627


In [95]:
# summing the squares of the eigenvectors
apply(eig.var$vectors**2, 1, sum)

In [96]:
# percentages of variance accounted for by the three
# principal components
100*eig.var$values / sum(eig.var$values)

In [97]:
eig.cor <- eigen(spend.cor)

eig.cor

eigen() decomposition
$values
[1] 1.87268832 1.03935677 0.08795491

$vectors
           [,1]        [,2]       [,3]
[1,] -0.5565053 -0.61454914  0.5591344
[2,] -0.7148146  0.01112539 -0.6992255
[3,] -0.4234878  0.78880009  0.4454801


In [98]:
apply(eig.cor$vectors**2, 1, sum)

In [99]:
100*eig.cor$values / sum(eig.cor$values)

**Note, the reason the correlation matrix appears as it does is because the underlying X vector has been standardised by dividing through by the standard error/deviation.**

This means that you get the R (sample correlation) matrix rather than the S (sample covariance) matrix

## Trying to produced standardised data

In [100]:
spend

groceries,leisure,income
227.01,96.98,741.29
241.42,140.44,854.07
188.08,85.13,812.07
238.23,158.22,813.69
235.86,103.06,731.42


In [101]:
m.spend <- apply(spend, 2, mean)
m.spend

In [102]:
var.spend <- apply(spend, 2, var)

gro.var <- as.numeric(var.spend[1])
lei.var <- as.numeric(var.spend[2])
inc.var <- as.numeric(var.spend[3])

## Incredibly ugly method!

In [103]:
spend['groceries'] <- spend['groceries'] / sqrt(gro.var)

In [104]:
spend['leisure'] <- spend['leisure'] / sqrt(lei.var)

In [105]:
spend['income'] <- spend['income'] / sqrt(inc.var)

In [106]:
spend

groceries,leisure,income
10.352263,3.122273,14.16404
11.009397,4.521469,16.31896
8.576951,2.740762,15.51645
10.863925,5.093896,15.5474
10.755846,3.318019,13.97545


In [107]:
var(spend)

Unnamed: 0,groceries,leisure,income
groceries,1.0,0.7034587,-0.04058433
leisure,0.70345873,1.0,0.54861532
income,-0.04058433,0.5486153,1.0


In [108]:
spend.cor

Unnamed: 0,groceries,leisure,income
groceries,1.0,0.7034587,-0.04058433
leisure,0.70345873,1.0,0.54861532
income,-0.04058433,0.5486153,1.0


In [110]:
eigen(var(spend))

eigen() decomposition
$values
[1] 1.87268832 1.03935677 0.08795491

$vectors
           [,1]        [,2]       [,3]
[1,] -0.5565053 -0.61454914  0.5591344
[2,] -0.7148146  0.01112539 -0.6992255
[3,] -0.4234878  0.78880009  0.4454801


**Can also get the output from the `princomp` command, but this uses the biased estimate of the sample covariance, rather than the unbiased estimate**

> Need to adjust by multplying by $\dfrac{n}{n-1}$.

In [112]:
summary(princomp(spend))

Importance of components:
                          Comp.1    Comp.2    Comp.3
Standard deviation     1.2239896 0.9118582 0.2652620
Proportion of Variance 0.6242294 0.3464523 0.0293183
Cumulative Proportion  0.6242294 0.9706817 1.0000000

In [113]:
loadings(princomp(spend))


Loadings:
          Comp.1 Comp.2 Comp.3
groceries  0.557  0.615  0.559
leisure    0.715        -0.699
income     0.423 -0.789  0.445

               Comp.1 Comp.2 Comp.3
SS loadings     1.000  1.000  1.000
Proportion Var  0.333  0.333  0.333
Cumulative Var  0.333  0.667  1.000

---

&nbsp;

&nbsp;

## PCA Ex Q4 PS1

In [114]:
ass.1 <- c(8,12,14,12,9,10,11,11,12,14)
ass.2 <- c(14,13,11,13,10,12,10,15,13,10)
ass.3 <- c(7,13,8,10,12,11,10,12,10,9)

scores <- data.frame(ass.1, ass.2, ass.3)

scores

ass.1,ass.2,ass.3
8,14,7
12,13,13
14,11,8
12,13,10
9,10,12
10,12,11
11,10,10
11,15,12
12,13,10
14,10,9


In [115]:
scores.cov <- var(scores)

scores.cov

Unnamed: 0,ass.1,ass.2,ass.3
ass.1,3.7888889,-0.9222222,-0.2888889
ass.2,-0.9222222,3.2111111,0.3111111
ass.3,-0.2888889,0.3111111,3.5111111


In [116]:
scores.cor <- cor(scores)

scores.cor

Unnamed: 0,ass.1,ass.2,ass.3
ass.1,1.0,-0.2643942,-0.079205
ass.2,-0.2643942,1.0,0.0926543
ass.3,-0.079205,0.0926543,1.0


* Assignment 1  and 2 negatively correlated
* No correlation between 1 and 3
* No correlation between 2 and 3.

Assignments seem to be testing different skills / parts of chemistry.

---

&nbsp;

&nbsp;

### By hand method to produce correlation matrix

In [123]:
(1 / apply(scores, 2, var))

In [161]:
scores.hand <- scores
scores.hand['ass.1'] <- scores.hand['ass.1'] * sqrt(0.263929618768328)
scores.hand['ass.2'] <- scores.hand['ass.2'] * sqrt(0.311418685121107)
scores.hand['ass.3'] <- scores.hand['ass.3'] * sqrt(0.284810126582278)

var(scores.hand)

Unnamed: 0,ass.1,ass.2,ass.3
ass.1,1.0,-0.2643942,-0.079205
ass.2,-0.2643942,1.0,0.0926543
ass.3,-0.079205,0.0926543,1.0


---

&nbsp;

&nbsp;

In [162]:
scores.cov

Unnamed: 0,ass.1,ass.2,ass.3
ass.1,3.7888889,-0.9222222,-0.2888889
ass.2,-0.9222222,3.2111111,0.3111111
ass.3,-0.2888889,0.3111111,3.5111111


In [163]:
scores.cor

Unnamed: 0,ass.1,ass.2,ass.3
ass.1,1.0,-0.2643942,-0.079205
ass.2,-0.2643942,1.0,0.0926543
ass.3,-0.079205,0.0926543,1.0


---

&nbsp;

&nbsp;

## Finding principal components from S

In [165]:
eig.cov <- eigen(scores.cov)

eig.cov

eigen() decomposition
$values
[1] 4.623209 3.361408 2.526494

$vectors
           [,1]       [,2]        [,3]
[1,]  0.7463196  0.3365007 -0.57425985
[2,] -0.5649449 -0.1359171 -0.81385737
[3,] -0.3519153  0.9318229  0.08866676


In [168]:
# spread of the variance between the three principal components
100*eig.cov$values / sum(eig.cov$values)

## Principal Components from R

In [169]:
eig.cor <- eigen(scores.cor)

eig.cor

eigen() decomposition
$values
[1] 1.3117837 0.9529924 0.7352239

$vectors
           [,1]       [,2]        [,3]
[1,]  0.6545354  0.2856441 -0.69999344
[2,] -0.6630159 -0.2280570 -0.71302170
[3,] -0.3633088  0.9308047  0.04011519


In [170]:
100*eig.cor$values / sum(eig.cor$values)

## Summary comments

* Incredibly  similar spreads of variance between covariance and correlation matrix principal components

* Quite even amounts of variance spread between the PC's. No need to drop one of them and focus on others.

Looking specifically at the covar PC's:

1. Shows strong contrast between assignemnt 1 Vs 2 and 3. In
2. Mainly assignment 3, with a small contrast between  1 and 2.
3. Mainly assignments 1 and 2, with a little more emphasis on 2.