In [None]:
# David Bourgin
# QuACK Workshop, 11/2/17
# Dimensionality Reduction and Factor Analysis

In [None]:
# Install and load a few useful packages
# corrplot:    for making a pretty heatmap of the sample correlation matrix
# psych:       for the `fa` and `vss` helper functions
# GPArotation: for the oblimin rotation functionality

installed <- installed.packages()
if (!"corrplot" %in% installed) {install.packages("corrplot")}
if (!"psych" %in% installed) {install.packages("psych")}
if (!"GPArotation" %in% installed) {install.packages("GPArotation")}

library("corrplot")
library("psych")
library("GPArotation")

# seed the rng for reproducibility
set.seed(94608)

# Exploratory Factor Analysis in R

In this notebook we analyze some personality data collected by Bertram Malle, hosted on Stanford's excellent [Psych253 data repository](https://web.stanford.edu/class/psych253/data/). The data consists of participants’ self-ratings on 32 personality traits. 

In exploratory factor analysis (EFA) our goal is to find a small(er) set of latent dimensions (factors) that account for as much of the co-variation in the raw data as possible. In contrast to a method like PCA, FA explicitly assumes a particular generative model for the data. For more, see the [FA theory notebook](./Factor Analysis Theory.ipynb).

<div class="alert alert-block alert-warning">
**N.B.** This notebook assumes you are familiar with the general factor analysis model. To review its derivation and assumptions, see its exposition in the accompanying [FA theory notebook](./Factor Analysis Theory.ipynb).
</div>


In [None]:
# download the Stanford personality dataset
d = read.table("https://www.stanford.edu/class/psych253/data/personality0.txt")
head(d)

In [None]:
# plot correlations between dimensions
corrplot(cor(d), order = "hclust", tl.col='black', tl.cex=.75)

In [None]:
# standardize the data by subtracting column means (centering) 
# and dividing by the standard deviation (scaling)
d_stan = as.data.frame(scale(d))

In [None]:
# A scree plot for deciding how many factors to use
R = cor(d_stan) # sample correlation matrix
evs = eigen(R)$values # compute eigenvalues
options(repr.plot.width=6, repr.plot.height=4)
plot(evs, type='b', xlab='Component', ylab='Eigenvalue', main="Scree plot")

In [None]:
# perform a VSS analysis to decide the number of factors to use 

# note the diagonal=FALSE argument; this is because we are concerned with factors 
# which account for the maximum COvariance in the data. Play around with different 
# rotations here
options(repr.plot.width=5, repr.plot.height=6)
vss = VSS(d_stan, n=8, rotate="oblimin", diagonal=FALSE)

In [None]:
# Compute the unrotated factor loadings using 10 factors
res1b = fa(d_stan, nfactors=5, rotate="none")

In [None]:
# Compute the proportion of overall covariance accounted for by first factor
evs[1] / 32

The **communality** of a variable reflects the extent to which the variability across subjects in a particular dimension is ‘explained’ by the set of factors extracted in the factor analysis. 

**Uniqueness** is just 1-communality, and measures (surprise!) the variance that is ‘unique’ to the variable / not shared with other variables.

In [None]:
res1b$uniquenesses

In [None]:
# Calculate uniqueness by hand for first dimension
loadings_distant = res1b$loadings[1,]
communality_distant = sum(loadings_distant^2)
communality_distant

In [None]:
uniqueness_distant = 1 - communality_distant
uniqueness_distant

In [None]:
# Plot loadings for factors 1 and 2 under different rotations
faRotations <- function(x, k) {
    options(repr.plot.width=10, repr.plot.height=4)
    par(mfrow=c(1,3), pty='s')
    
    # no rotation
    res1c = fa(x, nfactors=k, rotate="none")
    load = res1c$loadings[, 1:2]
    plot(load, type="n", xlab="Factor 1", ylab="Factor 2", main="Partial loading plot (Unrotated)") 
    text(load, labels=names(x), cex=.7)

    # varimax rotation
    res1c = fa(x, nfactors=k, rotate="varimax")
    load = res1c$loadings[, 1:2]
    plot(load, type="n", xlab="Factor 1", ylab="Factor 2", main="Partial loading plot (Varimax rotation)") 
    text(load, labels=names(x), cex=.7)

    # oblimin rotation
    res1c = fa(x, nfactors=k, fm='ml', rotate="oblimin")
    load = res1c$loadings[, 1:2]
    plot(load, type="n", xlab="Factor 1", ylab="Factor 2", main="Partial loading plot (Oblimin rotation)") 
    text(load, labels=names(x), cex=.7)
}

faRotations(d_stan, 5)