# Intro to Factor Analysis

<div class="alert alert-block alert-warning">
**N.B.** This notebook focuses on the theory behind factor analysis. For an example of how to perform factor analysis in R, see the accompanying [Factor Analysis in R notebook](./Factor Analysis in R.ipynb).
</div>

### Motivating Example
Assume we have a collection of $n$ student test scores on a set of $p$ IQ metrics. We find that the scores on certain IQ metrics are highly correlated with the scores on others: for example, students who scored highly on the algebra test also tended to score highly on the arithmetic test. In trying to understand these relationships, we assume that the correlation between two IQ metrics reflects the fact that both are testing the same  _latent factors_ of intelligence, but perhaps in different proportions, and with different degrees of accuracy.

We summarize this assumption below. In the graphical model, each node represents a random variable: the $z_i$'s denote the (unobserved) latent factors of intelligence, and the $x_i$'s represent the (observed) IQ test scores. The connections between nodes indicates a dependence relation. 

![](images/factor_diagram_no_labels.png) 

In psychometrics, a common goal is to try to discover the fewest number of latent factors that account for majority of the covariance in IQ metrics. Factor analysis is one means by which we might achieve this goal.  

### Factor Analysis
In factor analysis (FA), we seek to discover:
1. The _number_ of latent factors $z$ necessary to account for the observed relations between observed variables $x$.
2. The _influence_ of each latent factor $z$ on each observed variable $x$.
3. The _values_ of the latent factors $z$ given the observed variables $x$.

In order to do so, the FA model makes several assumptions about the way latent factors and observed variables relate to one another. I provide two equivalent descriptions of this model. The first should be more familiar for those versed in machine learning, while the second should be more recognizeable for those familiar with regression models.

<div class="alert alert-block alert-info">
**Notation Reference**
<ul>
    <li>$\mathbf{z} = [z_1, \ldots, z_m ]^\top$: The vector of values for the $m$ latent factors</li>
    <li>$\mathbf{x} = [x_1, \ldots x_p ]^\top$: The vector of values for the $p$ observed variables</li>
    <li>$\mathbf{I}$: The $m \times m$ identity matrix, with ones down the diagonal and zeros everywhere else.</li>
    <li>$\mathbf{0}$: The $m$-dimensional zero vector</li>
    <li>$\mu = [\mu(x_1), \ldots, \mu(x_p)]^\top$: The vector of sample means for each of the $p$ observed variables.</li>
    <li>$\theta$: The $p \times m$ matrix of factor loadings, where index $(i, j)$ contains the influence of factor $z_j$ on observed variable $x_i$.</li>
    <li>$\Psi$: The $p \times p$ diagonal matrix where entry $(i, i)$ contains the variance of observed variable $x_i$</li>
</ul>
</div>

<div class="alert alert-block alert-warning">
**N.B.** The expositions below assume that the observed data are normally distributed. While this assumption is not strictly necessary for FA, it often results in more robust factor loadings. As with many machine learning methods, it is good practice to first inspect your raw data first and apply any necessary transformations before fitting your model.
</div>

### Machine Learning View
FA is an example of a _generative model_ for our observed data. A standard FA model is:

$$ \mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) $$
$$ \mathbf{x} \ | \ \mathbf{z} \sim \mathcal{N}(\mu + \theta z, \Psi)$$

In our IQ testing example, this constitutes the assumption that a student's scores on the $p$ IQ metrics are generated according to the following recipe:
1. **Sample $\mathbf{z}$**: Draw a sample from an $m$-dimensional multivariate Gaussian.
2. **Compute the factor influences**: Map our sampled $\mathbf{z}$ to $\mu+\theta z$.
3. **Sample $\mathbf{x}$**: Add covariance noise $\Psi$ to $\mu + \theta z$ to produce our observed scores $\mathbf{x}$.

The parameters of the FA model are thus $\mu, \Psi$, and $\theta$. From this description we can see that FA is an example of a [mixture model](https://en.wikipedia.org/wiki/Mixture_model), where each observed dimension of our data vector $\mathbf{x}$ is associated to a varying degree with each of the latent factors. The extent of this association for each factor is known as the _factor loading_, and is held in the columns of our _loading matrix_, $\theta$.

If the above modeling assumptions hold, we can derive a version of the [Expectation Maximization algorithm](https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm) for mixture models to estimate the maximum likelihood parameters.

### Regression View
Equivalently, we can view FA in a way that emphasizes its similarity to standard regression analysis.
1. FA dictates that each observed variable $x_i$ is a _linear function_ of the latent factors, with an intercept set to the expected value for the observed variable and some additional random noise:
<br>
<br>
$$x_i = \mu(x_i) + \theta_{1i} z_1 + + \ldots + \theta_{mi} z_m + e_{i}$$
 <br>
 In the IQ testing example, this means we can express a student's scores on the $p$ IQ metrics as:
 <br>
 <br>
$$ \mathbf{x} = \mu + \theta \mathbf{z} + \mathbf{e}$$
<br>
where both the latent factors $\mathbf{z}$ and the specific errors $\mathbf{e}$ are normally distributed as:
$$ \mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) $$
$$ \mathbf{e} \sim \mathcal{N}(\mathbf{0}, \Psi)$$
 
2. FA assumes that 
 - Latent factors and specific errors are independent: $Cov(z_i, e_j) = 0$ for $i = 1, \ldots, m$ and $j = 1, \ldots, p$. 
 - Specific errors are also independent of one another: $Cov(e_i, e_j) = 0$ for all $i \neq j$.
 
The above exposition illustrates that we can think of FA as a series of multiple regressions, predicting each of the observable variables from the values of the unobservable common factors $\mathbf{z}$:
 
 $$
 \begin{array}{llc}
 x_1 & = & \mu(x_1) + \theta_{11}z_1 + \theta_{12}z_2 + \dots + \theta_{1m}z_m + e_1 \\ 
 x_2 & = & \mu(x_2) + \theta_{21}z_1 + \theta_{22}z_2 + \dots + \theta_{2m}z_m + e_2 \\ 
 & & \vdots \\ 
 x_p & = & \mu(x_p) + \theta_{p1}z_1 + \theta_{p2}z_2 + \dots + \theta_{pm}z_m + e_p 
 \end{array}
 $$

### Estimating Model Parameters
Estimating the parameters of the FA model turns out to be non-trivial. As a result, FA parameters are typically estimated using one of several iterative approximation methods:
1. The **Maximum Likelihood method** for computing parameter estimates that (locally) maximize the likelihood of the sample correlation matrix.
2. The **Principal Axis Factoring method** for iteratively decomposing the sample correlation matrix using PCA.
3. **Minimum Residual/Unweighted Least Squares method** for finding parameters which minimize the residual sum of the off-diagonal squared reconstruction error for the sample correlation matrix.

In R, the `psych` package's `fa` function uses the minimum residual estimator by default, while the `factanal` function from the `stats` package uses the MLE.

### How many factors to use?
So far we have assumed that the number of latent factors to look for in our data has been given. In a real-world example, however, this is never the case. Consequently, researchers have come up with the following heuristics for determining how many factors to look for during FA:

1. **Scree plot**: A scree plot displays the eigenvalues of the sample correlation matrix in decreasing order. The $x$ value at the "elbow" of the scree plot (i.e., the point at which the amount of variation captured by subsequent eigenvectors levels off) corresponds to the number of factors to use. 

 ![](./images/scree.jpeg)
 <br>
2. **Very Simple Structure criterion**:  Compares the fit achieved by a number of factor analyses using different loading matrices "simplified" by deleting all except the $c$ greatest loadings per item. Fit is defined as the proportion of the original sum of squared error accounted for by the simplified model. The optimal number of factors corresponds to the $x$ value at which the fit peaks across multiple values of $c$.

 ![](./images/vss.png)

### Factor Rotations
In general, there exist an infinite number of equally-good paramterizations of a FA model for any given dataset. Though this might seem like a significant downside of FA, we will see how this can be exploited to identify a collection of maximally interpretable latent factors.

Briefly, let us demonstrate why the factors and factor loadings estimated during FA are not unique. Recall that the FA model can be expressed in vector notation as

$$\mathbf{x} = \mu + \theta \mathbf{z} + \mathbf{e}$$

We see that this is equivalent to 

$$\mathbf{x} = \mu + \theta' \mathbf{z}' + \mathbf{e}$$ 

where $\theta' = \theta \mathbf{T}$ and $\mathbf{z}' = \mathbf{T}^\top \mathbf{z}$ for some orthogonal matrix $\mathbf{T}$ (orthogonal matrices are square matrices which have the property that $\mathbf{T}^\top \mathbf{T} = \mathbf{TT}^\top = \mathbf{I}$). By simple arithmetic,
\begin{align}
\mathbf{x} &= \mu + \theta' \mathbf{z}' + \mathbf{e} \\
&= \mu + \theta \mathbf{T} \mathbf{T}^\top \mathbf{z} + \mathbf{e} \\
&= \mu + \theta \mathbf{I} \mathbf{z} + \mathbf{e} \\
&= \mu + \theta \mathbf{z} + \mathbf{e}
\end{align}

Since there are an infinite number of possible orthogonal matrices, there are also an infinite number of factor loading + value combinations that achieve the same fit for a given $\mathbf{x}$. 

In FA, we take advantage of this ambiguity to identify a matrix $\mathbf{T}$ which results in the most _interpretable_ factor loadings. Typically, interpretability is maximized when each observed variable is associated with a single factor. Though this goal is rarely fully realized, we would like to find a matrix which gets us as close as possible.

An intuitive way to understand the effects of changing $\mathbf{T}$ on our factor loadings is to look at a _factor loading plot_. In a loading plot, each dimension corresponds to a latent factor, and each point corresponds to the factor loadings for a given observed variable. We can compare the loading plot for our untransformed factor loadings with the loading plot we achieve when multiplying by $\mathbf{T}$. 

Here's a cartoon example of two loading plots for our IQ test example. The plot on the left shows the loadings for the untransformed factors, while the plot on the right shows the loadings produced by multiplying by a particular orthogonal matrix $\mathbf{T}$:

![](./images/rotations.png)

As you can see, multiplying by the orthogonal matrix $\mathbf{T}$ corresponds to a _rotation_ of the basis factors (axes) about the origin. We would like to find a particular rotation matrix $\mathbf{T}$ which produces rotated coordinates that are easy to interpret. 
- For example, in the transformed plot above, we can reasonably interpret rotated factor $z_1'$ as corresponding to something like reasoning ability, and rotated factor $z_2'$ as corresponding to memory. In contrast, it is much more difficult to interpret the dimensions in the untransformed plot!

### What _is_ a good rotation?
A good rotation is one which emphasizes the "simple structure" in the factor loadings. L. L. Thurstone proposed 5 conditions on  simple structure: 
> 1. Each row of the factor matrix should contain at least one zero
> 2. If there are $m$ common factors, each column of the factor matrix should have at least $m$ zeros
> 3. For every pair of columns in the factor matrix, there should be several variables for which entries approach zero in the one column but not in the other
> 4. For every pair of columns in the factor matrix, a large proportion of the variables should have entries approaching zero in both columns when there are four or more factors
> 5. For every pair of columns in the factor matrix, there should be only a small number of variables with nonzero entries in both columns


The goal is thus to find a rotation which produces loadings where each observed variable has a high loading on one factor only and each factor has a high loading for only some of the observed variables.


### Finding a good rotation
There are two families of factor rotations. As described above, multiplying by an orthogonal rotation matrix $\mathbf{T}$ guarantees that the factors in our model remain independent. An alternative is to use  an _oblique_ rotation method, which permits dependencies between factors. 
- Visually, an oblique rotation corresponds to the case where the dimensions in the loading plot are not constrained to be at right angles to one another.

There are multiple approaches to determining the best rotation for a set of factor loadings. Two of the most commonly used are:
- [Varimax rotation technique](https://en.wikipedia.org/wiki/Varimax_rotation) - An orthogonal method, enforcing that all rotated factors be mutually independent. Identifies loadings which account for the maximum amount of variance in the squared loadings across the observed variables.
- [Oblimin rotation technique]() - An oblique method allowing for rotated factors to be non-independent. Identifies factor loadings which minimize the squared loading covariance between factors. 

Determining what rotation method to use will depend on your data. Some rules of thumb:
- The advantage of Varimax (or any orthogonal rotation method) is that it allows a straightforward interpretation: factor loadings correspond directly to the correlations between a factor and an observed variable. Its downside is that it assumes factors are _a priori_ independent. 

- While oblique methods like Oblimin make fewer assumptions about the relationships between factors, the loadings they produce are more difficult to interpret. While in Varimax we have only a single loading matrix $\mathbf{\theta}'$, in Oblimin we have two separate matrices: 
 1. A _structure matrix_ holding the correlation between factors and the observed variables _before_ the dependence between factors is taken into account.
 2. A _pattern matrix_ holding the relationship between each factor and observed variable _after_ the dependence between factors is taken into account
 3. A _factor correlation matrix_ holding the correlations between latent factors

### Factor Analysis vs. PCA
Both FA and PCA are commonly used for dimensionality reduction. Importantly, however, the way in which the two methods identify factors/principal components differs. 

The most important conceptual difference between PCA and FA is that PCA is a _descriptive technique_, while FA is a _structural model_. This means that PCA does not assume any specific relationship between the principal components it identifies from the observed data, while FA makes a strong assumption about the causal relationship that exists between factors and the observed data.

Computationally, the two approaches differ in the kind of explanation they provide. In particular, principal components are the optimal entities for explaining the _variance_ of each dimension in the data, while factors are the optimal entities for explaining the _covariance_ between pairs of dimensions.

In FA, we model the observed variables as linear functions of the latent factors. In PCA, however, we create _new_ variables that are linear combinations of the observed variables. 

However, FA and PCA are still deeply related. In fact, factor loadings and PCA eigenvalues converge as the number of observed data dimensions grows to infinity. This means that factor loadings may be thought of as the asymptotic principal component loadings as the number of items on the component grows towards infinity.