### Chapter 12: Principal Components Analysis

*The Geometric Approach to PCA*

In PCA, the goal is to find a new basis for a set of observations where the new variables (principal components) are uncorrelated. We take as our observation vector

$$ 
\mathbf{y} = [y_1, y_2, y_3, ... y_p]'
$$

of which we have $n$ observations. 

PCA consists of two steps. The first is centering the dataset, or translation to a new origin, which is $\mathbf{\bar{y}}$. The second consists of rotating the axes, such that the new variables are uncorrelated. Rotation consists of multiplication by an orthogonal matrix, for example, $\mathbf{A}$. If we have completed our task, the covariance matrix, $\mathbf{S_z} = \mathbf{ASA'}$, of our new variables, $\mathbf{z} = \mathbf{Ay}$, will be diagonal. Here, $\mathbf{S}$ is the sample covariance matrix of $\mathbf{Y} = [\mathbf{y1, y2, y3, ... y_n}]$.

Since $\mathbf{S}$ is a symmetric matrix, we can diagonalize it by taking the spectral decomposition, $\mathbf{C'SC} = \mathbf{D}$, where $\mathbf{D}$ is a diagonal matrix containing the eigenvalues of $\mathbf{S}$ and $\mathbf{C}$ is an orthogonal matrix whose columns are normalized eigenvectors of $\mathbf{S}$. Thus, $\mathbf{A} = \mathbf{C'}$. Our principal components are the transformed variables $\mathbf{z_1} = \mathbf{a_1'y}$ and the variances of our new variables (the principal components) are the eigenvalues of $\mathbf{S}$.

Principal components are not scale invariant, so we need to pay attention to the units in which our observations are measured.

Additionally, it is important that the variables not have widely disparate variances, as this will bias the results of PCA. The first principal component will largely represent that variable. If the variables have widely disparate variances, we could standardize them before performing PCA - this is similar to performing PCA on the correlation matrix, $\mathbf{R}$.

*PCA on the Correlation Matrix*

PCA should typically be performed on the covariance matrix $\mathbf{S}$. However, there are certain circumstances under which it is more sensible to perform PCA on the correlation matrix $\mathbf{R}$. One such case is when the variances of the variables are quite disparate. The variables are standardized in transforming $\mathbf{S}$ to $\mathbf{R}$, and as such the shape of the swarm of points corresponding to our observations will change. However, the principal components of $\mathbf{R}$, unlike those of $\mathbf{S}$, are scale invariant.

Importantly, the components from a given matrix $\mathbf{R}$ are not unique to that matrix because the eigenvectors do not depend on $r$ but on the ratio of the correlations. As such, claiming a percent of variance accounted for is less meaningful in these circumstances.

*Deciding which Components to Keep*

A number of methods are standardly used, including:

1. Retaining components that account for a sufficient percentage of variance explained
2. Retaining components whose eigenvalues are greater than the average
3. Use a *scree plot* and look for a natural break
4. Test the significance of the larger components



### Chapter 13: Factor Analysis

**Introduction to the model**

The essential idea behind factor analysis is that the observed variables' responses are generated by a smaller set of underlying variables which cannot be directly observed. If a particular subset of observed variables have high correlations with each other and low correlations with the other set of variables, there may be a single factor underlying those observed variables' responses.

As in PCA, we have a set of $p$-dimensional observation vectors $\mathbf{y_i}, i = 1,2,...n$ with mean $\mathbf{\mu}$ and covariance $\mathbf{\sigma}$. The factor analysis model expresses each variable as a linear combination of the underlying factors, $f_1, f_2, ... f_p$ with an accompanying error term, $\epsilon$ for that part of the variability which is unique to each variable. Thus, the model is the following

\begin{align}
y_1-\mu_1 &= \lambda_{11} f_1 + \lambda_{12} f_2 + ... + \lambda_{1m} f_m + \epsilon_1 \\
y_2-\mu_2 &= \lambda_{21} f_1 + \lambda_{22} f_2 + ... + \lambda_{2m} f_m + \epsilon_2 \\
&\vdots \\
y_p-\mu_p &= \lambda_{p1} f_1 + \lambda_{p2} f_2 + ... + \lambda_{pm} f_m + \epsilon_p
\end{align},

where the $f_k$ are factors and the $\lambda_{jk}$ are the *loadings* or weights associated with each factor for each variable $y_j$. Each $f_k$ is assumed to have $E[f_k] = 0$, Var($f_k$) $= 1$ and Cov($f_k, f_l$) $= 0$. Similar assumptions hold for $\epsilon_j$ except that each $\epsilon$ has its own *specific variance*, $\psi_j$. Further, the $\epsilon_j$ do not covary with each other. Thus, the factors are what account for the correlations amongst our observed variables.

**Methods of Finding Loadings and Factors**

*Principal Component Method*


*(Iterated) Principal Factor Method*


*Maximum Likelihood Method*


**Choosing the number, $m$, of factors**

These methods resemble those for PCA:

1. Choose the number such that a specific percentage of variance is accounted for
2. Choose as the number of factors that which is equal to the number of eigenvalues greater than the average eigenvalue
3. Scree plot test, choosing factors prior to flat-lining
4. Test the specific hypothesis that the number of factors is $m$

**Rotation of loadings**

The goal of factor rotation is to bring clusters of points close to the axes of the identified factors, often to aid interpretability of the underlying factors by associating them more strongly with subsets of variables.

Why can PC's not be rotated?

**Factor Scores**

Factor scores are estimates of the underlying factor values for each observation. One would be interested in this if the behavior of the observed variables in terms of the factors was of interest or to use factor scores as input to a further application.

The $f$'s are not observed, so they must be estimated, and the most popular approach is to use a regression model with the observed variables as predictors. The model for each observation is:

\begin{align}
\mathbf{f_i} &= \mathbf{B_1}'(\mathbf{y_i}-\mathbf{\bar{y}}) + \mathbf{\epsilon_i} \\ \mathbf{f_i}' &= (\mathbf{y_i}-\mathbf{\bar{y}})'\mathbf{B_1} + \mathbf{\epsilon_i}'
\end{align}

The overall model is thus

$$
\mathbf{F} = \mathbf{Y_c}\mathbf{B_1} + \mathbf{E}
$$

Estimating $\mathbf{B_1}$ requires evaluating

$$
\mathbf{\hat{B_1}} = (\mathbf{Y_c}'\mathbf{Y_c})^{-1}\mathbf{Y_c}'\mathbf{f} = \mathbf{S_{yy}}^{-1}\mathbf{S_{yf}}
$$

Given that $\mathbf{S_{yy}} = S$ in our treatment thus far, and $\mathbf{S_{yf}} = \mathbf{\lambda}$, our equation becomes

$$
\mathbf{F} = \mathbf{Y_c}\mathbf{S}^{-1}\mathbf{\hat{\Lambda}}
$$
