# Continuous Latent Variable Models - PCA, FA and ICA

### Preliminaries

- Goal 
  - Introduction to Linear Latent Variable Models (LLVM) on continuous domains, including factor analysis, principal component analysis and independent component analysis
- Materials        
  - Mandatory
    - These lecture notes
  - Optional
    - Bishop pp. 570-573, 577-580, 584-586 (PCA and FA)
    - Bishop pp. 591-592 (ICA)
    - M. Tipping and C. Bishop, [Probabilistic Principal Component Analysis](./files/bishop-ppca-jrss.pdf), Journal of the Royal Statistical Society. Series B, Vol.61, No.3, 1999 


###  Continuous Latent Variable Models

-  (Recall that) mixture models use a discrete class variable.

-  Sometimes, it is more appropriate to think in terms of **continuous**
underlying causes (factors) that control the observed data.

  -  E.g., observe test results for subjects: English, Spanish and French

$$\begin{align*}
  \underbrace{ \begin{bmatrix} x_1\;(=\text{English})\\ x_2\;(=\text{Spanish})\\ x_3\;(=\text{French}) \end{bmatrix} }_{\text{observed}}% &= f(\text{causes},\theta) + \text{noise}\\
&= \begin{bmatrix} \lambda_{11},\lambda_{12}\\ \lambda_{21},\lambda_{22}\\ \lambda_{31},\lambda_{32}\end{bmatrix} \cdot \underbrace{ \begin{bmatrix} z_1\;(=\text{literacy})\\ z_2\;(=\text{intelligence})\end{bmatrix} }_{\text{causes}} +    \underbrace{\begin{bmatrix} v_1\\v_2\\v_3\end{bmatrix} }_{\text{noise}}
\end{align*}$$

- (**Unsupervised Regression**). This is like (linear) regression with unobserved inputs.


###  Dimensionality Reduction

-  If the dimension for the hidden 'causes' ($z$) is smaller than for the observed data ($x$), then the model (tries to) achieve **dimensionality reduction**.

-  Key applications include 
  1. **compression** (store $z$ rather than $x$) 
    - Compression through **real-valued** latent variables can be far more efficient than with discrete clusters.
    - E.g., with two 8-bit hidden factors, one can describe $2^{16}\approx 10^5$ settings; this would take $2^{16}$ clusters!
  2. **noise reduction** (e.g. in biomedical, financial or speech signals)
  3. **feature extraction** (e.g. as a pre-processor for classification) 
  4. **visualization** (particularly if $\mathrm{dim}(Z)=2$)


  

### Example Problem


###  Model Specification for LC-LVM

- In this lesson, we focus on _Linear_ Continuous Latent Variable Models (**LC-LVM**).  

-  Introduce observation vector ${x}\in\mathbb{R}^D$ and $M$-dim ($M<D$) real-valued **latent factor**  $z$:
$$\begin{align*}
  x &= W z + \mu + \epsilon \\
  z &\sim \mathcal{N}(0,I) \\
  \epsilon &\sim \mathcal{N}(0,\Psi)
\end{align*}$$
or equivalently
$$\begin{align*}
p(x|z,\theta) &= \mathcal{N}(x|\,W z + \mu,\Psi) \tag{likelihood}\\
p(z) &= \mathcal{N}(z|\,0,I) \tag{prior}
\end{align*}$$
where $W$ is the $(D\times M)$-dim **factor loading matrix** 

- As we will see, the **observation noise covariance matrix** $\Psi$ is always **diagonal** for interesting LC-LVM models. 

- Note that LC-LVM is very similar to very to linear regression: $p(y|x) = \mathcal{N}(y|\theta^T {x}, \sigma^2)$. 

- Note also that the components of the hidden variables $Z$ are **statistically independent** of each other; the components of the observed vector $X$ may be correlated.







###  LC-LVM Analysis (1): The marginal distribution $p({x})$

-  Since the product of Gaussians is Gaussian, both the joint $p(x,z) = p(x|z)p(z)$, the marginal $p(x)$ and the conditional
$p(z|x)$ distributions are also Gaussian.

- The marginal distribution for the observed data is
$$
\boxed{ p(x) = \mathcal{N}({x}|\,{\mu},W W^T + \Psi) } 
$$
since the **mean** evaluates to 
$$\begin{align*}
\mathrm{E}[x] &= \mathrm{E}[W z + \mu+ \epsilon] \\
 &= W \mathrm{E}[z] + \mu + \mathrm{E}[\epsilon] \\
 &= \mu 
\end{align*}$$
and the **covariance** matrix is
$$\begin{align*}
\mathrm{cov}[x] &= \mathrm{E}[({x}-{\mu})({x}-{\mu})^T] \\
  &=  \mathrm{E}[(W z +\epsilon)(W z +\epsilon^T] \\
   &= W \mathrm{E}[z z^T] W^T + \mathrm{E}[\epsilon \epsilon^T] \\
   &= W W^T + \Psi 
\end{align*}$$

$\Rightarrow$ **LC-LVM is just a MultiVariate Gaussian (MVG) model** $x \sim \mathcal{N}({\mu},\Sigma)$ with the restriction that $\Sigma= W W^T + \Psi$.

-  The effective covariance is the low-rank outer product of two
long skinny matrices plus a diagonal matrix.


<img src="./figures/fig-FA-eq1.png" width="400px">

$\Rightarrow$ LC-LVM provides a MVG model of **intermediate complexity**. Compare the number of free parameters:
  - $D(D+1)/2$ for full Gaussian covariance $\Sigma$
  - $D(M+1)$  for LC-LVM model where $\Sigma = W W^T + \Psi$. 
  - $D$ for diagonal Gaussian covariance $\Sigma = \mathrm{diag}(\sigma_i^2)$
  - $1$ for isotropic Gaussian noise $\Sigma = \sigma^2 \mathrm{I}$
 


    


###  LC-LVM Analysis (2): The Factor Loading Matrix $W$ is Not Unique

-  The factor loading matrix $W$ can only be estimated up to a rotation matrix $R$. Namely, if we rotate $W \rightarrow WR $, then the covariance matrix for observations $x$ does not change (N.B.: a rotation (or orthogonal) matrix $R$ is a matrix such that $R^TR = R R^T = I$):

$$
W R (W R)^T + \Psi = W R R^T W^T + \Psi = W W^T + \Psi
$$


$\Rightarrow$ Two persons that estimate ML parameters for FA on the same data are **not guaranteed to find the same parameters**, since any rotation of $W$ is equally likely.

$\Rightarrow$ we can infer latent **subspaces** rather than individual components. One has to be careful when interpreting the numerical values of $W$ and $z$.



###  LC-LVM analysis (3): Constraints on the Noise Variance $\Psi$

-  When doing ML estimation for the parameters, a trivial solution for the covariance matrix $\Sigma_x = W W^T + \Psi$ is setting $\hat W=0$ and $\hat\Psi$ equal to the sample variance of the data.

-  In this case, all data correlation is explained as noise. (We'd like to avoid this).

$\Rightarrow$ The LC-LVM model is uninteresting without some restriction on the observation noise covariance matrix $\Psi$. 

- The interesting cases are mostly for diagonal $\Psi$. Note that if $\Psi$ is diagonal, all correlations between the $(D)$ components of $x$ **must be explained** by the rank-$M$ matrix $W W^T$. 

##### Factor Anaysis 

- In Factor Analysis (**FA**), $\Psi$ is restricted to be _diagonal_:

$$\begin{align*} 
\Psi = \mathrm{diag}(\psi_i) 
\end{align*}$$

##### Probabilistic Principal Component Analysis 

- In Probabilistic Principal Component Analysis (**pPCA**), the variances are further restricted to be the same,
 
$$\begin{align*} 
\Psi = \sigma^2 I 
\end{align*}$$

##### Principal Component Analysis 

- The 'regular' (deterministic) Principal Component Analysis (**PCA**) procedure can be obtained by further requiring that
$$\begin{align*} 
\Psi &= \lim_{\sigma^2\rightarrow 0} \sigma^2 I \\
W^T W &= I
\end{align*}$$ 
i.e., the noise model is discarded altogether and the columns of $W$ are orthonormal. 

- Regular PCA is a well-known deterministic procedure for dimensionality reduction (that predates pPCA).

$\Rightarrow$ FA, pPCA and PCA differ by their model for the noise variance $\Psi$ (namely, diagonal, isotropic and 'zeros').

###  Typical Applications
-  In PCA (or pPCA), the noise variance is assumed to be the same for all components. This is appropriate if all components of the observed data are 'shifted' versions of each other.

$\Rightarrow$ **PCA is very widely applied to image and signal processing tasks!**

-  Google (May-2015): [PCA "face recognition"] $>$ 300K hits; [PCA "noise reduction"] $>$ 100K hits 
-  FA is insensitive to scaling of individual components in the observed data (see appendix).
-  Use FA if the data are not shifted versions of the same kind.

$\Rightarrow$ **FA has strong history in 'social sciences'**



###  ML estimation for pPCA Model

- Given the generative model for pPCA 
$$\begin{align*}
p(x_n|z_n) &= \mathcal{N}(x_n\mid W z_n + \mu,\sigma^2 \mathrm{I})\\
p(z_n) &= \mathcal{N}(z_n \mid0,\mathrm{I})
\end{align*}$$
and observations ${D}=\{x_1,\dotsc,x_N\}$, find ML estimates for the parameters $\theta=\{W,\mu,\sigma\}$ 

- **Inference for ${\mu}$** is easy: ${x}$ is a multivariate Gaussian with mean ${\mu}$, so its ML estimate is
$$ \hat {\mu} = \frac{1}{N}\sum_n {x}_n$$
Now subtract $\hat {\mu}$ from all data points (${x}_n:= {x}_n-\hat {\mu}$) and assume that we have zero-mean data.

##### Solution method 1: Gradient-ascent on the log-likelihood. 

- Work out the gradients for the log-likelihood
$$\log p({D}|{\theta}) = -\frac{N}{2} \log \lvert 2\pi(W W^T + \sigma^2 \mathrm{I})\rvert  -\frac{1}{2}\sum_n {x}_n^T(W W^T + \sigma^2 \mathrm{I})^{-1}{x}_n$$
and optimize w.r.t. $W$ and $\sigma^2$. This turns out to be quite difficult since it is not possible to decouple $W$ and $\sigma^2$ (but it is possible, see [Tipping and Bishop, 1999](./files/bishop-ppca-jrss.pdf)).

##### Solution method 2: Use EM



###  Inference for the Hidden Variables, $p(z|x)$

- Both for dimensionality reduction and the E-step in EM-based parameter estimation, we need $p(z|{x})$.
-  (Use Bayes rule). We first derive the joint density $p({x},z)$ and then condition to $p(z|{x})$
-  The covariance between ${x}$ and $z$
\begin{align}
\mathrm{cov}[z,x] &= \mathrm{E}[(z-0)(x-\mu)^T] \\
  &= \mathrm{E}[z (W z +\epsilon)^T] \\
  &= \mathrm{E}[z z^T]W^T + \mathrm{E}[z \epsilon^T]= W^T
\end{align}

-  Now we can write out the joint distribution $p(x,z)$,
$$
p(x,z|{\theta}) = \mathcal{N} \left( \begin{bmatrix} z \\{x}\end{bmatrix} \left|\, \begin{bmatrix} 0\\ {\mu} \end{bmatrix}, \begin{bmatrix} I & W^T \\ W & W W^T + \Psi \end{bmatrix} \right. \right)
$$

-  Direct application of the formula for gaussian-conditioning (SRG-5d) leads to the posterior $p(z|{x})$
$$
p(z|{x}) = \mathcal{N}\left( z| \tilde{V}(x-\mu),\,I-\tilde{V}\Psi \right)
$$
with $\tilde{V} = W^T(W W^T + \Psi)^{-1}$.

- In principle, we're done for infering $p(z|{x})$, but note that computing $\tilde{V}$ requires inversion of a $D\times D$ matrix. The computational load can be substantially reduced to inversion of a $M\times M$ matrix (size of $z$), after application of the **matrix inversion identity**, (Bishop eqn. C.7),
$$
(A+XBX^T)^{-1} = A^{-1} - A^{-1}X(B^{-1}+X^TA^{-1}X)^{-1}X^TA^{-1}
$$

- Substitution of this identity into the previous formula for $p(z|{x})$ gives
\begin{align}
p(z|x) &= \mathcal{N} \left( z |m,V \right) \\
  m &= V W^T \Psi^{-1}(x-\mu) \\
  V &= (I + W^T \Psi^{-1} W)^{-1}
\end{align}
which relies on an $M\times M$ matrix inversion.

- Note that when ${x}$ is given, the unobserved input $z$ is not known exactly; the uncertainty about input $z$, as expressed by the variance $V=(I + W^T \Psi^{-1} W)^{-1}$, can be computed _before the data point ${x}$ has been seen_.

- Compare this to linear regression, where we have full knowledge about an input-output pair $({x},y)$.



### EM for Factor Analysis

-  Subtract sample mean $\hat{\mu} = \frac{1}{N}\sum_n {x}_n$ from all data points.
#### E-step
  - For each data point ${x}_n$, compute the posterior distribution of hidden factors, given the observed data,
\begin{align}
\hat{q}_n(z) &\triangleq p(z|x_n,\hat{\theta}^{-}) = \mathcal{N}(z|m_n,V)\\
V &= (I + W^T \Psi^{-1} W)^{-1} \\
m_n &= V W^T \Psi^{-1} x_n
\end{align}

#### M-step
  - Maximize $\mathrm{F}(\theta,\hat{q})$ w.r.t. ${\theta}$, (see section ... for details)
\begin{align}
\hat{W} &= \left( \sum_n x_n m_n^T \right) \left( N V + \sum_n m_n m_n^T \right)^{-1}\\
\hat \Psi &= \hat{W} V \hat{W}^T + \frac{1}{N} \sum_n (x_n-W m_n)(x_n-W m_n)^T 
\end{align}

- Note again the close relationship between the M-step iteration for LC-LVM and the ML solution for linear regression (in altered notation $x_n=\theta^T u_n + \mathcal{N}(0,\sigma^2 I)$,
$$ 
\hat{\theta} = (U^T U)^{-1}U^T x = \left( \sum_n u_n u_n^T \right)^{-1} \left( \sum_n u_n x_n^T \right)
$$

- The uncertainty about the inputs $z_n$, expressed by variance $V$ is the essential difference between LC-LVM and regression.

- In general, as $V= (I + W^T \Psi^{-1} W)^{-1} \rightarrow 0$, the equations for linear regression are recovered.





### Example Problem Revisited


###  Example: Noise Reduction by Auto-Encoder

\begin{center}\includegraphics[width=9cm]{./figures/fig-diabolo}\end{center}

- [Q.] Suppress noise in observation ${x}$ on the basis of observed data set $D=\{x_1,\dotsc,x_N\}$.
- [A.] Use ${D}$ to train FA (or pPCA) network and recover cleaned version $\hat {x}$ from ${x}$ as
$$
\hat {x} = \underbrace{W}_{(D \times M)} \cdot \underbrace{ V W^T\Psi^{-1} }_{ (M \times D)} \cdot {x}
$$


  

### Recap LC-LVM Models

- Model specification
- LC-LVM is just a Gaussian density model for $X$ with restricted covariance matrix $$
- The factor loading matrix is not unique
- Different names for different assumptions on the diagonal noise covariance matrix
- 

###  Independent Component Analysis (ICA)

- Often, in nature the observed data are mixtures of **independent non-Gaussian** sources, e.g., cocktail party
- The ICA (a.k.a. blind source separation) model is aimed at these kind of problems,

\begin{align}
x &= W z + \text{noise} \quad &\text{(x observed)} \\
p(z) &= \prod_k p(z_k) \quad &\text{(unknown non-Gaussian factorized prior)}
\end{align}

-  Many sensor noise models and many algorithms (often EM-based) for estimation of the sources $z$ and the mixing matrix $W$ have been developed.


###  Demo ICA; Blind Signal Separation
\begin{center}\includegraphics[width=11cm]{./figures/fig-bss}\end{center}

-----
_The cell below loads the style file_


In [4]:
open("../../styles/aipstyle.html") do f
    display("text/html", readall(f))
end