# Continuous Latent Variable Models - PCA, FA and ICA


###  Continuous Latent Variable Models

-  Mixture models use a discrete class variable.
-  Sometimes, it is more appropriate to think in terms of **continuous**
underlying causes (factors) that control the observed data.
  -  E.g., observe test results for subjects: English, Spanish and French
\begin{align}
  \underbrace{ \begin{bmatrix} x_1\;(=\text{English})\\ x_2\;(=\text{Spanish})\\ x_3\;(=\text{French}) \end{bmatrix} }_{\text{observed}}% &= f(\text{causes},\theta) + \text{noise}\\
&= \begin{bmatrix} \lambda_{11},\lambda_{12}\\ \lambda_{21},\lambda_{22}\\ \lambda_{31},\lambda_{32}\end{bmatrix} \cdot \underbrace{ \begin{bmatrix} z_1\;(=\text{illiteracy})\\ z_2\;(=\text{intelligence})\end{bmatrix} }_{\text{causes}} +    \underbrace{\begin{bmatrix} v_1\\v_2\\v_3\end{bmatrix} }_{\text{noise}}
\end{align}

-  (Unsupervised Regression)}. This is like (linear) regression with unobserved inputs.



###  Dimensionality Reduction
\includegraphics[height=4cm]{./figures/fig-Moerland-FA-geometry}
-  If the dimension for the hidden 'causes' ($z$) is smaller than for the observed data ($x$), then the model (tries to) achieve **dimensionality reduction**.
-  Think of this as an observed data pancake in a 3-D space.


  

###  Why Dimensionality Reduction?


-  Key applications include **compression** (store $z$ rather than $x$) and **visualization**.
-  Compression through **real-valued** latent variables can be far more efficient than with discrete clusters.
-  E.g., with two 8-bit hidden factors, one can describe $2^{16}\approx 10^5$ settings; this would take $2^{16}$ clusters!
-  **Noise reduction** (e.g. in biomedical, financial or speech signals), **efficient representation and communication** and **feature extraction** (e.g. as a pre-processor for classification) can also be achieved through compression.




###  Specification for Linear Continuous Latent Variable Model (LC-LVM) 

-  Introduce observation vector ${x}\in\Re^D$ and $M$-dim ($M<D$) real-valued **latent factor**  $z$.
$$
x = W z + \mu + \epsilon, \qquad z \sim \mathcal{N}(0,I), \qquad \epsilon \sim \mathcal{N}(0,\Psi)$$
or equivalently
\begin{align}
p(x|z,\theta) &= \mathcal{N}(x|\,W z + \mu,\Psi) \tag{likelihood}\\
p(z) &= \mathcal{N}(z|\,0,I) \tag{prior}
\end{align}

where $W$ is the $(D\times M)$-dim **factor loading matrix** and **observation noise covariance matrix** $\Psi$ is **diagonal**.

- Note also that the components of the hidden variables $Z$ are **statistically independent** of each other; the components of the observed vector $X$ may be correlated.

-  Compare to linear regression: $p(y|x) = \mathcal{N}(y|\theta^T {x}, \sigma^2)$

-  Our goal: Given observations ${D}=\{x_1,\dotsc,x_N\}$, find 
  1. ML estimates for the parameters $\theta=\{W,\mu,\Psi\}$ 
  2. Application: the posterior $p(z|x)$.


%\begin{center}\pgfuseimage{fig:FA-structure}\end{center}






###  Analysis of LC-LVM: The Marginal Distribution $p({x})$

-  Note: since the product of Gaussians is Gaussian, both the joint
distribution $p({x},z)$, the marginal $p({x})$ and the conditional
$p(z|{x})$ are also Gaussian.

- The marginal distribution for the observed data is
$$
\boxed{ p({x}) = \mathcal{N}({x}|\,{\mu},W W^T + \Psi) }
$$
since
\begin{align}
\mathrm{E}[x] = \mathrm{E}[W z + \mu+ \epsilon] = W \mathrm{E}[z] + \mu + \mathrm{E}[\epsilon] = \mu \tag{mean}
\end{align}
and
\begin{align}
\mathrm{var}[x] &= \mathrm{E}[({x}-{\mu})({x}-{\mu})^T] =  \mathrm{E}[(W z +\epsilon)(W z +\epsilon^T] \tag{variance}\\
   &= W \mathrm{E}[z z^T] W^T + \mathrm{E}[\epsilon \epsilon^T] = W W^T + \Psi 
\end{align}

$\Rightarrow$ **LC-LVM is just a MultiVariate Gaussian (MVG) model** $x \sim \mathcal{N}({\mu},\Sigma)$ with the restriction that $\Sigma= W W^T + \Psi$.

-  The effective covariance is the low-rank outer product of two
long skinny matrices plus a diagonal matrix.

\begin{figure}
\begin{center}
\includegraphics[height=2cm]{./figures/fig-FA-eq}
\end{center}
\end{figure}

-  Number of free parameters
  -  $D(D+1)/2$ for full Gaussian covariance $\Sigma$
  -  $D(M+1)$  for LC-LVM model where $\Sigma = W W^T + \Psi$. 
  -  $D$ for diagonal Gaussian covariance $\Sigma = \mathrm{diag}(\sigma_i^2)$
 

$\Rightarrow$ LC-LVM provides a MVG model of **intermediate complexity**.
    


###  The Factor Loading Matrix $W$ is Not Unique

-  The factor loading matrix $W$ can only be estimated up to a rotation matrix $R$. Namely, if we rotate $W \rightarrow WR $, then the observed variance
$$
W R (W R)^T + \Psi = W R R^T W^T + \Psi = W W^T + \Psi
$$
does not change. (N.B., a rotation (or orthogonal) matrix $R$ is a matrix such that $R^TR = R R^T = I$).

-  Two persons that estimate ML parameters for FA on the same data are **not guaranteed to find the same parameters**, since any rotation of $W$ is equally likely.

$\Rightarrow$ we can infer latent **subspaces** rather than individual components. One has to be careful when interpreting the numerical values of $W$ and $z$.



###  Constraints on the Noise Variance $\Psi$

-  When doing ML estimation for the parameters, a trivial solution for the covariance matrix $W W^T + \Psi$ is setting $\hat W=0$ and $\hat\Psi$ equal to the sample variance of the data.
-  In this case, all data correlation is explained as noise. (We'd like to avoid this).

$\Rightarrow$ The LC-LVM model is uninteresting without some restriction on the sensor noise covariance matrix $\Psi$.

 - In **Factor Analysis** (FA), $\Psi$ is restricted to be _diagonal_:
\begin{align} \Psi = \mathrm{diag}(\psi_i) \tag{FA}\end{align}
- Note that if $\Psi$ is diagonal, all correlations between the $(D)$ components of $x$ **must be explained** by the rank-$M$ matrix $W W^T$.

- In **(probabilistic) Principal Component Analysis** (pPCA), the variances are further restricted to be the same,
 \begin{align} \Psi = \sigma^2 I \tag{pPCA}\end{align}

- The 'regular' (determistic) **Principal Component Analysis** procedure can be obtained by further requiring that
 \begin{align} \Psi = \lim_{\sigma^2\rightarrow 0} \sigma^2 I, \quad \text{and}\quad W^T W = I, \tag{PCA}\end{align} 
i.e., the noise model is discarded altogether and the columns of $W$ are orthonormal.

- Regular PCA is a well-known deterministic procedure for dimensionality reduction (that predates pPCA).

$\Rightarrow$ FA, pPCA and PCA differ by their model for the noise variance $\Psi$ (namely, diagonal, isotropic and 'zeros').

###  Typical Applications
-  In PCA (or pPCA), the noise variance is assumed to be the same for all components. This is appropriate if all components of the observed data are 'shifted' versions of each other.

$\Rightarrow$ **PCA is very widely applied to image and signal processing tasks!**

-  Google (May-2015): [PCA "face recognition"] $>$ 300K hits; [PCA "noise reduction"] $>$ 100K hits 
-  FA is insensitive to scaling of individual components in the observed data (see appendix).
-  Use FA if the data are not shifted versions of the same kind.

$\Rightarrow$ **FA has strong history in 'social sciences'**



###  ML estimation for FA Model

-  The parameters to be optimized are $\theta = \{ \mu,W,\Psi \}$

-  (Inference for ${\mu}$). $\hat {\mu}$ is easy: ${x}$ is a multivariate Gaussian with mean ${\mu}$, so its ML estimate is
$$ \hat {\mu} = \frac{1}{N}\sum_n {x}_n$$
    Now subtract $\hat {\mu}$ from all data points (${x}_n:= {x}_n-\hat {\mu}$) and
assume that we have zero-mean data.

-  **solution 1**. Work out the gradients for the log-likelihood
$$\log p({D}|{\theta}) = -\frac{N}{2} \log |2\pi(W W^T + \Psi)| -\frac{1}{2}\sum_n {x}_n^T(W W^T + \Psi)^{-1}{x}_n$$
and optimize w.r.t. $W$ and $\Psi$, subject to constraints ($\Psi$ diagonal, possibly $W$ orthonormal, etc.). This turns out to be quite difficult since it is not possible to decouple $W$ and $\Psi$. Homework: try this!

-  **Solution 2**. Use EM



###  Inference for the Hidden Variables, $p(z|x)$

- Both for dimensionality reduction and the E-step in EM-based parameter estimation, we need $p(z|{x})$.
-  (Use Bayes rule). We first derive the joint density $p({x},z)$ and then condition to $p(z|{x})$
-  The covariance between ${x}$ and $z$
\begin{align}
\mathrm{cov}[z,x] &= \mathrm{E}[(z-0)(x-\mu)^T] \\
  &= \mathrm{E}[z (W z +\epsilon)^T] \\
  &= \mathrm{E}[z z^T]W^T + \mathrm{E}[z \epsilon^T]= W^T
\end{align}

-  Now we can write out the joint distribution $p(x,z)$,
$$
p(x,z|{\theta}) = \mathcal{N} \left( \begin{bmatrix} z \\{x}\end{bmatrix} \left|\, \begin{bmatrix} 0\\ {\mu} \end{bmatrix}, \begin{bmatrix} I & W^T \\ W & W W^T + \Psi \end{bmatrix} \right. \right)
$$

-  Direct application of the formula for gaussian-conditioning (SRG-5d) leads to the posterior $p(z|{x})$
$$
p(z|{x}) = \mathcal{N}\left( z| \tilde{V}(x-\mu),\,I-\tilde{V}\Psi \right)
$$
with $\tilde{V} = W^T(W W^T + \Psi)^{-1}$.

- In principle, we're done for infering $p(z|{x})$, but note that computing $\tilde{V}$ requires inversion of a $D\times D$ matrix. The computational load can be substantially reduced to inversion of a $M\times M$ matrix (size of $z$), after application of the **matrix inversion identity**, (Bishop eqn. C.7),
$$
(A+XBX^T)^{-1} = A^{-1} - A^{-1}X(B^{-1}+X^TA^{-1}X)^{-1}X^TA^{-1}
$$

- Substitution of this identity into the previous formula for $p(z|{x})$ gives
\begin{align}
p(z|x) &= \mathcal{N} \left( z |m,V \right) \\
  m &= V W^T \Psi^{-1}(x-\mu) \\
  V &= (I + W^T \Psi^{-1} W)^{-1}
\end{align}
which relies on an $M\times M$ matrix inversion.

- Note that when ${x}$ is given, the unobserved input $z$ is not known exactly; the uncertainty about input $z$, as expressed by the variance $V=(I + W^T \Psi^{-1} W)^{-1}$, can be computed _before the data point ${x}$ has been seen_.

- Compare this to linear regression, where we have full knowledge about an input-output pair $({x},y)$.



### EM for Factor Analysis

-  Subtract sample mean $\hat{\mu} = \frac{1}{N}\sum_n {x}_n$ from all data points.
#### E-step
  - For each data point ${x}_n$, compute the posterior distribution of hidden factors, given the observed data,
\begin{align}
\hat{q}_n(z) &\triangleq p(z|x_n,\hat{\theta}^{-}) = \mathcal{N}(z|m_n,V)\\
V &= (I + W^T \Psi^{-1} W)^{-1} \\
m_n &= V W^T \Psi^{-1} x_n
\end{align}

#### M-step
  - Maximize $\mathrm{F}(\theta,\hat{q})$ w.r.t. ${\theta}$, (see section ... for details)
\begin{align}
\hat{W} &= \left( \sum_n x_n m_n^T \right) \left( N V + \sum_n m_n m_n^T \right)^{-1}\\
\hat \Psi &= \hat{W} V \hat{W}^T + \frac{1}{N} \sum_n (x_n-W m_n)(x_n-W m_n)^T 
\end{align}

- Note again the close relationship between the M-step iteration for LC-LVM and the ML solution for linear regression (in altered notation $x_n=\theta^T u_n + \mathcal{N}(0,\sigma^2 I)$,
$$ 
\hat{\theta} = (U^T U)^{-1}U^T x = \left( \sum_n u_n u_n^T \right)^{-1} \left( \sum_n u_n x_n^T \right)
$$

- The uncertainty about the inputs $z_n$, expressed by variance $V$ is the essential difference between LC-LVM and regression.

- In general, as $V= (I + W^T \Psi^{-1} W)^{-1} \rightarrow 0$, the equations for linear regression are recovered.






###  Example: Noise Reduction by Auto-Encoder

\begin{center}\includegraphics[width=9cm]{./figures/fig-diabolo}\end{center}

- [Q.] Suppress noise in observation ${x}$ on the basis of observed data set $D=\{x_1,\dotsc,x_N\}$.
- [A.] Use ${D}$ to train FA (or pPCA) network and recover cleaned version $\hat {x}$ from ${x}$ as
$$
\hat {x} = \underbrace{W}_{(D \times M)} \cdot \underbrace{ V W^T\Psi^{-1} }_{ (M \times D)} \cdot {x}
$$


  

###  Independent Component Analysis (ICA)

- Often, in nature the observed data are mixtures of **independent non-Gaussian** sources, e.g., cocktail party
- The ICA (a.k.a. blind source separation) model is aimed at these kind of problems,

\begin{align}
x &= W z + \text{noise} \quad &\text{(x observed)} \\
p(z) &= \prod_k p(z_k) \quad &\text{(unknown non-Gaussian factorized prior)}
\end{align}

-  Many sensor noise models and many algorithms (often EM-based) for estimation of the sources $z$ and the mixing matrix $W$ have been developed.


###  Demo ICA; Blind Signal Separation
\begin{center}\includegraphics[width=11cm]{./figures/fig-bss}\end{center}