# 1. Non-liner Dimensionality Reduction

## 1.1. Introduction

* **Applications of Dimensionality Reduction**

>* Modelling data on/near manifolds
>* Visualization of high-dimensional data
>* Simple building blocks for complex models (e.g. FA $\rightarrow$ LGSSMs, GPLVM $\rightarrow$ GPSSMs)

* **Dimensionality Reduction: Conceptual Space**

><img src = 'images/image1_01.png' width=500>

## 1.2. DR via Distance

* **Goal:** find a mapping that preserves the distance

>$$d^{(y)}_{nm} = ||y^{(n)}-y^{(m)}|| \approx d^{(x)}_{nm} = ||x^{(n)}-x^{(m)}||$$

* **PCA:** Linear DR

>* Data

>$$\mathcal{D} = \{ \mathbf{y}_{1}, \cdots , \mathbf{y}_{N} \} \;\;\;,\;\;\; \mathbf{y}_{n} \in \mathbb{R}^D \;\;\;,\;\;\; \text{w.l.g. assume} \;\;\; \frac{1}{N} \sum_n \mathbf{y}_{n} = \mathbf{0}$$

>* Linear projection

>$$\mathbf{x}_n = \mathbf{w}^T \mathbf{y}_n \;\;\;,\;\;\; \mathbf{x}_{n} \in \mathbb{R}^K$$

>* Variance

>$$\text{Var}(x) = \frac{1}{N} \sum_n \mathbf{x}_n \mathbf{x}_n^T = \frac{1}{N} \sum_n \mathbf{w}^T \mathbf{y}_n \mathbf{y}_n^T \mathbf{w} = \mathbf{w}^T \left( \frac{1}{N} \sum_n \mathbf{y}_n \mathbf{y}_n^T \right) \mathbf{w} = \mathbf{w}^T \mathbf{\Sigma}_y \mathbf{w}$$

>* Objective Function (regularization by setting $\mathbf{w}^T \mathbf{w} = 1$)

>$$\mathbf{w}^* = \underset{\mathbf{w}}{\text{argmax}} \; \mathbf{w}^T \mathbf{\Sigma}_y \mathbf{w} - \lambda(\mathbf{w}^T \mathbf{w} - 1)$$

>* Solution

>$$\mathbf{\Sigma}_y \mathbf{w}^* = \lambda \mathbf{w}^*$$

>$$\mathbf{w}^*:  N \text{ eigenvectors in the order of decreasing eigenvalues}$$

* **ISOMAP:** Non-linear DR / geodesic distance via neighbourhood graph

>1. Determine the **neighbors** of each point (e.g. kNN)
>2. Construct a **neighborhood graph** (connect each point to its kNNs / edge length: Euclidean distance)
>3. Compute **shortest path** between two nodes (e.g. Dijkstra's algorithm, Floyd-Warshall algorithm, ...)
>4. Compute **lower-dimensional embedding** (MDS - multidimensional scaling)

* **Dijkstra's Algorithm**

>1. Create the **unvisited set** (containing all nodes)
>1. Assign a **tentative distance** to every node ($0$ for initial node, $\infty$ for others)
>1. For every **unvisited neighbour** of the current node,
>  1. calculate the tentative distance through the current node
>  1. update if it is smaller than the current value
>1. **Remove** the current node from the unvisited set
>1. **Stop if:**
>  1. destination node is visited (when planning a route between two specific nodes)
>  1. smallest tentative distance in the unvisited set is $\infty$ (when planning a complete traversal)
>1. **Otherwise:**
>  1. Current node $\leftarrow$ unvisited node with the smallest tentative distance
>  1. Go back to **Step 3**

* **MDS** (a.k.a. **PCoA** - Principal Coordinates Analysis)

>1. Set up the **squared proximity matrix**
>$$$$
>$$D^{(2)} = [d^2_{ij}]$$
>$$$$
>2. Apply **double centering** ($n$: no. of objects) 
>$$$$
>$$B = -\frac{1}{2} JD^{(2)}J \;\;\;,\;\;\; J=I-\frac{1}{n} \mathbf{1} \mathbf{1}^T$$
>$$$$
>3. Determine $K$ largest **eigenvalues and corresponding eigenvectors** of $B$ ($K$: desired dimension)
>4. $X=E_K \Lambda_K^{1/2}$ ($E_K$: matrix of eigenvectors / $\Lambda_K$: diagonal matrix of eigenvalues)

* **Limitations**

>1. Non-linear embedding-based methods require optimisation of new representation $x$
>2. Works well for small $K$, but slow for higher dimensions
>3. Does not provide quick way to $y^{(new)} \rightarrow x^{(new)}$
>4. Does not proviee a way for $x^{(new)} \rightarrow y^{(new)}$

## 1.3. DR using Auto-Encoders

* **Framework**

><img src = 'images/image1_02.png' width=250>

>* **Cost function**

>$$\underset{\theta,\phi}{\text{argmin}} \sum^N_{n=1} ||y^{(n)} - \hat{y}^{(n)}||^2 + \text{constraints}$$

>* **Examples for Encoding/Decoding Fn.**
>  * **Linear PCA:** use linear matrix multiplication - $\Phi$ and $\Theta$
>  * **Deep Neural AE:** use DNN

* **Objective:** to find **Interesting Embedding** (not identity mapping)

>* **Achieve via constraint**
>  * Dimensionality of $x$
>  * Sparsity - only a subset of $x$ is non-zero
>  * Function complexity

>* **Achieve via data corruption**
>  * Add noise in $y$
>  * Drop-out
>  * Transform(e.g. $y$: image from one angle $\rightarrow$ $\hat{y}$: from other angle)

## 1.4. DR using Probabilistic Models

* **Family of Models**

>|Model Class|<img src = 'images/image1_03.png' width=67>|<img src = 'images/image1_04.png' width=210>|<img src = 'images/image1_05.png' width=210>|
|-|-|-|-|
|**Full Linear**|　　Factor Analysis (FA)<br/><br/>$p(x)=\mathcal{G}(x;0,I)$<br/><br/>$p(y|x)=\mathcal{G}(y;\theta x,D)$|Inter-battery FA<br/><br/>$p(x),p(x_i)=\mathcal{G}(x;0,I)$<br/><br/>$p(y_i|x)=\mathcal{G}(y_i;\theta^{\text{sh}}_i x + \theta^{\text{pri}}_i x_i, D)$|LGSSM<br/><br/>$p(x_t|x_{t-1})=\mathcal{G}(x_t;\Psi x_{t-1},\Sigma)$<br/><br/>$p(y|x)=\mathcal{G}(y;\theta x,D)$|
|||||
|**Special Linear**|PCA<br/><br/>$D=\sigma^2 I$|Canonical Correlation Analysis<br/><br/>$D=\sigma^2 I$|Slow Feature Analysis<br/><br/>$D=\sigma^2 I \;\;,\;\; \Sigma=1-\Psi^2$<br/><br/>$\Psi=\text{diag}(\psi_1,...,\psi_K)$|
|||||
|**GP maps $y=f(x)$**|GP-LVM|Multi-view GP-LVM|GP-dynamical System|
|||||
|**Others**|e.g. ICA|Information Bottleneck<br/>style and content|e.g. GP-SSM|

* **Probabilistic PCA**

>* **Generative Model**

>$$x_n \text{ ~ } \mathcal{N}(0,I) \;\;\;,\;\;\;
y_n = {\theta} x_n + \sigma \epsilon_n \;\;\;,\;\;\;
\epsilon_n \text{ ~ } \mathcal{N}(0,I)$$

>\begin{align}
p(y_n) &= \int p(x_n, y_n) dx_n = \mathcal{N} (y_n|\mathbf{\mu}_y, \Sigma_y) \\
\mathbf{\mu}_y &= \mathbb{E}_{p(x,\mathbf{\epsilon})} [\theta x + \sigma \mathbf{\epsilon}] = \theta \cdot 0 + 0 = 0 \\
\Sigma_y &= \text{Cov}(y) = \mathbb{E}(y y^T) = \mathbb{E}_{p(x,\epsilon)} [(\Theta x + \sigma \epsilon)(\Theta x + \sigma \epsilon)^T] = \theta \theta^T + \sigma^2 I \\
p(y_n|\theta,\sigma) &= \mathcal{N} (y_n | 0, \theta \theta^T + \sigma^2 I)
\end{align}

>* **ML Solution**

>\begin{align}
\Sigma_y e_k &= \lambda_k e_k \;\;\;\rightarrow\;\;\; E_K = [e_1,...,e_K] \text{ and } \Lambda_K = \text{diag}(\lambda_1,...,\lambda_K)\\
\sigma^{ML} &= \frac{1}{D-K} \sum^D_{k=K+1} \lambda_k \\
\theta^{ML} &= E_K (\Lambda_K - \sigma^2 I)^{1/2}R \;\;\;\rightarrow\;\;\; R: \text{arbitrary rotation}
\end{align}

* **GP-LVM**

>* **Gaussian Process**

>\begin{align}
p(f(x)|\theta) &= \mathcal{GP}(0,K(x,x')) \\
K(x,x') &= \sigma^2 \exp \left( -\frac{1}{2l^2} (x-x')^2 \right) \\
p(y(x)|\theta) &= \mathcal{GP}(0,K(x,x')+\sigma_y^2 I)
\end{align}

>* **GP-LVM** (set $C(x,x')=\sum_k x_k x_k'$ to recover PCA)

>\begin{align}
p(f_d) &= \mathcal{GP} (f;0,C(x,x'))\\
p(x) &= \mathcal{G} (x;0,I) \\
p(y_d|x,f_d) &= \mathcal{G}(y;f_d(x),\sigma^2)
\end{align}

>* **MAP Inference**
>  * Optimise positions of data in latent space
>  * No explicit mappings (ISOMAP-like)

>\begin{align}
p(y|x) &= \int p(y|x,f)p(f)df = \mathcal{G}(y_{1:N};0,\Sigma(x_{1:N})) \\
\\
x_{\text{MAP}} &= \underset{x}{\text{argmax}} p(y|x)p(x) \\
&= \underset{x}{\text{argmax}} \log p(x) - \frac{1}{2} \log \det \Sigma (x_{1:N}) - \frac{1}{2} \text{trace} \left( \Sigma (x_{1:N})^{-1} y_{1:N} y_{1:N}^T \right)
\end{align}

>* **Back-constrained Inference**
>  * Accelerates inference/learning via recognition model
>  * $\phi$ and $\theta$ optimised using same objective

>\begin{align}
x_{\text{MAP}}(y_{1:N}) &\approx g_\phi (y_{1:N}) \\
\phi &= \underset{\phi}{\text{argmax}} \; p(x=g_\phi(y_{1:N})|y,\theta) = \underset{\phi}{\text{argmax}} \log p(x=g_\phi(y_{1:N}),y|\theta) \\
\theta &= \underset{\theta}{\text{argmax}} \log p(y|\theta) \approx \underset{\theta}{\text{argmax}} \log p(x=g_\phi(y_{1:N}),y|\theta)
\end{align}

## 1.5. Probabilistic Inference as an Auto-encoder

* **Variational Inference**

>* **Goal:** maximise the **approximate likelihood**

><img src = 'images/image1_06.png' width=300>

>\begin{align}
\mathcal{L}(\theta) &= \log p(y|\theta) = \log \int p(y,x|\theta) dx = \log \int \frac{q(x)}{q(x)} p(y,x|\theta)dx \\
&\geq \int q(x) \log \frac{p(y,x|\theta)}{q(x)}  dx = \mathcal{F}(q,\theta) \\
\mathcal{F}(q,\theta) &= \underset{\text{(reconstruction cost)}}{ \int q(x) \log p(y|x,\theta)dx} + \underset{\text{(soft-constraint)}}{ \int q(x) \log \frac{p(x|\theta)}{q(x)}dx} \\
&= - \frac{1}{2\sigma^2} \langle ||y - f_\theta(x)||^2 \rangle_{q(x)} - \frac{D}{2} \log \sigma^2 - \text{KL}(q(x)||p(x))
\end{align}

>* **Reconstruction Cost:** $f_\theta(x)$ should be similar to $y$ over $q(x)$
>* **Self-constraint:** $q(x)$ should be similar to $p(x)$

* **Flavours of Variational Inference**

>|Flavour|　　　　　　　　　　　　$q(x)$|　　　　　Parameter Learning|
|-|-|-|
|Fixed Family|$q(x)=\mathcal{G}(x;\mu_q,\Sigma_q)$|$\underset{\theta,\mu_q,\Sigma_q}{\text{argmax}} \mathcal{F}(\theta,\mu_q,\Sigma_q)$|
|Structured|$q(x)=\prod^K_{k=1} q_k(x_k)$|$\underset{\{q_k(x)\}^K_{k=1},\theta}{\text{argmax}} \mathcal{F}(\{q_k(x)\}^K_{k=1},\theta)$|
|Recognition Model<br/>(Variational Auto-encoder)|$q_\phi(x)=\mathcal{G}(x;\mu_\phi(y),\Sigma_\phi(y)$|$\underset{\phi,\theta}{\text{argmax}} \mathcal{F}(\phi,\theta)$|
|GP-LVM|$q_\phi(x) = \delta(x-g_\phi(y))$||
