Deep Gaussian Process (DGP), a hierarchical composition of Gaussian Processes(GP), can overcome the limitations of standard (single-layer) GP while maintaining the benefits of GP<sup>[1]</sup>.

# Standard (Single-layer) Gaussian Processes  

>Consider inferring a stochastic function $f:\mathbb{R}^{D}\to\mathbb{R}$ , given a likelihood $p(y|f)$ and a set of $N$ observations $\mathbf{y}=(y_{1},\dots,y_{N})^{\top}$ at (design) locations $\mathbf X=(\mathbf x_{1},\dots,\mathbf x_{N})^{\top}$. 

> Place a GP prior on function $f$ so that all function values as jointly Gaussian, with a mean function $m:\mathbb{R}^{D}\rightarrow\mathbf{R}$ and a covariance function $k:\mathbb{R}^{D}\times\mathbb{R}^{D}\overset{.}{\to}\mathbb{R}$.

>Define an additional set of $M$ inducing locations $\mathbf{Z}=(\mathbf{z}_{1}, \cdots,\mathbf{z}_{M})^{\top}$. Use the notation $\mathbf{f}=f(\mathbf{X})$ and $\mathbf{u}=f(\mathbf{Z})$ to represent the function values at the design and inducing locations.

By the definition of a GP, the joint density $p(\mathbf{f},\mathbf{u})$ is a Gaussian whose mean is given by the mean function evaluated at every input $(\mathbf{X},\mathbf{Z})^{\top}$ , and the corresponding covariance is given by the covariance function evaluated at every pair of inputs.
$$
\begin{bmatrix} \mathbf{f} \\ \mathbf{u} \end{bmatrix} \sim \mathcal{N} \left( \begin{bmatrix} m(\mathbf{X}) \\ m(\mathbf{Z}) \end{bmatrix}, \begin{bmatrix} K_{\mathbf{XX}} & K_{\mathbf{XZ}} \\ K_{\mathbf{ZX}} & K_{\mathbf{ZZ}} \end{bmatrix} \right)
$$


The joint density of $\mathbf{y},\mathbf{f}$ and $\mathbf{u}$ is given by:  

$$
\begin{align}
p(\mathbf{y},\mathbf{f},\mathbf{u})&= p(\mathbf{f}, \mathbf{u}) p(\mathbf{y}|\mathbf{f}, \not{\mathbf{u}}) \\
&= p(\mathbf{f}|\mathbf{u}) p(\mathbf{u}) \prod_{i=1}^{N}p(y_{i}|f_{i}) \\
&=\underbrace{p(\mathbf{f}|\mathbf{u};\mathbf{X},\mathbf{Z})p(\mathbf{u};\mathbf{Z})}_{\mathrm{GP~prior}}\underbrace{\prod_{i=1}^{N}p(y_{i}|f_{i})}_{\mathrm{likelihood}}
\end{align}
$$  

**Notice** that $p(\mathbf{f}|\mathbf{u};\mathbf{X},\mathbf{Z})$ indicates that the input locations for $\mathbf{f}$ and $\mathbf{u}$ are $\mathbf{X}$ and $\mathbf{Z}$, respectively.

The prior $p(\mathbf{u}; \mathbf{Z}) = \mathcal{N}(\mathbf{u} | m(\mathbf{Z}), k(\mathbf{Z}, \mathbf{Z}))$ and the conditional $p(\mathbf{f} | \mathbf{u}; \mathbf{X}, \mathbf{Z}) = \mathcal{N}(\mathbf{f} | \boldsymbol{\mu}, \mathbf{\Sigma})$, where for $i, j = 1, \ldots, N$:

$$[\boldsymbol{\mu}]_i = m(\mathbf{x}_i) + \boldsymbol{\alpha}(\mathbf{x}_i)^{\top} (\mathbf{u} - m(\mathbf{Z}))$$
$$[\mathbf{\Sigma}]_{ij} = k(\mathbf{x}_i, \mathbf{x}_j) - \boldsymbol{\alpha}(\mathbf{x}_i)^{\top} k(\mathbf{Z}, \mathbf{Z}) \boldsymbol{\alpha}(\mathbf{x}_j)$$

with $\boldsymbol{\alpha}(\mathbf{x}_{i})=k(\mathbf{Z},\mathbf{Z})^{-1}k(\mathbf{Z},\mathbf{x}_{i})$. Inference is possible in closed form when the likelihood $p(y|f)$ is Gaussian, but the time complexity is $O(N^3)$.  

## Reference<br>
<!-- apa -->
[1] Salimbeni, H., & Deisenroth, M. (2017). Doubly stochastic variational inference for deep Gaussian processes. Advances in neural information processing systems, 30.