In [1]:
import numpy as np

In [8]:
import gpjax as gpx

mean = gpx.mean_functions.Zero()
kernel = gpx.kernels.RBF()
prior = gpx.gps.Prior(mean_function = mean, kernel = kernel)
likelihood = gpx.likelihoods.Gaussian(num_datapoints = 123)

posterior = prior * likelihood


print (type(mean))
print (type(kernel))
print (type(prior))
print (type(posterior))

<class 'gpjax.mean_functions.Zero'>
<class 'gpjax.kernels.stationary.rbf.RBF'>
<class 'gpjax.gps.Prior'>
<class 'gpjax.gps.ConjugatePosterior'>


In [9]:
# define a Kernel function

def Kernel(x1, x2):
    """
    Gaussian Kernel function
    """
    return np.exp(-0.5 * np.sum((x1 - x2)**2))

We write a GP $f(\cdot) \sim \mathcal{GP}(\mu(\cdot), k(\cdot, \cdot))$ with mean function $\mu: \mathcal{X} \rightarrow \mathbb{R}$ and $\boldsymbol{\theta}$-parameterised kernel $k: \mathcal{X} \times \mathcal{X}\rightarrow \mathbb{R}$. When evaluating the GP on a finite set of points $\mathbf{X}\subset\mathcal{X}$, $k$ gives rise to the Gram matrix $\mathbf{K}_{ff}$ such that the $(i, j)^{\text{th}}$ entry of the matrix is given by $[\mathbf{K}_{ff}]_{i, j} = k(\mathbf{x}_i, \mathbf{x}_j)$. As is conventional within the literature, we centre our training data and assume $\mu(\mathbf{X}):= 0$ for all $\mathbf{X}\in\mathbf{X}$. We further drop dependency on $\boldsymbol{\theta}$ and $\mathbf{X}$ for notational convenience in the remainder of this article.



We define a joint GP prior over the latent function

\begin{align}
p\left(\mathbf{f}, \mathbf{f}^{\star}\right)=\mathcal{N}\left(\mathbf{0},\left[\begin{array}{ll}
\mathbf{K}_{x f} & \mathbf{K}_{x x}
\end{array}\right]\right)
\end{align}

where $\mathbf{f}^{\star} = f(\mathbf{X}^{\star})$. Conditional on the GP's latent function $f$, we assume a factorising likelihood generates our observations

\begin{align}
p(\mathbf{y} \mid \mathbf{f})=\prod_{i=1}^n p\left(y_i \mid f_i\right)
\end{align}

Strictly speaking, the likelihood function is $p(\mathbf{y}\,|\,\phi(\mathbf{f}))$ where $\phi$ is the likelihood function's associated link function. Example link functions include the probit or logistic functions for a Bernoulli likelihood and the identity function for a Gaussian likelihood. We eschew this notation for now as this section primarily considers Gaussian likelihood functions where the role of $\phi$ is superfluous. However, this intuition will be helpful for models with a non-Gaussian likelihood, such as those encountered in classification.



Applying Bayes' theorem (???) yields the joint posterior distribution over the latent function
\begin{equation*}
p\left(\mathbf{f}, \mathbf{f}^{\star} \mid \mathbf{y}\right)=\frac{p(\mathbf{y} \mid \mathbf{f}) p\left(\mathbf{f}, \mathbf{f}^{\star}\right)}{p(\mathbf{y})} .
\end{equation*}

The choice of kernel function that we use to parameterise our GP is an important modelling decision as the choice of kernel dictates properties such as differentiability, variance and characteristic lengthscale of the functions that are admissible under the GP prior. A kernel is a positive-definite function with parameters $\boldsymbol{\theta}$ that maps pairs of inputs $\mathbf{X}, \mathbf{X}' \in \mathcal{X}$ onto the real line. We dedicate the entirety of the Introduction to Kernels notebook to exploring the different GPs each kernel can yield.