# Gaussian processes – Kernel functions (Python version)
GMM, INSA Toulouse, France <br />
Andrés F. López-Lopera, ONERA-DTIS <br />
May 2021
<br />
___

For this lab session, you are free to use the language of your choice (e.g. R or Python). In this notebook we propose Python implementations.

In [None]:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from scipy.spatial.distance import cdist

plt.rc('text', usetex=True)
%matplotlib inline

## Covariance functions

We recall some usual covariance functions on $k: \mathbb{R} \times \mathbb{R} \to \mathbb{R}$:
- Squared Exponential (SE):
$$ k(x,y) = \sigma^2 \exp\left( - \frac{(x-y)^2}{2 \theta^2} \right)$$

- Matérn 5/2:
$$ k(x,y) = \sigma^2 \left(1+\frac{\sqrt{5} |x-y|}{\theta}+\frac{5 |x-y|^2}{3 \theta^2}\right)  \exp\left( - \frac{\sqrt{5}|x-y|}{\theta} \right) $$ 

- Matérn 3/2:
$$ k(x,y) = \sigma^2 \left(1+\frac{\sqrt{3} |x-y|}{\theta}\right)  \exp\left( - \frac{\sqrt{3}|x-y|}{\theta} \right) $$ 

- Exponential:
$$ k(x,y) = \sigma^2 \exp\left( - \frac{|x-y|}{\theta} \right)  $$ 

- Brownian:
$$ k(x,y) = \sigma^2 \min(x, y) $$ 

- White noise:
$$ k(x,y) = \sigma^2 \delta_{x,y} $$ 

- Constant:
$$ k(x,y) = \sigma^2 $$ 

- Linear:
$$ k(x,y) = \sigma^2 x y $$ 

- Cosine:
$$ k(x,y) = \sigma^2 \cos\left(\frac{x-y}{\theta}\right) $$ 

- Sinc:
$$ k(x,y) = \sigma^2 \frac{\theta}{x-y} \sin\left(\frac{x-y}{\theta}\right) $$ 

**Question 1.** For at least three kernels of your choice, write a function that takes as input the vectors ``x``, ``y`` and ``param`` and that returns the matrix with general terms $k(x_i, y_j)$.

In [None]:
def SEKernel(x, y, param):
    """ Squared exponential kernel
    input:
      x,y: input vectors
      param: parameters (sigma,theta)
    output:
      covariance matrix cov(x,y)
    """
    sigma2, theta = param[0], param[1]
    dist = cdist(x, y)/theta
    kern = ## to be filled (covariance matrix)
    return kern

## to be filled (to define another 3 kernel functions) 

**Question 2.** For a grid of 100 points on $[0, 1]$, compute the covariance matrix associated to each kernel you wrote in **Question 1**. Simulate Gaussian samples using the function ``sample``.

In [None]:
# function for generating GP samples
jitter = 1e-10  # small number to ensure numerical stability (eigenvalues of K can decay rapidly)
def sample(mu, var, jitter, N):
    """Generate N samples from a multivariate Gaussian \mathcal{N}(mu, var)"""
    L = ## to be filled (Cholesky decomposition)
    f_post = ## to be filled (samples)
    return f_post

In [None]:
n = 100 # number of input points
x = np.linspace(0, 1, n).reshape(-1,1)
param = [1, 0.1] # parameters of the GP
nsamples = 10 # number of GP samples

In [None]:
# samples from different types of kernels
np.random.seed(1)

# plotting GP samples
fig = plt.figure(figsize=(9, 3.5))
plt.subplot(1, 2, 1)
kern = SEKernel(x, x, param) 
plt.contourf(x.flatten(), x.flatten(), kern)
plt.xlabel("$x$"); plt.ylabel("$x'$"); plt.title("Squared Exponential")
plt.subplot(1, 2, 2)
samples = sample(0, kern, jitter, N=nsamples)
plt.plot(x, samples)
plt.xlabel("$x$"); plt.ylabel("$Y(x)$"); plt.title("GP samples");

## to be filled (repeat the plots for other kernel functions)

**Question 3.** Change the kernel and the kernel parameters. What are the effects on the sample paths? Write down your observations.

**Question 4.**  Using the SE kernel, generate a large number of samples and extract the vectors of the samples evaluated at two (or three) points of the input space. Plot the associated cloud of points. What happen if the two input points are close by? what happen if they are far away?

In [None]:
nsamples = 1000 # number of samples
kern = SEKernel(x, y, param) # covariance matrix
np.random.seed(1)

samples = sample(0, kern, jitter, N=nsamples)
idx_points = [9, 14, 99]
x_points = ## to be filled
samples_points = ## to be filled

fig = plt.figure(figsize=(9, 3.5))
plt.subplot(1, 2, 1)
plt.plot(samples_points[0,:], samples_points[1,:], "o")
plt.xlabel("$y_1$"); plt.ylabel("$y_2$")
plt.subplot(1, 2, 2)
plt.plot(samples_points[0,:], samples_points[2,:], "o")
plt.xlabel("$y_1$"); plt.ylabel("$y_3$");

## Building new kernels from other ones

**Question 5.**  As discussed in the cours, we can create new kernels by combining predefined ones, e.g.:

$$
\begin{array}{ll}
	\text{Sum of kernels:} & k(x, y) = k_1(x, y) + k_2(x, y) \\
	\text{Product of kernels:} & k(x, y) = k_1(x, y) \times k_2(x, y)
\end{array}	
$$

Play to make combinations of the kernel you wrote previously. Display the resulting covariance matrix and some GP samples.

In [None]:
nsamples = 10
kern = ## to be filled 

fig = plt.figure(figsize=(9, 3.5))
plt.subplot(1, 2, 1)
plt.contourf(x.flatten(), x.flatten(), kern)
plt.xlabel("$x$"); plt.ylabel("$x'$")
plt.subplot(1, 2, 2)
np.random.seed(0)
samples = sample(0, kern, jitter, N=nsamples)
plt.plot(x, samples)
plt.xlabel("$x$"); plt.ylabel("$Y(x)$"); plt.title("GP samples");

## Gaussian process regression

We aim at approximating the test function $f : x \in [0, 1] \mapsto x + sin(6\pi x)$ by a Gaussian process regression model:

$$m(x) = k(x, X) k(X,X)^{-1} Y$$

$$c(x,y) = k(x,y) - k(x, X) k(X,X)^{-1} x(X,y)$$


**Question 6.** We write two functions $m$ and $c$ that return the conditional mean and covariance. These functions will typically take as inputs the scalar/vector of prediction point(s) ``x``, the DoE vector ``X``, the vector of responses ``Y``, a kernel function ``kern``, and the covariance parameters ``param``.

In [None]:
# functions used for computing the conditional mean and covariance functions
def cond_mean(x, X, Y, kern, param):
    """Conditional GP mean vector
    input:
      x: vector of prediction points
      X: DoE vector
      Y: vector of responses
      kern: kernel function
      param: parameters of the covariance
    output:
      conditional mean
    """
    m = ## to be filled 
    return(m)

def cond_cov(x, X, Y, kern, param):
    """Conditional GP covariance matrix 
    input:
      x: vector of prediction points
      X: DoE vector
      Y: vector of responses
      kern: kernel function
      param: parameters of the covariance
    output:
      conditional covariance
    """
    c = ## to be filled 
    return(c)

**Question 7.** Create a design of experiment $X$ composed of 5 to 20 points in the input space (regularly spaced points for instance) and compute the vector of observations $Y =
f(X)$. Display in the same figure the design points and the target function.

In [None]:
def f(x): # target function
    return(10*x + np.sin(6*np.pi*x))

n_design = ## to be filled (number of input points)
X = ## to be filled (design points)
Y = ## to be filled (responses at the design points)

fig = plt.figure(figsize=(6, 5))
plt.plot(X, Y, 'x', color = 'C1', label = "obs")
X2 = np.linspace(0, 1, 1000).reshape(-1,1)
plt.plot(X2, f(X2), '--', color = 'C1', label = "test function")
plt.legend();

**Question 8.**  Considering the SE kernel, draw on the same graph $f(x)$, $m(x)$ and $95\%$ confidence intervals: $m(x) \pm 1.96 \sqrt{c(x, x)}$.

In [None]:
x = np.linspace(0, 1, 500).reshape(-1,1) # vector of prediction points
param = [1, 0.1] # parameters of the covariance
mu = ## to be filled (mean vector)
Cov = ## to be filled (covariance matrix)

def plotGP(x, m, c, X, Y, y):
    """
    input:
      x: test points
      m: conditional mean vector
      c: conditional covariance matrix
      X: DoE vector
      Y: vector of responses
      y: responses at test points
    output: GP regression plot
    """
    upperBound = m.flatten() + 1.96*np.sqrt(np.abs(np.diag(c)))
    lowerBound = m.flatten() - 1.96*np.sqrt(np.abs(np.diag(c)))
    
    fig = plt.figure(figsize=(6, 5))
    plt.plot(X, Y, "x", color = "C1", label = "obs") 
    plt.plot(x, f(x), '--', color = 'C1', label = "test function")
    plt.fill_between(x.flatten(), lowerBound.flatten(), upperBound.flatten(),
                     label="CI 95%",
                     color="C0", alpha=0.3)
    plt.plot(x, m, color="C0", label = "predicted mean")
    plt.xlabel("$x$")
    plt.ylabel("$f(x)$")
    plt.legend()
    
plotGP(x, mu, Cov, X, Y, f(x))

**Question 9.**  Change the kernel as well as the values in ``param``. What is the effect of
- $\sigma^2$ on $m(x)$? Can you prove this result?
- $\sigma^2$ on the conditional variance $v(x) = c(x, x)$? Can you prove this result?
- $\theta$ on $m(x)$ (try (very) small and large values)?
- $\theta$ on $v(x)$ (try (very) small and large values)?

**Question 10.** Generate samples from the conditional process

In [None]:
nsamples = 10
samples = ## to be filled 

fig = plt.figure(figsize=(9, 3.5))
plt.subplot(1, 2, 1)
plt.contourf(x.flatten(), x.flatten(), Cov)
plt.xlabel("$x$"); plt.ylabel("$x'$")
plt.subplot(1, 2, 2)
np.random.seed(0)
plt.plot(x, samples)
plt.xlabel("$x$"); plt.ylabel("$Y(x)$"); plt.title("GP samples");
plt.plot(X, Y, 'x', color = 'C1', label = "obs")

**Question 11.**  Use the resulting model to predict values of $f$ for $x \in [1, 1.5]$. What can you conclude?

In [None]:
x = np.linspace(0, 1.5, 500).reshape(-1,1) # vector of prediction points
mu = cond_mean(x, X, Y, SEKernel, param) # conditional mean
Cov = cond_cov(x, X, Y, SEKernel, param) # conditional covariance

plotGP(x, mu, Cov, X, Y, f(x))

**Question 12.** Repeat the procedure in **Question 11** but this time considering $k(x,y) = k_{lin}(x,y) + k_{cos}(x,y) + k_{SE}(x,y)$. For instance, fix the length-scale parameter of the cosine kernel to $\theta_{cos} = 1/(6\pi)$.

In [None]:
def linCosineSEKernel(x, y, param):
    # input:
    #  x,y: input vectors
    #  param: parameters (sigma2_lin, sigma2_cos, theta_cos, sigma2_SE, theta_SE)
    # output:
    #  kern: covariance matrix cov(x,y)
    kern = ## to be filled 
    return(kern)

x = np.linspace(0, 1.5, 500).reshape(-1,1) # vector of prediction points
param = [1, 1, 1/(6*np.pi), 1, 0.5] # parameters of the covariance
mu = cond_mean(x, X, Y, linCosineSEKernel, param) # conditional mean
Cov = cond_cov(x, X, Y, linCosineSEKernel, param) # conditional covariance

fig = plt.figure(figsize=(9, 3.5))
plt.subplot(1, 2, 1)
plt.contourf(x.flatten(), x.flatten(), Cov)
plt.xlabel("$x$"); plt.ylabel("$x'$"); plt.title("SE")
plt.subplot(1, 2, 2)
plotGP(x, mu, Cov, X, Y, f(x))
plt.xlabel("$x$"); plt.ylabel("$Y(x)$"); plt.title("GP samples");
plt.plot(X, Y, 'x', color = 'C1', label = "obs");

**Bonus question.** After testing different kernels and various values for $\sigma^2$ and $\theta$, which one would you recommend?