# 5. Gaussian Process

In Bayesian regression, we assumed linear relationship between the output and the inputs or the tranformation of inputs. 
- Given training data set $\mathcal{D} = \{(X_n, T_n)\}_{n=1}^N$, we want to predict the output target $t$ for a new input $\mathcal{x}$.
    - Inference: get the posterior distribution $p(w|X, T)$
    - Decision: get the posterior predictive distribution $p(t|x, X, T)$ for a new input $\mathcal{x}$

**Inference**:
- assume a linear model $t = y(x, w) = w^Tx + \epsilon$, where $\epsilon$ is the modeling noise, $y$ is the model to be learned.
- assume a prior distribution for parameters $w$, e.g., $p(w) = N(w|0, I)$
- assume a likelihood function $p(T|X, w)$, e.g., $p(T|X, w) = N(T|y(X), \beta^{-1}I) = N(T|Xw, \beta^{-1}I)$, where $\beta$ is the precision of the modeling noise. The unknown parameters are $w$ and $\beta$, which can be solved by maximizing the logrithatic likelihood function $p(T|X, w)$.
    - $p(T|X, w) = \prod_{n=1}^N N(T_n|X_nw, \beta^{-1})$
    - $\log p(T|X, w) = -\frac{\beta}{2}\sum_{n=1}^N (T_n - X_nw)^2 + \frac{N}{2}\log \beta - \frac{N}{2}\log (2\pi)$
    - by solving the optimization problem, we can get the $w$ and $\beta$ that maximize the likelihood 
- assume a posterior distribution $p(w|X, T)$, which is propotional ot the product of the prior distribution and the likelihood. Due to the choice of a conjugate Gaussian prior distribution, the posterior distribution is alos a Gaussian distribution. Let's assume $p(w|X, T) = N(w|m_N, S_N)$ after observing $N$ data from training set. The mean $m_N$ and variance $S_N$ can be calculated based on marginal and conditional Gaussian principles as follows:
    - $p(w|X, T) = N(w|m_N, S_N)$
    - $S_N^{-1} = \beta XX^T + I$
    - $m_N = \beta S_N X^T T$

**Decision**
- based on conditional and marginal probability rules, the posterior predictive distribution $p(t|x,X,T)$ can be expressed as:
    - $p(t|x,X,T) = \int p(t|x,w)p(w|X,T)dw$
    - $p(t|x,X,T) = N(t|m_N^Tx, \sigma_N^2(x))$
    - $\sigma_N^2(x) = \beta^{-1} + x^TS_Nx$


Gaussian process is very similar to the above process, but there are also differences:
- GP instead of inferencing the parameters of a given linear model $y(x) = w^Tx$, infers the function itself $y(x)$ directly from the training data.
- GP is a non-parametric approach, which means that the number of parameters grows with the size of the training data set.
- GP is a distribution over functions, which is defined by a mean function $m(x)$ and a covariance function $k(x, x')$.
- GP is a memory-based approach, like k-NN, which means it stores the training data and uses all of them to predict the output for a new input. It will require retraining the model when new data is added to the training set.

## 5.1 Gaussian Process Regression

The regression problem is standard:
- given training set $(X, T)$, predict the output $t$ for a new input $x$.

**Inference**
- assume the prediction is made by a Gaussian process $y(x)$ so that $t_i = y(x_i) + \epsilon_i$ for all the data points in the training set, where $\epsilon_i$ is the modeling noise.
- get conditional distribution $p(T|y(X)) = N(T|y(X), \beta^{-1}I_N)$
- assume the Gaussian process prior or marginal distribution $p(y(X)) = N(y(X)|m(X), k(X, X'))$, where $m(X)$ is the mean function and $k(X, X')$ is the covariance function, also known as kernel function.
    - we can always assume 0 mean for prior distribution, i.e., $p(y) = N(y|0, K)$, where $K$ is $k(X, X')$
- find the conditional distribution $p(T|X)$:
    - $p(T|X) = \int_{y} p(T, y|X)dy = \int_{y} p(T|y,X)p(y|X)dy$
    - $p(T|X) = N(T|0, C_N)$
    - $C_N = K_N + \beta^{-1}I_N$ -> simply add the covariance due to independency

**Decision**
- find the posterior distribution $p(t|x, X, T)$:
    - get joint distribution $p(t, T|x, X)$ based on conditional distribution $p(T|X)$ -> simply added one more data point
        - $p(t, T|x, X) = N(t,T|0, C_{N+1})$
        - $ C_{N+1} = \begin{bmatrix}
                        C_N & \mathcal{k} \\
                        \mathcal{k}^T & c
                        \end{bmatrix} 
          $ -> (N+1)-by-(N+1) matrix
        - $c = k(x, x') + \beta^{-1}, \mathcal{k} = k(x_N, x_{N+1}) = k(X[-1], x)$
    - get conditional distribution $p(t|x, X, T)$ based on joint distribution $p(t, T|x, X)$
        - $p(t|x, X, T) = N(t|m(x), \sigma^2(x))$
        - $m(x) = \mathcal{k}^TC_N^{-1}T$
        - $\sigma^2(x) = c - \mathcal{k}^TC_N^{-1}\mathcal{k}$

In [8]:
import jax
import jax.scipy as jsp
import jax.numpy as jnp 

# kernel
class GaussianProcessRegression():
    def rbf_kernel(self, params, x1, x2):
        # take a squared exponential kernel as an example
        return params['variance'] * jnp.exp(-0.5 * params['length_scale'] * jnp.sum((x1 - x2)**2))

    def construct_covariance_matrix(self, params, X):
        # assume first dimension is the batch dimension
        kernel_vmap = jax.vmap(jax.vmap(self.rbf_kernel, in_axes=(None, 0, 0)), in_axes=(None, 0, 0))
        K = kernel_vmap(params, X[:, None, :], X[None, :, :])
        return K
    
    def compute_posterior(self, params, X_train, y_train, X_test, noise_variance):
        # compute convariance matrices C_{n+1} = [C_n, k; k.T, c_{n+1}]
        K = self.construct_covariance_matrix(params, X_train)
        K_s = self.construct_covariance_matrix(params, X_test)
        K_sT = self.construct_covariance_matrix(params, X_test, X_train)
        
        # directly computing the inverse is numerically unstable
        # K_inv = jnp.linalg.inv(K + noise_variance * jnp.eye(K.shape[0]))
        # we use scipy.linalg.solve instead
        # K*K_inv = I
        # refer to Eq.(6.62) in PRML
        K_inv = jsp.linalg.solve(K + noise_variance * jnp.eye(len(X_train)), jnp.eye(len(X_train)))

        # posterior mean
        mu_s = jnp.matmul(jnp.matmul(K_sT, K_inv), y_train)
        
        # posteiror covariance
        cov_s = K_s - jnp.matmul(jnp.matmul(K_sT, K_inv), K_sT.T)
        
        # diagonal elements of the covariance matrix
        var_s = jnp.diag(cov_s) 
        
        # return 
        return mu_s, var_s
        
    # Make predictions
    def predict(self, params, X_train, y_train, X_test, noise_variance):
        mu_s, cov_s = self.compute_posterior(params, X_train, y_train, X_test, noise_variance)
        return mu_s, cov_s
        

In [9]:
import matplotlib.pyplot as plt
from jax import random
import jax.numpy as jnp 

# Generate some example data
key = random.PRNGKey(0)
X_train = random.uniform(key, (10, 1), minval=-5, maxval=5)
y_train = jnp.sin(X_train) + 0.2 * random.normal(key, (10, 1))

# Define the kernel parameters
params = {'variance': 1.0, 'length_scale': 1.0}

# Define the noise variance
noise_variance = 0.05

# Generate some test points
X_test = jnp.linspace(-5, 5, 100)[:, None]

# Perform Gaussian Process Regression
gpr = GaussianProcessRegression()
mu_s, cov_s = gpr.predict(params, X_train, y_train, X_test, noise_variance)

# Plot the results
plt.figure()
plt.plot(X_train, y_train, 'kx')
plt.plot(X_test, mu_s, 'r')
plt.fill_between(X_test.flatten(), mu_s - jnp.sqrt(jnp.diag(cov_s)), mu_s + jnp.sqrt(jnp.diag(cov_s)), color='red', alpha=0.2)
plt.show()

TypeError: Shapes must be 1D sequences of concrete values of integer type, got (None, 10).

In [5]:
y_train.shape

(10, 1)