# Gaussian Process Regression (GPR) Explained - Part 1: Introduction

## 1. Introduction

**Gaussian Process Regression (GPR)** is a non-parametric, Bayesian approach to regression. It provides not only predictions but also an estimate of the uncertainty in those predictions. GPR can be seen as a generalization of linear regression, but it allows for complex, non-linear relationships between the inputs and the outputs.

In **linear regression**, the model is:

$y = X\beta + \varepsilon$

where:
- $y$ is the vector of observed values.
- $X$ is the matrix of input features.
- $\beta$ are the regression coefficients.
- $\varepsilon \sim \mathcal{N}(0, \sigma^2)$ is Gaussian noise with zero mean and variance $\sigma^2$.

In GPR, we take a more flexible approach by assuming that the underlying function $f(\mathbf{x})$ comes from a **Gaussian process**, rather than assuming a fixed linear form.

---

## 2. Gaussian Process

### Definition

A **Gaussian Process (GP)** is a collection of random variables, any finite number of which have a joint Gaussian distribution. A GP can be used to define a distribution over functions. 

We say that a function $f(\mathbf{x})$ follows a Gaussian process if:

$$
f(\mathbf{x}) \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}'))
$$

where:
- $m(\mathbf{x})$ is the **mean function**, which describes the average value of the function at each point $\mathbf{x}$. Typically, we assume $m(\mathbf{x}) = 0$.
- $k(\mathbf{x}, \mathbf{x}')$ is the **covariance function** (or kernel), which describes how the function values at $\mathbf{x}$ and $\mathbf{x}'$ are correlated.


# Gaussian Process Regression (GPR) Explained - Part 2: From Prior to Posterior

## 3. GPR: From Prior to Posterior

In GPR, we place a **prior** on the function $f(\mathbf{x})$ by assuming that it follows a Gaussian process. The goal is to update this prior with observed data to obtain a **posterior distribution** over the functions, which can be used to make predictions at new points.

### Problem Setup

Given training data $\mathcal{D} = \{ (\mathbf{x}_i, y_i) \}_{i=1}^n$, we assume:

$$
y_i = f(\mathbf{x}_i) + \varepsilon_i
$$

where $\varepsilon_i \sim \mathcal{N}(0, \sigma_n^2)$ is Gaussian noise with variance $\sigma_n^2$.

### Joint Distribution of Training and Test Points

Let $f_* = f(\mathbf{x}_*)$ be the function value at a new test point $\mathbf{x}_*$. The joint distribution of the function values at the training points $\mathbf{X}$ and the function value at the test point $\mathbf{x}_*$ is also Gaussian:

$$
\begin{bmatrix}
f(\mathbf{X}) \\
f_*
\end{bmatrix}
\sim \mathcal{N} \left( 0, \begin{bmatrix}
K(\mathbf{X}, \mathbf{X}) & K(\mathbf{X}, \mathbf{x}_*) \\
K(\mathbf{x}_*, \mathbf{X}) & k(\mathbf{x}_*, \mathbf{x}_*)
\end{bmatrix} \right)
$$

Here:
- $K(\mathbf{X}, \mathbf{X})$ is the covariance matrix of the training points.
- $K(\mathbf{X}, \mathbf{x}_*)$ is the vector of covariances between the training points and the test point.
- $k(\mathbf{x}_*, \mathbf{x}_*)$ is the variance of the test point.


# Gaussian Process Regression (GPR) Explained - Part 3: Noisy Observations and Posterior Distribution

## 4. Incorporating Noisy Observations

In practice, we don’t observe the true function values $f(\mathbf{X})$; instead, we observe noisy versions $y = f(\mathbf{X}) + \varepsilon$, where $\varepsilon \sim \mathcal{N}(0, \sigma_n^2 I)$ is Gaussian noise with variance $\sigma_n^2$.

Thus, the covariance matrix of the **observed outputs** $y$ is:

$$
\text{Cov}(y) = K(\mathbf{X}, \mathbf{X}) + \sigma_n^2 I
$$

---

## 5. Conditioning to Get the Posterior Distribution

We now want to compute the **posterior distribution** of the function value $f_*$ at the test point $\mathbf{x}_*$, conditioned on the observed data $y$.

Using the properties of multivariate Gaussians, the posterior distribution of $f_*$ is also Gaussian with:

### Posterior Mean

$$
\mu_* = K(\mathbf{x}_*, \mathbf{X}) [K(\mathbf{X}, \mathbf{X}) + \sigma_n^2 I]^{-1} \mathbf{y}
$$

This is the best estimate of $f_*$ at $\mathbf{x}_*$ given the observed data.

### Posterior Variance

$$
\sigma_*^2 = k(\mathbf{x}_*, \mathbf{x}_*) - K(\mathbf{x}_*, \mathbf{X}) [K(\mathbf{X}, \mathbf{X}) + \sigma_n^2 I]^{-1} K(\mathbf{X}, \mathbf{x}_*)
$$

This tells us how uncertain the prediction is at the test point $\mathbf{x}_*$.


# Gaussian Process Regression (GPR) Explained - Part 4: Simple Example

## 6. Simple Example

Consider a simple 1D example where we have 3 training points $X = [1, 2, 3]$ and corresponding noisy observations $y = [1.1, 1.9, 3.2]$. We want to predict the function value $f_*$ at a new point $x_* = 1.5$.

### 1. Define the Covariance Function

We'll use the **squared exponential (RBF)** kernel:

$$
k(x, x') = \sigma_f^2 \exp\left( -\frac{(x - x')^2}{2l^2} \right)
$$

where $\sigma_f^2 = 1$ and $l = 1$ are the signal variance and length scale, respectively.

### 2. Compute the Covariance Matrices

- **Covariance matrix for training points**:

$$
K(X, X) = \begin{bmatrix}
k(1, 1) & k(1, 2) & k(1, 3) \\
k(2, 1) & k(2, 2) & k(2, 3) \\
k(3, 1) & k(3, 2) & k(3, 3)
\end{bmatrix}
= \begin{bmatrix}
1.0 & 0.6065 & 0.1353 \\
0.6065 & 1.0 & 0.6065 \\
0.1353 & 0.6065 & 1.0
\end{bmatrix}
$$

- **Covariance vector between training points and the test point**:

$$
K(X, x_*) = \begin{bmatrix}
k(1, 1.5) \\
k(2, 1.5) \\
k(3, 1.5)
\end{bmatrix}
= \begin{bmatrix}
0.8825 \\
0.8825 \\
0.3247
\end{bmatrix}
$$

- **Variance at the test point**:

$$
k(x_*, x_*) = k(1.5, 1.5) = 1
$$

### 3. Compute the Posterior Mean and Variance

- **Posterior mean**:

$$
\mu_* = K(x_*, X) [K(X, X) + \sigma_n^2 I]^{-1} y
$$

where $\sigma_n^2 = 0.1^2 = 0.01$ is the noise variance.

- **Posterior variance**:

$$
\sigma_*^2 = k(x_*, x_*) - K(x_*, X) [K(X, X) + \sigma_n^2 I]^{-1} K(X, x_*)
$$

This tells us both the predicted function value $\mu_*$ and the uncertainty $\sigma_*^2$ at the test point $x_* = 1.5$.
