# Spatial Prediction

<hr>

**How to Model Covariance Matrices?**<br>
Often, we will be in a situation where we have made some observations of a certain location, but do not have the covariances with another location.

One can assume that the covariance between a pair of cities can be determined by its distance (spatial correlation). In this case,

$\displaystyle  \textsf{Cov}(X_1,X_2) = k(Z_1,Z_2)$

where $k(Z_1, Z_2)$ is some covariance function, for example, RBF kernel (squared exponential), which decays exponentially

$\displaystyle  \textsf{Cov}(X_1,X_2) = k(Z_1, Z_2) = \exp (- \frac{\| Z_1 - Z_2\| ^2}{2\ell ^2})$

where $\ell$ is a parameter to be estimated, called the **length-scale**. Covariance functions often go by the name **kernels** ($k$) as well, which are a broader class of functions.

$
\displaystyle  \mathbf{\Sigma_N} = \begin{bmatrix}  \mathbf{k(x_1, x_1)} &  \mathbf{k(x_1, x_2)}\\ \mathbf{k(x_2, x_1)} &  \mathbf{k(x_2, x_2)} \end{bmatrix}
$

which is also called a kernel ($k$) matrix, $K_N$


High-level idea: 
- Determine a probabilistic model, a multivariate Gaussian with a prior covariance which is decaying with distance
- Observe values $y_1, \dots, y_N$ for some locations, $x_1, \dots, x_N$
- Obtain a posterior distribution for the unobserved variables $Y_*$: $p(Y_* | y_1, \dots, y_N)$ Gaussian

Interpolation with Gaussian processes ([Kriging](https://en.wikipedia.org/wiki/Kriging))
- Build a kernel matrix $K_N$ of $N$ observations
- For a new point $x_*$: compute a vector $k_*^T = [k(x_*, x_1), \dots, k(x_*, x_N)]$

    $\mu_{*|1:N} = \mu_{*} + k_{*}^T K_{N}^{-1} (y_{1:N} - \mu_{1:N})$
    
    $\sigma_{*|1:N}^2 = \sigma_*^2 - k_{*}^T K_{N}^{-1} k_*$

<img alt="Kriging" src="assets/kriging.jpg" width="300">

<img alt="Kriging with temperature sensors" src="assets/kriging_temperatures.jpg" width="600">

<hr>

**Kernel functions**
- Helps to predict for any $x_*$, by parameterizing the correlations with a relatively easy and computationally efficient way
- Determines the shape of the predicted function
- The kernel function applied must yield a covariance matrix, i.e. *symmetric, positive semidefinite*

$\therefore$ A kernel function expresses covariance by a function: $cov(y_i, y_j) = k(x_i, x_j)$

This generalizes to all points in our space, which is then called a **Gaussian Process** (GP), a collection of random variables, which represents a joint Gaussian distribution

Gaussian Process is fully specified by mean and covariance functions:

$m(x) = \mu_x = \mathbb {E}[f(x)]$

$k(x, x') = cov(f(x), f(x'))$

One needs to guarantee that the kernel function's output creates a positive definite matrix or a square of matrices:
- If the covariance function is translation invariant, then it is called *stationary*, i.e. if the function depends on $Z_1 - Z_2$
- If the covariance function depends only on a norm $\lvert Z_1 - Z_2 \rvert$, then it is called isotropic, i.e. depends only on the distance between $Z_1$ and $Z_2$
- If the covariance functions depends on $Z_1^T Z_2$, then it is called dot product covariance

Examples of kernels and the effects of its parameters:

1. Radial Basis Function (RBF) Kernel (*Default*)

    $k(x_i, x_j) = \exp (-\frac{\Vert x_i - x_j \Vert^2}{2 \ell ^2})$
    
    If $\ell$ is large, then distance divided by $\ell$ goes to zero and there will still be large covariance values between two points that are far apart. For small $\ell$, each point is interpolated with only its closest neighbours and therefore *varies quickly*.
    
    <img alt="RBF Kernel and $\ell$ parameter" src="assets/rbf_kernel.jpg" width="300">
    
    <img alt="RBF Kernel and $\ell$ parameter 2" src="assets/rbf_kernel_2.jpg" width="600">
    
    
2. Gamma-exponential Kernel

    Helps to alter the sharpness/shape of the covariance function against the distance between two points
    
    $k(x_i, x_j) = \exp (-(\frac{\Vert x_i - x_j \Vert^2}{2 \ell ^2})^\gamma)$
    
    <img alt="Gamma-exponential Kernel" src="assets/gamma_exponential.jpg" width="600">
    

3. Polynomial Kernel

    Linear kernel: $k(x, x') = \langle x, x' \rangle$
    
    Quadratic kernel: $k(x, x') = (\langle x, x' \rangle + 1)^2$
    
    <img alt="Polynomial Kernel" src="assets/polynomial_kernel.jpg" width="600">
    
    
4. Periodic Kernel

    $k(x, x') = \exp (-\frac{2\sin^2 (\pi (x - x') / p)}{2 \ell ^2})$
    
    where $p$ is a period paramter, which states that the value of a point will default back to the previous period, for e.g. seasonal temperature
    
    This has advantages over the RBF Kernel when it comes to extrapolation for periodic data. RBF does well interpolating when data points are close but has too much uncertainty when moving away from data points (extrapolation)
    
    <img alt="Periodic Kernel" src="assets/periodic_kernel.jpg" width="600">
    
5. Other possible kernel functions

    *Assume here that $x$ is a difference between points, e.g $x = Z_1 - Z_2$, and $r$ is a distance, e.g $r = \Vert Z_1 - Z_2 \Vert$*
    
    - Constant, $\sigma_0^2$
    - Linear, $\sum_{d=1}^{D} \sigma_d^2 x_d x_d'$
    - Squared exponential, $\exp (- \frac{r^2}{2 \ell ^2})$
    - Matérn, $\frac{1}{2^{\nu - 1} \Gamma(\nu)} (\frac{\sqrt{2 \nu}}{\ell} r)^{\nu} K_{\nu} (\frac{\sqrt{2 \nu}}{\ell} r)$
    - Exponential, $\exp (- \frac{r}{\ell})$
    - Rational quadratic, $(1 + \frac{r^2}{2 \alpha \ell^2})^{-\alpha}$
    - Neural network, $\sin ^{-1}\left( \frac{2\mathrm{{\boldsymbol x}}^{\intercal }{\boldsymbol \Sigma }\mathrm{{\boldsymbol x}}}{\sqrt{(1 + 2\mathrm{{\boldsymbol x}}^{\intercal }{\boldsymbol \Sigma }\mathrm{{\boldsymbol x}})(1 + 2{\mathrm{{\boldsymbol x}}^\prime }^{\intercal }{\boldsymbol \Sigma }\mathrm{{\boldsymbol x}}')}} \right)$


How to build more covariance functions? Some processes are a mix of functions and therefore we may have to estimate the covariance structure with a mixed function.

Some ways to build kernel functions:

- A sum of kernel functions is a kernel function, $k(x, x') = k_1 (x, x') + k_2 (x, x')$, e.g. $k_{linear} + k_{periodic}$
- A product of kernel functions is a kernel function, $k(x, x') = k_1 (x, x') \cdot k_2 (x, x')$
    - $k_{linear} \cdot k_{periodic}$, where the periodicity is scaled higher and lower by the distance
    - $k_{RBF} \cdot k_{periodic}$
    - $k((x_1, x_2, t), (x_1', x_2', t')) = k_{space} ((x_1, x_2), (x_1', x_2')) \cdot k_{time} (t, t')$

****

**Effect of Measurement Noise on Gaussian Processes**

1. Stationarity

    Given the ground truth, that may have large varaince in low $x$ values and small variance in high $x$ values
    <img alt="Non-Stationarity" src="assets/nonstationary_kernels.jpg" width="300">

    Use a $\log$ transformation, for e.g., $k(x, x') = \exp (- \Vert \log(0.1 + x) - \log(0.1 + x') \Vert^2)$

    <img alt="Log-Transformation" src="assets/log_transformation_kernel.jpg" width="300">
    
2. Noise in observations, $\tau$

    Given observations, $y_i = f(x_i) + \epsilon_i$, where $\epsilon_i \sim N(0, \tau^2)$ are iid
    
    How does noise affect variance / covariance?
    
    $cov(y_i, y_j) = cov(y_i' + \epsilon_i, y_j' + \epsilon_j) = cov(y_i' + y_j') + cov(\epsilon_i, \epsilon_j) = k(x_i, x_j) + 0$, if $i \neq j$ ($\tau^2$, if $i = j$), 
    
    where $y_i', y_j'$ are ground-truth of $y$
    
    $\therefore$ A new covariance matrix of observed $y_i, \dots, y_N$: $K_N + \tau^2 \mathbb{1}$, where we add $\tau^2$ on the diagonal of the covariance matrix. Follow the rest as per regular sequence.
    
    
****

**General comments on Gaussian Processes**

- Flexible, nonlinear regression method
- Kernel determins what kind of function we fit
- Applicable far beyond spatial models, many settings of nonlinear regression (including time series)
- Reducing uncertainty, e.g. can help guide sensor placement to maximize information by reducing uncertainty, *Bayesian Optimization*

<hr>

# Basic code
A `minimal, reproducible example`