# Experimental Design for Gaussian Processes 

## Model-Based Design for GPs 
While space-filling designs like latin hypercube or maximim are agnostic to the model being fit to the data, another approach is to develop design strategies that are tailored to the specific model being used - in our case, a GP. 

We begin with the following setup. Suppose we are interesting in fitting a GP 

$$ y(\cdot) \sim \mathcal{GP}(0, k(\cdot, \cdot))$$

over a $D$-dimensional input space. Let $\tilde{\mathbf{X}}$ denote the $M \times D$ matrix containing a set of input points $\tilde{\mathbf{x}}_1, \dots, \tilde{\mathbf{x}}_M$. From these $M$ inputs, we seek to choose an $N \times D$ design $\mathbf{X}$ consisting of inputs $\mathbf{x}_1, \dots, \mathbf{x}_N$ that is "optimal" in some sense.

It will be helpful to establish some notation here. First let $\overline{\mathbf{X}}$ be the $(M - N) \times D$ matrix contain the points not selected in the design $\mathbf{X}$. I will denote the kernel matrices by $\mathbf{K} := k(\mathbf{X})$, $\mathbf{\tilde{K}} := k(\mathbf{\tilde{X}})$, $\mathbf{\overline{K}} := k(\overline{\mathbf{X}})$ and the random vectors $\mathbf{y} := y(\mathbf{X})$, $\mathbf{\tilde{y}} := y(\mathbf{\tilde{X}})$, $\overline{\mathbf{y}} := y(\overline{\mathbf{X}})$. 

Note that once we have selected $\mathbf{X}$, then conditioning on the observed $\mathbf{y}$ yields the standard GP predictive distribution over the unobserved points. 

$$ \mathbf{\overline{y}}|\mathbf{y}, \mathbf{X}, \mathbf{\overline{X}} \sim \mathcal{N}_{M - N}\left(\mu_{\mathbf{X}}(\overline{\mathbf{X}}), k_{\mathbf{X}}(\overline{\mathbf{X}})  \right)$$

where 

$$
\begin{align*}
\mu_{\mathbf{X}}(\overline{\mathbf{X}}) &= k\left(\overline{\mathbf{X}}, \mathbf{X}\right) \mathbf{K}^{-1} \mathbf{y} \\
k_{\mathbf{X}}(\overline{\mathbf{X}}) &= k\left(\overline{\mathbf{X}}\right) - k\left(\overline{\mathbf{X}}, \mathbf{X}\right) \mathbf{K}^{-1} k\left(\mathbf{X}, \overline{\mathbf{X}}\right)
\end{align*}
$$

I will let $\overline{\mathbf{y}}_{\overline{\mathbf{X}}|\mathbf{X}}$ denote the random vector with this predictive distribution. 

### Maximum Entropy Design
Intuitively, we seek to select $\mathbf{X}$ such that yields the most information about the predictive distribution over the unobserved points; that is, the selection that makes $\overline{\mathbf{y}}_{\overline{\mathbf{X}}|\mathbf{X}}$ as uncertain as possible. As our measure of uncertainty, we will first consider Shannon's entropy

$$ H\left(\overline{\mathbf{y}}_{\overline{\mathbf{X}}|\mathbf{X}}\right) = -\mathbb{E}\left[\log p_{\overline{\mathbf{y}}_{\overline{\mathbf{X}}|\mathbf{X}}}\left(\overline{\mathbf{y}}\right)\right]$$

where the expectation is with respect to the $(M - N)$-dimensional Gaussian distribution of $\overline{\mathbf{y}}_{\overline{\mathbf{X}}|\mathbf{X}}$. The negative of entropy is called *information*, and hence minimizing entropy to equivalent to maximizing information 

$$ I_{\overline{\mathbf{X}}|\mathbf{X}} := -H\left(\overline{\mathbf{y}}_{\overline{\mathbf{X}}|\mathbf{X}}\right)$$

Now, intuitively it seems that minimizing the entropy over the unobserved points $\overline{\mathbf{y}}_{\overline{\mathbf{X}}|\mathbf{X}}$ would be equivalent to maximizing the information content of the selected points $\mathbf{y}$. We make this idea concrete by deriving an entropy decomposition. 