# Experimental Design for Gaussian Processes 

## Model-Based Design for GPs 
While space-filling designs like latin hypercube or maximim are agnostic to the model being fit to the data, another approach is to develop design strategies that are tailored to the specific model being used - in our case, a GP. 

We begin with the following setup. Suppose we are interesting in fitting a GP 

$$ y(\cdot) \sim \mathcal{GP}(0, k(\cdot, \cdot))$$

over a $D$-dimensional input space. Let $\tilde{\mathbf{X}}$ denote the $M \times D$ matrix containing a set of input points $\tilde{\mathbf{x}}_1, \dots, \tilde{\mathbf{x}}_M$. From these $M$ inputs, we seek to choose an $N \times D$ design $\mathbf{X}$ consisting of inputs $\mathbf{x}_1, \dots, \mathbf{x}_N$ that is "optimal" in some sense.

It will be helpful to establish some notation here. First let $\overline{\mathbf{X}}$ be the $(M - N) \times D$ matrix contain the points not selected in the design $\mathbf{X}$. I will denote the kernel matrices by $\mathbf{K} := k(\mathbf{X})$, $\mathbf{\tilde{K}} := k(\mathbf{\tilde{X}})$, $\mathbf{\overline{K}} := k(\overline{\mathbf{X}})$ and the random vectors $\mathbf{y} := y(\mathbf{X})$, $\mathbf{\tilde{y}} := y(\mathbf{\tilde{X}})$, $\overline{\mathbf{y}} := y(\overline{\mathbf{X}})$. 

Note that once we have selected $\mathbf{X}$, then conditioning on the observed $\mathbf{y}$ yields the standard GP predictive distribution over the unobserved points. 

$$ \mathbf{\overline{y}}|\mathbf{y}, \mathbf{X}, \mathbf{\overline{X}} \sim \mathcal{N}_{M - N}\left(\mu_{\mathbf{X}}(\overline{\mathbf{X}}), k_{\mathbf{X}}(\overline{\mathbf{X}})  \right)$$

where 

$$
\begin{align*}
\mu_{\mathbf{X}}(\overline{\mathbf{X}}) &= k\left(\overline{\mathbf{X}}, \mathbf{X}\right) \mathbf{K}^{-1} \mathbf{y} \\
k_{\mathbf{X}}(\overline{\mathbf{X}}) &= k\left(\overline{\mathbf{X}}\right) - k\left(\overline{\mathbf{X}}, \mathbf{X}\right) \mathbf{K}^{-1} k\left(\mathbf{X}, \overline{\mathbf{X}}\right)
\end{align*}
$$

I will let $\overline{\mathbf{y}}_{\overline{\mathbf{X}}|\mathbf{X}}$ denote the random vector with this predictive distribution. 

### Maximum Entropy Design
Intuitively, we seek to select $\mathbf{X}$ such that it yields the most information about the predictive distribution over the unobserved points; that is, the selection that makes $\overline{\mathbf{y}}_{\overline{\mathbf{X}}|\mathbf{X}}$ as uncertain as possible. As our measure of uncertainty, we will first consider Shannon's entropy

$$ H\left(\overline{\mathbf{y}}_{\overline{\mathbf{X}}|\mathbf{X}}\right) = -\mathbb{E}\left[\log p_{\overline{\mathbf{y}}_{\overline{\mathbf{X}}|\mathbf{X}}}\left(\overline{\mathbf{y}}\right)\right]$$

where the expectation is with respect to the $(M - N)$-dimensional Gaussian distribution of $\overline{\mathbf{y}}_{\overline{\mathbf{X}}|\mathbf{X}}$. The negative of entropy is called *information*, and hence minimizing entropy to equivalent to maximizing information 

$$ I_{\overline{\mathbf{X}}|\mathbf{X}} := -H\left(\overline{\mathbf{y}}_{\overline{\mathbf{X}}|\mathbf{X}}\right)$$

 Since everything is Gaussian here, the formula for entropy of a Gaussian distribution will come up a lot. This is derived in the appendix for reference. Applying that formula to $\overline{\mathbf{y}}_{\overline{\mathbf{X}}|\mathbf{X}}$, we find that the entropy of the predictive distribution over the outputs at the unobserved locations is given by 
 
$$
\begin{align*}
H\left(\overline{\mathbf{y}}_{\overline{\mathbf{X}}|\mathbf{X}}\right) &= \frac{1}{2}\log\det\left(2\pi e k_{\mathbf{X}}(\overline{\mathbf{X}})\right) \\
&= \frac{1}{2}\log\det\left(2\pi e \left[k\left(\overline{\mathbf{X}}\right) - k\left(\overline{\mathbf{X}}, \mathbf{X}\right) \mathbf{K}^{-1} k\left(\mathbf{X}, \overline{\mathbf{X}}\right) \right] \right)
\end{align*}
$$

Now, as mentioned above, the goal is to select $\mathbf{X}$ to minimize $H\left(\overline{\mathbf{y}}_{\overline{\mathbf{X}}|\mathbf{X}}\right)$. Notice that the entropy of Gaussian distributions does not depend on the mean of the distributions. Since the GP predictive covariance $k_{\mathbf{X}}$ does not depend on $\mathbf{y}$ (the observed valued being conditioned on), this implies that 
$H\left(\overline{\mathbf{y}}_{\overline{\mathbf{X}}|\mathbf{X}}\right)$ does not actually depend on $\mathbf{y}$. The relation to $\mathbf{y}$ is captured entirely through the kernel. In more general non-Gaussian settings this is not the case, and hence it is typical to instead consider minimizing $\mathbb{E} H\left(\overline{\mathbf{y}}_{\overline{\mathbf{X}}|\mathbf{X}}\right)$, where the expectation is with respect to the prior on $\mathbf{y}$. Again, this is not required here but I state this in the interest of providing a more general picture. 
 
We will now show that minimizing $H\left(\overline{\mathbf{y}}_{\overline{\mathbf{X}}|\mathbf{X}}\right)$ is equivalent to maximizing $H(\mathbf{y})$; that is, the optimal design is to choose input points $\mathbf{X}$ corresponding to the locations that are *most uncertain* under the GP prior. To verify this claim, we show that the prior entropy $H(\tilde{\mathbf{y}})$ can be decomposed as a sum of the prior entropy at the selected inputs $H(\mathbf{y})$ and the entropy of the predictive distribution at the remaining inputs $H\left(\overline{\mathbf{y}}_{\overline{\mathbf{X}}|\mathbf{X}}\right)$. To this end, note that the entropy of the prior at locations $\tilde{\mathbf{X}}$ is 

$$ H(\tilde{\mathbf{y}}) = \frac{1}{2} \log\det\left(2\pi e \tilde{\mathbf{K}} \right) = \frac{N + M}{2} \log(2\pi e) + \frac{1}{2} \log\det(\tilde{\mathbf{K}})$$

where $\tilde{\mathbf{K}}$ can be written in block form as 

$$\tilde{\mathbf{K}} = \begin{pmatrix} \mathbf{K} & k(\mathbf{X}, \overline{\mathbf{X}}) \\ 
k(\overline{\mathbf{X}}, \mathbf{X}) & \overline{\mathbf{K}} \end{pmatrix} $$

We can re-write the determinant $\det(\tilde{\mathbf{K}})$ in terms of each block using the formula for determinants of block matrices; see e.g., $\href{https://math.stackexchange.com/questions/1905652/proofs-of-determinants-of-block-matrices}{this}$ StackExchange post. 

$$
\begin{align*}
\det(\tilde{\mathbf{K}}) &= \det(\mathbf{K}) \cdot \det\left(\overline{\mathbf{K}} - k(\overline{\mathbf{X}}, \mathbf{X}) \mathbf{K}^{-1} k(\mathbf{X}, \overline{\mathbf{X}})\right) \\
&= \det(\mathbf{K}) \cdot \det(k_{\mathbf{X}}(\overline{\mathbf{X}}))
\end{align*}
$$

Plugging this back into the expression for $H(\tilde{\mathbf{y}})$, we obtain
$$
\begin{align*}
H(\tilde{\mathbf{y}}) &= \left\{\frac{N}{2} \log(2\pi e) + \log\det(\mathbf{K}) \right\} + \left\{\frac{M}{2} \log(2\pi e) + \log\det(k_{\mathbf{X}}(\overline{\mathbf{X}})) \right\} \\
&= H(\mathbf{y}) + H\left(\overline{\mathbf{y}}_{\overline{\mathbf{X}}|\mathbf{X}}\right)
\end{align*}
$$

Since the lefthand side - the entropy of the prior at locations $\tilde{\mathbf{X}}$ - is fixed, then we see that minimizing $H\left(\overline{\mathbf{y}}_{\overline{\mathbf{X}}|\mathbf{X}}\right)$ is equivalent to maximizing $H(\mathbf{y})$. This sort of entropy decomposition shows up in a variety of contexts; I emphasize that the decomposition is typically of the form 

$$ H(\tilde{\mathbf{y}}) = H(\mathbf{y}) + \mathbb{E}_{\overline{\mathbf{y}}} H\left(\overline{\mathbf{y}}_{\overline{\mathbf{X}}|\mathbf{X}}\right)$$ 
where the expectation is with respect to the prior, but that in this special Gaussian setting th

# Appendix

## Information and Entropy of multivariate Gaussian
Here we derive the Shannon information content of a multivariate Gaussian 

$$ X \sim \mathcal{N}_D(\mu, \Sigma)$$

We have 
$$
\begin{align*}
I(X) &= \mathbb{E}\left[\log p_X(X) \right] \\
     &= -\frac{1}{2}\log\det(2\pi \Sigma) - \frac{1}{2} \mathbb{E}\left[(X - \mu)^T \Sigma^{-1} (X - \mu) \right] \\
     &= -\frac{1}{2}\log\det(2\pi \Sigma) - \frac{1}{2} \mu^T \Sigma^{-1}\mu - \frac{1}{2}\mathbb{E}\left[X^T \Sigma^{-1} X \right] + \mathbb{E}\left[X^T \Sigma^{-1} \mu \right] \\
     &= -\frac{1}{2}\log\det(2\pi \Sigma) - \frac{1}{2} \mu^T \Sigma^{-1}\mu - \frac{1}{2}\left[\text{tr}(\Sigma^{-1}\Sigma) + \mu^T \Sigma^{-1} \mu \right] + \mu^T \Sigma^{-1} \mu \\
     &= -\frac{1}{2}\log\det(2\pi \Sigma) - \frac{1}{2} \mu^T \Sigma^{-1}\mu - \frac{D}{2} - \frac{1}{2}\mu^T \Sigma^{-1} \mu + \mu^T \Sigma^{-1} \mu \\
     &= -\frac{1}{2}\log\det(2\pi \Sigma) - \frac{D}{2} \\
     &= -\frac{1}{2}\log\det(2\pi e \Sigma)
\end{align*}
$$

The entropy is thus 
$$ H(X) = -I(X) = \frac{1}{2}\log\det(2\pi e \Sigma)$$