title | author | date | bibliography | link-citations | output | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
The basics of Bayesian optimization |
Simon Dirmeier <simon.dirmeier @ web.de> |
March 2021 |
./references/references.bib |
true |
|
knitr::opts_chunk$set(comment = NA, warning = FALSE, error = FALSE,
fig.align = "center", fig.width=11, fig.height=4)
Many real-world problems involve optimization of functions which are difficult, or costly, to evaluate. For instance, in deep learning, finding the optimal hyperparameters and architecture of a neural network is a cumbersome process that should ideally be automated with as little function evaluations (i.e. training the networks) as possible. For problems like these, Bayesian optimization (BO) offers a unifying framework where the function to evaluate is approximated using a surrogate model.
In this notebook, we introduce the basics of BO. More specifically, we are interested in optimizing some function
\begin{equation} \mathbf{x}{\min} = \arg \max{x \in \mathcal{X}} f(\mathbf{x}) \end{equation}
We assume that we are allowed to query
Throughout this notebook, we will use Stan to fit the surrogate model. Some introductory references on Bayesian optimization include @snoek2012practical, @shahriari2015taking and @frazier2018tutorial. Feedback and comments are welcome!
suppressMessages({
library(ggplot2)
library(cowplot)
library(pracma)
library(rdist)
library(rstan)
})
set.seed(23)
For convenience, we define a method to plot the GP fit and data.
scatterplot <- function(x.true, y.true, f_star = NULL, f_star_var = NULL, data = NULL) {
df <- data.frame(x = x.true, y = y.true)
g <- ggplot() +
theme(
axis.line.x = element_line(color = "black", size = .25),
axis.line.y = element_line(color = "black", size = .25),
axis.text = element_text(size = 12),
axis.title = element_text(size = 13),
legend.text = element_text(size = 13)
)
if (!is.null(f_star) && !is.null(f_star_var)) {
df_star <- data.frame(
x = x.true,
y = f_star,
lower = f_star - sqrt(f_star_var),
upper = f_star + sqrt(f_star_var)
)
g <- g +
geom_ribbon(
data = df_star,
aes(x, ymin = lower, ymax = upper, fill = "#DCBCBC")
) +
geom_line(data = df_star, aes(x, y, color = "darkred"))
}
if (!is.null(data)) {
g <- g + geom_point(data = data, aes(x, y, color = "darkred"))
}
g +
geom_line(data = df, aes(x, y, color = "darkgrey")) +
scale_color_manual(
breaks = c("darkgrey", "darkred", "black"),
labels = c("Function to optimize", "Posterior", "Acquisition function"),
values = c("darkgrey", "darkred", "black")
) +
scale_fill_manual(
breaks = c("lightgrey", "darkred", "#DCBCBC"),
values = c("lightgrey", "darkred", "#DCBCBC")
) +
labs(color = NULL) +
guides(fill = FALSE) +
theme_cowplot()
}
Usually the function to maximize is difficult/infeasible to query and the number of functions evaluations and we want to minimize the number as much as possible. In this example, we demonstrate Bayesian optimization on the function below.
f <- function(x) cos(4 * x) + exp(-(x**2) / 2)
Without loss of generality, we constrain the optimization on a set of
m <- 1000
x.init <- seq(-5, 5, length.out = m)
y.init <- f(x.init)
The function to optimize is shown below:
scatterplot(x.init, y.init)
To optimize this function, we define functions to fit a surrogate model and make a predict using the model fit. As mentioned above, we use Stan to implement the GP. The Stan code is fairly short. It takes the data, a set of points for which predictions should be made and kernel hyperparameters. The model and parameters blocks are actually empty, since here we confine ourselves to not fitting posterior of the kernel parameters.
surrogate.model.file <- "_models/bo-surrogate.stan"
cat(readLines(surrogate.model.file), sep = "\n")
Stan requires us to compile the model once such that we can use it. We do that like this:
gp <- stan_model(file = surrogate.model.file)
The predict function takes the data, the points to prediction, and the compiled Stan model.
predict.gp <- function(gp, x.d, y.d, x.star) {
dat <- list(
N = length(x.d),
x = array(x.d),
y = array(y.d),
N_star = length(x.star),
x_star = array(x.star),
rho = 1,
alpha = 1,
sigma = 1
)
pred <- rstan::sampling(
gp, dat,
chains = 1, algorithm = "Fixed_param", iter = 1, show_messages = FALSE, refresh = 0
)
ext <- rstan::extract(pred)
f.star <- as.vector(ext$f_star)
f.star.var <- as.vector(ext$f_star_cov)
list(f.star, f.star.var)
}
As acquisition function we use the upper confidence bound (see @snoek2012practical) which is computed from the predictive posterior mean
\begin{equation} a(\mathbf{x}) = \mu(\mathbf{x}) + \kappa \sigma(\mathbf{x}) \end{equation}
where
acquisition.function <- function(gp, x.d, y.d, x.init, conf) {
preds <- predict.gp(gp, x.d, y.d, x.init)
f.star <- preds[[1]]
f.star.var <- preds[[2]]
ucb <- f.star + conf * sqrt(f.star.var)
ucb
}
Finally, we define a function that proposes the next point to evaluate:
acquire <- function(x.d, y.d, x.init, conf = 2.0) {
preds <- predict.gp(gp, x.d, y.d, x.init)
f.star <- preds[[1]]
f.star.var <- preds[[2]]
ucb <- acquisition.function(gp, x.d, y.d, x.init, conf)
x.next <- x.init[which.max(ucb)]
list(
x.next = x.next,
ucb = ucb,
f.star = f.star,
f.star.var = f.star.var
)
}
We start with a random point on the interval defined above and query it against the function that we want to optimize.
set.seed(23)
x.d <- runif(1, -5, 5)
y.d <- f(x.d)
Then we train the surrogate model, and use the aquisition function to propose a new point.
iter <- acquire(x.d, y.d, x.init)
x.n <- iter$x.next
ucb <- iter$ucb
f.star <- iter$f.star
f.star.var <- iter$f.star.var
For the acquisition of the first point, we first fit the data (consisting of one observation) to the GP. The posterior mean is shown in red, and the posterior variance in lightred. We then compute the acquisition function on a set of
Note how the acquisition has a dent around the data
scatterplot(
x.init, y.init, f.star, f.star.var,
data = data.frame(x = x.d, y = y.d)
) +
geom_line(
data = data.frame(x = x.init, y = ucb),
aes(x, y, color = "black")
) +
geom_point(
data = data.frame(x = x.n, y = max(ucb)),
aes(x, y)
)
We then evaluate the point and add it to the data.
y.n <- f(x.n)
x.d <- c(x.d, x.n)
y.d <- c(y.d, y.n)
We can repeat this process of querying points and then acquiring new points as often as we want, until we think we have a good estimate of the maximum of
iter <- acquire(x.d, y.d, x.init)
x.n <- iter$x.next
ucb <- iter$ucb
f.star <- iter$f.star
f.star.var <- iter$f.star.v
The data now consists of two observations, hence the acqusition functions has to dents here.
scatterplot(
x.init, y.init, f.star, f.star.var,
data = data.frame(x = x.d, y = y.d)
) +
geom_line(
data = data.frame(x = x.init, y = ucb),
aes(x, y, color = "black")
) +
geom_point(
data = data.frame(x = x.n, y = max(ucb)),
aes(x, y)
)
Let's repeat this procedure for a couple of times, i.e., propose a point, evaluate it, and repeat.
for (i in seq(20)) {
y.n <- f(x.n)
x.d <- c(x.d, x.n)
y.d <- c(y.d, y.n)
iter <- acquire(x.d, y.d, x.init)
x.n <- iter$x.next
}
For the last iteration, we extract the acquisition function, and posterior mean and variance, and plot all iterations.
ucb <- iter$ucb
f.star <- iter$f.star
f.star.var <- iter$f.star.v
Then we plot the results of the last iteration.
scatterplot(
x.init, y.init, f.star, f.star.var,
data = data.frame(x = x.d, y = y.d)
) +
geom_line(
data = data.frame(x = x.init, y = ucb),
aes(x, y, color = "black")
) +
geom_point(
data = data.frame(x = x.n, y = max(ucb)),
aes(x, y)
)
This worked nicely. While some of the points are spread around the entire domain of
The notebook is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
sessionInfo()