## Bayesian Hyperparameter Optimization

Due to the sensitivity of the cLSTM's hyperparameters, we use Bayesian optimization to discover a high-performing set of training and regularization hyperparameter values. Bayesian optimization is a natural framework for hyperparameter search. It excels at optimization under uncertainty, particularly with noisy black-box functions that are expensive to evaluate, such as training and evaluating neural networks.

Under the Bayesian optimization framework, we wish to identify a set of hyperparameters $\theta^\ast$ such that,

$$
\theta^\ast \approx \arg\max_\theta \sigma(f(\mathcal{D};\theta))
$$

where $f$ is the model whose hyperparameters we wish to tune and $\sigma(\cdot)$ is a scoring model on $f$. In other words, we wish to find the hyperparameters $\theta$ of a model that maximizes its score (defined below).

First, we'll load in the data and prepare it for training:

In [13]:
import torch
import numpy as np
from scipy.stats import beta
from utils import load_data

from bayes_opt import BayesianOptimization
from bayes_opt.observer import JSONLogger
from bayes_opt.event import Events

from clstm import cLSTM, train_model_gista, train_model_adam

In [9]:
mice = load_data()

In [10]:
mouse2 = mice[2]

In [11]:
# percent reads / times
mouse2_pct = mouse2['reads_percent']
mouse2_abs = mouse2['reads_abs_mean']

In [12]:
IMP_READS = np.arange(20)
top_reads_pct = mouse2_pct[IMP_READS, :].T
top_reads_abs = mouse2_abs[IMP_READS, :].T

mean_abs = np.mean(top_reads_abs, axis=0)
std_abs = np.std(top_reads_abs, axis=0)
top_reads_abs = (top_reads_abs - mean_abs) / std_abs

X_torch_pct = torch.tensor(top_reads_pct[np.newaxis], dtype=torch.float32)
X_torch_abs = torch.tensor(top_reads_abs[np.newaxis], dtype=torch.float32)

Now, let's define the score model $\sigma$. The score function is tricky in this context because we do not actually have a concrete way of quantifying how well a model is doing. Since we're interested in the Granger Causality coefficients and not the overall predictive power, just using the MSE loss would not give us the results we want. However, we can encode the belief that we expect there to be some causal links between bacteria. We also believe that causal links should generally be more prevalant along the diagonal. This led us to develop the following score function:

$$
\sigma(C) = \text{Beta}(\mathbb{E}[C]; \alpha, \beta) + \mathbb{E}[\text{diag}\{C\}] \cdot \left(1 - \prod_i \mathbb{1}_{c_i = 1}\right)
$$

where we use $\alpha = \beta = 1.6$, and $C \in \{0, 1\}^{NxN}$ is the collection of Granger Causality terms. In other words, we reward models with a mixture of zero and non-zero GC terms, and we further reward models with diagonal GC terms.

This function is imprecise, but allows us to run optimization in order to find models that provide interesting results. In practice, non-lienar models are extremely difficult to train and most get all non-zero or all zero GC values. Thus, any model that gets a non-zero score with the above function is potentially of value to us. Whether or not it is precisely calibrated, it is useful in helping us discover hyperparameter confugirations that are useful to us.

In [14]:
def get_gc_score(gc):
    score = beta(a=1.6, b=1.6).pdf(gc.mean())
    score += gc.diagonal().mean() * (gc.mean() != 1.)
    return score

In [22]:
heatmaps = []
def evaluate(n_hidden, lr_scale, lam_scale, lam_ridge_scale, truncation, data=X_torch_abs, max_iter=2500):
    """ 
    Evaluate a given set of hyperparameters, training a model for 2500 epochs
    and returning a score.
    
    This is the black-box function which our Bayesian Optimization algorithm
    will attempt to optimize
    """
    # transform continues values into valid hyperparams
    n_hidden = int(n_hidden + .5)
    truncation = int(truncation + .5)
    lr = 10**lr_scale
    lam = 10**lam_scale
    lam_ridge = 10**lam_ridge_scale
    
    # train the model
    gcmodel = cLSTM(p, n_hidden)
    gcmodel.to('cuda')
    train_loss_list, train_mse_list = train_model_gista(
        gcmodel,
        data.to('cuda'),
        lam=lam,
        lam_ridge=lam_ridge,
        lr=lr,
        max_iter=max_iter,
        check_every=100,
        truncation=truncation,
        verbose=1
    )
    gc = gcmodel.GC(threshold=False).cpu().detach().numpy()
    heatmaps.append(gc)
    eval_heatmaps.append(gc)
    gc_thresh = (heatmaps[-1] > 0).astype('float')
    
    # return the resulting model's score
    return get_gc_score(gc_thresh)

Here, we define the range of hyperparameters which the algorithm should search over. We optimize the following ranges:

- **\# of hidden LSTM nodes:** min: 10, max: 256
- **Learning rate:** min: .00001, max: .01 (searched on log 10 scale)
- **GC sparsity penalty:**, min: .000001, max: 1 (searched on log 10 scale)
- **LSTM Ridge term:** min: .000001, max: 1 (searched on log 10 scale)
- **Time series window:** min: 3, max: 20

In [16]:
pbounds = {
    'n_hidden': (10, 256),
    'lr_scale': (-2, -5),
    'lam_scale': (0, -6),
    'lam_ridge_scale': (0, -6),
    'truncation': (3, 20)
}

In [17]:
optimizer = BayesianOptimization(
    f=evaluate,
    pbounds=pbounds,
    verbose=0
)

We then optimize using the Expected Improvement acquisition function, which essentially chooses new hyperparameters based on which point in hyperparameter space has the highest expected improvement over the maximum score we have seen so far.

Since each point is extremely expensive to evaluate, we plot 5 initial random values and then 12 points selected according to our acquistion, saving the results in a `.json` file.

In [240]:
logger = JSONLogger(path="./logs.json")
optimizer.subscribe(Events.OPTMIZATION_STEP, logger)

In [None]:
optimizer.maximize(
    init_points=5,
    n_iter=12,
    acq='ei'
)

In [21]:
optimizer.max

{'target': 1.9629146090709826,
 'params': {'lam_ridge_scale': -5.420411491727301,
  'lam_scale': -0.8248482248774252,
  'lr_scale': -2.4178290417522756,
  'n_hidden': 208.61825702254285,
  'truncation': 8.555859902725139}}

The best model we found had a score of `1.96` (which is largely uninterpretable). It seemed to learn very small LSTM regression terms ($10^{-5.42}=0.0000038$) and relatively large terms for the GC sparsity regularization ($10^{-.82}=0.15$), as well as a window size of $9$ and $209$ hidden LSTM units, trained with a learning rate of $10^{-2.41}=Â 0.003$. In our experiments, we found this hyperparameter configuration did indeed give us well-performing models whose GC values were relatively consistent across experiments.