# Deep Gaussian Processes with Doubly Stochastic VI

In this notebook, we provide a GPyTorch implementation of deep Gaussian processes, where training and inference is performed using the method of Salimbeni et al., 2017 (https://arxiv.org/abs/1705.08933) adapted to CG-based inference.

We'll be training a simple two layer deep GP on the `elevators` UCI dataset.

In [1]:
import torch
from torch.nn import Linear
from gpytorch.means import ConstantMean
from gpytorch.kernels import RBFKernel, ScaleKernel
from gpytorch.variational import VariationalStrategy, CholeskyVariationalDistribution
from gpytorch.distributions import MultivariateNormal
from gpytorch.models import AbstractVariationalGP
from gpytorch.mlls import VariationalELBO, AddedLossTerm
from gpytorch.likelihoods import GaussianLikelihood

In [2]:
from gpytorch.models.deep_gps import AbstractDeepGPHiddenLayer, AbstractDeepGP, DeepGaussianLikelihood

## Loading Data

For this example notebook, we'll be using the `elevators` UCI dataset used in the paper. Running the next cell downloads a copy of the dataset that has already been scaled and normalized appropriately. For this notebook, we'll simply be splitting the data using the first 80% of the data as training and the last 20% as testing.

**Note**: Running the next cell will attempt to download a ~400 KB dataset file to the current directory.

In [3]:
import urllib.request
import os.path
from scipy.io import loadmat
from math import floor
import numpy as np

if not os.path.isfile('elevators.mat'):
    print('Downloading \'elevators\' UCI dataset...')
    urllib.request.urlretrieve('https://drive.google.com/uc?export=download&id=1jhWL3YUHvXIaftia4qeAyDwVxo6j1alk', 'elevators.mat')
    
data = torch.Tensor(loadmat('elevators.mat')['data'])
X = data[:, :-1]
y = data[:, -1]

N = data.shape[0]
np.random.seed(0)
data = data[np.random.permutation(np.arange(N)),:]

train_n = int(floor(0.8*len(X)))

train_x = X[:train_n, :].contiguous().cuda()
train_y = y[:train_n].contiguous().cuda()

test_x = X[train_n:, :].contiguous().cuda()
test_y = y[train_n:].contiguous().cuda()

mean = train_x.mean(dim=-2, keepdim=True)
std = train_x.std(dim=-2, keepdim=True) + 1e-6
train_x = (train_x - mean) / std
test_x = (test_x - mean) / std

mean,std = train_y.mean(),train_y.std()
train_y = (train_y - mean) / std
test_y = (test_y - mean) / std

Downloading 'elevators' UCI dataset...


In [4]:
from torch.utils.data import TensorDataset, DataLoader
train_dataset = TensorDataset(train_x, train_y)
train_loader = DataLoader(train_dataset, batch_size=1024, shuffle=True)

# Defining hidden GP layers

In GPyTorch, defining a GP involves extending one of our abstract GP models and defining a `forward` method that returns the prior. For deep GPs, things are similar, but there are two abstract GP models that must be overwritten: one for hidden layers and one for the deep GP model itself.

In the next cell, we define an example deep GP hidden layer. This looks very similar to every other variational GP you might define. However, there are a few key differences:

1. Instead of extending `AbstractVariationalGP`, we extend `AbstractDeepGPHiddenLayer`.
2. `AbstractDeepGPHiddenLayers` need a number of input dimensions, a number of output dimensions, and a number of samples. This is kind of like a linear layer in a standard neural network -- `input_dims` defines how many inputs this hidden layer will expect, and `output_dims` defines how many hidden GPs to create outputs for.
3. In practice, instances of `AbstractDeepGPHiddenLayer` will never be called by the user directly. They have slightly different behavior from standard abstract GPs, in that calling them returns samples from the variational distribution rather than the variational distribution directly. Instead, they will be incorporated in to a DeepGP model (see the next cell).

In [6]:
class ToyDeepGPHiddenLayer(AbstractDeepGPHiddenLayer):
    def __init__(self, input_dims, output_dims, num_inducing=512, num_samples=1):
        inducing_points = torch.randn(output_dims, num_inducing, input_dims)

        variational_distribution = CholeskyVariationalDistribution(
            num_inducing_points=num_inducing,
            batch_size=output_dims
        )

        variational_strategy = VariationalStrategy(
            self,
            inducing_points,
            variational_distribution,
            learn_inducing_locations=True
        )

        super(ToyDeepGPHiddenLayer, self).__init__(variational_strategy,
                                            input_dims,
                                            output_dims,
                                            num_samples=num_samples)

        self.mean_module = ConstantMean(batch_size=output_dims)
        self.covar_module = ScaleKernel(RBFKernel(batch_size=output_dims,
                                                  ard_num_dims=input_dims), batch_size=output_dims,
                                        ard_num_dims=None)
        
        self.linear_layer = Linear(input_dims, 1)


    def forward(self, x):
        mean_x = self.linear_layer(x).squeeze(-1)
        covar_x = self.covar_module(x)
        return MultivariateNormal(mean_x, covar_x)

# Defining the deep GP model

A deep GP model itself consists of two main components:

1. Defining a GP layer that will serve as the output layer.
2. Taking a `AbstractDeepGPHiddenLayer` or a `torch.nn.Sequential` containing deep GP hidden layers to call before forwarding through the output layer.

In the next cell, we define an example deep GP. For the most part, this also looks like an `AbstractVariationalGP`, but we need to tell it how many input dims to expect (e.g., the dimensionality of the hidden network), how many output dimensions there are (e.g., the number of total model outputs), and also provide a `hidden_gp_net`.

Typically the `hidden_gp_net` will take the form of a `torch.nn.Sequential` consisting of a number of GP hidden layers.

In [7]:
class ToyDeepGP(AbstractDeepGP):
    def __init__(self, input_dims, output_dims, hidden_gp_net, num_samples, num_inducing=256):
        inducing_points = torch.randn(output_dims, num_inducing, input_dims)

        variational_distribution = CholeskyVariationalDistribution(
            num_inducing_points=num_inducing,
            batch_size=output_dims
        )

        variational_strategy = VariationalStrategy(
            self,
            inducing_points,
            variational_distribution,
            learn_inducing_locations=True
        )
        
        super(ToyDeepGP, self).__init__(variational_strategy,
                                                  input_dims,
                                                  output_dims,
                                                  num_samples,
                                                  hidden_gp_net)
        
        self.mean_module = ConstantMean(batch_size=output_dims)
        self.covar_module = ScaleKernel(RBFKernel(batch_size=output_dims,
                                                  ard_num_dims=input_dims), batch_size=output_dims,
                                        ard_num_dims=None)
    
    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return MultivariateNormal(mean_x, covar_x)

# Building the model

Now that we've defined a class for our hidden layers and a class for our output layer, we can build our deep GP. To do this, we create a hidden GP layer, put it in a `torch.nn.Sequential`, and pass that to an instance of the toy deep GP we just defined above.

In [8]:
num_samples = 5
hidden_layer_size = train_x.size(-1)

hidden_gp = ToyDeepGPHiddenLayer(input_dims=train_x.size(-1),
                                 output_dims=hidden_layer_size,
                                 num_samples=num_samples).cuda()
hidden_net = torch.nn.Sequential(hidden_gp)
# Uncomment these lines to use a 3 layer deep GP instead of a 2 layer deep GP!
# hidden_gp2 = ToyDeepGPHiddenLayer(input_dims=hidden_layer_size,
#                                 output_dims=hidden_layer_size,
#                                 num_samples=num_samples).cuda()
# hidden_net = torch.nn.Sequential(hidden_gp, hidden_gp2)
model = ToyDeepGP(hidden_layer_size, 1, hidden_gp_net=hidden_net, num_samples=num_samples).cuda()

# Likelihood

Because deep GPs use some amounts of internal sampling (even in the stochastic variational setting), we need to handle the likelihood in a slightly different way. In the future, we anticipate `DeepLikelihood` being a general wrapper around an arbitrary likelihood once likelihoods become a little more general purpose, but for now we simply define a `DeepGaussianLikelihood` to use for regression.

In [9]:
likelihood = DeepGaussianLikelihood(num_samples=num_samples).cuda()
mll = VariationalELBO(likelihood, model, train_x.size(-2), combine_terms=False)

# Training the model

The training loop for a deep GP looks similar to a standard GP model with stochastic variational inference, but there are a few differences:

1. Because the output of a deep GP is actually num_outputs x num_samples Gaussians rather than a single Gaussian, we need to expand the labels to be num_outputs x num_samples x minibatch_size before calling the ELBO.
2. Because deep GPs involve a few added loss terms and normalize slightly differently, we created the `VariationalELBO` above with `combine_terms=False`. This just lets us do the extra normalization we need to make the math work out.

In [14]:
num_epochs = 60

optimizer = torch.optim.Adam([
    {'params': model.parameters()},
    {'params': likelihood.parameters()},
], lr=0.01)

import time

other_params = model.named_hyperparameters()

for i in range(num_epochs):
    # Within each iteration, we will go over each minibatch of data
    for minibatch_i, (x_batch, y_batch) in enumerate(train_loader):
        start_time = time.time()
        optimizer.zero_grad()
        output = model(x_batch)
        # Here we handle the fact that the output is actually num_samples Gaussians by expanding the labels.
        y_batch = y_batch.unsqueeze(0).unsqueeze(0).expand(model.output_dims, model.num_samples, -1)
        log_lik, kl_div, _, added_loss = mll(output, y_batch, num_samples=num_samples)
        
        log_lik_loss = log_lik * num_samples
        kl_div_loss = - kl_div + added_loss.div(mll.num_data)
        
        q_f_means = [mod._mean_qf for mod in hidden_net]
        q_f_stds = [mod._std_qf for mod in hidden_net]
        
        
        from IPython.core.debugger import set_trace
        set_trace()
        
        # Here we do some extra normalization for deep GPs because of the number of samples involved.
        num_batch = x_batch.size(-2)
        elbo = log_lik * num_samples - kl_div + added_loss.div(mll.num_data)
        loss = -elbo
        
        print('Epoch %d [%d/%d] - Loss: %.3f - - Time: %.3f' % (i + 1, minibatch_i, len(train_loader), loss.item(), time.time() - start_time))

        loss.backward()
        optimizer.step()

> [0;32m<ipython-input-14-31442609e95a>[0m(33)[0;36m<module>[0;34m()[0m
[0;32m     31 [0;31m[0;34m[0m[0m
[0m[0;32m     32 [0;31m        [0;31m# Here we do some extra normalization for deep GPs because of the number of samples involved.[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m---> 33 [0;31m        [0mnum_batch[0m [0;34m=[0m [0mx_batch[0m[0;34m.[0m[0msize[0m[0;34m([0m[0;34m-[0m[0;36m2[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     34 [0;31m        [0melbo[0m [0;34m=[0m [0mlog_lik[0m [0;34m*[0m [0mnum_samples[0m [0;34m-[0m [0mkl_div[0m [0;34m+[0m [0madded_loss[0m[0;34m.[0m[0mdiv[0m[0;34m([0m[0mmll[0m[0;34m.[0m[0mnum_data[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     35 [0;31m        [0mloss[0m [0;34m=[0m [0;34m-[0m[0melbo[0m[0;34m[0m[0;34m[0m[0m
[0m
ipdb> q_f_means
[tensor([[ 0.6525, -1.0649, -0.9546,  ..., -0.7253, -0.3303,  0.2416],
        [ 0.6525, -1.0649, -0.9546,  ..., -0.7

BdbQuit: 

# Make predictions and get an RMSE

The output distribution of a deep GP in this framework is actually a mixture of `num_samples` Gaussians for each output. We get predictions the same way with all GPyTorch models, but we do currently need to do some reshaping to get the means and variances in a reasonable form.

SVGP gets an RMSE of around 0.41 after 60 epochs of training, so overall getting 0.35 out of a 2 layer deep GP without much tuning involved is pretty good!

In [10]:
preds = likelihood(model(test_x))

# Here, model.output_dims is just 1, but the reshape below is general.
predictive_means = preds.mean.reshape(model.output_dims, num_samples, -1)
predictive_variances = preds.variance.reshape(model.output_dims, num_samples, -1)

rmse = torch.mean(torch.pow(predictive_means[0].mean(0) - test_y, 2)).sqrt()
print(rmse)

tensor(0.3430, device='cuda:0', grad_fn=<SqrtBackward>)
