# Large-Scale Stochastic Variational GP Regression (CUDA)

## Overview

In this notebook, we'll give an overview of how to use SVGP stochastic variational regression ((https://arxiv.org/pdf/1411.2005.pdf)) to rapidly train using minibatches on the `3droad` UCI dataset with hundreds of thousands of training examples. 

In [1]:
import math
import torch
import gpytorch
from matplotlib import pyplot as plt

# Make plots inline
%matplotlib inline

## Loading Data

For this example notebook, we'll be using the `song` UCI dataset used in the paper. Running the next cell downloads a copy of the dataset that has already been scaled and normalized appropriately. For this notebook, we'll simply be splitting the data using the first 80% of the data as training and the last 20% as testing.

**Note**: Running the next cell will attempt to download a **~136 MB** file to the current directory.

In [2]:
import urllib.request
import os.path
from scipy.io import loadmat
from math import floor

if not os.path.isfile('3droad.mat'):
    print('Downloading \'3droad\' UCI dataset...')
    urllib.request.urlretrieve('https://www.dropbox.com/s/f6ow1i59oqx05pl/3droad.mat?dl=1', '3droad.mat')
    
data = torch.Tensor(loadmat('3droad.mat')['data'])
X = data[:, :-1]
X = X - X.min(0)[0]
X = 2 * (X / X.max(0)[0]) - 1
y = data[:, -1]

# Use the first 80% of the data for training, and the last 20% for testing.
train_n = int(floor(0.8*len(X)))

train_x = X[:train_n, :].contiguous().cuda()
train_y = y[:train_n].contiguous().cuda()

test_x = X[train_n:, :].contiguous().cuda()
test_y = y[train_n:].contiguous().cuda()

## Creating a DataLoader

The next step is to create a torch `DataLoader` that will handle getting us random minibatches of data. This involves using the standard `TensorDataset` and `DataLoader` modules provided by PyTorch.

In this notebook we'll be using a fairly large batch size of 1024 just to make optimization run faster, but you could of course change this as you so choose.

In [3]:
from torch.utils.data import TensorDataset, DataLoader
train_dataset = TensorDataset(train_x, train_y)
train_loader = DataLoader(train_dataset, batch_size=1024, shuffle=True)

test_dataset = TensorDataset(test_x, test_y)
test_loader = DataLoader(test_dataset, batch_size=1024, shuffle=False)

## Defining the SVGP Model

We now define the GP regression module that, intuitvely, will act as the final "layer" of our neural network. In this case, because we are doing variational inference and *not* exact inference, we will be using an `AbstractVariationalGP`. In this example, because we will be learning the inducing point locations, we'll be using a base `VariationalStrategy` with `learn_inducing_locations=True`.

Because the feature extractor we defined above extracts two features, we'll need to define our grid bounds over two dimensions.

In [4]:
from gpytorch.models import AbstractVariationalGP
from gpytorch.variational import CholeskyVariationalDistribution
from gpytorch.variational import HalfWhitenedVariationalStrategy

class GPModel(AbstractVariationalGP):
    def __init__(self, inducing_points):
        variational_distribution = CholeskyVariationalDistribution(inducing_points.size(0))
        variational_strategy = HalfWhitenedVariationalStrategy(self, inducing_points, variational_distribution, learn_inducing_locations=True)
        super(GPModel, self).__init__(variational_strategy)
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())
        
    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

inducing_points = train_x[:500, :]
model = GPModel(inducing_points=inducing_points).cuda()
likelihood = gpytorch.likelihoods.GaussianLikelihood().cuda()

## Training the Model

The cell below trains the model above, learning both the hyperparameters of the Gaussian process **and** the parameters of the neural network in an end-to-end fashion using Type-II MLE.

Unlike when using the exact GP marginal log likelihood, performing variational inference allows us to make use of stochastic optimization techniques. For this example, we'll do one epoch of training. Given the small size of the neural network relative to the size of the dataset, this should be sufficient to achieve comparable accuracy to what was observed in the DKL paper.

The optimization loop differs from the one seen in our more simple tutorials in that it involves looping over both a number of training iterations (epochs) *and* minibatches of the data. However, the basic process is the same: for each minibatch, we forward through the model, compute the loss (the `VariationalMarginalLogLikelihood` or ELBO), call backwards, and do a step of optimization.

In [None]:
import time

model.train()
likelihood.train()

# We'll do 6 epochs of training in this tutorial
num_epochs = 4

# We use SGD here, rather than Adam. Emperically, we find that SGD is better for variational regression
optimizer = torch.optim.Adam([
    {'params': model.parameters()},
    {'params': likelihood.parameters()},
], lr=0.01)

scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[3, 5], gamma=0.1)

# Our loss object. We're using the VariationalELBO, which essentially just computes the ELBO
mll = gpytorch.mlls.VariationalELBO(likelihood, model, num_data=train_y.size(0), combine_terms=False)

# We use more CG iterations here because the preconditioner introduced in the NeurIPS paper seems to be less
# effective for VI.
from time import time
for i in range(num_epochs):
    scheduler.step()
    # Within each iteration, we will go over each minibatch of data
    for minibatch_i, (x_batch, y_batch) in enumerate(train_loader):
        st = time()
        optimizer.zero_grad()
        output = model(x_batch)
        # with combine_terms=False, we get the terms of the ELBO separated so we can print them individually if we'd like.
        # loss = -mll(output, y_batch) would also work.
        log_lik, kl_div, log_prior = mll(output, y_batch)
        loss = -(log_lik - kl_div + log_prior)
        print(
            f'Epoch {i + 1} [{minibatch_i}/{len(train_loader)}]'
            f' - Loss: {loss.item():.3f} [{log_lik.item():.3f}, {kl_div.item():.3f}, {log_prior.item():.3f}]'
            f'- Time: {time() - st:.3f}'
        )

        loss.backward()
        optimizer.step()

tensor(0., device='cuda:0', grad_fn=<NormBackward0>)
tensor(6.1922e-05, device='cuda:0', grad_fn=<NormBackward0>)
Epoch 1 [0/340] - Loss: 233.928 [-233.928, 0.000, 0.000]- Time: 0.648
Epoch 1 [1/340] - Loss: 269.385 [-269.384, 0.000, 0.000]- Time: 0.023
Epoch 1 [2/340] - Loss: 227.632 [-227.632, 0.000, 0.000]- Time: 0.026
Epoch 1 [3/340] - Loss: 254.816 [-254.816, 0.000, 0.000]- Time: 0.023
Epoch 1 [4/340] - Loss: 245.082 [-245.082, 0.000, 0.000]- Time: 0.024
Epoch 1 [5/340] - Loss: 237.467 [-237.467, 0.000, 0.000]- Time: 0.025
Epoch 1 [6/340] - Loss: 244.266 [-244.266, 0.000, 0.000]- Time: 0.023
Epoch 1 [7/340] - Loss: 229.941 [-229.941, 0.000, 0.000]- Time: 0.028
Epoch 1 [8/340] - Loss: 252.439 [-252.439, 0.000, 0.000]- Time: 0.027
Epoch 1 [9/340] - Loss: 213.721 [-213.721, 0.000, 0.000]- Time: 0.026
Epoch 1 [10/340] - Loss: 239.035 [-239.035, 0.000, 0.000]- Time: 0.027
Epoch 1 [11/340] - Loss: 227.559 [-227.559, 0.000, 0.000]- Time: 0.023
Epoch 1 [12/340] - Loss: 230.745 [-230.745, 

Epoch 1 [115/340] - Loss: 90.308 [-90.305, 0.003, 0.000]- Time: 0.024
Epoch 1 [116/340] - Loss: 98.171 [-98.168, 0.003, 0.000]- Time: 0.023
Epoch 1 [117/340] - Loss: 92.716 [-92.713, 0.003, 0.000]- Time: 0.022
Epoch 1 [118/340] - Loss: 100.603 [-100.600, 0.003, 0.000]- Time: 0.025
Epoch 1 [119/340] - Loss: 86.759 [-86.756, 0.003, 0.000]- Time: 0.025
Epoch 1 [120/340] - Loss: 93.563 [-93.559, 0.003, 0.000]- Time: 0.024
Epoch 1 [121/340] - Loss: 86.456 [-86.453, 0.003, 0.000]- Time: 0.026
Epoch 1 [122/340] - Loss: 91.948 [-91.945, 0.003, 0.000]- Time: 0.023
Epoch 1 [123/340] - Loss: 89.140 [-89.137, 0.003, 0.000]- Time: 0.023
Epoch 1 [124/340] - Loss: 84.679 [-84.675, 0.003, 0.000]- Time: 0.023
Epoch 1 [125/340] - Loss: 85.214 [-85.211, 0.003, 0.000]- Time: 0.024
Epoch 1 [126/340] - Loss: 84.209 [-84.205, 0.004, 0.000]- Time: 0.026
Epoch 1 [127/340] - Loss: 86.704 [-86.701, 0.004, 0.000]- Time: 0.023
Epoch 1 [128/340] - Loss: 87.766 [-87.763, 0.004, 0.000]- Time: 0.031
Epoch 1 [129/340] 

Epoch 1 [233/340] - Loss: 59.752 [-59.746, 0.006, 0.000]- Time: 0.026
Epoch 1 [234/340] - Loss: 59.141 [-59.136, 0.006, 0.000]- Time: 0.025
Epoch 1 [235/340] - Loss: 57.082 [-57.076, 0.006, 0.000]- Time: 0.023
Epoch 1 [236/340] - Loss: 57.988 [-57.982, 0.006, 0.000]- Time: 0.023
Epoch 1 [237/340] - Loss: 54.716 [-54.710, 0.006, 0.000]- Time: 0.022
Epoch 1 [238/340] - Loss: 57.793 [-57.787, 0.006, 0.000]- Time: 0.022
Epoch 1 [239/340] - Loss: 56.254 [-56.248, 0.006, 0.000]- Time: 0.025
Epoch 1 [240/340] - Loss: 57.257 [-57.251, 0.006, 0.000]- Time: 0.025
Epoch 1 [241/340] - Loss: 54.162 [-54.156, 0.006, 0.000]- Time: 0.023
Epoch 1 [242/340] - Loss: 63.951 [-63.945, 0.006, 0.000]- Time: 0.026
Epoch 1 [243/340] - Loss: 56.490 [-56.484, 0.006, 0.000]- Time: 0.025
Epoch 1 [244/340] - Loss: 52.663 [-52.657, 0.006, 0.000]- Time: 0.024
Epoch 1 [245/340] - Loss: 61.246 [-61.240, 0.006, 0.000]- Time: 0.028
Epoch 1 [246/340] - Loss: 49.133 [-49.127, 0.006, 0.000]- Time: 0.023
Epoch 1 [247/340] - 

Epoch 2 [12/340] - Loss: 43.317 [-43.310, 0.007, 0.000]- Time: 0.026
Epoch 2 [13/340] - Loss: 40.206 [-40.199, 0.007, 0.000]- Time: 0.024
Epoch 2 [14/340] - Loss: 39.262 [-39.254, 0.007, 0.000]- Time: 0.023
Epoch 2 [15/340] - Loss: 45.068 [-45.061, 0.007, 0.000]- Time: 0.026
Epoch 2 [16/340] - Loss: 44.956 [-44.949, 0.007, 0.000]- Time: 0.027
Epoch 2 [17/340] - Loss: 46.324 [-46.317, 0.007, 0.000]- Time: 0.025
Epoch 2 [18/340] - Loss: 44.176 [-44.169, 0.007, 0.000]- Time: 0.024
Epoch 2 [19/340] - Loss: 43.007 [-43.000, 0.007, 0.000]- Time: 0.026
Epoch 2 [20/340] - Loss: 45.275 [-45.268, 0.007, 0.000]- Time: 0.024
Epoch 2 [21/340] - Loss: 42.845 [-42.838, 0.007, 0.000]- Time: 0.023
Epoch 2 [22/340] - Loss: 39.850 [-39.842, 0.007, 0.000]- Time: 0.027
Epoch 2 [23/340] - Loss: 46.319 [-46.312, 0.007, 0.000]- Time: 0.023
Epoch 2 [24/340] - Loss: 42.549 [-42.541, 0.007, 0.000]- Time: 0.025
Epoch 2 [25/340] - Loss: 41.131 [-41.123, 0.007, 0.000]- Time: 0.024
Epoch 2 [26/340] - Loss: 44.172 [-

Epoch 2 [131/340] - Loss: 34.932 [-34.924, 0.008, 0.000]- Time: 0.025
Epoch 2 [132/340] - Loss: 36.563 [-36.555, 0.008, 0.000]- Time: 0.022
Epoch 2 [133/340] - Loss: 35.927 [-35.919, 0.008, 0.000]- Time: 0.023
Epoch 2 [134/340] - Loss: 35.672 [-35.663, 0.008, 0.000]- Time: 0.024
Epoch 2 [135/340] - Loss: 38.008 [-38.000, 0.008, 0.000]- Time: 0.024
Epoch 2 [136/340] - Loss: 36.324 [-36.316, 0.008, 0.000]- Time: 0.024
Epoch 2 [137/340] - Loss: 34.023 [-34.014, 0.008, 0.000]- Time: 0.026
Epoch 2 [138/340] - Loss: 34.854 [-34.845, 0.008, 0.000]- Time: 0.026
Epoch 2 [139/340] - Loss: 34.062 [-34.053, 0.008, 0.000]- Time: 0.025
Epoch 2 [140/340] - Loss: 36.162 [-36.154, 0.008, 0.000]- Time: 0.024
Epoch 2 [141/340] - Loss: 34.205 [-34.196, 0.008, 0.000]- Time: 0.024
Epoch 2 [142/340] - Loss: 35.150 [-35.142, 0.008, 0.000]- Time: 0.024
Epoch 2 [143/340] - Loss: 39.445 [-39.437, 0.008, 0.000]- Time: 0.024
Epoch 2 [144/340] - Loss: 35.458 [-35.450, 0.008, 0.000]- Time: 0.029
Epoch 2 [145/340] - 

Epoch 2 [250/340] - Loss: 28.824 [-28.815, 0.009, 0.000]- Time: 0.022
Epoch 2 [251/340] - Loss: 33.220 [-33.211, 0.009, 0.000]- Time: 0.023
Epoch 2 [252/340] - Loss: 33.410 [-33.401, 0.009, 0.000]- Time: 0.022
Epoch 2 [253/340] - Loss: 35.368 [-35.359, 0.009, 0.000]- Time: 0.024
Epoch 2 [254/340] - Loss: 31.816 [-31.807, 0.009, 0.000]- Time: 0.023
Epoch 2 [255/340] - Loss: 32.536 [-32.527, 0.009, 0.000]- Time: 0.026
Epoch 2 [256/340] - Loss: 32.131 [-32.122, 0.009, 0.000]- Time: 0.027
Epoch 2 [257/340] - Loss: 35.063 [-35.054, 0.009, 0.000]- Time: 0.026
Epoch 2 [258/340] - Loss: 31.521 [-31.512, 0.009, 0.000]- Time: 0.027
Epoch 2 [259/340] - Loss: 32.428 [-32.419, 0.009, 0.000]- Time: 0.024
Epoch 2 [260/340] - Loss: 31.248 [-31.239, 0.009, 0.000]- Time: 0.028
Epoch 2 [261/340] - Loss: 31.892 [-31.883, 0.009, 0.000]- Time: 0.028
Epoch 2 [262/340] - Loss: 29.535 [-29.525, 0.009, 0.000]- Time: 0.023
Epoch 2 [263/340] - Loss: 33.939 [-33.930, 0.009, 0.000]- Time: 0.028
Epoch 2 [264/340] - 

Epoch 3 [29/340] - Loss: 29.979 [-29.969, 0.010, 0.000]- Time: 0.026
Epoch 3 [30/340] - Loss: 27.879 [-27.869, 0.010, 0.000]- Time: 0.026
Epoch 3 [31/340] - Loss: 27.717 [-27.707, 0.010, 0.000]- Time: 0.024
Epoch 3 [32/340] - Loss: 28.116 [-28.106, 0.010, 0.000]- Time: 0.024
Epoch 3 [33/340] - Loss: 29.994 [-29.983, 0.010, 0.000]- Time: 0.027
Epoch 3 [34/340] - Loss: 30.252 [-30.242, 0.010, 0.000]- Time: 0.023
Epoch 3 [35/340] - Loss: 31.304 [-31.294, 0.010, 0.000]- Time: 0.025
Epoch 3 [36/340] - Loss: 30.255 [-30.245, 0.010, 0.000]- Time: 0.024
Epoch 3 [37/340] - Loss: 29.974 [-29.964, 0.010, 0.000]- Time: 0.027
Epoch 3 [38/340] - Loss: 28.745 [-28.735, 0.010, 0.000]- Time: 0.024
Epoch 3 [39/340] - Loss: 26.900 [-26.890, 0.010, 0.000]- Time: 0.027
Epoch 3 [40/340] - Loss: 26.331 [-26.321, 0.010, 0.000]- Time: 0.025
Epoch 3 [41/340] - Loss: 28.628 [-28.617, 0.010, 0.000]- Time: 0.023
Epoch 3 [42/340] - Loss: 28.648 [-28.638, 0.010, 0.000]- Time: 0.023
Epoch 3 [43/340] - Loss: 29.083 [-

Epoch 3 [149/340] - Loss: 25.331 [-25.320, 0.011, 0.000]- Time: 0.027
Epoch 3 [150/340] - Loss: 26.010 [-25.999, 0.011, 0.000]- Time: 0.022
Epoch 3 [151/340] - Loss: 26.932 [-26.921, 0.011, 0.000]- Time: 0.027
Epoch 3 [152/340] - Loss: 26.978 [-26.968, 0.011, 0.000]- Time: 0.022
Epoch 3 [153/340] - Loss: 26.089 [-26.078, 0.011, 0.000]- Time: 0.026
Epoch 3 [154/340] - Loss: 25.033 [-25.022, 0.011, 0.000]- Time: 0.025
Epoch 3 [155/340] - Loss: 25.343 [-25.332, 0.011, 0.000]- Time: 0.025
Epoch 3 [156/340] - Loss: 22.961 [-22.950, 0.011, 0.000]- Time: 0.027
Epoch 3 [157/340] - Loss: 24.391 [-24.380, 0.011, 0.000]- Time: 0.023
Epoch 3 [158/340] - Loss: 28.486 [-28.475, 0.011, 0.000]- Time: 0.023
Epoch 3 [159/340] - Loss: 25.895 [-25.884, 0.011, 0.000]- Time: 0.023
Epoch 3 [160/340] - Loss: 23.731 [-23.720, 0.011, 0.000]- Time: 0.022
Epoch 3 [161/340] - Loss: 24.278 [-24.267, 0.011, 0.000]- Time: 0.024
Epoch 3 [162/340] - Loss: 25.650 [-25.639, 0.011, 0.000]- Time: 0.023
Epoch 3 [163/340] - 

Epoch 3 [267/340] - Loss: 24.949 [-24.937, 0.011, 0.000]- Time: 0.030
Epoch 3 [268/340] - Loss: 22.701 [-22.690, 0.011, 0.000]- Time: 0.026
Epoch 3 [269/340] - Loss: 25.812 [-25.801, 0.011, 0.000]- Time: 0.023
Epoch 3 [270/340] - Loss: 23.419 [-23.408, 0.011, 0.000]- Time: 0.025
Epoch 3 [271/340] - Loss: 22.973 [-22.961, 0.011, 0.000]- Time: 0.023
Epoch 3 [272/340] - Loss: 23.832 [-23.821, 0.011, 0.000]- Time: 0.029
Epoch 3 [273/340] - Loss: 25.813 [-25.801, 0.011, 0.000]- Time: 0.028
Epoch 3 [274/340] - Loss: 22.197 [-22.185, 0.011, 0.000]- Time: 0.029
Epoch 3 [275/340] - Loss: 22.074 [-22.063, 0.011, 0.000]- Time: 0.025
Epoch 3 [276/340] - Loss: 21.468 [-21.456, 0.011, 0.000]- Time: 0.027
Epoch 3 [277/340] - Loss: 21.544 [-21.532, 0.011, 0.000]- Time: 0.024
Epoch 3 [278/340] - Loss: 21.347 [-21.336, 0.011, 0.000]- Time: 0.029
Epoch 3 [279/340] - Loss: 24.099 [-24.087, 0.011, 0.000]- Time: 0.024
Epoch 3 [280/340] - Loss: 24.721 [-24.710, 0.011, 0.000]- Time: 0.026
Epoch 3 [281/340] - 

Epoch 4 [46/340] - Loss: 23.833 [-23.822, 0.012, 0.000]- Time: 0.026
Epoch 4 [47/340] - Loss: 24.151 [-24.139, 0.012, 0.000]- Time: 0.024
Epoch 4 [48/340] - Loss: 20.129 [-20.118, 0.012, 0.000]- Time: 0.026
Epoch 4 [49/340] - Loss: 22.382 [-22.370, 0.012, 0.000]- Time: 0.024
Epoch 4 [50/340] - Loss: 20.419 [-20.407, 0.012, 0.000]- Time: 0.027
Epoch 4 [51/340] - Loss: 20.146 [-20.134, 0.012, 0.000]- Time: 0.022
Epoch 4 [52/340] - Loss: 22.353 [-22.341, 0.012, 0.000]- Time: 0.027
Epoch 4 [53/340] - Loss: 20.425 [-20.413, 0.012, 0.000]- Time: 0.022
Epoch 4 [54/340] - Loss: 21.152 [-21.140, 0.012, 0.000]- Time: 0.023
Epoch 4 [55/340] - Loss: 22.264 [-22.253, 0.012, 0.000]- Time: 0.023


## Making Predictions

The next cell gets the predictive covariance for the test set (and also technically gets the predictive mean, stored in `preds.mean()`). Because the test set is substantially smaller than the training set, we don't need to make predictions in mini batches here, although this can be done by passing in minibatches of `test_x` rather than the full tensor.

In [None]:
model.eval()
likelihood.eval()
means = torch.tensor([0.])
with torch.no_grad():
    for x_batch, y_batch in test_loader:
        preds = model(x_batch)
        means = torch.cat([means, preds.mean.cpu()])
means = means[1:]

In [None]:
print('Test MAE: {}'.format(torch.mean(torch.abs(means - test_y.cpu()))))