# Large-Scale Stochastic Variational GP Regression (CUDA) (w/ KISS-GP)

## Overview

In this notebook, we'll give an overview of how to use Deep Kernel Learning with SKI stochastic variational regression to rapidly train using minibatches on the `song` UCI dataset, which has over 500,000 training examples in 90 dimensions. 

Stochastic variational inference has several major advantages over the standard regression setting. Most notably, the ELBO used for optimization decomposes in such a way that stochastic gradient descent techniques can be used. See https://arxiv.org/pdf/1411.2005.pdf and https://arxiv.org/pdf/1611.00336.pdf for more technical details of this.

In [1]:
import math
import torch
import gpytorch
from matplotlib import pyplot as plt

# Make plots inline
%matplotlib inline

## Loading Data

For this example notebook, we'll be using the `3droad` UCI dataset used in the paper. Running the next cell downloads a copy of the dataset that has already been scaled and normalized appropriately. For this notebook, we'll simply be splitting the data using the first 80% of the data as training and the last 20% as testing.

**Note**: Running the next cell will attempt to download a **~340 MB** file to the current directory.

In [2]:
!ls ~/data/uci/3droad/3droad.mat

/home/jrg365/data/uci/3droad/3droad.mat


In [3]:
import urllib.request
import os.path
from scipy.io import loadmat
from math import floor

if not os.path.isfile('3droad.mat'):
    print('Downloading \'3droad\' UCI dataset...')
    urllib.request.urlretrieve('https://www.dropbox.com/s/f6ow1i59oqx05pl/3droad.mat?dl=1', '3droad.mat')
    
data = torch.Tensor(loadmat('3droad.mat')['data'])
X = data[:, :-1]
X = X - X.min(0)[0]
X = 2 * (X / X.max(0)[0]) - 1
y = data[:, -1]

# Use the first 80% of the data for training, and the last 20% for testing.
train_n = int(floor(0.8*len(X)))

train_x = X[:train_n, :].contiguous().cuda()
train_y = y[:train_n].contiguous().cuda()

test_x = X[train_n:, :].contiguous().cuda()
test_y = y[train_n:].contiguous().cuda()

## Creating a DataLoader

The next step is to create a torch `DataLoader` that will handle getting us random minibatches of data. This involves using the standard `TensorDataset` and `DataLoader` modules provided by PyTorch.

In this notebook we'll be using a fairly large batch size of 1024 just to make optimization run faster, but you could of course change this as you so choose.

In [4]:
from torch.utils.data import TensorDataset, DataLoader
train_dataset = TensorDataset(train_x, train_y)
train_loader = DataLoader(train_dataset, batch_size=2028, shuffle=True)

## Defining the DKL Feature Extractor

Next, we define the neural network feature extractor used to define the deep kernel. In this case, we use a fully connected network with the architecture `d -> 1000 -> 500 -> 50 -> 2`, as described in the original DKL paper. All of the code below uses standard PyTorch implementations of neural network layers.

In [5]:
data_dim = train_x.size(-1)

class LargeFeatureExtractor(torch.nn.Sequential):           
    def __init__(self):                                      
        super(LargeFeatureExtractor, self).__init__()        
        self.add_module('linear1', torch.nn.Linear(data_dim, 1000))
        self.add_module('bn1', torch.nn.BatchNorm1d(1000))
        self.add_module('relu1', torch.nn.ReLU())
        self.add_module('linear2', torch.nn.Linear(1000, 1000))
        self.add_module('bn2', torch.nn.BatchNorm1d(1000))
        self.add_module('relu2', torch.nn.ReLU())                       
        self.add_module('linear3', torch.nn.Linear(1000, 500))
        self.add_module('bn3', torch.nn.BatchNorm1d(500))
        self.add_module('relu3', torch.nn.ReLU())                  
        self.add_module('linear4', torch.nn.Linear(500, 50))       
        self.add_module('bn4', torch.nn.BatchNorm1d(50))
        self.add_module('relu4', torch.nn.ReLU())                  
        self.add_module('linear5', torch.nn.Linear(50, 2))         
                                                             
feature_extractor = LargeFeatureExtractor().cuda()
# num_features is the number of final features extracted by the neural network, in this case 2.
num_features = 2

## Defining the GP Regression Layer

We now define the GP regression module that, intuitvely, will act as the final "layer" of our neural network. In this case, because we are doing variational inference and *not* exact inference, we will be using an `AbstractVariationalGP`. To use grid interpolation for variational inference, we'll be using a `GridInterpolationVariationalStrategy`.

Because the feature extractor we defined above extracts two features, we'll need to define our grid bounds over two dimensions.

See the CIFAR example for using an `AbstractVariationalGP` with an `AdditiveGridInterpolationVariationalStrategy`, which additionally assumes the kernel decomposes additively, which is a strong modelling assumption but allows us to use many more output features from the neural network.

In [6]:
from gpytorch.models import AbstractVariationalGP
from gpytorch.variational import CholeskyVariationalDistribution, GridInterpolationVariationalStrategy
class GPRegressionLayer(AbstractVariationalGP):
    def __init__(self, grid_size=32, grid_bounds=[(-1, 1), (-1, 1)]):
        variational_distribution = CholeskyVariationalDistribution(num_inducing_points=grid_size*grid_size)
        variational_strategy = GridInterpolationVariationalStrategy(self,
                                                                    grid_size=grid_size,
                                                                    grid_bounds=grid_bounds,
                                                                    variational_distribution=variational_distribution)
        super(GPRegressionLayer, self).__init__(variational_strategy)
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel(
            log_lengthscale_prior=gpytorch.priors.SmoothedBoxPrior(0.001, 1., sigma=0.1, log_transform=True)
        ))

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

## Defining the DKL Model

With the feature extractor and GP regression layer defined, we can now define our full model. To do this, we simply create a module whose `forward()` method passes the data first through the feature extractor, and then through the GP regression layer.

The only other interesting feature of the model below is that we use a helper function, `scale_to_bounds`, to ensure that the features extracted by the neural network fit within the grid bounds used for SKI.

In [14]:
class DKLModel(gpytorch.Module):
    def __init__(self, feature_extractor, num_features, grid_bounds=(-1., 1.)):
        super(DKLModel, self).__init__()
        self.feature_extractor = feature_extractor
        self.gp_layer = GPRegressionLayer()
        self.grid_bounds = grid_bounds
        self.num_features = num_features

    def forward(self, x):
        features = self.feature_extractor(x)
        features = gpytorch.utils.grid.scale_to_bounds(features, self.grid_bounds[0], self.grid_bounds[1])
        res = self.gp_layer(features)
        return res

model = DKLModel(feature_extractor, num_features=num_features).cuda()
likelihood = gpytorch.likelihoods.GaussianLikelihood().cuda()

## Training the Model

The cell below trains the DKL model above, learning both the hyperparameters of the Gaussian process **and** the parameters of the neural network in an end-to-end fashion using Type-II MLE.

Unlike when using the exact GP marginal log likelihood, performing variational inference allows us to make use of stochastic optimization techniques. For this example, we'll do one epoch of training. Given the small size of the neural network relative to the size of the dataset, this should be sufficient to achieve comparable accuracy to what was observed in the DKL paper.

The optimization loop differs from the one seen in our more simple tutorials in that it involves looping over both a number of training iterations (epochs) *and* minibatches of the data. However, the basic process is the same: for each minibatch, we forward through the model, compute the loss (the `VariationalELBO`), call backwards, and do a step of optimization.

In [15]:
model.train()
likelihood.train()

# We'll do 1 epochs of training in this tutorial
num_epochs = 6

# We use SGD here, rather than Adam. Emperically, we find that SGD is better for variational regression
optimizer = torch.optim.Adam([
    {'params': model.feature_extractor.parameters(), 'weight_decay': 1e-3},
    {'params': model.gp_layer.parameters()},
    {'params': likelihood.parameters()},
], lr=0.1)

scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[3, 5], gamma=0.1)

# Our loss object. We're using the VariationalELBO, which essentially just computes the ELBO
mll = gpytorch.mlls.VariationalELBO(likelihood, model.gp_layer, num_data=train_y.size(0), combine_terms=False)

for i in range(num_epochs):
    scheduler.step()
    # Within each iteration, we will go over each minibatch of data
    for minibatch_i, (x_batch, y_batch) in enumerate(train_loader):        
        optimizer.zero_grad()
        
        # Because the grid is relatively small, we turn off the Toeplitz matrix multiplication and just perform them directly
        # We find this to be more efficient when the grid is very small.
        with gpytorch.settings.use_toeplitz(False):
            output = model(x_batch)
            log_lik, kl_div, log_prior = mll(output, y_batch)
            loss = -(log_lik - kl_div + log_prior)
            print('Epoch %d [%d/%d] - Loss: %.3f [%.3f, %.3f, %.3f]' % (i + 1, minibatch_i, len(train_loader), loss.item(), log_lik.item(), kl_div.item(), log_prior.item()))

        # The actual optimization step
        loss.backward()
        optimizer.step()

Epoch 1 [0/172] - Loss: 171.907 [-171.786, 0.121, -0.000]
Epoch 1 [1/172] - Loss: 158.461 [-157.545, 0.916, -0.000]
Epoch 1 [2/172] - Loss: 145.601 [-144.634, 0.967, -0.000]
Epoch 1 [3/172] - Loss: 126.168 [-125.540, 0.628, -0.000]
Epoch 1 [4/172] - Loss: 120.279 [-119.786, 0.494, -0.000]
Epoch 1 [5/172] - Loss: 112.098 [-111.550, 0.548, -0.000]
Epoch 1 [6/172] - Loss: 100.376 [-99.964, 0.412, -0.000]
Epoch 1 [7/172] - Loss: 90.687 [-90.449, 0.238, -0.000]
Epoch 1 [8/172] - Loss: 91.281 [-91.098, 0.183, -0.000]
Epoch 1 [9/172] - Loss: 74.134 [-73.978, 0.157, -0.000]
Epoch 1 [10/172] - Loss: 65.697 [-65.568, 0.129, -0.000]
Epoch 1 [11/172] - Loss: 60.717 [-60.583, 0.134, -0.000]
Epoch 1 [12/172] - Loss: 61.200 [-61.065, 0.135, -0.000]
Epoch 1 [13/172] - Loss: 53.973 [-53.855, 0.118, -0.000]
Epoch 1 [14/172] - Loss: 54.097 [-53.984, 0.113, -0.000]
Epoch 1 [15/172] - Loss: 46.051 [-45.938, 0.114, -0.000]
Epoch 1 [16/172] - Loss: 42.500 [-42.404, 0.096, -0.000]
Epoch 1 [17/172] - Loss: 39.

Epoch 1 [148/172] - Loss: 5.327 [-5.324, 0.003, -0.000]
Epoch 1 [149/172] - Loss: 5.029 [-5.026, 0.003, -0.000]
Epoch 1 [150/172] - Loss: 5.265 [-5.263, 0.003, -0.000]
Epoch 1 [151/172] - Loss: 5.076 [-5.073, 0.003, -0.000]
Epoch 1 [152/172] - Loss: 5.124 [-5.121, 0.003, -0.000]
Epoch 1 [153/172] - Loss: 5.118 [-5.115, 0.003, -0.000]
Epoch 1 [154/172] - Loss: 5.024 [-5.022, 0.003, -0.000]
Epoch 1 [155/172] - Loss: 5.109 [-5.106, 0.003, -0.000]
Epoch 1 [156/172] - Loss: 5.134 [-5.131, 0.003, -0.000]
Epoch 1 [157/172] - Loss: 5.177 [-5.174, 0.003, -0.000]
Epoch 1 [158/172] - Loss: 4.959 [-4.956, 0.003, -0.000]
Epoch 1 [159/172] - Loss: 4.741 [-4.739, 0.003, -0.000]
Epoch 1 [160/172] - Loss: 5.186 [-5.183, 0.003, -0.000]
Epoch 1 [161/172] - Loss: 4.922 [-4.919, 0.003, -0.000]
Epoch 1 [162/172] - Loss: 4.857 [-4.855, 0.003, -0.000]
Epoch 1 [163/172] - Loss: 5.135 [-5.133, 0.003, -0.000]
Epoch 1 [164/172] - Loss: 5.087 [-5.084, 0.003, -0.000]
Epoch 1 [165/172] - Loss: 4.894 [-4.891, 0.003, 

Epoch 2 [126/172] - Loss: 4.242 [-4.240, 0.002, -0.000]
Epoch 2 [127/172] - Loss: 4.342 [-4.340, 0.002, -0.000]
Epoch 2 [128/172] - Loss: 4.495 [-4.493, 0.002, -0.000]
Epoch 2 [129/172] - Loss: 4.381 [-4.379, 0.002, -0.000]
Epoch 2 [130/172] - Loss: 4.305 [-4.303, 0.002, -0.000]
Epoch 2 [131/172] - Loss: 4.373 [-4.371, 0.002, -0.000]
Epoch 2 [132/172] - Loss: 4.221 [-4.219, 0.002, -0.000]
Epoch 2 [133/172] - Loss: 4.321 [-4.319, 0.002, -0.000]
Epoch 2 [134/172] - Loss: 4.547 [-4.545, 0.002, -0.000]
Epoch 2 [135/172] - Loss: 4.413 [-4.411, 0.002, -0.000]
Epoch 2 [136/172] - Loss: 4.332 [-4.330, 0.002, -0.000]
Epoch 2 [137/172] - Loss: 4.321 [-4.319, 0.002, -0.000]
Epoch 2 [138/172] - Loss: 4.526 [-4.524, 0.002, -0.000]
Epoch 2 [139/172] - Loss: 4.277 [-4.275, 0.002, -0.000]
Epoch 2 [140/172] - Loss: 4.327 [-4.325, 0.002, -0.000]
Epoch 2 [141/172] - Loss: 4.235 [-4.232, 0.002, -0.000]
Epoch 2 [142/172] - Loss: 4.447 [-4.444, 0.002, -0.000]
Epoch 2 [143/172] - Loss: 4.581 [-4.579, 0.002, 

Epoch 3 [104/172] - Loss: 4.096 [-4.095, 0.002, -0.000]
Epoch 3 [105/172] - Loss: 4.053 [-4.052, 0.002, -0.000]
Epoch 3 [106/172] - Loss: 4.092 [-4.090, 0.002, -0.000]
Epoch 3 [107/172] - Loss: 4.075 [-4.073, 0.002, -0.000]
Epoch 3 [108/172] - Loss: 4.077 [-4.075, 0.002, -0.000]
Epoch 3 [109/172] - Loss: 4.183 [-4.181, 0.002, -0.000]
Epoch 3 [110/172] - Loss: 4.105 [-4.103, 0.002, -0.000]
Epoch 3 [111/172] - Loss: 4.142 [-4.140, 0.002, -0.000]
Epoch 3 [112/172] - Loss: 4.090 [-4.089, 0.002, -0.000]
Epoch 3 [113/172] - Loss: 4.085 [-4.083, 0.002, -0.000]
Epoch 3 [114/172] - Loss: 4.084 [-4.082, 0.002, -0.000]
Epoch 3 [115/172] - Loss: 4.122 [-4.120, 0.002, -0.000]
Epoch 3 [116/172] - Loss: 4.158 [-4.156, 0.002, -0.000]
Epoch 3 [117/172] - Loss: 4.082 [-4.081, 0.002, -0.000]
Epoch 3 [118/172] - Loss: 4.111 [-4.109, 0.002, -0.000]
Epoch 3 [119/172] - Loss: 4.122 [-4.120, 0.002, -0.000]
Epoch 3 [120/172] - Loss: 4.197 [-4.195, 0.002, -0.000]
Epoch 3 [121/172] - Loss: 4.061 [-4.059, 0.002, 

Epoch 4 [80/172] - Loss: 4.021 [-4.020, 0.002, -0.000]
Epoch 4 [81/172] - Loss: 3.998 [-3.997, 0.002, -0.000]
Epoch 4 [82/172] - Loss: 4.083 [-4.081, 0.002, -0.000]
Epoch 4 [83/172] - Loss: 4.012 [-4.011, 0.001, -0.000]
Epoch 4 [84/172] - Loss: 4.071 [-4.070, 0.001, -0.000]
Epoch 4 [85/172] - Loss: 4.037 [-4.035, 0.001, -0.000]
Epoch 4 [86/172] - Loss: 4.071 [-4.070, 0.001, -0.000]
Epoch 4 [87/172] - Loss: 4.042 [-4.040, 0.002, -0.000]
Epoch 4 [88/172] - Loss: 4.032 [-4.030, 0.001, -0.000]
Epoch 4 [89/172] - Loss: 4.006 [-4.005, 0.002, -0.000]
Epoch 4 [90/172] - Loss: 3.982 [-3.981, 0.002, -0.000]
Epoch 4 [91/172] - Loss: 3.991 [-3.990, 0.001, -0.000]
Epoch 4 [92/172] - Loss: 4.050 [-4.048, 0.001, -0.000]
Epoch 4 [93/172] - Loss: 4.010 [-4.009, 0.002, -0.000]
Epoch 4 [94/172] - Loss: 4.000 [-3.999, 0.002, -0.000]
Epoch 4 [95/172] - Loss: 3.972 [-3.970, 0.001, -0.000]
Epoch 4 [96/172] - Loss: 3.981 [-3.979, 0.002, -0.000]
Epoch 4 [97/172] - Loss: 4.010 [-4.008, 0.001, -0.000]
Epoch 4 [9

Epoch 5 [56/172] - Loss: 3.992 [-3.990, 0.001, -0.000]
Epoch 5 [57/172] - Loss: 3.969 [-3.967, 0.002, -0.000]
Epoch 5 [58/172] - Loss: 4.055 [-4.054, 0.002, -0.000]
Epoch 5 [59/172] - Loss: 3.995 [-3.994, 0.002, -0.000]
Epoch 5 [60/172] - Loss: 3.968 [-3.966, 0.001, -0.000]
Epoch 5 [61/172] - Loss: 4.005 [-4.003, 0.002, -0.000]
Epoch 5 [62/172] - Loss: 4.088 [-4.086, 0.002, -0.000]
Epoch 5 [63/172] - Loss: 4.001 [-4.000, 0.002, -0.000]
Epoch 5 [64/172] - Loss: 4.004 [-4.002, 0.001, -0.000]
Epoch 5 [65/172] - Loss: 3.988 [-3.986, 0.002, -0.000]
Epoch 5 [66/172] - Loss: 4.010 [-4.009, 0.001, -0.000]
Epoch 5 [67/172] - Loss: 4.020 [-4.018, 0.002, -0.000]
Epoch 5 [68/172] - Loss: 4.009 [-4.007, 0.001, -0.000]
Epoch 5 [69/172] - Loss: 3.976 [-3.975, 0.001, -0.000]
Epoch 5 [70/172] - Loss: 4.027 [-4.025, 0.002, -0.000]
Epoch 5 [71/172] - Loss: 3.975 [-3.974, 0.002, -0.000]
Epoch 5 [72/172] - Loss: 3.989 [-3.988, 0.001, -0.000]
Epoch 5 [73/172] - Loss: 4.065 [-4.064, 0.002, -0.000]
Epoch 5 [7

Epoch 6 [32/172] - Loss: 4.000 [-3.998, 0.002, -0.000]
Epoch 6 [33/172] - Loss: 3.987 [-3.985, 0.002, -0.000]
Epoch 6 [34/172] - Loss: 3.983 [-3.981, 0.002, -0.000]
Epoch 6 [35/172] - Loss: 3.996 [-3.995, 0.002, -0.000]
Epoch 6 [36/172] - Loss: 4.005 [-4.004, 0.001, -0.000]
Epoch 6 [37/172] - Loss: 3.995 [-3.994, 0.001, -0.000]
Epoch 6 [38/172] - Loss: 3.923 [-3.922, 0.002, -0.000]
Epoch 6 [39/172] - Loss: 3.946 [-3.945, 0.002, -0.000]
Epoch 6 [40/172] - Loss: 3.976 [-3.974, 0.001, -0.000]
Epoch 6 [41/172] - Loss: 3.995 [-3.994, 0.001, -0.000]
Epoch 6 [42/172] - Loss: 4.009 [-4.007, 0.001, -0.000]
Epoch 6 [43/172] - Loss: 3.993 [-3.992, 0.002, -0.000]
Epoch 6 [44/172] - Loss: 3.978 [-3.976, 0.001, -0.000]
Epoch 6 [45/172] - Loss: 3.956 [-3.955, 0.001, -0.000]
Epoch 6 [46/172] - Loss: 3.937 [-3.935, 0.001, -0.000]
Epoch 6 [47/172] - Loss: 3.990 [-3.989, 0.001, -0.000]
Epoch 6 [48/172] - Loss: 3.971 [-3.969, 0.002, -0.000]
Epoch 6 [49/172] - Loss: 3.997 [-3.996, 0.002, -0.000]
Epoch 6 [5

## Making Predictions

The next cell gets the predictive covariance for the test set (and also technically gets the predictive mean, stored in `preds.mean()`) using the standard SKI testing code, with no acceleration or precomputation. Because the test set is substantially smaller than the training set, we don't need to make predictions in mini batches here, although our other tutorials demonstrate how to do this (for example, see the CIFAR tutorial).

In [16]:
model.eval()
likelihood.eval()
with torch.no_grad(), gpytorch.settings.use_toeplitz(False):
    preds = model(test_x)

In [17]:
print('Test MAE: {}'.format(torch.mean(torch.abs(preds.mean - test_y))))

Test MAE: 9.38260269165039
