# Large-Scale Stochastic Variational GP Regression (CUDA) (w/ SVGP)

## Overview

In this notebook, we'll give an overview of how to use Deep Kernel Learning with SVGP stochastic variational regression to rapidly train using minibatches on the `song` UCI dataset, which has over 500,000 training examples in 90 dimensions. 

Stochastic variational inference has several major advantages over the standard regression setting. Most notably, the ELBO used for optimization decomposes in such a way that stochastic gradient descent techniques can be used. See https://arxiv.org/pdf/1411.2005.pdf and https://arxiv.org/pdf/1611.00336.pdf for more technical details of this.

In [1]:
import math
import torch
import gpytorch
from matplotlib import pyplot as plt

# Make plots inline
%matplotlib inline

## Loading Data

For this example notebook, we'll be using the `song` UCI dataset used in the paper. Running the next cell downloads a copy of the dataset that has already been scaled and normalized appropriately. For this notebook, we'll simply be splitting the data using the first 80% of the data as training and the last 20% as testing.

**Note**: Running the next cell will attempt to download a **~340 MB** file to the current directory.

In [2]:
import urllib.request
import os.path
from scipy.io import loadmat
from math import floor

if not os.path.isfile('song.mat'):
    print('Downloading \'song\' UCI dataset...')
    urllib.request.urlretrieve('https://www.dropbox.com/s/mg91x4c0muatanp/song.mat?dl=1', 'song.mat')
    
data = torch.Tensor(loadmat('song.mat')['data'])
X = data[:, :-1]
X = X - X.min(0)[0]
X = 2 * (X / X.max(0)[0]) - 1
y = data[:, -1]

# Use the first 80% of the data for training, and the last 20% for testing.
train_n = int(floor(0.8*len(X)))

train_x = X[:train_n, :].contiguous().cuda()
train_y = y[:train_n].contiguous().cuda()

test_x = X[train_n:, :].contiguous().cuda()
test_y = y[train_n:].contiguous().cuda()

## Creating a DataLoader

The next step is to create a torch `DataLoader` that will handle getting us random minibatches of data. This involves using the standard `TensorDataset` and `DataLoader` modules provided by PyTorch.

In this notebook we'll be using a fairly large batch size of 1024 just to make optimization run faster, but you could of course change this as you so choose.

In [3]:
from torch.utils.data import TensorDataset, DataLoader
train_dataset = TensorDataset(train_x, train_y)
train_loader = DataLoader(train_dataset, batch_size=1024, shuffle=True)

## Defining the DKL Feature Extractor

Next, we define the neural network feature extractor used to define the deep kernel. In this case, we use a fully connected network with the architecture `d -> 1000 -> 500 -> 50 -> 2`, as described in the original DKL paper. All of the code below uses standard PyTorch implementations of neural network layers.

In [4]:
data_dim = train_x.size(-1)

class LargeFeatureExtractor(torch.nn.Sequential):           
    def __init__(self):                                      
        super(LargeFeatureExtractor, self).__init__()        
        self.add_module('linear1', torch.nn.Linear(data_dim, 1000))
        self.add_module('relu1', torch.nn.ReLU())                  
        self.add_module('linear2', torch.nn.Linear(1000, 500))     
        self.add_module('relu2', torch.nn.ReLU())                  
        self.add_module('linear3', torch.nn.Linear(500, 50))       
        self.add_module('relu3', torch.nn.ReLU())                  
        self.add_module('linear4', torch.nn.Linear(50, 2))         
                                                             
feature_extractor = LargeFeatureExtractor().cuda()
# num_features is the number of final features extracted by the neural network, in this case 2.
num_features = 2

## Defining the GP Regression Layer

We now define the GP regression module that, intuitvely, will act as the final "layer" of our neural network. In this case, because we are doing variational inference and *not* exact inference, we will be using an `AbstractVariationalGP`. In this example, because we will be learning the inducing point locations, we'll be using a base `VariationalStrategy` with `learn_inducing_locations=True`.

Because the feature extractor we defined above extracts two features, we'll need to define our grid bounds over two dimensions.

In [5]:
from gpytorch.models import AbstractVariationalGP
from gpytorch.variational import CholeskyVariationalDistribution
from gpytorch.variational import VariationalStrategy
class GPRegressionLayer(AbstractVariationalGP):
    def __init__(self, inducing_points):
        variational_distribution = CholeskyVariationalDistribution(inducing_points.size(0))
        variational_strategy = VariationalStrategy(self, inducing_points, variational_distribution, learn_inducing_locations=True)
        super(GPRegressionLayer, self).__init__(variational_strategy)
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel(
            log_lengthscale_prior=gpytorch.priors.SmoothedBoxPrior(0.001, 1., sigma=0.1, log_transform=True)
        ))

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

## Defining the DKL Model

With the feature extractor and GP regression layer defined, we can now define our full model. To do this, we simply create a module whose `forward()` method passes the data first through the feature extractor, and then through the GP regression layer.

The only other interesting feature of the model below is that we use a helper function, `scale_to_bounds`, to ensure that the features extracted by the neural network fit within the grid bounds used for SKI.

In [6]:
class DKLModel(gpytorch.Module):
    def __init__(self, inducing_points, feature_extractor, num_features, grid_bounds=(-1., 1.)):
        super(DKLModel, self).__init__()
        self.feature_extractor = feature_extractor
        self.gp_layer = GPRegressionLayer(inducing_points)
        self.grid_bounds = grid_bounds
        self.num_features = num_features

    def forward(self, x):
        features = self.feature_extractor(x)
        features = gpytorch.utils.grid.scale_to_bounds(features, self.grid_bounds[0], self.grid_bounds[1])
        res = self.gp_layer(features)
        return res
inducing_points = gpytorch.utils.grid.scale_to_bounds(feature_extractor(train_x[:500, :]), -1, 1)
model = DKLModel(inducing_points=inducing_points, feature_extractor=feature_extractor, num_features=num_features).cuda()
likelihood = gpytorch.likelihoods.GaussianLikelihood().cuda()

## Training the Model

The cell below trains the DKL model above, learning both the hyperparameters of the Gaussian process **and** the parameters of the neural network in an end-to-end fashion using Type-II MLE.

Unlike when using the exact GP marginal log likelihood, performing variational inference allows us to make use of stochastic optimization techniques. For this example, we'll do one epoch of training. Given the small size of the neural network relative to the size of the dataset, this should be sufficient to achieve comparable accuracy to what was observed in the DKL paper.

The optimization loop differs from the one seen in our more simple tutorials in that it involves looping over both a number of training iterations (epochs) *and* minibatches of the data. However, the basic process is the same: for each minibatch, we forward through the model, compute the loss (the `VariationalMarginalLogLikelihood` or ELBO), call backwards, and do a step of optimization.

In [7]:
from gpytorch.mlls.variational_elbo import VariationalELBO

model.train()
likelihood.train()

# We'll do 1 epochs of training in this tutorial
num_epochs = 2

# We use SGD here, rather than Adam. Emperically, we find that SGD is better for variational regression
optimizer = torch.optim.SGD([
    {'params': model.feature_extractor.parameters(), 'weight_decay': 1e-3},
    {'params': model.gp_layer.parameters()},
    {'params': likelihood.parameters()},
], lr=0.001)

# Our loss object. We're using the VariationalELBO, which essentially just computes the ELBO
mll = VariationalELBO(likelihood, model.gp_layer, num_data=train_y.size(0))

for i in range(num_epochs):
    # Within each iteration, we will go over each minibatch of data
    for minibatch_i, (x_batch, y_batch) in enumerate(train_loader):
        optimizer.zero_grad()
        
        # Because the grid is relatively small, we turn off the Toeplitz matrix multiplication and just perform them directly
        # We find this to be more efficient when the grid is very small.
        with gpytorch.settings.use_toeplitz(False):
            output = model(x_batch)
            # Calc loss and backprop gradients
            loss = -mll(output, y_batch)
            print('Epoch %d [%d/%d] - Loss: %.3f' % (i + 1, minibatch_i, len(train_loader), loss.item()))

        # The actual optimization step
        loss.backward()
        optimizer.step()

Epoch 1 [0/403] - Loss: 790.282
Epoch 1 [1/403] - Loss: 3.822
Epoch 1 [2/403] - Loss: 3.738
Epoch 1 [3/403] - Loss: 3.846
Epoch 1 [4/403] - Loss: 4.077
Epoch 1 [5/403] - Loss: 3.522
Epoch 1 [6/403] - Loss: 3.655
Epoch 1 [7/403] - Loss: 3.317
Epoch 1 [8/403] - Loss: 3.106
Epoch 1 [9/403] - Loss: 3.085
Epoch 1 [10/403] - Loss: 3.047
Epoch 1 [11/403] - Loss: 2.972
Epoch 1 [12/403] - Loss: 2.939
Epoch 1 [13/403] - Loss: 2.895
Epoch 1 [14/403] - Loss: 2.831
Epoch 1 [15/403] - Loss: 2.797
Epoch 1 [16/403] - Loss: 2.810
Epoch 1 [17/403] - Loss: 2.770
Epoch 1 [18/403] - Loss: 2.782
Epoch 1 [19/403] - Loss: 2.774
Epoch 1 [20/403] - Loss: 2.705
Epoch 1 [21/403] - Loss: 2.663
Epoch 1 [22/403] - Loss: 2.599
Epoch 1 [23/403] - Loss: 2.728
Epoch 1 [24/403] - Loss: 2.645
Epoch 1 [25/403] - Loss: 2.654
Epoch 1 [26/403] - Loss: 2.487
Epoch 1 [27/403] - Loss: 2.545
Epoch 1 [28/403] - Loss: 2.465
Epoch 1 [29/403] - Loss: 2.461
Epoch 1 [30/403] - Loss: 2.448
Epoch 1 [31/403] - Loss: 2.444
Epoch 1 [32/403]

Epoch 1 [260/403] - Loss: 1.921
Epoch 1 [261/403] - Loss: 1.916
Epoch 1 [262/403] - Loss: 1.918
Epoch 1 [263/403] - Loss: 1.918
Epoch 1 [264/403] - Loss: 1.920
Epoch 1 [265/403] - Loss: 1.913
Epoch 1 [266/403] - Loss: 1.918
Epoch 1 [267/403] - Loss: 1.915
Epoch 1 [268/403] - Loss: 1.916
Epoch 1 [269/403] - Loss: 1.917
Epoch 1 [270/403] - Loss: 1.915
Epoch 1 [271/403] - Loss: 1.911
Epoch 1 [272/403] - Loss: 1.913
Epoch 1 [273/403] - Loss: 1.904
Epoch 1 [274/403] - Loss: 1.914
Epoch 1 [275/403] - Loss: 1.909
Epoch 1 [276/403] - Loss: 1.912
Epoch 1 [277/403] - Loss: 1.906
Epoch 1 [278/403] - Loss: 1.904
Epoch 1 [279/403] - Loss: 1.908
Epoch 1 [280/403] - Loss: 1.903
Epoch 1 [281/403] - Loss: 1.902
Epoch 1 [282/403] - Loss: 1.902
Epoch 1 [283/403] - Loss: 1.901
Epoch 1 [284/403] - Loss: 1.901
Epoch 1 [285/403] - Loss: 1.901
Epoch 1 [286/403] - Loss: 1.902
Epoch 1 [287/403] - Loss: 1.900
Epoch 1 [288/403] - Loss: 1.894
Epoch 1 [289/403] - Loss: 1.900
Epoch 1 [290/403] - Loss: 1.895
Epoch 1 

Epoch 2 [117/403] - Loss: 1.780
Epoch 2 [118/403] - Loss: 1.770
Epoch 2 [119/403] - Loss: 1.773
Epoch 2 [120/403] - Loss: 1.768
Epoch 2 [121/403] - Loss: 1.775
Epoch 2 [122/403] - Loss: 1.766
Epoch 2 [123/403] - Loss: 1.774
Epoch 2 [124/403] - Loss: 1.765
Epoch 2 [125/403] - Loss: 1.771
Epoch 2 [126/403] - Loss: 1.771
Epoch 2 [127/403] - Loss: 1.769
Epoch 2 [128/403] - Loss: 1.769
Epoch 2 [129/403] - Loss: 1.771
Epoch 2 [130/403] - Loss: 1.775
Epoch 2 [131/403] - Loss: 1.771
Epoch 2 [132/403] - Loss: 1.768
Epoch 2 [133/403] - Loss: 1.767
Epoch 2 [134/403] - Loss: 1.767
Epoch 2 [135/403] - Loss: 1.770
Epoch 2 [136/403] - Loss: 1.766
Epoch 2 [137/403] - Loss: 1.761
Epoch 2 [138/403] - Loss: 1.763
Epoch 2 [139/403] - Loss: 1.766
Epoch 2 [140/403] - Loss: 1.763
Epoch 2 [141/403] - Loss: 1.759
Epoch 2 [142/403] - Loss: 1.766
Epoch 2 [143/403] - Loss: 1.766
Epoch 2 [144/403] - Loss: 1.763
Epoch 2 [145/403] - Loss: 1.764
Epoch 2 [146/403] - Loss: 1.768
Epoch 2 [147/403] - Loss: 1.759
Epoch 2 

Epoch 2 [377/403] - Loss: 1.693
Epoch 2 [378/403] - Loss: 1.693
Epoch 2 [379/403] - Loss: 1.700
Epoch 2 [380/403] - Loss: 1.693
Epoch 2 [381/403] - Loss: 1.695
Epoch 2 [382/403] - Loss: 1.689
Epoch 2 [383/403] - Loss: 1.693
Epoch 2 [384/403] - Loss: 1.695
Epoch 2 [385/403] - Loss: 1.694
Epoch 2 [386/403] - Loss: 1.695
Epoch 2 [387/403] - Loss: 1.695
Epoch 2 [388/403] - Loss: 1.686
Epoch 2 [389/403] - Loss: 1.694
Epoch 2 [390/403] - Loss: 1.694
Epoch 2 [391/403] - Loss: 1.693
Epoch 2 [392/403] - Loss: 1.686
Epoch 2 [393/403] - Loss: 1.693
Epoch 2 [394/403] - Loss: 1.689
Epoch 2 [395/403] - Loss: 1.694
Epoch 2 [396/403] - Loss: 1.698
Epoch 2 [397/403] - Loss: 1.687
Epoch 2 [398/403] - Loss: 1.687
Epoch 2 [399/403] - Loss: 1.694
Epoch 2 [400/403] - Loss: 1.692
Epoch 2 [401/403] - Loss: 1.685
Epoch 2 [402/403] - Loss: 1.681


## Making Predictions

The next cell gets the predictive covariance for the test set (and also technically gets the predictive mean, stored in `preds.mean()`) using the standard SKI testing code, with no acceleration or precomputation. Because the test set is substantially smaller than the training set, we don't need to make predictions in mini batches here, although our other tutorials demonstrate how to do this (for example, see the CIFAR tutorial).

In [8]:
model.eval()
likelihood.eval()
with torch.no_grad(), gpytorch.settings.use_toeplitz(False):
    preds = model(test_x)

In [9]:
print('Test MAE: {}'.format(torch.mean(torch.abs(preds.mean - test_y))))

Test MAE: 0.4752172827720642
