# Large Scale Stochastic Variational Deep Kernel Learning Regression

## Overview

In this notebook, we'll give an overview of how to use Deep Kernel Learning with SKI stochastic variational regression to rapidly train using minibatches on the `song` UCI dataset, which has over 500,000 training examples in 90 dimensions. 

Stochastic variational inference has several major advantages over the standard regression setting. Most notably, the ELBO used for optimization decomposes in such a way that stochastic gradient descent techniques can be used. See https://arxiv.org/pdf/1411.2005.pdf and https://arxiv.org/pdf/1611.00336.pdf for more technical details of this.

In [1]:
import math
import torch
import gpytorch
from matplotlib import pyplot as plt
from torch import nn, optim
from torch.autograd import Variable
from gpytorch.kernels import RBFKernel, GridInterpolationKernel
from gpytorch.means import ConstantMean
from gpytorch.likelihoods import GaussianLikelihood
from gpytorch.random_variables import GaussianRandomVariable

# Make plots inline
%matplotlib inline

  'your system' % dvipng_req)


## Loading Data

For this example notebook, we'll be using the `song` UCI dataset used in the paper. Running the next cell downloads a copy of the dataset that has already been scaled and normalized appropriately. For this notebook, we'll simply be splitting the data using the first 80% of the data as training and the last 20% as testing.

**Note**: Running the next cell will attempt to download a **~340 MB** file to the current directory.

In [3]:
import urllib.request
import os.path
from scipy.io import loadmat
from math import floor

if not os.path.isfile('song.mat'):
    print('Downloading \'song\' UCI dataset...')
    urllib.request.urlretrieve('https://www.dropbox.com/s/mg91x4c0muatanp/song.mat?dl=1', 'song.mat')
    
data = torch.Tensor(loadmat('song.mat')['data'])
X = data[:, :-1]
X = X - X.min(0)[0]
X = 2 * (X / X.max(0)[0]) - 1
y = data[:, -1]

# Use the first 80% of the data for training, and the last 20% for testing.
train_n = int(floor(0.9*len(X)))

train_x = X[:train_n, :].contiguous().cuda()
train_y = y[:train_n].contiguous().cuda()

test_x = X[train_n:, :].contiguous().cuda()
test_y = y[train_n:].contiguous().cuda()

## Creating a DataLoader

The next step is to create a torch `DataLoader` that will handle getting us random minibatches of data. This involves using the standard `TensorDataset` and `DataLoader` modules provided by PyTorch.

In this notebook we'll be using a fairly large batch size of 1024 just to make optimization run faster, but you could of course change this as you so choose.

In [4]:
from torch.utils.data import TensorDataset, DataLoader
train_dataset = TensorDataset(train_x, train_y)
train_loader = DataLoader(train_dataset, batch_size=1024, shuffle=True)

## Defining the DKL Feature Extractor

Next, we define the neural network feature extractor used to define the deep kernel. In this case, we use a fully connected network with the architecture `d -> 1000 -> 500 -> 50 -> 2`, as described in the original DKL paper. All of the code below uses standard PyTorch implementations of neural network layers.

In [5]:
data_dim = train_x.size(-1)

class LargeFeatureExtractor(nn.Sequential):           
    def __init__(self):                                      
        super(LargeFeatureExtractor, self).__init__()        
        self.add_module('linear1', nn.Linear(data_dim, 1000))
        self.add_module('relu1', nn.ReLU())                  
        self.add_module('linear2', nn.Linear(1000, 500))     
        self.add_module('relu2', nn.ReLU())                  
        self.add_module('linear3', nn.Linear(500, 50))       
        self.add_module('relu3', nn.ReLU())                  
        self.add_module('linear4', nn.Linear(50, 2))         
                                                             
feature_extractor = LargeFeatureExtractor().cuda()
# num_features is the number of final features extracted by the neural network, in this case 2.
num_features = 2

## Defining the GP Regression Layer

We now define the GP regression module that, intuitvely, will act as the final "layer" of our neural network. In this case, because we are doing variational inference and *not* exact inference, we will be extending one of our `VariationalGP` models rather than an `ExactGP`. Specifically, because we'd still like to use SKI as a kernel approximation, we'll make use of the `GridInducingVariationalGP`. 

Because the feature extractor we defined above extracts two features, we'll need to define our grid bounds over two dimensions.

See the CIFAR and MNIST examples for using an `AdditiveGridInducingVariationalGP`, which additionally assumes the kernel decomposes additively, which is a strong modelling assumption but allows us to use many more output features from the neural network.

In [6]:
class GPRegressionLayer(gpytorch.models.GridInducingVariationalGP):
    def __init__(self, grid_size=20, grid_bounds=[(-1, 1), (-1, 1)]):
        super(GPRegressionLayer, self).__init__(grid_size=grid_size, grid_bounds=grid_bounds)
        self.mean_module = ConstantMean()
        self.covar_module = RBFKernel()
        self.register_parameter(
            name="log_outputscale",
            parameter=torch.nn.Parameter(torch.Tensor([0]))
        )

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x) * self.log_outputscale.exp()
        return GaussianRandomVariable(mean_x, covar_x)

## Defining the DKL Model

With the feature extractor and GP regression layer defined, we can now define our full model. To do this, we simply create a module whose `forward()` method passes the data first through the feature extractor, and then through the GP regression layer.

The only other interesting feature of the model below is that we use a helper function, `scale_to_bounds`, to ensure that the features extracted by the neural network fit within the grid bounds used for SKI.

In [7]:
class DKLModel(gpytorch.Module):
    def __init__(self, feature_extractor, n_features, grid_bounds=(-1., 1.)):
        super(DKLModel, self).__init__()
        self.feature_extractor = feature_extractor
        self.gp_layer = GPRegressionLayer()
        self.grid_bounds = grid_bounds
        self.n_features = n_features

    def forward(self, x):
        features = self.feature_extractor(x)
        features = gpytorch.utils.scale_to_bounds(features, self.grid_bounds[0], self.grid_bounds[1])
        res = self.gp_layer(features)
        return res

model = DKLModel(feature_extractor, n_features=num_features).cuda()
likelihood = gpytorch.likelihoods.GaussianLikelihood().cuda()

## Training the Model

The cell below trains the DKL model above, learning both the hyperparameters of the Gaussian process **and** the parameters of the neural network in an end-to-end fashion using Type-II MLE.

Unlike when using the exact GP marginal log likelihood, performing variational inference allows us to make use of stochastic optimization techniques. For this example, we'll do one epoch of training. Given the small size of the neural network relative to the size of the dataset, this should be sufficient to achieve comparable accuracy to what was observed in the DKL paper.

The optimization loop differs from the one seen in our more simple tutorials in that it involves looping over both a number of training iterations (epochs) *and* minibatches of the data. However, the basic process is the same: for each minibatch, we forward through the model, compute the loss (the `VariationalMarginalLogLikelihood` or ELBO), call backwards, and do a step of optimization.

In [8]:
model.train()
likelihood.train()

# We'll do 1 epochs of training in this tutorial
num_epochs = 1

# We use SGD here, rather than Adam. Emperically, we find that SGD is better for variational regression
optimizer = torch.optim.SGD([
    {'params': model.parameters()},
    {'params': likelihood.parameters()},
], lr=0.1)

# Our loss object. We're using the VariationalMarginalLogLikelihood, which essentially just computes the ELBO
mll = gpytorch.mlls.VariationalMarginalLogLikelihood(likelihood, model, n_data=train_y.size(0))

for i in range(num_epochs):
    # Within each iteration, we will go over each minibatch of data
    for minibatch_i, (x_batch, y_batch) in enumerate(train_loader):
        optimizer.zero_grad()
        
        # Because the grid is relatively small, we turn off the Toeplitz matrix multiplication and just perform them directly
        # We find this to be more efficient when the grid is very small.
        with gpytorch.settings.use_toeplitz(False), gpytorch.beta_features.diagonal_correction():
            output = model(x_batch)
            loss = -mll(output, y_batch)
            print('Epoch %d [%d/%d] - Loss: %.3f' % (i + 1, minibatch_i, len(train_loader), loss.item()))

        # The actual optimization step
        loss.backward()
        optimizer.step()

Epoch 1 [0/453] - Loss: 4.055
Epoch 1 [1/453] - Loss: 3.407
Epoch 1 [2/453] - Loss: 3.707
Epoch 1 [3/453] - Loss: 3.544
Epoch 1 [4/453] - Loss: 3.433
Epoch 1 [5/453] - Loss: 3.351
Epoch 1 [6/453] - Loss: 3.295
Epoch 1 [7/453] - Loss: 3.238
Epoch 1 [8/453] - Loss: 3.185
Epoch 1 [9/453] - Loss: 3.127
Epoch 1 [10/453] - Loss: 3.062
Epoch 1 [11/453] - Loss: 2.975
Epoch 1 [12/453] - Loss: 2.804
Epoch 1 [13/453] - Loss: 2.516
Epoch 1 [14/453] - Loss: 2.281
Epoch 1 [15/453] - Loss: 2.172
Epoch 1 [16/453] - Loss: 2.081
Epoch 1 [17/453] - Loss: 2.022
Epoch 1 [18/453] - Loss: 1.968
Epoch 1 [19/453] - Loss: 1.929
Epoch 1 [20/453] - Loss: 1.884
Epoch 1 [21/453] - Loss: 1.856
Epoch 1 [22/453] - Loss: 1.847
Epoch 1 [23/453] - Loss: 1.826
Epoch 1 [24/453] - Loss: 1.814
Epoch 1 [25/453] - Loss: 1.795
Epoch 1 [26/453] - Loss: 1.778
Epoch 1 [27/453] - Loss: 1.770
Epoch 1 [28/453] - Loss: 1.754
Epoch 1 [29/453] - Loss: 1.746
Epoch 1 [30/453] - Loss: 1.730
Epoch 1 [31/453] - Loss: 1.720
Epoch 1 [32/453] -

Epoch 1 [261/453] - Loss: 0.918
Epoch 1 [262/453] - Loss: 1.023
Epoch 1 [263/453] - Loss: 0.959
Epoch 1 [264/453] - Loss: 0.925
Epoch 1 [265/453] - Loss: 0.964
Epoch 1 [266/453] - Loss: 0.892
Epoch 1 [267/453] - Loss: 1.163
Epoch 1 [268/453] - Loss: 0.956
Epoch 1 [269/453] - Loss: 0.923
Epoch 1 [270/453] - Loss: 0.955
Epoch 1 [271/453] - Loss: 0.899
Epoch 1 [272/453] - Loss: 0.925
Epoch 1 [273/453] - Loss: 0.914
Epoch 1 [274/453] - Loss: 0.912
Epoch 1 [275/453] - Loss: 0.945
Epoch 1 [276/453] - Loss: 0.934
Epoch 1 [277/453] - Loss: 0.897
Epoch 1 [278/453] - Loss: 0.939
Epoch 1 [279/453] - Loss: 0.938
Epoch 1 [280/453] - Loss: 0.871
Epoch 1 [281/453] - Loss: 0.923
Epoch 1 [282/453] - Loss: 0.917
Epoch 1 [283/453] - Loss: 0.940
Epoch 1 [284/453] - Loss: 0.919
Epoch 1 [285/453] - Loss: 0.915
Epoch 1 [286/453] - Loss: 0.953
Epoch 1 [287/453] - Loss: 0.952
Epoch 1 [288/453] - Loss: 0.920
Epoch 1 [289/453] - Loss: 0.934
Epoch 1 [290/453] - Loss: 0.891
Epoch 1 [291/453] - Loss: 0.914
Epoch 1 

KeyboardInterrupt: 

## Making Predictions

The next cell gets the predictive covariance for the test set (and also technically gets the predictive mean, stored in `preds.mean()`) using the standard SKI testing code, with no acceleration or precomputation. Because the test set is substantially smaller than the training set, we don't need to make predictions in mini batches here, although our other tutorials demonstrate how to do this (for example, see the CIFAR tutorial).

In [9]:
model.eval()
likelihood.eval()
with torch.no_grad(), gpytorch.settings.use_toeplitz(False):
    preds = model(test_x)

In [10]:
print('Test MAE: {}'.format(torch.mean(torch.abs(preds.mean() - test_y))))

Test MAE: 0.47488638758659363
