# Exact GP Regression with Multiple GPUs and Kernel Partitioning

In this notebook, we'll demonstrate training exact GPs on large datasets using two key features from the paper https://arxiv.org/abs/1903.08114: 

1. The ability to distribute the kernel matrix across multiple GPUs, for additional parallelism.
2. Partitioning the kernel into chunks computed on-the-fly when performing each MVM to reduce memory usage.

We'll be using the `protein` dataset, which has about 37000 training examples. The techniques in this notebook can be applied to much larger datasets, but the training time required will depend on the computational resources you have available: both the number of GPUs available and the amount of memory they have (which determines the partition size) have a significant effect on training time.

In [1]:
import math
import torch
import gpytorch
import sys
from matplotlib import pyplot as plt
sys.path.append('../')
from LBFGS import FullBatchLBFGS

%matplotlib inline
%load_ext autoreload
%autoreload 2

In [2]:
torch.__version__

'1.1.0.dev20190411'

## Downloading Data
We will be using the Protein UCI dataset which contains a total of 40000+ data points. The next cell will download this dataset from a Google drive and load it.

In [3]:
import os
import urllib.request
from scipy.io import loadmat
dataset = 'protein'
if not os.path.isfile(f'{dataset}.mat'):
    print(f'Downloading \'{dataset}\' UCI dataset...')
    urllib.request.urlretrieve('https://drive.google.com/uc?export=download&id=1nRb8e7qooozXkNghC5eQS0JeywSXGX2S',
                               f'{dataset}.mat')
    
data = torch.Tensor(loadmat(f'{dataset}.mat')['data'])

## Normalization and train/test Splits

In the next cell, we split the data 80/20 as train and test, and do some basic z-score feature normalization.

In [4]:
import numpy as np

N = data.shape[0]
# make train/val/test
n_train = int(0.8 * N)
train_x, train_y = data[:n_train, :-1], data[:n_train, -1]
test_x, test_y = data[n_train:, :-1], data[n_train:, -1]

# normalize features
mean = train_x.mean(dim=-2, keepdim=True)
std = train_x.std(dim=-2, keepdim=True) + 1e-6 # prevent dividing by 0
train_x = (train_x - mean) / std
test_x = (test_x - mean) / std

# normalize labels
mean, std = train_y.mean(),train_y.std()
train_y = (train_y - mean) / std
test_y = (test_y - mean) / std

# make continguous
train_x, train_y = train_x.contiguous(), train_y.contiguous()
test_x, test_y = test_x.contiguous(), test_y.contiguous()

output_device = torch.device('cuda:0')

train_x, train_y = train_x.to(output_device), train_y.to(output_device)
test_x, test_y = test_x.to(output_device), test_y.to(output_device)

In [5]:
print(train_x.size(-2))

36584


## How many GPUs do you want to use?

In the next cell, specify the `n_devices` variable to be the number of GPUs you'd like to use. By default, we will use all devices available to us.

In [6]:
n_devices = 2  # torch.cuda.device_count()
print('Planning to run on {} GPUs.'.format(n_devices))

Planning to run on 2 GPUs.


## GP Model + Training Code

In the next cell we define our GP model and training code. For this notebook, the only thing different from the Simple GP tutorials is the use of the `MultiDeviceKernel` to wrap the base covariance module. This allows for the use of multiple GPUs behind the scenes.

In [7]:
import sys
sys.path.append('../')
from bayesopt import BayesOptDerivativeOptimizer, _init_model_from_vector

In [8]:
class ExactGPModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood, n_devices):
        super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
        self.mean_module = gpytorch.means.ConstantMean()
        base_covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel(ard_num_dims=None))
        
        if n_devices > 1:
            self.covar_module = gpytorch.kernels.MultiDeviceKernel(
                base_covar_module, device_ids=range(n_devices),
                output_device=output_device
            )
        else:
            self.covar_module = base_covar_module
    
    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

def train(train_x,
          train_y,
          n_devices,
          output_device,
          checkpoint_size,
          preconditioner_size,
          n_training_iter,
          use_lbfgs=False,
):
    likelihood = gpytorch.likelihoods.GaussianLikelihood().to(output_device)
    model = ExactGPModel(train_x, train_y, likelihood, n_devices).to(output_device)
    model.train()
    likelihood.train()
    
    optimizer = BayesOptDerivativeOptimizer(
        list(model.parameters()),
        optim_lbs=[-4., -2., -2., -2.],
        optim_ubs=[1., 2., 2., 2.],
    )
    
    mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)
    def closure(param_vec):
        _init_model_from_vector(model, param_vec.to(model.train_targets.device).to(model.train_targets.dtype))
        with gpytorch.beta_features.checkpoint_kernel(checkpoint_size), \
             gpytorch.settings.max_preconditioner_size(preconditioner_size):
            output = model(train_x)
            loss = -mll(output, train_y)
            loss.backward()
        return loss
    
    for i in range(n_training_iter):
        optimizer.step(closure)
        optimizer.zero_grad()
        print(f'** Iteration {i}, Best value: {optimizer.train_y[:, 0].min()}')

    return model, likelihood

# Training

In [14]:
preconditioner_size = 100

In [15]:
model, likelihood = train(train_x, train_y,
                          n_devices=n_devices, output_device=output_device,
                          checkpoint_size=15000,
                          preconditioner_size=preconditioner_size,
                          n_training_iter=20)

tensor([-3.8578,  1.7209, -0.7383,  1.0251])


  " a gpytorch.settings.max_cg_iterations(value) context.".format(k + 1, residual_norm.mean(), tolerance)


RuntimeError: CUDA out of memory. Tried to allocate 1.02 GiB (GPU 0; 10.73 GiB total capacity; 6.06 GiB already allocated; 523.56 MiB free; 1.45 GiB cached)

# Testing: Computing test time caches

In [9]:
likelihood = gpytorch.likelihoods.GaussianLikelihood().to(output_device)
model = ExactGPModel(train_x, train_y, likelihood, n_devices).to(output_device)

In [10]:
params = torch.load('model_params.pth')
model.load_state_dict(params)

IncompatibleKeys(missing_keys=[], unexpected_keys=[])

In [11]:
# Get into evaluation (predictive posterior) mode
model.eval()
likelihood.eval()

with torch.no_grad(), gpytorch.settings.fast_pred_var(), gpytorch.beta_features.checkpoint_kernel(5000):
    latent_pred = model(test_x)

# Testing: Computing predictions

In [13]:
test_rmse = torch.sqrt(torch.mean(torch.pow(latent_pred.mean - test_y, 2)))
print(f"Test RMSE: {test_rmse.item()}")

Test RMSE: 0.5815756916999817


In [14]:
gc.collect()
torch.cuda.empty_cache()

In [13]:
torch.save(model.state_dict(), 'model_params.pth')