# Multitask GP Regression

## Introduction

This notebook demonstrates how to perform standard (Kronecker) multitask regression.

This differs from the [hadamard multitask example](./Hadamard_Multitask_GP_Regression.ipynb) in one key way:
- Here, we assume that we want to learn **all tasks per input**. (The kernel that we learn is expressed as a Kronecker product of an input kernel and a task kernel).
- In the other notebook, we assume that we want to learn one tasks per input.  For each input, we specify the task of the input that we care about. (The kernel in that notebook is the Hadamard product of an input kernel and a task kernel).

Multitask regression, first introduced in [this paper](https://papers.nips.cc/paper/3189-multi-task-gaussian-process-prediction.pdf) learns similarities in the outputs simultaneously. It's useful when you are performing regression on multiple functions that share the same inputs, especially if they have similarities (such as being sinusodial). 

Given inputs $x$ and $x'$, and tasks $i$ and $j$, the covariance between two datapoints and two tasks is given by

\begin{equation*}
  k([x, i], [x', j]) = k_\text{inputs}(x, x') * k_\text{tasks}(i, j)
\end{equation*}

where $k_\text{inputs}$ is a standard kernel (e.g. RBF) that operates on the inputs.
$k_\text{task}$ is a special kernel - the `IndexKernel` - which is a lookup table containing inter-task covariance.

In [None]:
import math
import torch
import gpytorch
from matplotlib import pyplot as plt

%matplotlib inline
%load_ext autoreload
%autoreload 2

### Set up training data

In the next cell, we set up the training data for this example. We'll be using 100 regularly spaced points on [0,1] which we evaluate the function on and add Gaussian noise to get the training labels.

We'll have two functions - a sine function (y1) and a cosine function (y2).

For MTGPs, our `train_targets` will actually have two dimensions: with the second dimension corresponding to the different tasks.

In [None]:
train_x = torch.linspace(0, 1, 100)

train_y = torch.stack([
    torch.sin(train_x * (2 * math.pi)) + torch.randn(train_x.size()) * 0.2,
    torch.cos(train_x * (2 * math.pi)) + torch.randn(train_x.size()) * 0.2,
], -1)

In [None]:
train_y.shape

In [None]:
### Defining the models


In [None]:

class SpectralModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood, **kwargs):
        from spectralgp.kernels import SpectralGPKernel

        super(SpectralModel, self).__init__(train_x, train_y, likelihood)

        self.mean_module = gpytorch.means.ConstantMean()

        self.covar_module = SpectralGPKernel(**kwargs)

        self.covar_module.initialize_from_data(train_x, train_y, **kwargs)

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)



##### The latent models

The model provided by `spectralgp.models.SpectralModel` takes in a latent model and latent likelihood to define the GP over the log-spectral density. In the multi-task setting this latent model is shared among tasks, so we only need a single one. 

The simplest thing to do here is to allow the `SpectralModel` instances handle the data-dependent initialization of the latent model, so we define the latent model and likelihood as `None`, then redefine and pass them through to the next data dimension. The ensures that each output dimension has an appropriately initialized latent model and that the latent model is shared correctly

In [None]:
latent_lh = None
latent_mod = None

##### The Data Models
Define the lists of data models and likelihoods (one for each task), passing in the training data and shared latent model for each one

In [None]:
data_lh_list = []
data_mod_list = []
for dim in range(train_y.shape[-1]):
    lh = gpytorch.likelihoods.GaussianLikelihood(noise_prior=gpytorch.priors.SmoothedBoxPrior(1e-8, 1e-3))
    
    data_lh_list.append(lh)
    data_mod_list.append(SpectralModel(train_x, train_y[:,dim], likelihood=lh,
                                                        latent_mod=latent_mod, latent_lh=latent_lh))
    
    latent_mod = data_mod_list[dim].covar_module.latent_mod
    latent_lh = data_mod_list[dim].covar_module.latent_lh

### Defining the Samplers

***Gradient updates*** remain largely the same from the single-task setting, we need to define a single factory that handles creating loss functions for the hyperparameters of the model. 

***Elliptical slice sampling*** must happen differently than the 1-D case, since we need samples of the spectral densities corresponding to the kernel of each output dimension. Thus we make a list of factories that each defines a likelihood function of the spectral densities based on the data from that output dimension. Within each dimension the `ess_factory` initialization remains the same, so it's just a simple list comprehension of calls that are exactly like the 1-D case.

***The alternating sampler*** takes in the factories as defined above and runs in the same fashion as the single task case. For the multi-task case we provide the `spectralgp.samplers.GibbsAlternatingSampler` to handle generating updates to the hyperparameters and latent GP.

As usual we need to choose the total number of iterations as well as the number of SGD updates and ESS samples per iteration

In [None]:
total_iters = 1
sgd_iters = 5
ess_iters = 20

In [None]:
alt_sampler = gpytorch.samplers.AlternatingSampler(data_mod_list, data_lh_list,
                        gpytorch.samplers.ss_multmodel_factory,
                        [gpytorch.samplers.ess_factory] * train_y.shape[-1],
                        numInnerSamples=ess_iters, numOuterSamples=sgd_iters, totalSamples=total_iters,
                        num_dims=1, num_tasks=train_y.shape[-1]     
                        )

In [None]:
%pdb

In [None]:
alt_sampler.run()

In [None]:
## Set up the model

The model should be somewhat similar to the `ExactGP` model in the [simple regression example](../01_Simple_GP_Regression/Simple_GP_Regression.ipynb).

The differences:

1. We're going to wrap ConstantMean with a `MultitaskMean`. This makes sure we have a mean function for each task.
2. Rather than just using a RBFKernel, we're using that in conjunction with a `MultitaskKernel`. This gives us the covariance function described in the introduction.
3. We're using a `MultitaskMultivariateNormal` and `MultitaskGaussianLikelihood`. This allows us to deal with the predictions/outputs in a nice way. For example, when we call MultitaskMultivariateNormal.mean, we get a `n x num_tasks` matrix back.

You may also notice that we don't use a ScaleKernel, since the IndexKernel will do some scaling for us. (This way we're not overparameterizing the kernel.)

In [None]:
class MultitaskGPModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood):
        super(MultitaskGPModel, self).__init__(train_x, train_y, likelihood)
        self.mean_module = gpytorch.means.MultitaskMean(
            gpytorch.means.ConstantMean(), num_tasks=2
        )
        self.covar_module = gpytorch.kernels.MultitaskKernel(
            gpytorch.kernels.RBFKernel(), num_tasks=2, rank=1
        )

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultitaskMultivariateNormal(mean_x, covar_x)

    
likelihood = gpytorch.likelihoods.MultitaskGaussianLikelihood(num_tasks=2)
model = MultitaskGPModel(train_x, train_y, likelihood)

## Train the model hyperparameters


In [None]:
# Find optimal model hyperparameters
model.train()
likelihood.train()

# Use the adam optimizer
optimizer = torch.optim.Adam([
    {'params': model.parameters()},  # Includes GaussianLikelihood parameters
], lr=0.1)

# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

n_iter = 50
for i in range(n_iter):
    optimizer.zero_grad()
    output = model(train_x)
    loss = -mll(output, train_y)
    loss.backward()
    print('Iter %d/%d - Loss: %.3f' % (i + 1, n_iter, loss.item()))
    optimizer.step()

## Make predictions with the model

In [None]:
# Set into eval mode
model.eval()
likelihood.eval()

# Initialize plots
f, (y1_ax, y2_ax) = plt.subplots(1, 2, figsize=(8, 3))

# Make predictions
with torch.no_grad(), gpytorch.settings.fast_pred_var():
    test_x = torch.linspace(0, 1, 51)
    predictions = likelihood(model(test_x))
    mean = predictions.mean
    lower, upper = predictions.confidence_region()
    
# This contains predictions for both tasks, flattened out
# The first half of the predictions is for the first task
# The second half is for the second task

# Plot training data as black stars
y1_ax.plot(train_x.detach().numpy(), train_y[:, 0].detach().numpy(), 'k*')
# Predictive mean as blue line
y1_ax.plot(test_x.numpy(), mean[:, 0].numpy(), 'b')
# Shade in confidence 
y1_ax.fill_between(test_x.numpy(), lower[:, 0].numpy(), upper[:, 0].numpy(), alpha=0.5)
y1_ax.set_ylim([-3, 3])
y1_ax.legend(['Observed Data', 'Mean', 'Confidence'])
y1_ax.set_title('Observed Values (Likelihood)')

# Plot training data as black stars
y2_ax.plot(train_x.detach().numpy(), train_y[:, 1].detach().numpy(), 'k*')
# Predictive mean as blue line
y2_ax.plot(test_x.numpy(), mean[:, 1].numpy(), 'b')
# Shade in confidence 
y2_ax.fill_between(test_x.numpy(), lower[:, 1].numpy(), upper[:, 1].numpy(), alpha=0.5)
y2_ax.set_ylim([-3, 3])
y2_ax.legend(['Observed Data', 'Mean', 'Confidence'])
y2_ax.set_title('Observed Values (Likelihood)')

None