# Probabilistic Time Series Analysis

## Week 10: Sparse Gaussian Processes Methods

Places where you are supposed to fill in code are marked

    #
    # TODO: some instructions
    # 
    
The rest of the code we will run and discuss if time permits, otherwise try it out at home and try to answer the questions mentioned in the text boxes for yourself.

### Please turn in the code before 12/5/2018 5:20pm. 

### Your work will be evaluated based on the code and plots. You don't need to write down your answers to other questions in the text blocks, just think them over.

### Title your submission file `lab10-student-[YOUR NET ID].ipynb`.

# Setup

In [None]:
import numpy as np
import GPy
import time

%matplotlib inline

# I. Pseudo-Dataset Size Requirement

In [None]:
def uniform_samples(x_min, x_max, n_samples, f, noise_scale):
    """Generates f(x) + noise for x uniformly distributed in [x_min, x_max]."""
    X = np.random.uniform(x_min, x_max, size=(n_samples, 1))
    Y = f(X) + np.random.normal(scale=noise_scale, size=(n_samples, 1))
    return X, Y

def gapped_samples(x_min, x_max, n_samples, f, noise_scale):
    """Generates f(x) + noise for x uniformly distributed in [x_min, x_max] missing middle third."""
    X = np.array(
        list(np.random.uniform(x_min, x_min + (x_max - x_min) / 3.0, size=(n_samples / 2, 1))) + 
        list(np.random.uniform(x_min + 2.0 * (x_max - x_min) / 3.0, x_max, size=(n_samples / 2, 1))))
    Y = f(X) + np.random.normal(scale=noise_scale, size=(n_samples, 1))
    return X, Y

In [None]:
X, Y = uniform_samples(-20.0, 20.0, 100, np.sin, 0.05)

In [None]:
kernel = GPy.kern.RBF(input_dim=1, variance=1., lengthscale=1.)

As a reminder, here is an example of "naive" GP regression, as we studied in the last lab:

In [None]:
m_reg = GPy.models.GPRegression(X, Y, kernel)
_ = m_reg.optimize()
_ = m_reg.plot(plot_density=True)

And here is an example of FITC regression, where the `num_inducing` parameter controls how many points are used in the pseudo-dataset.

In [None]:
m_sparse = GPy.models.SparseGPRegression(X, Y, kernel, num_inducing=10)
m_sparse.inference_method=GPy.inference.latent_function_inference.FITC()
_ = m_sparse.optimize()
_ = m_sparse.plot(plot_density=True)

In [None]:
#
# TODO: Try varying num_inducing in the example above to identify how many samples are needed before the quality
# of the fit with FITC is similar to that with the naive method. Show one plot where there are still not enough
# pseudo-data points, and one where the results are similar.
#

In [None]:
test_X = np.linspace(-20.0, 20.0, 400).reshape(400, 1)

def model_distance(model_ref, model_test):
    ref_out = model_ref.predict(test_X)[0]
    test_out = model_test.predict(test_X)[0]
    return np.sum((ref_out - test_out) ** 2) / np.sum(ref_out ** 2)

print model_distance(m_reg, m_sparse)

In [None]:
#
# TODO: Let's make the above result a little more quantitative. The function model_distance gives a numerical relative
# difference (in L^2 norm) between the outputs of two models. Make a plot of this number vs. the pseudo-dataset size 
# for a reasonable range of sizes.
# 

# II. Efficiency Comparison

Here is a simple way to measure how long a piece of Python code takes:

In [None]:
X, Y = uniform_samples(-20.0, 20.0, 1000, np.sin, 0.05)

In [None]:
start = time.time()
m_reg = GPy.models.GPRegression(X, Y, kernel)
_ = m_reg.optimize()
end = time.time()
print 'Naive method:', end - start, 'seconds'

start = time.time()
m_sparse = GPy.models.SparseGPRegression(X, Y, kernel, num_inducing=10)
m_sparse.inference_method=GPy.inference.latent_function_inference.FITC()
_ = m_sparse.optimize()
end = time.time()
print 'Sparse method:', end - start, 'seconds'

As you can see, we weren't lying to you---the sparse method is much faster. But how much? Let's investigate the runtime asymptotics we claimed in class.

There are two relevant numbers: $N$ is the number of data points in the original set, $N = 1000$ above, and $M$ is the number of pseudo-datapoints we use, $M = 10$ above. The naive method is supposed to run in time $O(N^3)$, while the FITC method is supposed to run in time $O(NM^2)$.

In [None]:
#
# TODO: Study the dependence of the runtimes of the two methods on N: fix some reasonable M, vary N, for each N 
# drawing several sets of samples as above, and measure how long it takes to optimize each of the two models. For each
# N and each model, take the mean of the times you measure, and plot these. If you want, try to confirm the power
# scaling on a log-log plot. (Timing studies are hard and this is a very naive way to measure code execution, so don't
# worry if it doesn't look perfect.)
#

In [None]:
#
# TODO: Just for the FITC method, study the dependence of the runtime on M: fix some reasonable N, vary M, and proceed
# as above.
#