# Fitting a Gaussian process kernel

In the [previous post](TODO: URL) we introduced the Gaussian process model with the exponentiated quadratic covariance function. In this post we will introduce parametrized covariance functions (kernels), fit them to real world data, and use them to make posterior predictions.

We will implement the Gaussian process model in [TensorFlow Probability](https://www.tensorflow.org/probability/) which will allow us to easily implement and tune our model without having to worry about the details.

In [23]:
# Imports
%matplotlib notebook

import sys
import numpy as np
import pandas as pd
import scipy
import sklearn
import sklearn.datasets
import tensorflow as tf
import tensorflow_probability as tfp
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from functools import partial

import bokeh
import bokeh.io
import bokeh.plotting
from bokeh.models import BoxAnnotation, Span, Label, Legend, Title, LinearAxis, Range1d
from bokeh.palettes import brewer
bokeh.io.output_notebook(hide_banner=True)

tfd = tfp.distributions
psd_kernels = tfp.positive_semidefinite_kernels

sns.set_style('darkgrid')
np.random.seed(42)
tf.set_random_seed(42)
#

In [2]:
from polynomial import Linear
from rational_quadratic import RationalQuadratic

- https://www.tensorflow.org/probability/api_docs/python/tfp/distributions/GaussianProcessRegressionModel
- https://www.tensorflow.org/probability/api_docs/python/tfp/positive_semidefinite_kernels
- https://github.com/tensorflow/probability/tree/master/tensorflow_probability/python/positive_semidefinite_kernels

## Mauna Loa CO2 data

The dataset used in this example is the monthly average [atmospheric CO<sub>2</sub>](https://en.wikipedia.org/wiki/Carbon_dioxide_in_Earth%27s_atmosphere) concentrations (in parts per million (ppm)) collected at the [Mauna Loa Observatory](https://en.wikipedia.org/wiki/Mauna_Loa_Observatory) in Hawaii. The observatory has been collecting these CO<sub>2</sub> concentrations since 1958 and [showed](https://en.wikipedia.org/wiki/Keeling_Curve) the first significant evidence of rapidly increasing CO<sub>2</sub> levels in the atmosphere. 

These measures of atmospheric CO<sub>2</sub> concentrations show different features such as a long term rising trend, variation with the seasons, and smaller irregularities. This made it into a canonical example in Gaussian process modelling [[1](#References)].

In this post the data is downloaded as csv from the [Scripps CO<sub>2</sub> Program website](http://scrippsco2.ucsd.edu/data/atmospheric_co2/mlo). This data is loaded and plotted below.

In [3]:
# Load the data
# Load the data from the Scripps CO2 program website. 
co2_df = pd.read_csv(
    ('http://scrippsco2.ucsd.edu/assets/data/atmospheric/stations/'
     'in_situ_co2/monthly/monthly_in_situ_co2_mlo.csv'), 
    header=54, # Data starts here
    skiprows=[55, 56], # Headers consist of multiple rows
    usecols=[3, 4], # Only keep the 'Date' and 'CO2' columns
    na_values='-99.99'  # NaNs are denoted as '-99.99'
)

# Drop missing values
co2_df.dropna(inplace=True)
# Remove whitespace from column names
co2_df.rename(columns=lambda x: x.strip(), inplace=True)
#

In [4]:
# Plot data
p = bokeh.plotting.figure(width=600, height=400)
p.xaxis.axis_label = 'Date'
p.yaxis.axis_label = 'CO₂ [ppm]'
p.add_layout(Title(
    text='In situ air measurements at Mauna Loa, Observatory, Hawaii',
    text_font_style="italic"), 'above')
p.add_layout(Title(
    text='Atmospheric CO₂ concentrations', 
    text_font_size="16pt"), 'above')
p.line(
    co2_df.Date, co2_df.CO2, legend='All data',
    line_width=2, line_color='midnightblue')
p.legend.location = 'top_left'
bokeh.plotting.show(p)
#

In this post we are going to make predictions for the measurements from 2008 and after based on the observed measurements from before 2008.

In [5]:
# Split the data into observed and to predict
date_split = 2008
df_observed = co2_df[co2_df.Date < date_split]
print('{} measurments in the observed set'.format(len(df_observed)))
df_predict = co2_df[co2_df.Date >= date_split]
print('{} measurments in the prediction set'.format(len(df_predict)))
#

593 measurments in the observed set
130 measurments in the prediction set


## Gaussian process model

A Gaussian process is uniquely defined by it's mean function $m(x)$ and covariance function $k(x,x')$:

$$f(x) \sim \mathcal{GP}(m(x),k(x,x'))$$

For simplicity, we will model the mean function as zero $m(x)=0$ in this post. The covariance function will be used to model the data and to make posterior predictions.



In [6]:
# Optimize model parameters via maximum marginal likelihood

# Define a kernel with trainable parameters. Note we transform the trainable
# variables to apply a positivity constraint.
# Smooth kernel
amplitude_smooth = tf.exp(tf.Variable(np.float64(0)), name='amplitude_smooth')
length_scale_smooth = tf.exp(tf.Variable(np.float64(0)), name='length_scale_smooth')
kernel_smooth = psd_kernels.ExponentiatedQuadratic(
    amplitude=amplitude_smooth, 
    length_scale=length_scale_smooth)
# Periodic kernel smoothed
amplitude_periodic = tf.exp(tf.Variable(np.float64(0)), name='amplitude_periodic')
length_scale_periodic = tf.exp(tf.Variable(np.float64(0)), name='length_scale_periodic')
period_periodic = tf.exp(tf.Variable(np.float64(0)), name='period_periodic')
amplitude_periodic_smooth = tf.exp(tf.Variable(np.float64(0)), name='amplitude_periodic_smooth')
length_scale_periodic_smooth = tf.exp(tf.Variable(np.float64(5)), name='length_scale_periodic_smooth')
# kernel_periodic = psd_kernels.ExpSinSquared(
#     amplitude=amplitude_periodic, 
#     length_scale=length_scale_periodic,
#     period=period_periodic)
kernel_periodic = (
    psd_kernels.ExpSinSquared(
        amplitude=amplitude_periodic, 
        length_scale=length_scale_periodic,
        period=period_periodic) * 
    psd_kernels.ExponentiatedQuadratic(
        amplitude=amplitude_periodic_smooth, 
        length_scale=length_scale_periodic_smooth))
# Linear kernel
slope_linear = tf.exp(tf.Variable(np.float64(0)), name='slope_linear')
bias_linear = tf.exp(tf.Variable(np.float64(0)), name='bias_linear')
amplitude_linear_smooth = tf.exp(tf.Variable(np.float64(0)), name='amplitude_linear_smooth')
length_scale_linear_smooth = tf.exp(tf.Variable(np.float64(5)), name='length_scale_linear_smooth')
kernel_dotprod = (
    Linear(
        slope_variance=slope_linear,
        bias_variance=bias_linear) *
    psd_kernels.ExponentiatedQuadratic(
        amplitude=amplitude_linear_smooth, 
        length_scale=length_scale_linear_smooth))

# Rational quadratic
amplitude_irregular = tf.exp(tf.Variable(np.float64(0)), name='amplitude_irregular')
length_scale_irregular = tf.exp(tf.Variable(np.float64(0)), name='length_scale_irregular')
scale_mixture_irregular = tf.exp(tf.Variable(np.float64(0)), name='scale_mixture_irregular')
kernel_irregular = RationalQuadratic(
    amplitude=amplitude_irregular,
    length_scale=length_scale_irregular,
    scale_mixture_rate=scale_mixture_irregular
)

kernel = kernel_smooth + kernel_periodic + kernel_dotprod + kernel_irregular

observation_noise_variance = tf.exp(
    tf.Variable(np.float64(0)), name='observation_noise_variance')

# mean_fn = lambda _: tf.reduce_mean(
#     df_observed.CO2.values, keepdims=True)

In [7]:
batch_size=128
batch_date, batch_co2 = tf.data.Dataset.from_tensor_slices(
    (df_observed.Date.values.reshape(-1, 1), 
     df_observed.CO2.values)).shuffle(
    buffer_size=len(df_observed)).repeat(count=None).batch(
    batch_size).make_one_shot_iterator().get_next()
# print(observations_batch)

# with tf.Session() as sess:
#     _d, _c = sess.run([batch_date, batch_co2])
#     print(type(_d), _d.shape, _d)
#     print(type(_c), _c.shape, _c)

  return _inspect.getargspec(target)


In [39]:
# We'll use an unconditioned GP to train the kernel parameters.
gp_batched = tfd.GaussianProcess(
    kernel=kernel,
    index_points=batch_date,
#     index_points=df_observed.Date.values.reshape(-1, 1),
    observation_noise_variance=observation_noise_variance)
neg_log_likelihood_batch = -gp_batched.log_prob(batch_co2)

gp_full = tfd.GaussianProcess(
    kernel=kernel,
    index_points=df_observed.Date.values.reshape(-1, 1),
    observation_noise_variance=observation_noise_variance)
log_likelihood_full = -gp_full.log_prob(df_observed.CO2.values)

optimizer = tf.train.AdamOptimizer(learning_rate=0.002)
optimize = optimizer.minimize(neg_log_likelihood_batch)


session = tf.InteractiveSession()
session.run(tf.global_variables_initializer())

batch_nlls = []
full_ll = []
for i in range(5000):
    _, nlls = session.run([optimize, neg_log_likelihood_batch])
    batch_nlls.append((i, nlls))
    if i % 100 == 0:
        ll = session.run(log_likelihood_full)
        full_ll.append((i, ll))
        print("Step {}: NLL = {}".format(i, ll))
full_ll.append((i, ll))
print("Final NLL = {}".format(ll))



Step 0: NLL = 698.053018821953
Step 100: NLL = 612.6159080417453
Step 200: NLL = 553.7397651212278
Step 300: NLL = 502.70199007037763
Step 400: NLL = 458.1886817044202
Step 500: NLL = 419.68603184452127
Step 600: NLL = 386.98343805927107
Step 700: NLL = 353.733002528729
Step 800: NLL = 325.4975099953751
Step 900: NLL = 292.85706597705655
Step 1000: NLL = 266.7404194550019
Step 1100: NLL = 242.48553324792573
Step 1200: NLL = 222.33307439114645
Step 1300: NLL = 212.06290143817898
Step 1400: NLL = 192.91815229296094
Step 1500: NLL = 180.89971588405217
Step 1600: NLL = 175.09836292098385
Step 1700: NLL = 164.7795850806126
Step 1800: NLL = 161.69650065337862
Step 1900: NLL = 156.64038056814172
Step 2000: NLL = 152.41354528424677
Step 2100: NLL = 147.70585427842457
Step 2200: NLL = 147.20028959844456
Step 2300: NLL = 143.815886273199
Step 2400: NLL = 142.23883204351102
Step 2500: NLL = 140.85428052371037
Step 2600: NLL = 140.78306292156026
Step 2700: NLL = 138.52067981757205
Step 2800: NLL =

In [43]:
# fig, ax1 = plt.subplots()
# ax1.plot(*zip(*batch_nlls))
# ax2 = ax1.twinx()
# ax2.plot(*zip(*full_ll))
# plt.show()

fig = bokeh.plotting.figure(
    width=600, height=400, 
    x_range=(0, 5000), y_range=(50, 200))
fig.add_layout(Title(
    text='Negative Log-Likelihood (NLL) during training iterations', 
    text_font_size="16pt"), 'above')
fig.xaxis.axis_label = 'iteration'
fig.yaxis.axis_label = 'NLL batch'
# First plot
fig.line(
    *zip(*batch_nlls), legend='Batch data',
    line_width=2, line_color='midnightblue')
# Seoncd plot
# Setting the second y axis range name and range
fig.extra_y_ranges = {'ax2': Range1d(start=100, end=700)}
fig.line(
    *zip(*full_ll), legend='All observed data',
    line_width=2, line_color='red', y_range_name='ax2')
# Adding the second axis to the plot.  
fig.add_layout(LinearAxis(
    y_range_name='ax2', axis_label='NLL all'), 'right')

fig.legend.location = 'top_right'
bokeh.plotting.show(fig)

In [9]:
# We can construct the posterior at a new set of `index_points` using the same
# kernel (with the same parameters, which we'll optimize below).
# index_points = np.linspace(-1., 1., 100)[..., np.newaxis]
gprm = tfd.GaussianProcessRegressionModel(
    kernel=kernel,
    index_points=df_predict.Date.values.reshape(-1, 1),
    observation_index_points=df_observed.Date.values.reshape(-1, 1),
    observations=df_observed.CO2.values,
    observation_noise_variance=observation_noise_variance)

samples = gprm.sample(10)
mean = gprm.mean()
# mean = gprm.loc
std = gprm.stddev()

samples_ = session.run(samples)
print('samples_: ', samples_.shape)
mean_ = session.run(mean)
print('mean_: ', mean_.shape)
std_ = session.run(std)
print('std_: ', std_.shape)

samples_:  (10, 130)
mean_:  (130,)
std_:  (130,)


In [10]:
upper = mean_ + 2*std_
lower = mean_ - 2*std_
band_x = np.append(
    df_predict.Date.values, df_predict.Date.values[::-1])
band_y = np.append(lower, upper[::-1])

print('band_x: ', band_x.shape)

band_x:  (260,)


In [11]:
# tfd.GaussianProcessRegressionModel??
# dir(gprm)
# gprm.scale??

In [12]:
variables = [
    amplitude_periodic, 
    length_scale_periodic,
    period_periodic,
    slope_linear,
    bias_linear,
    amplitude_linear_smooth,
    length_scale_linear_smooth,
    length_scale_irregular,
    scale_mixture_irregular,
    observation_noise_variance
]

variables_eval = session.run(variables)
for var, var_eval in zip(variables, variables_eval):
    print(var.name, var_eval)

amplitude_periodic:0 1.7904324389439996
length_scale_periodic:0 1.5945158803225827
period_periodic:0 0.9998979224046203
slope_linear:0 0.4849915193005098
bias_linear:0 0.19011146207107876
amplitude_linear_smooth:0 0.48499145710840175
length_scale_linear_smooth:0 165.57043833022817
length_scale_irregular:0 0.8625489996556547
scale_mixture_irregular:0 0.18804732202597016
observation_noise_variance:0 0.047068450162612865


In [55]:
# make plot
fig = bokeh.plotting.figure(
    width=600, height=400,
    x_range=(2007, 2019), y_range=(370, 420))
fig.xaxis.axis_label = 'Date'
fig.yaxis.axis_label = 'CO₂ (ppm)'
fig.add_layout(Title(
    text='In situ air measurements at Mauna Loa, Observatory, Hawaii',
    text_font_style="italic"), 'above')
fig.add_layout(Title(
    text='Atmospheric CO₂ concentrations', 
    text_font_size="16pt"), 'above')
fig.line(
    co2_df.Date, co2_df.CO2, legend='True data',
    line_width=2, line_color='midnightblue', line_dash='4 3')
fig.line(
    df_predict.Date.values, mean_, legend='Predictions',
    line_width=2, line_color='firebrick')
fig.patch(
    band_x, band_y, color='firebrick', alpha=0.4, 
    line_color='firebrick', legend='sigma')

fig.legend.location = 'top_left'
fig.toolbar.autohide = True
bokeh.plotting.show(fig)

In [14]:
# session.close()

In [15]:
# Versions used
print('Python: {}.{}.{}'.format(*sys.version_info[:3]))
print('Numpy: {}'.format(np.__version__))
print('Pandas: {}'.format(pd.__version__))
print('Tensorflow: {}'.format(tf.__version__))
print('Tensorflow Probability: {}'.format(tfp.__version__))
print('sklearn: {}'.format(sklearn.__version__))
print('matplotlib: {}'.format(matplotlib.__version__))
print('seaborn: {}'.format(sns.__version__))
#

Python: 3.6.6
Numpy: 1.15.1
Pandas: 0.23.4
Tensorflow: 1.12.0
Tensorflow Probability: 0.5.0
sklearn: 0.19.2
matplotlib: 2.2.3
seaborn: 0.9.0


## References

1. [Gaussian Processes for Machine Learning. Chapter 5: Model Selection and Adaptation of Hyperparameters](http://www.gaussianprocess.org/gpml/chapters/RW5.pdf) by Carl Edward Rasmussen and Christopher K. I. Williams.

To read:
- http://130.243.105.49/Research/Learning/courses/ml/2011/lectures/ML_2011_L05.pdf 
- https://www.inf.ed.ac.uk/teaching/courses/mlpr/2016/notes/w7c_gaussian_process_kernels.pdf
- https://george.readthedocs.io/en/latest/user/kernels/
- http://ml.dcs.shef.ac.uk/gpss/gpws14/KernelDesign.pdf
