# [5SSD0] Probabilistic Programming - Assignment

### Year: 2022-2023

In [None]:
# Enter your name and student ID
name = 
ID   = 

In this assignment, we will go through the cycle of model specification, performance evaluation, critiqueing the model and revising it.

![Figure taken from "Build, Compute, Critique, Repeat: Data Analysis with Latent Variable Models" [pdf](https://www.cs.columbia.edu/~blei/papers/Blei2014b.pdf)](figures/model-critique.png)
_Figure taken from "Build, Compute, Critique, Repeat: Data Analysis with Latent Variable Models" ([pdf](https://www.cs.columbia.edu/~blei/papers/Blei2014b.pdf))_

In questions 1 and 2, you will build a simple model, fit it to data and evaluate its performance on future data. You will find that its performance is not great. In question 3, you will improve the model in multiple ways. Finally, in question 4, you will do model selection based on free energy.

The final questions will require knowledge from the last probabilistic programming session. But questions 1 and 2 can be done relatively early in the course.

In [None]:
using Pkg
Pkg.activate(".")
Pkg.instantiate();

In [None]:
using CSV
using DataFrames
using LinearAlgebra
using ProgressMeter
using RxInfer
using Plots
default(label="",
        grid=false, 
        linewidth=3, 
        markersize=4,
        guidefontsize=12, 
        margins=15Plots.pt)

## Problem: Forecasting Air Quality

Many Europeans suspect that the air quality in their city is declining. A [recent study](https://doi.org/10.1016/j.snb.2007.09.060) measured the air quality of a major city in North Italy using an electronic nose. The measurements were made in the middle of the city and reflect urban activity. We will inspect the specific chemical concentrations found and build a model to accurately predict CO for future time points.

![https://www.theguardian.com/environment/2020/apr/07/air-pollution-linked-to-far-higher-covid-19-death-rates-study-finds](figures/air-milan-wide.png)

Photograph taken by Claudio Furlan/LaPresse/Zuma Press/Rex/Shutterstock ([link](https://www.theguardian.com/environment/2020/apr/07/air-pollution-linked-to-far-higher-covid-19-death-rates-study-finds))

The data can be found here: https://archive.ics.uci.edu/ml/datasets/Air+Quality. I've done some pre-processing and selected the most important features. In this assignment we will infer parameters in a model of the data and predict air quality in the future. For that purpose, the data has been split into past and future.

In [None]:
# Load training data
past_data = DataFrame(CSV.File("data/airquality_past.csv"))

In [None]:
# Number of data points
N = 100;

Let's visualize the carbon monoxide measurements over time.

In [None]:
scatter(past_data[:,1], 
        past_data[:,2], 
        size=(900,300), 
        color="black", 
        xlabel="time", 
        ylabel="CO (ppm)",
        ylims=[400,2000])

## 1. Auto-regression

We suspect that there is a temporal dependence in this dataset. In other words, the data changes relatively slowly over time and neighbouring data points end up being highly correlated. To exploit this correlation, we will build an _auto-regressive model_ of the form:

$$ y_k = \theta y_{k-1} + \epsilon_k \, , $$

where the noise $\epsilon_k$ is drawn from a zero-mean Gaussian with precision parameter $\tau$: 

$$ \epsilon_k \sim \mathcal{N}(0, \tau^{-1}) \, .$$

Tasks:
- [1pt] Specify the above equation as a probabilistic model in RxInfer, using $\tau = 1.0$.
- [1pt] Specify and execute an inference procedure to infer a posterior distribution for $\theta$.
- [1pt] Plot the inferred distribution over the interval $[0,\ 2]$.

In [None]:
### YOUR CODE HERE

## 2. Predictions

Now that we have inferred a posterior distribution for the coefficient, we can start making predictions. The data set also contains "future data" for which we want to make 1-step ahead predictions. The posterior predictive distribution for the next time step is:

$$ p(y_{t+1} \mid y_{t}, \mathcal{D}) = \int p(y_{t+1} \mid \theta, y_{t}) p(\theta \mid \mathcal{D}) \, \mathrm{d}\theta \, , $$

where $\mathcal{D}$ refers to "past data" (used to infer the posterior distribution). To make 1-step ahead predictions, you will have to loop over the future data (i.e., `for t in 1:T`), plug in the current data point and compute the parameters of the posterior predictive distribution for the next data point. For the initial $y_t$, you may use the last entry of the "past data" set.

Tasks:
- [1pt] Compute the 1-step ahead predictions (mean and variance) for the "future data" set.
- [1pt] Plot the predictions (variance in `ribbon=`) along with the actual future data (scatterplot).

---

Note that if you failed to infer a posterior distribution in the previous question, you can still answer this question using a standard normal, $p(\theta) = \mathcal{N}(0,1)$.

In [None]:
# Load test data
future_data = DataFrame(CSV.File("data/airquality_future.csv"))

In [None]:
T = size(future_data,1);

In [None]:
### YOUR CODE HERE

## 3. Model critique

Our model only considers extremely short term changes which are highly affected by noise. Furthermore, we only set the noise level $\tau$ to $1.0$ but that was based on convenience, not on domain expertise or data. We are going to improve the model based on these two criteria. First, auto-regressive models can be extended further in the past to capture slower trends over time;

$$ y_k = \sum_{m=1}^{M} \theta_m y_{k-m} + \epsilon_k \, ,$$

where $M$ corresponds to model order. Secondly, we can put a prior probability distribution over $\tau$ and infer a posterior $p(\tau \mid \mathcal{D})$ simultaneously. To do that, you will have to specify the constraint $q(\theta,\tau) = q(\theta)q(\tau)$ in the variational inference procedure.

Tasks:
- [1pt] Extend the model with an order parameter $M$ and noise precision estimation.
- [1pt] Infer the approximate posteriors for $\theta$ and $\tau$, for model order $M=3$.
- [1pt] Visualize the posterior for the noise distribution $q(\tau)$ (think carefully about its range).
- [1pt] Visualize the 1-step ahead predictions (mean and variance) on the future data.

In [None]:
# Number of iterations of variational inference
n_iters = 10;

# Model order
M = 5;

In [None]:
### YOUR CODE HERE

## 4. Model selection

We now essentially have a different model for each value of $M$. Which is the best?

Tasks:
- [1pt] Compute the free energies for a given range of model orders and report the best performing one.

In [None]:
# Model order range
model_orders = [2,4,8,16,32];

In [None]:
##### YOUR CODE HERE

## Submission

Before you submit the assignment, make sure your notebook runs! You can do that by going to the `Kernel` tab in the toolbar and pressing `Restart & Run All`. This is important! If your code doesn't run, we can't verify the correctness of your answer.

When you're ready, head on over to Canvas and upload your notebook.