<a href="https://colab.research.google.com/github/euxoa/ompeluseura/blob/master/ProbabilisticProgramming.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Probabilistic programming

(This is about how and why to do bayesian models programmatically. More generally about probabilistic programming, see Wikipedia or https://arxiv.org/abs/1809.10756.

## Generative models and their uses

A generative model describes 
* how your data possibly arise,
* conditioned on the _parameters_ of the model, 
* including parameterized uncertainty, that is, things that you cannot or do not want to model.

In probabilistic  programming (PP), you set up a generative model for your data, and let the PP system infer what the parameters might be. Because of uncertainty in your model, you will not get a definite answer, but typically a set of _parameter scenarios or samples_. These and their variation may then be useful:
* Inference: _“The weather station has warmed 0.7±0.2 °C during the last decade.”_
* Uncertainty: _“±0.2° (95% confidence interval)”_
* Hypothesis testing: _“The observed difference between groups is within expected random fluctuation, so no support for the idea they are different.”_
* Prediction: _“The model indicates GDP drop of 0.2% during the next quartal.”_
* Prediction: _“Based on other users and their preferences, we recommend you these books...”_
* Decisions: _“Our expected loss (or risk, or a combination) is minimized by doing A instead of B.”_

Note that often you can estimate a generative model by finding the maximally probable parameters (_maximum a posteriori_, MAP), which is almost the same as maximum likelihood (ML). This is fast, but does not work for all models, is sometimes misleading, and omits the uncertainty.

Parametric approximative techniques are available for uncertainty of MAP, and for the whole posterior. We don't deal with these techniques here.

## Examples of generative models:
* Ordinary regression:
$y$ arises as a function of covariates, plus noise
($b$ and $\sigma$ arise from some fuzzy distribution left unspecified)
$$y = x ^T b + N(0, \sigma)$$
* Random-walk time-series model:
$y$ arises as a function of previous $y$, plus noise
($a$ again comes from something random, left unspecified but within $[0, 1]$)
$$y(t) = a y(t-1) + (1-a) N(0, \sigma)$$
* Latent random walk for time series:
$$y(t) = mu(t) + N(0, \sigma_\textit{obs})$$
$$mu(t) = mu(t-1) + N(0, \sigma_\textit{rw})$$
Note that all $mu(t)$ are parameters. More parameters than observations! But this is ok.
* A biased coin:
$$y_i \sim \textrm{Bernoulli}(\theta)$$
* Many coins, each with different unknown bias, and a distribution for the biases:
$$y_ki \sim \textrm{Bernoulli}(\theta_k)$$
$$\theta_k \sim \textrm{Beta}(a_1, a_2)$$


In general, you define how data $y$ depends on parameters, $p(y | \theta)$, called likelihood, and how parameters are distributed if you have no data at all, $p(\theta)$, called prior. PP gives you samples from $p(\theta | y)$, or the posterior distribution of parameters given the data.


## Why bayes?
* Need good, detailed estimates of uncertainty.
* You don’t have a clear unit of independent data, but a hierarchy, lateral dependencies in time or space, or qualitatively different sets of observations. 
* You want to combine old model (or strong prior) with new data (model update).
* You have so few data that uncertainty dominates.


## Why probabilistic programming?
* Most practical problems are somehow non-standard. Getting the details right is critical for some reason. Standard packages then do not apply, you need a custom model.
* Your setup, data or model have non-trivial structure, no hope with standard packages at all. 
* You don’t have three months to derive the gradients and write a sampler.


## Why not probabilistic programming?
* You data is too big, computation would take forever.
* No time, just give me some estimates, any estimates better than guessing!


## Levels:
* Use a specialized package: in R, brms, rstan_arm; Facebook’s Prophet, …; you don’t even know you are using PP.
* Write raw likelihood: Stan, PyMC, Tensorflow Probability, …
* There are also approximate inference techniques: variational, etc., but these are maybe for advanced users only.
* Research on inference techniques, deep probabilistic models, etc.


## Given data, and model, how do you get the parameters?
* In principle easy: $$p(\theta | y) \propto p(y | \theta) p(\theta)$$
* In practice you cannot typically solve the distribution of $\theta$.
   * Finding normalization would require very non-trivial integration.
* High dimensionality of $\theta$, so no grids of parameters.
* But there are ways to get samples from the posterior.
* Sample means a set of possible (parameter) worlds behind data, or
scenarios of parameters: 
   * “It seems that the tulips are either yellow and small, or big and a bit darker, but we don’t know which one.”
* For continuous distributions, best samplers are based on gradients or even higher derivatives of the posterior distribution.

Some platforms do gradients and find an optimum for you (Tensorflow etc.).
Other platforms do gradients and sampling, and find the posterior distribution for you (Stan etc.)


## Probabilistic programming workflow:
   * Typically you write the data-generating process, as a code.
   * The platform computes the likelihood (if it's not already obvious).
   * The platform computes gradients of your code.
   * The platform samples (nontrivial: tuning and utilizing gradients).
   * You get a sample from the posterior.
   * You make quality checks.


## Sampling itself is nontrivial
   * For good sampling, gradients are required.
   * Sampler needs tuning, which with PP platforms is automatic.
   * Still, quality checks are needed, but these are now partly automatic.