title | author | csl | bibliography | link-citations | wikidata-links | citekeys | ||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Stochaskell |
David A Roberts |
styles/chicago-author-date.csl |
|
true |
|
<style>.github-corner:hover .octo-arm{animation:octocat-wave 560ms ease-in-out}@keyframes octocat-wave{0%,100%{transform:rotate(0)}20%,60%{transform:rotate(-25deg)}40%,80%{transform:rotate(10deg)}}@media (max-width:500px){.github-corner:hover .octo-arm{animation:none}.github-corner .octo-arm{animation:octocat-wave 560ms ease-in-out}}</style>
Probabilistic programming systems automate the task of translating specifications of probabilistic models into executable inference procedures. Here we present Stochaskell, an embedded domain-specific language for specifying probabilistic programs that can be compiled to multiple existing probabilistic programming languages. Programs are written as Haskell code, which transparently constructs an intermediate representation that is then passed to specialised code generation modules. This allows tight integration with existing probabilistic programming systems, with each system providing inference for different parameters of a model. These model-specific inference strategies are also written as Haskell code, providing a powerful means of structuring and composing inference methods.
At one end of the probabilistic programming spectrum, systems like Stan produce high-performance inference programs utilising sampling algorithms such as Hamiltonian Monte Carlo (HMC). However, a consequence of this is that Stan does not support sampling discrete model parameters, requiring users to manually marginalise them out of the model prior to implementing it in Stan.1 In particular, this precludes the ability to perform inference on models in which the dimensionality of the parameter space is itself a random variable. At the other end of the spectrum, implementations of Church [@goodman08] emphasise the ability to perform inference with few limitations on the models permitted.
Unfortunately, the model specification language is often tightly coupled with the system for translating the specification into an executable inference program. This requires users to rewrite their models in a number of different languages in order to fully take advantage of the variety of available probabilistic programming systems. Not only is this a burden on the user, but it also limits the ability to integrate a number of different probabilistic programming systems in the implementation of a single model. For instance, it may be desirable to infer continuous model parameters with Stan, whilst delegating the discrete and trans-dimensional parameters of the same model to a more universally applicable method.
Here we present Stochaskell, a probabilistic programming language designed for portability, allowing models and inference strategies to be written in a single language, but inference to be performed by a variety of probabilistic programming systems. This is achieved through runtime code generation, whereby code is automatically produced to perform inference via an external probabilistic programming system on subsets of the model specified by the user. In this way, users can benefit from the diverse probabilistic programming ecosystem without the cost of manually rewriting models in multiple different languages.
Many probabilistic programming systems --- particularly those implementing the technique described by @wingate11 --- rely on code generation by translating probabilistic programs to a general-purpose programming language. Probabilistic C [@paige14] goes further by positioning itself as a code generation target language for other probabilistic programming systems. Hakaru [@narayanan16] performs inference through program transformation, but both the source and target are expressed in the same probabilistic programming language. The Fun language [@borgstrom11] is implemented through compilation to Infer.NET programs. To the best of our knowledge, Stochaskell is the first probabilistic programming language capable of generating code for multiple probabilistic programming systems, spanning diverse probabilistic programming paradigms.
In this section,
we will introduce Stochaskell through a number of example probabilistic programs.2
We begin with a simple program to simulate a homogeneous Poisson process over a fixed interval
poissonProcess :: R -> R -> P (Z,RVec)
poissonProcess rate t = do
n <- poisson (rate * t)
s <- orderedSample n (uniform 0 t)
return (n,s)
The program takes two real numbers as input3
specifying the
> simulate (poissonProcess 1 5)
(2,[1.2151445207978782,2.7518508992238075])
Of course, the real benefit of probabilistic programming is not just the ability to simulate data, but to infer model parameters given some observed data. However, we defer further discussion of this to the following section, focusing for now on just the programs themselves.
We now consider the following program, which implements a Gaussian process (GP) model [@rasmussen06, sec. 2.2]:
gp :: (R -> R -> R) -> Z -> RVec -> P RVec
gp kernel n x = do
let mu = vector [ 0 | i <- 1...n ]
cov = matrix [ kernel (x!i) (x!j) | i <- 1...n, j <- 1...n ]
g <- normal mu cov
return g
The definitions of mu
and cov
illustrate the usage of Stochaskell's abstract arrays.4
In contrast to concrete arrays,5
the dimensions of abstract arrays can be specified in terms of random variables.
Array elements can be accessed using the conventional subscript operator,
as witnessed by the expression (x!i)
providing the value of the i
th location at which we are evaluating the GP.
It is also worth drawing attention to the fact that the first input to this program is a function which takes two real locations and returns the desired covariance of the GP at those locations.6
This highlights an advantage of the embedded approach, whereby existing Haskell functions can be freely used within Stochaskell programs.
For instance, in the following code,7
kernelSE1
specifies a squared exponential kernel function with unit signal variance and length-scale hyper-parameters [@rasmussen06, sec. 4.2],
and a small Gaussian observation noise of variance
kernelSE1 = kernelSE (log 1) (log 1)
kernelSE lsv lls2 a b =
exp (lsv - (a - b)^2 / (2 * exp lls2))
+ if a == b then 1e-6 else 0
Although it is possible to use the provided normal
primitive to sample from a multivariate Gaussian distribution,
8
it is also possible to do so more explicitly by transforming a vector of (univariate) standard normal samples,
which can be useful for fine-tuning inference efficiency.
The following program defines such a transformation with the aid of a Cholesky decomposition [@kroese11, sec. 4.3]:
normalChol :: Z -> RVec -> RMat -> P RVec
normalChol n mu cov = do
w <- joint vector [ normal 0 1 | i <- 1...n ]
return (mu + chol cov #> w)
The joint
keyword indicates that we are sampling a random vector whose elements have the given marginal distributions.
Note that, although the elements of this random vector are (conditionally) independent,
correlations can be induced by applying a deterministic transformation,
as with the matrix-vector multiplication (#>
) in the above program.
Building upon this, we can write a GP classification model [@rasmussen06, sec. 3.3],
where bernoulliLogit
represents the logit-parametrised
gpClassifier :: (R -> R -> R) -> Z -> RVec -> P (RVec,BVec)
gpClassifier kernel n x = do
g <- gpChol kernel n x
phi <- joint vector [ bernoulliLogit (g!i) | i <- 1...n ]
return (g,phi)
In this example, gpChol
is itself a probabilistic program,
defined much like the gp
program earlier, but utilising the normalChol
program instead of the normal
primitive.
This demonstrates how sub-programs can be easily composed in Stochaskell, another benefit we obtain for free from embedding.
Now for a somewhat different example,
this program implements the stick-breaking process of @sethuraman94 with concentration parameter
stickBreak :: R -> P RVec
stickBreak alpha = do
sticks <- joint vector [ beta 1 alpha | i <- 1...infinity ] :: P RVec
let sticks' = vector [ 1 - (sticks!i) | i <- 1...infinity ]
rems = scanl (*) 1 sticks'
probs = vector [ (sticks!i) * (rems!i) | i <- 1...infinity ]
return probs
Here we are utilising the higher-order function scan.
Much like the closely-related fold operation
(sometimes called reduce or accumulate),
scan allows us to recursively combine elements of an array.
In this example, rems
is a vector containing the cumulative product of sticks'
.
Although fold and scan provide only a restricted form of recursion,
they are powerful enough to be able to implement practically useful models such as Dirichlet processes (DP), as demonstrated in the following program.
Stochaskell supports using higher-order functions like these, rather than allowing unlimited recursion,
in order to provide a balance between expressiveness and portability,
in consideration of the fact that some probabilistic programming systems may support only limited forms of recursion, if at all.
dirichletProcess :: R -> P R -> P (P R)
dirichletProcess alpha base = do
probs <- stickBreak alpha
atoms <- joint vector [ base | i <- 1...infinity ]
let randomDistribution = do
stick <- pmf probs
return (atoms!stick)
return randomDistribution
Note that we can see from the type signature that this program encodes a distribution over distributions. These random distributions can be used like any other sub-program, as shown in this implementation of a DP mixture model [@neal00]:
dpmm :: Z -> P RVec
dpmm n = do
let base = uniform 0 100
paramDist <- dirichletProcess 5 base
params <- joint vector [ paramDist | j <- 1...n ]
values <- joint vector [ normal (params!j) 1 | j <- 1...n ]
return values
See @roberts19 [sec. 3.1].
Statistical inference, within a Bayesian paradigm,
involves the consideration of the posterior distribution of model parameters conditioned on observed data.
For clarity, we will focus on the common case of performing inference by drawing approximate samples from the posterior via Markov chain Monte Carlo (MCMC) methods.
However, note that Stochaskell does not restrict itself to this case,
and integration with a wide range of different inference methods is possible.
Posterior distributions can be expressed in Stochaskell with a familiar notation,9
for instance given a probabilistic program prior
and some observed
data:
posterior = [ z | (z,y) <- prior, y == observed ]
Stochaskell supports programmable inference, as introduced by @mansinghka14, with inference strategies expressed in the same language as the model, i.e. Haskell. This allows us to seamlessly combine different inference methods, be they provided by native Haskell code or external probabilistic programming systems. In the latter case, runtime code generation allows us to offload inference of different parts of a model without any manual code duplication.
Stochaskell integrates with Stan, to provide high-performance inference for (conditionally) fixed-dimensional, continuous model parameters. This is achieved by automatically translating Stochaskell programs into Stan code, compiling the resulting code with CmdStan, then executing the compiled model and parsing the results so that they can be further processed within Haskell. This allows the user to offload inference to Stan with a single line of code, receiving a list of posterior samples in return: 10
samples <- hmcStan posterior
The IR outlined in the previous section can be translated to Stan code in a fairly straightforward manner. Each Apply node is translated to variable definition with an automatically generated identifier. Array definitions correspond to a for-loop, or a set of nested for-loops, which iterate over each possible index value and store the appropriate element value into an array. A Fold node produces code that first assigns the default value to a variable, then iteratively updates it by looping over the elements of the vector.
Similar backends are also provided for integration with PyMC3 and Edward.
To demonstrate the portability of Stochaskell across a variety of probabilistic programming paradigms, we also provide integration with Church. The implementation and usage of this is quite similar to the Stan backend, allowing inference to be easily offloaded: 11
samples <- mhChurch posterior
The main conceptual difference from the Stan code generation is that for Church, arrays are represented by memoised functions.
In particular, random arrays --- constructed with the joint
keyword in Stochaskell --- utilise Church's support for stochastic memoisation [@goodman08, sec. 2.1].
The figure on the right illustrates a simulation of the DP mixture model program presented in the previous section, sampled via the Church backend.
Stochaskell provides a native Haskell library for both simulating probabilistic programs,
as well as computing the probability density function of the distribution they represent where possible.
A notable feature of the latter is that we provide limited support for computing the density of transformed random variables.
Suppose we have
where the denominator is the absolute value of the determinant of the Jacobian matrix of the transformation
This density computation library allows us to provide a simple implementation of the Metropolis--Hastings (M--H) inference algorithm, with support for user-defined proposal distributions:12
mh target proposal x = do
y <- simulate (proposal x)
let f = density target
q z = density' (proposal z)
a = fromLogFloat $
(f y * q y x) / (f x * q x y)
accept <- bernoulli (if a > 1 then 1 else a)
return (if accept then y else x)
Proposals are specified by probabilistic programs written in Stochaskell, alongside the model. For instance, given the following random walk proposal:
proposal :: R -> P R
proposal x = do
z <- normal 0 1
return (x + z)
we can sample from the distribution represented by some program
via M--H as follows:
x <- chain n (program `mh` proposal) x0
where n
is the number of iterations and x0
is some initial state.
See @roberts19.
^[{-} Replication of @green09 [fig. 1] with a Reversible Jump sampler automatically derived by Stochaskell using the method described in @roberts19. [![Binder]](https://mybinder.org/v2/gh/davidar/stochaskell/master?filepath=rj.ipynb) ] ![](images/soccer.svg){width=100%} ^[{-} Replication of @green95 [figs. 1--4], as above. [![Binder]](https://mybinder.org/v2/gh/davidar/stochaskell/master?filepath=rj-coal.ipynb) ]
![](images/coal-mean.svg) | ![](images/coal-n.svg) |
![](images/coal-s.svg) | ![](images/coal-g.svg) |
To quickly get started using Stochaskell, it can be launched via Binder. It can also be launched on a local machine by installing Docker and running the following commands:
docker pull davidar/stochaskell
docker run -it -p8888:8888 davidar/stochaskell
To build from source, rather than using a pre-built image, run the following commands:
git clone --recursive https://github.com/davidar/stochaskell.git
docker build -t stochaskell:latest stochaskell/
docker run -it -p8888:8888 stochaskell:latest
Footnotes
-
See §7 of Stan User's Guide. ↩
-
Also see A Gentle Introduction to Haskell and the Stochaskell API documentation. ↩
-
Each represented by the type
R
in the type signature. Likewise,Z
represents integers,RVec
a vector of reals, andP (Z,RVec)
a probability distribution over pairs of these. ↩ -
Using the monad comprehensions language extension. ↩
-
Such as those provided by the standard
Data.Array
module. ↩ -
The kernel function should be symmetric and positive semidefinite [@rasmussen06, sec. 4.1]. ↩
-
Here we are utilising the
Data.Boolean.Overload
module to ensure the definition is sufficiently polymorphic. ↩ -
{-} Plot of five independent simulations of
gp kernelSE1
. ↩ -
Thanks to monad comprehensions, and a generalised
guard
function. ↩ -
{-} Plot of posterior samples from the Gaussian process classification model program from the previous section, sampled via the Stan backend. ↩
-
{-} Histogram of $n=10^4$ samples from the Dirichlet process mixture model program from the previous section, via the WebChurch implementation of Church. ↩
-
Here
density
includes the Jacobian adjustment, whereasdensity'
does not. The latter allows us to use proposals which make few random choices relative to the size of the state, under the assumption that they do not involve non-trivial transformations.Note: these functions have been deprecated in favour of the more efficient
lpdf
andlpdfAux
functions. ↩