# Minimal probabilistic expression in Haskell
### Haskell Kōan: Probabilstic programming with conditioning

We implement a tiny embedded domain-specific language which allows us to _sample from random variables_ and
build computations from them. We also build the less-known but equally important ability to
use _conditional reasoning_, where we can condition a random variable on another random variable. This is not as well known, and implementing this is the raison d'être of this Kōan.

Implementing this within haskell with a monadic inteface allows us to leverage the full power of Haskell,
making our computations compositional, and our implementation of _conditioning_ concise. We will allow users to build expressions denoting _random variables_, which will automatically be typed (due to being embedded within Haskell). We also gain the ability to _introspect_ these expressions, which will be the key requirement to build the conditioning infrastructure. We also gain the ability to _freely mix_ all haskell code with our random variables --- All of this becomes available for free thanks to the design of the library.

This can be seen as a minimal implementation of [`monad-bayes`](https://github.com/adscib/monad-bayes), whose general approach we follow, but reduce the generality for brevity.

Let's begin: we first enable [GADTs](https://en.wikibooks.org/wiki/Haskell/GADT), and import some libraries.

In [2]:
{-# LANGUAGE GADTs #-}
import Control.Monad (ap, replicateM)
import System.Random (getStdGen, getStdRandom, randomR)
import qualified Data.Map as M
import Control.Monad

Now, we create a new type called `R`, for *random variable expression*.

There are three ways to construct a random variable expression:

### `Return a`

`Return a` lifts a pure value `a` into a random variable, which takes on only one value `a`. This has no randomness.

for example, we can construct

```
f = Return 5
```
to lift the pure value `5` as a random variable expression.

### `Uniform f`

`Uniform f` takes a callback function `f :: Float -> R a`. The input of `f` is a random sample `(s :: Float)`.  The output of `f` is  `(f s :: R a)`, which is a new *random variable expression* which might depend on the random sample `s`. `Uniform f :: R a` wraps this request as a new *random variable expression*. (This model of thinking follows the freer-monad approach, [about which one can read more here](http://okmij.org/ftp/Haskell/extensible/more.pdf))

for example, we can construct

```
f = Uniform $ \u -> -- ^ request for a random sample 
      Uniform $ \u' -> -- ^ request for another random sample
        Ret $ u + u' -- ^ use the responses to return a value
```

### `Weigh w r`

`Weigh w r` takes as input a scaling factor `w :: Float`, and more random computations that we wish to perform as `r :: R a`. It scales the probability of the `r` computation to be executed by a factor of `w`. 

for example, we can construct:

```
f = Uniform $ \u -> -- ^ request for a random *uniformly distributed* sample
     in Weigh u (Ret $ u * u) -- ^ probability of returning value `u*u` is now scaled by `u`.
```


Note that while the user of the library does not need to care about the actual probabilities, the library
implicity keeps track of the probabilities of the random samples. 

These random expressions `R` support a `Functor`, `Applicative`, and `Monad` instance which we implement in the following code block to easily build up random computations. We include `runRUnweighted` which shows how to run an `R a` computation to get a _random_ `a` value, without taking into account the weights (hence, `unweighted`).

In [3]:
data R a where
  Return :: a -> R a -- ^ lift a pure value
  Uniform :: (Float -> R a) -- ^ computation that needs a random number to provide the rest 
                            -- of the computation
            -> R a
  Weigh :: Float -> R a -> R a   -- ^ scale the probability of the computation by a factor
  
instance Functor R where
  fmap f (Return x) = Return (f x)
  fmap f (Uniform rand2m) = Uniform $ \r -> fmap f (rand2m r)
  fmap f (Weigh w m) = Weigh w (fmap f m) 
  
instance Monad R where
  return = Return
  Return a >>= f = f a
  Uniform rand2m >>= f = Uniform $ \r -> rand2m r >>= f
  (Weigh w m) >>= f = Weigh w (m >>= f)
  
instance Applicative R where
  pure = return
  (<*>) = ap
  
-- | Run a random computation, *not taking into account the weights*.
-- | This will be used to run the _traced computation_, which
-- | does take into account the weights.
runUnweighted :: R a -> IO a
runUnweighted (Return a) = return a
runUnweighted (Weigh w m) = runUnweighted m
runUnweighted (Uniform rand2m) = do
  r <- getStdRandom $ randomR (0, 1)
  runUnweighted (rand2m r) 
  

-- | A value that is uniformly distributed over (0, 1). Convenient constructor
uniform01 :: R Float
uniform01 = Uniform Return

-- \ Quick run of the uniform sampler to see what it outputs.
runUnweighted (replicateM 5 uniform01) >>= \xs -> putStrLn $ "uniform random values: " <> show xs


uniform random values: [0.99059993,0.53365195,0.96819735,0.83136976,0.47343558]

### Composing `Uniform` to build more complex random variables

Now that we have a way to get random numbers _uniformly_ from the range (0, 1) we'll use this to build
more complicated random variables --- namely, coins with different biases.

`coin p` creates a coin that returns `1` with probability `p`, and `0` with probabilty `(1 - p)`.
We pick a number `r` uniformly from `(0, 1)`. If this number is less than `p`, we return `1`, otherwise we
return `0`.

In [4]:
-- | 'coin p' returns 1 with probability p, 0 with probability (1 - p) 
coin :: Num a => Float -> R a
coin p = do
  r <- uniform01
  return $ if r < p then 1 else 0
  
-- | Exmaple runs of the coin
runUnweighted (replicateM 10 (coin 0)) >>= \xs -> putStrLn $ "coin 0: " <> show xs
runUnweighted (replicateM 10 (coin 1)) >>= \xs -> putStrLn $ "coin 1: " <> show xs
runUnweighted (replicateM 10 (coin 0.5)) >>= \xs -> putStrLn $ "coin 0.5: " <> show xs

coin 0: [0,0,0,0,0,0,0,0,0,0]

coin 1: [1,1,1,1,1,1,1,1,1,1]

coin 0.5: [0,1,1,0,1,1,0,1,1,0]

We also build `discrete`, that lets us pick a discrete value from a list of values with equal probability.

In [5]:
-- | Chose a discrete value with equal probability
discrete :: [a] -> R a
discrete as = do
  r <- uniform01
  let ix = floor $ r * (fromIntegral $ length as)
  return $ as !! ix
  
runUnweighted (replicateM 10 (discrete [1, 10, 100])) >>= \xs -> putStrLn $ "discrete: " <> show xs

discrete: [10,100,100,10,10,100,100,100,1,1]

Let's also write a tiny utility to plot many values onto the command line with fancy
ASCII-art. This is useful when we want to sample a _large number of things_ and look at
histograms with `histogram`, or look at the values, with `printvals`.

In [6]:
-- | List of characters that represent sparklines
sparkchars :: String
sparkchars = "_▁▂▃▄▅▆▇█"

-- Convert an int to a sparkline character
num2spark :: RealFrac a => a -- ^ Max value
  -> a -- ^ Current value
  -> Char
num2spark maxv curv =
   sparkchars !!
     (floor $ (curv / maxv) * (fromIntegral (length sparkchars - 1)))

-- | Print sparklines with title
printvals :: RealFrac a => String -> [a] -> IO ()
printvals title vs = do 
  let maxv = maximum vs
  putStrLn $ title ++ " " ++ map (num2spark maxv) vs
  
-- | Create a histogram from values.
histogram :: RealFrac a
          => String -- ^ title
          -> Int -- ^ number of buckets
          -> [a] -- values
          -> IO ()
histogram title nbuckets vs = do
        let minv = minimum vs
            maxv = maximum vs
            perbucket = (maxv - minv) / (fromIntegral nbuckets)
            bucket v = floor ((v - minv) / perbucket)
            bucketed = M.fromListWith (+) [(bucket v, 1) | v <- vs]
        printvals title $ M.elems $ bucketed
        

We can now draw our previous coins. a vertical bar means that we got a `1`, and not having a vertical bar
means that we got a value `0`. We would expect `bias 0` to have no vertical bars (since we should never get a `1`). Similarly, we would expect `bias 1` to have only vertical bars (since we should always get a `1`).

In [7]:
-- | Print samples of coins.        
runsUneighted 20 (coin 0) >>=  printvals "coin: bias 0"
runsWeighted 20 (coin' 0.2) >>=  printvals "coin': bias 0.2"
runsWeighted 20 (coin 0.5) >>=  printvals "coin': 0.5"
runsWeighted 20 (coin 0.8) >>=  printvals "coin: 0.8"
runsWeighted 20 (coin 1) >>=  printvals "coin: 1"

: 

## Why to `weigh` 

We now have a DSL that can describe expressions of random variables and sample from their distributions. But what is missing and essential is the ability to do _inference_: we wish to modify our distribution based on observations or arbitrary conditions. `weigh` allows us to seamlessly include conditioning into our language concisely. 

Besides inferece, conditioning makes our language a lot more convenient: For example, suppose we wish to describe a random variable which is the sum of two dice, where each dice lands a _prime_ value. Given `weigh`, we can express this as:

```
dice = discrete [1..6]
sumdice = do
 d <- dice
 d' <- dice
 weigh $ if isprime d then 1 else 0
 weigh $ if isprime d' then 1 else 0
 return $ d + d'
```

Doing this without `weigh` would require us to construct this probability distribution from a uniform distribution, which is a repellent task.

Let's see how `weigh` allows us to express the biased coin in a different way. We call this new implementation `coin'`. We pick `0` and `1` with equal probability by using the call `fair <- discrete[0, 1]`. We then
_bias_ the probability of getting a `0` or a `1` with calls to `weigh`:

- If `fair` returns `1`, we bias the probability by `b` by calling `weigh $ b`.
- Similarly, if `fair` returns `0`, we bias the probability by `(1 - b)` by calling `weigh $ 1 - b`.

This gives us a biased coin, but in a completely different `weigh` --- before, the computation itself returned a biased value. Now in `coin'`,  the computation is unbised, but the probability of it being executed is biased.  This can be seen by running the `coin'` with `runUnweighted`: we will see that the samples appear as if they are from a fair coin. 

In [8]:
 -- | Change the weight of the rest of the computation. As of now, we cannot
-- interpret this.
weigh :: Float -> R ()
weigh w = Weigh w (Return ())

 -- | A biased coin, created from a fair coin.
coin' :: (Eq a, Num a) => Float -> R a
coin' b = do
  -- | pick heads or tails with uniform probability
  fair <- discrete [0, 1]
  -- | if the fair coin landed 1...
  if fair == 1
  then weigh $ b -- weigh the outcome by `b`.
  else weigh $ (1 - b) -- otherwise weigh the outcome by `1 - b`.
  -- | return the value that was tossed, with the new weight.
  return fair

-- | If we had weighting, we would see no vertical bars. But we do see vertical bars,
-- | as in the case of (coin 0.5), since without weighing, `coin'` is a fair coin.
runUnweighted (replicateM 40 (coin' 0)) 
  >>= printvals "coin' 0 (will appear as fair coin, since we do not deal with weigh): "

coin' 0 (will appear as fair coin, since we do not deal with weigh):  ██████_████_█_██_██__██___█_█___█_█_█___

## How to `weigh`

We need the ability to take into account the calls to `weigh` we have in the code.
For this, we use a technique described in the [Church programming language paper](https://web.stanford.edu/~ngoodman/papers/churchUAI08_rev2.pdf), and also explained in the paper [Denotational validation of higher-order bayesian inference](https://arxiv.org/abs/1711.03219). 

#### The high-level perspective:

- in `runUnweighted`, we knew the distribution from which the random sample came from: the uniform distribution. This allowed us to sample from this distribution directly.

- `Weigh` changes the probability of a value arbitrarily, and so there is no way for us to sample from the updated distribution.

- The way around is to use an MCMC-like technique, where we rely on _relative weights_ of samples.

- However, to perform MCMC, we need a way to propse a new sample which is close to the original sample. We only know the computation and the randomness that generated this sample. So, to propose a new value, we perturb the randomness which generated the sample to produce a new proposal sample.

- We also need a way to compare the weights of these two samples. This means we need to know the total sample weight.

This naturally leads us to consider _traces of computation_, which contain the value, the randomness that was used to produce this value, and the final weight of this value. We can perform metropolis-hastings on the space of _traces of computation_. We call this data structure a `Trace`.



#### The `Trace` data structure:

The idea is to sample from the space of "program traces", where a `Trace a` keeps track of:
- The final output value --- `traceval`
- All the randomness used in producing this output value (the list of `Float` samples that have been generated for each invocation of `Uniform`) --- `tracerands`
- All weighting that has been done on the output value (the product of all `weigh`s found along this computational trace) --- `traceweight`.

We store the traces in a `Trace a` object, and provide the functions:
- `liftTrace` to lift a pure value into a `Trace` that uses no randomness and weight `1.0`
- `weighTrace w t` to multiply the weight of an existing `(t :: Trace)` by `w`
- `recordRandomnessTrace r t` to store `r` in `tracerands`. This is used to record the fact
   that we have used this randomness, and will be used later on to "replay" the trace, with a perturbation.
   
*TODO: replace all randomness with samples*

In [9]:
-- | Trace of computation 
data Trace a = 
  Trace { traceval :: a -- ^ value being traced
        , tracerands :: [Float] -- ^ all the randomness used to produce this value
        , traceweight :: Float -- ^ weight of this current trace value
        } deriving(Show)

-- | Lift a value to a trace. Start it with weight 1.0, no randomness, and the given value
liftTrace :: a -> Trace a
liftTrace a = Trace a [] 1.0

-- | Weigh a trace by the given weight 
weighTrace :: Float -> Trace a -> Trace a
weighTrace w tr = tr {traceweight=(traceweight tr)*w}

-- | Record the use of randomness along the trace.
recordRandomnessTrace :: Float -> Trace a -> Trace a
recordRandomnessTrace r tr = tr {tracerands=tracerands tr ++ [r]}

Now, we implement `traceR :: R a -> R (Trace a)` that allows us to record the full trace that was used to produce it.

The implementation strategy is to take an expression `(r :: R a)`, and inject the correct introspection for `Ret`, `Uniform` and `Weigh`. More specifically:
-  for a `Return x`, we use `liftTrace` to create a new traced value from a pure value
- for a `Uniform`, we record the randomness that was passed to it with `recordRandomnessTrace`
- for a `Weigh`, we record the weight with `weighTrace`.

In [10]:
-- | given a regular computation, edit the computation to trace
-- | the computation. 
traceR :: R a -> R (Trace a)
traceR (Return x) = Return $ liftTrace x
traceR (Uniform rand2ra) = 
  Uniform $ \r -> do
      -- | feed the inner computation the random value it wants,
      -- | and continue tracing it
      t <- traceR (rand2ra r)
      -- | record the randomness that we used now.
      return $ recordRandomnessTrace r t
-- | record the weighing
traceR (Weigh w m) = do
 t <- traceR m
 return $ (weighTrace w t)

-- | Trace will have weight 0.2 or 0.8, and a single random value.
runUnweighted (traceR (coin' 0.2)) >>= \tr -> putStrLn $ "trace of coin': " <> show tr
-- | Trace will have weight 0.5, and a single random value.
runUnweighted (traceR (coin' 0.5)) >>= \tr -> putStrLn $ "trace of coin': " <> show tr
-- | Trace will have two random values
runUnweighted (traceR $ liftM2 (+) (coin' 0.2) (coin' 0.2)) >>= 
  \tr -> putStrLn $ "trace of coin': " <> show tr

trace of coin': Trace {traceval = 0, tracerands = [0.47372633], traceweight = 0.8}

trace of coin': Trace {traceval = 0, tracerands = [0.44166064], traceweight = 0.5}

trace of coin': Trace {traceval = 2, tracerands = [0.6901764,0.54960936], traceweight = 4.0000003e-2}

Next, we build another helper called `runRWithRandomness`. This allows us to _feed_ a sequence of pre-determined
values to run a random computation. This allows us to perturb a trace by changing the randomness it used, and then re-running it to see the new trace.

In [11]:
-- RENAME THIS TO renameRWithRandomness
-- | Run the random variable, using the randomness provided until the
-- | randomness is exhausted
runRWithRandomness :: [Float] -> R a -> R a
runRWithRandomness _ (Return a) = Return a
-- | Feed the Uniform sampling the randomness we have, and continue running
-- with the rest of the randomness
runRWithRandomness (r:rs) (Uniform rand2m) = runRWithRandomness rs (rand2m r)
-- | ran out of randomnessrunRWithRandomness rs m
runRWithRandomness [] (Uniform rand2m) = Uniform rand2m 
runRWithRandomness rs (Weigh w m) = 
  Weigh w m
  
-- | We feed the coin forcibly with the randomness as value 0. 
-- Note that `coin 0.0001` will be extremely unlikely to return a 1
-- unless it is fed a random value < 0.000001.
runUnweighted (runRWithRandomness [0] (coin 0.000001)) >>= 
  \x -> putStrLn $ "coin forcibly fed with randomness 0 (super unlikely to become 1): " <> show x

coin forcibly fed with randomness 0 (super unlikely to become 1): 1

Next, we build the proposal that runs a new computation given the current computation. 

We take the randomness that was used to produce the old trace, called `rand`. We edit the randomness at an
index `ix` to be some arbitrary new value `r`. This new randomness is called `newrand`. We then re-run the original computation `m t` with the new randomness `newrand`. This gives us the new trace `t'`.

In [13]:
-- | Propose a new trace given the randomness that was used to produce
-- the old trace
proposeTrace :: [Float]  -- ^ randomness used to produce old trace
  -> R (Trace a) -- ^ computation 
  -> R (Trace a) -- ^ new proposal trace.
proposeTrace rand mt  = do
-- | Pick a random position in the randomness of the original trace
  ix <- discrete [0..(length $ rand) - 1]
  -- | Edit the trace at this position by changing the randomness
  r <- uniform01                  
  let (randl, randr) = splitAt ix (rand)
  -- \ replace the randomness of the trace at this position with this
  -- new random value, and now re-run the computation
  let newrand = randl ++ [r] ++ drop 1 randr 
  -- | re-run the old computation, by feeding it the new randomness
  t' <- runRWithRandomness newrand mt 
  return $ t'

Now, we implement `tracedMhStep`, which is a step of the metropolis-hastings over this space of traces. We propose a new trace `t'` from the randomness use to generate the old trace `t` and the computation `m t`.

Our acceptance ratio is `ratio`, which is the ratio of the weights of the traces, multiplied by the ratio of the
amounts of randomness used to produce the traces. We then perform the usual acceptance criteria of metropolis-hastings.

In [15]:
-- | Take samples from the traced random variable using traced monte caro
tracedMhStep :: R (Trace a) -- ^ computation
  -> Trace a -- ^ old trace  state
  -> R (Trace a) -- ^ new state
tracedMhStep mt t = do
  -- | propose a new trace given the randomness used to produce
  -- | the old trace.
  t' <- proposeTrace (tracerands t) mt
  -- | TODO: lookup "rosenbluth factor"
  let ratio = traceweight t' * (fromIntegral . length . tracerands $ t') /
               traceweight t * (fromIntegral . length . tracerands $ t)
  accept <- uniform01
  return $ if accept < ratio then t' else t

To kick off metropolis-hastings, we need an initial trace on which we iterate over. This needs to be
a _legal_ trace: That is, it needs to have non-zero weight. If it does not, the calculations in `tracedMhStep` will go awry: our acceptance ratios are all only sensible as long as the denominator is non-zero. The denominator will contain the weight of the inital trace. So, we just repeatedly sample from `mt` till we
get a trace with non-zero weight.

In [16]:
-- | sample from the computation till we find a trace with
-- non zero weight
nonZeroWeightTrace :: R (Trace a) -> R (Trace a)
nonZeroWeightTrace mt = do
  t <- mt
  if traceweight t == 0
  then nonZeroWeightTrace mt
  else return $ t

We now have all the pieces to implement `(tracedMh :: Int -> R a -> R [a])`. Given the number of samples we want
and the random variable, we sample from the random variable using traced metropolis hastings. We first lift the untraced computation `m` to a traced `tm = traceR m`. Then, we find a first legal trace `t = nonZeroWeightTrace tm`. Given this, we then continue to take steps of `tracedMh`, and store all the steps taken in the helper function `go`.

Finally, we map `tracevals` over the list of traces `traces` to extrace out the values from the trace.

We implement a small helper called `runWeighted` which first invokes `tracedMH` to build the traced MH computation, and then run this computation using `runUnweighted`. 

In [17]:
-- | Repeat the monadic computation n times
loopM :: Monad m => Int -> (a -> m a) -> a -> m a
loopM 0 _ a = return a
loopM n f a = f a >>= loopM (n - 1) f


-- | Take samples from a random variable by using traced metropolois hastings   
tracedMH :: Int -> R a -> R [a]
tracedMH n m = do
  -- | create the traced randomness source, and sample fromt ti till we get an acceptable
  -- | computation
  let tm = traceR m 
  t <- nonZeroWeightTrace tm
  -- | Int -> Trace a -> R [Trace a]
  let go 0 t = pure []
      go n t = do
         t' <- loopM 10 (tracedMhStep tm) $ t
         ts <- go (n - 1) t'
         return $ t:ts
  traces <- go n t
  return $ map traceval traces
  
-- | get N samples from a random varaible that uses `weigh`, by sampling using traced metropolis hastings.
runsWeighted :: Int -> R a -> IO [a]
runsWeighted n m = runUnweighted $ tracedMH n m

## Payoff: `coin` vs `coin'`

Let's use the machinery to run `coin'`  (recall that `coin'` was defined using `weigh`,
and behaved like a fair coin when run with `runUnweighted`). We will run both the `coin` and `coin'` for different
biases to check that their histograms look roughly the same.

Also note that from this point onward, we will _only use_ `runsWeighted`, since our weighted sampler
completely subsumes the unweighted sampler.

In [10]:
runsWeighted 100 (coin 0) >>=  histogram "coin:  bias 0" 2
runsWeighted 100 (coin' 0) >>=  histogram "coin': bias 0" 2

putStrLn "---"
runsWeighted 100 (coin 0.2) >>=  histogram "coin:  bias 0.2" 2
runsWeighted 100 (coin' 0.2) >>=  histogram "coin': bias 0.2" 2

putStrLn "---"
runsWeighted 100 (coin 0.5) >>=  histogram "coin:  0.5" 2
runsWeighted 100 (coin' 0.5) >>=  histogram "coin': 0.5" 2

putStrLn "---"
runsWeighted 100 (coin 0.8) >>=  histogram "coin:  0.8" 2
runsWeighted 100 (coin' 0.8) >>=  histogram "coin': 0.8" 2

putStrLn "---"
runsWeighted 100 (coin 1) >>=  histogram "coin:  1" 2
runsWeighted 100 (coin' 1) >>=  histogram "coin': 1" 2

coin:  bias 0 █

coin': bias 0 █

---

coin:  bias 0.2 █▁

coin': bias 0.2 █▂

---

coin:  0.5 █▇

coin': 0.5 ▇█

---

coin:  0.8 ▂█

coin': 0.8 ▂█

---

coin:  1 █

coin': 1 █

Excellent, `coin` and `coin'` have similar distributions!

## Sampling distributions with `weigh`:

Next, we use the `weigh` mechanism to sample from _any shape of distribution we want_. The idea is this: if we want to sample points with a distribution `dist :: Float -> Float`, we will sample uniformly a value `r` in the range `(lo, hi)`, and then _weigh this `r`_ by the distribution `dist`.

This allows us to sample from shapes such as:
- $f(x) \propto x^2 : 0 \leq x \leq 6$
- $f(x) \propto |\sin x| : 0 \leq x \leq 6 $
- $f(x) \propto e^{-x^2} : -6 \leq x \leq 6$ (gaussian)

In [11]:
distributionToR :: (Float, Float) -- ^ support
  -> (Float -> Float) -- ^ distribution
  -> R Float
distributionToR (lo, hi) dist = do
  r <- uniform01
  -- | take a random value uniformly distributed in (lo, hi)
  let val = lo + r * (hi - lo)
  -- | weigh the sample `val` with weight `d val`
  weigh $ dist val
  -- | return the value, with the new weight applied.
  return $ val


runsWeighted 1000 (distributionToR (0, 6) (^2)) >>= histogram "x^2" 25
runsWeighted 1000 (distributionToR (0, 6) (abs . sin)) >>= histogram "|sin x|" 25
runsWeighted 1000 (distributionToR (-6, 6) (\x -> exp (-1.0 * x * x))) >>= histogram "e^{-x^2}" 25

x^2 ______▁▁▁▁▂▂▂▂▂▃▄▄▆▅▅▆█▇_

|sin x| ▂▂▅▆▆▅▇▆▆▄▃▂_▁▂▃▅▆▆█▆▇▆▄▃_

e^{-x^2} ____▁▁▃▄▇▇▇█▇▄▂▁▁_____

Great, all of them work.

### Inference

Let's now use similar ideas to estimate the bias of a coin. We know how likely heads or tails is, given the bias of a coin. Let's call this $P(data|bias)$ (probability of the data given the bias of the coin), better known as _likelihood_. For our coin, this is:

$$
\begin{align}
P(1|bias) &= bias \\
P(0|bias) &= 1 - bias
\end{align}
$$

What we want to do is solve the _inverse problem_.
Given observations about coin flips from a coin with an _unknown bias_, we wish to _predict its bias_. That is,
That is, we want to find its distribution $P(\text{bias}|\text{data})$. We solve this problem using Bayes' theorem. We know that:

$$P(bias|data) = \frac{P(data|bias) P(bias)}{P(data)}$$ 

The denominator is normalzation factor that is constant for a fixed $data$. Thus, we write:

$$P(bias|data) \propto P(data|bias) P(bias)$$

This, if $P(bias)$ is our _prior belief_ about the biases, then $P(data|bias)$ is how much we need to multiply the
prior with to get the _posterios belief_.
In our case, as mentioned above, the value of $P(data|bias)$ is:

$$
\begin{align}
P(1|bias) &= bias \\
P(0|bias) &= 1 - bias
\end{align}
$$


we implement `estimateBias` which takes a uniform prior: That is, it assumes that the bias `b` is uniformly distributed in `[0, 1]`. It then loops over all the observations `obs`. For each observation `ob`, it weighs the bias _in sequence_ by $\texttt{likelihood} = P(data|bias)$. Finally, it returns the bias `b` that has been scaled by the observations.

In [12]:
-- | Given a list of observations from a coin and the bias, return a value proportional
-- to the coin having that bias. Find this by multiplying by bias if we have a 1, (1 - bias)
-- if we have a 0, for each heads/tails we see.
estimateBias :: [Int] -> R Float
estimateBias obs = do
  b <- distributionToR (0, 2) (\d -> exp (-1.0 * (d - 0.5) * (d - 0.5)) / 5.0) -- ^ Uniform prior
  weigh $ if b >= 0 then 1 else 0
  forM_ obs $ \ob -> do
    -- | scale by P(data|bias)
    let likelihood = if ob == 1 then b else (1 - b)
    weigh likelihood
  return b
  
replicateList :: Int -> [a] -> [a]
replicateList n as = mconcat $ replicate n as 

runsWeighted 1000 (estimateBias []) >>= histogram "estimate with no data" 10
runsWeighted 1000 (estimateBias [1]) >>= histogram "estimate with [1]" 10
runsWeighted 1000 (estimateBias [0]) >>= histogram "estimate with [0]" 10
runsWeighted 1000 (estimateBias [0, 1]) >>= histogram "estimate with [0, 1]" 10
runsWeighted 1000 (estimateBias [1, 0]) >>= histogram "estimate with [1, 0]" 10
runsWeighted 1000 (estimateBias [1, 0, 1, 0]) >>= histogram "estimate with [1, 0]x2" 10
runsWeighted 1000 (estimateBias (replicateList 8 [1, 0])) >>= histogram "estimate with [1, 0]x8" 10
runsWeighted 1000 (estimateBias (replicateList 20 [1, 0])) >>= histogram "estimate with [1, 0]x20" 10

estimate with no data ▇▇█▇▇▅▄▂▁▁_

estimate with [1] ▁▂▆▆▇█▇▅▃▂_

estimate with [0] ▁▄▆▇▇▇█▅▅▅_

estimate with [0, 1] ▁▄▅█▇▇▇▆▃▁_

estimate with [1, 0] ▁▃▅▅▇█▆▅▄▁_

estimate with [1, 0]x2 _▁▁___▂▄▆█_

estimate with [1, 0]x8 ___▁█_

estimate with [1, 0]x20 __█_

This is as we expect: When we have no data, we have a uniform distribution over all biases. As we see data, we update our beliefs about the likely values of `bias`. For example, when we have only seen a `[1]`, we assume that it is much more likely for the coin to be unfair --- we give the fact that it has bias `1` to be the most likely.
However, _as soon as we see more data_ in the case of `[0,1]`, we adjust our belief, and we see that we now believe that the coin is most likely fair. This allows us to _infer_ the likely bias of the coin, given the observed data.

#### Conclusion

For me, the main takeaways from this experiment is that:

- The general implementation of probabilistic programs with `Ret` and `Uniform` is well-known. These allows us to sample easily from expressions of random variables, and is very straightforward to implement in haskell. This covers the `runUnweighted` part of the story. 


- In addition to the above, the ability to _weigh_ computations is quite powerful: It lets us describe many operations more naturally than we could otherwise (see, `coin` versus `coin'`). This is much less well-known
in the folklore (as far as I am aware), and was very interesting to learn about and implement. I originally ran across this at [`monad-bayes`](https://github.com/adscib/monad-bayes). There appears to be a rich paper-trail that one can chase, starting from `monad-bayes`.


- Our random variable expressions come with `Functor`, `Applicative` and `Monad` instances. Throughout our implementation, we made use of the machinery that comes from implementing these typeclasses. Since these are standard typeclasses, this interacts nicely with the rest of the ecosystem. Indeed, we could get this "for free" [by using the free monad infrastructure](TODO: add link to free monad). This would shrink our already tiny implementation. 

  
- Monte-carlo is a powerful class of techniques, which allow us to get an _approximation_ to an _arbitrary distribution_. This makes it quite unlike many of the other mathematical objects that I know. For example, in the case of groups, [deciding if two group elements are equal (the word problem) is undecidable in general](https://en.wikipedia.org/wiki/Word_problem_for_groups). I do not know of any theory of "approximate computing of groups".


- What I currently dislike about this approach is needing to have the `runUnweighted`, inside which the `runWeighted` is implemented. It is useful to implement it this way, since we can have our weighted computation (`tracedMhStep`) live within the same `R` monad, and reuse the infrastructure. However, we need to introduce a `Trace` and then eliminate the `Trace`, which makes the API less elegant that what I had hoped. 



#### References

- [Church, a language for generative models](https://web.stanford.edu/~ngoodman/papers/churchUAI08_rev2.pdf)
- [Denotational validation of higher order bayesian inference](https://arxiv.org/abs/1711.03219)
- [Practical probabilistic programming with monads](http://mlg.eng.cam.ac.uk/pub/pdf/SciGhaGor15.pdf)
- [`monad-bayes`](https://github.com/adscib/monad-bayes)