# Probabilistic Programming - 1 
## Monte Carlo sampling

So far you've been doing all your calculations by hand. As you have probably learned, this is timeconsuming and error prone. In this lesson we are going to introduce Probabilistic Programming as a method to automate some of the labour. We will cover 2 software packages: ForneyLab.jl and Turing.jl. We will mainly show you how to specify probabilistic models in both. The main takehome point is that while Probabilistic Programming requires some specialised knowledge in terms of probability theory and Bayesian inference, implementing an inference procedure is straightforward once you have the right tools.

### Preliminaries

- Goal 
  - Learn to write a basic probabilistic program using Monte Carlo sampling.
- Materials        
  - Mandatory
    - These lecture notes.
    - [Intro to programming in Julia](https://youtu.be/8h8rQyEpiZA?t=233).
    - Tutorials using [Turing.jl](https://turing.ml/dev/tutorials/0-introduction/)
  - Optional
    - Cheatsheets: [how does Julia differ from Matlab / Python](https://docs.julialang.org/en/v1/manual/noteworthy-differences/index.html).

# Introduction to Turing
In this lesson we are going to introduce Turing as an alternative to ForneyLab. Turing is another Probabilistic Programming library available in Julia. Unlike ForneyLab, Turing relies on sampling based schemes to perform inference. This has advantages and disadvantages which you will investigate in the coming lessons.
Things to keep in mind when using Turing:
1. Sampling based inference always runs. This means Turing can handle a wider class of problems than ForneyLab
2. Sampling based inference is stochastic. Your results will vary between runs.
3. Sampling based inference is slow. 

Let's build a model! We are going to work with the same linear regression model that you have already investigated in ForneyLab. Since we have already covered the model specification, we can go a little faster this time

In [None]:
# # Package managing
# using Pkg
# Pkg.activate("workspace")
# Pkg.instantiate()

In [1]:
using Turing
using Plots
using Random

┌ Info: Precompiling Turing [fce5fe82-541a-59a6-adf8-730c64b5f9a0]
└ @ Base loading.jl:1186


  Building Libtask → `~/.julia/packages/Libtask/RjRkK/deps/build.log`


In [None]:
# Parameters
true_W = [1.0;0.5]
true_σ = 1.
true_μ = 0.
true_slope = 0.1

function generate_data(true_W, true_σ, true_μ, n)
    x = randn(size(true_W)[1],n) .* true_σ .+ true_μ + cumsum(ones(size(true_W)[1],n) * true_slope,dims=2)
    y = true_W' * x
    return [x[:,i] for i in 1:n],y
end

x_data,y_data = generate_data(true_W,true_σ,true_μ,n)
scatter(1:n,y_data[:], label="")


In [None]:
# Turing uses the @model macro to define the model function
@model linear_regression(x, y, n) = begin # Number of datapoints (n) as additional input
    
    # Parameters for priors
    μ_w = [0.,0.]
    σ_w = [1. 0. ; 0. 1.]
    σ_y = 1.
    W ~ MvNormal(μ_w,σ_w) # As before we define a 2 dimensional Gaussian prior for the weights

    for i in 1:n # Loop over data points
        
        y[i] ~ Normal(W' * x[i], σ_y) # Estimate y as the dot product of W and x
        
    end
end;

And that's it! Now we are ready to do inference. For Turing that means selecting a sampling algorithm and setting the associated parameters. For this example we will use the No U-Turn Sampler (NUTS). As above, don't worry too much about the details of the inference algorithm. Though if you are feeling adventurous, feel free to try out the other sampling algorithms available or try out different parameter settings. Can you get better results? What happens to the runtime?

In [None]:
# Disable status bars for the inference procedure
Turing.turnprogress(:false);

# Run the sampling procedure and generate an MCMCChain
chain = Turing.sample(linear_regression(x_data,y_data,n), NUTS(500,0.65),5000);

The inference procedure has introduced a new object: The MCMCChain. This object holds the results of our sampling procedure as well as some diagnostic information to assess convergence of the sampler. Let's take a look at what's inside using the describe() function in Turing

In [None]:
describe(chain)

The above table holds a lot of information. A lot of the entries are specific to the inference algorithm so we will focus on the results that matter - that being the posterior distribution over the weights W[1] and W[2]. Did the sampler succesfully manage to recover the parameters of the data generating process? 

Now it's your turn! Below are 2 tasks to get you started working with Probabilistic Programming in Turing. An important part of being a good probabilistic programmer is to use the right tool for the right job and to do that you need to be familiar with multiple toolboxes. Good luck and may the odds be ever in your favour

#### Assigment 1. 
Generate some new data and change the value of true_$\mu$ to something other than 0. How does the model fit now? As was the case with the ForneyLab model, we are missing an intercept and it is your task to extend the model to remedy this. However this time you should implement it *without* adding an additional weight to W. 
A common trick from the probabilistic programmers bag is reparameterisation - the act of writing equivalent models in different ways. Besides leading to faster code, reparameterisation can allow you to work with models that otherwise would violate the constraints of your chosen library or inference procedure. This particular reparameterisation is mostly for illustrative purposes, but it is always important to be aware of what your options are. 
Can you think of a different parameterisation for the intercept?

In [None]:
# Your code here

#### Assignment 2.
The code below generates a dataset of 0's and 1's. Your task is to turn the Turing model above into a binary classifier. The probabilistic model looks like this
$$ W \sim \mathcal{N}(\mathbf{\mu_w},\mathbf{\sigma_w})$$
$$y^\prime{} \sim \mathcal{N}(W^Tx,\sigma_y)$$
$$ p = \sigma(y^\prime{})$$
$$y \sim \mathcal{B}er(p))$$

As you can see, it is pretty similar to the linear regression covered so far. The main difference is that $y$ is now a Bernoulli distribution since the data consists of 0's and 1's. The Bernoulli distribution takes a single parameter $p$ in the range [0,1]. Hence we need to squash the output of the regression $y^\prime{}$ using a logistic function $\sigma$. To that end we have provided a function for you to use.

In [None]:
# Sigmoid function
σ(x) = 1/ (1 + exp(-x))

# Parameters for generating data
true_W = [1.0;0.5]
true_σ = 1.
true_μ = 2.
n = 10

function generate_binary_data(true_W, true_σ, true_μ, n)
    x = randn(size(true_W)[1],n) .* true_σ .+ true_μ
    y = round.(σ.(true_W' * x))
    return [x[:,i] for i in 1:n],y
end

# Binary dataset
x_data,y_data = generate_binary_data(true_W,true_σ,true_μ,n);

In [None]:
@model binary_classifier(x, y, n) = begin
    
    # Parameters for priors
    μ_w = # YOUR CODE HERE
    σ_w = # YOUR CODE HERE
    σ_y = # YOUR CODE HERE
    W ~ # YOUR CODE HERE

    y_prime = Vector(undef,n)
    for i in 1:n # Loop over data points
        
        y_prime[i] ~ #YOUR CODE HERE
        y[i] ~ # YOUR CODE HERE
        
    end
end

chain = Turing.sample(binary_classifier(x_data,y_data,n), NUTS(500,0.65),5000);
describe(chain)

## Assignment 3: Optional
Write either the regression or classification model in at least 2 different parameterisations and investigate the differences. You can either examine runtime (Figure out how to profile your code in Julia), convergence (Check the documentation of Turing and MCMCChains for automated diagnostics. Good search terms to get you started are "Chain Plots" and "Gelman-Rubin statistics") or investigate how the different parameterisations scale with number of data points or input dimensions (Modify the data generating process).

In [None]:
# YOUR CODE HERE

## Assignment 4: Optional
Download the Titanic dataset from https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv. It contains passenger information about all the passengers of the Titanic as well as whether they survived the sinking or not. Use your newly acquired knowledge of Probabilistic Programming to build a classifier that predicts whether a passenger survives or not based on the available information. You will have to do your own data wrangling to get the data set into a shape that you can work with as well as come up with your own model specification and parameterisation.

Feel free to use any of the tricks you have learned or know from your prior experience. The only constraint is that you *must* solve the problem using Probabilistic Programming. 

In [None]:
# YOUR CODE HERE