horton2013-sampling.Rmd

---
title: "Integrating Computation Into Statistics Courses: Another Worked Example"
subtitle: "Evaluating and Sampling from A Probability Distribution"
author: "Alex Stringer"
date: '`r Sys.Date()`'
output: 
  html_document:
    toc: true
editor_options: 
  chunk_output_type: console
---

```{r setup-noshow, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r setup-show,include=FALSE}
# Load the tidyverse packages
suppressMessages({
  suppressWarnings({
    library(tidyverse)
  })
})
```

# Introduction

## Purpose of Document

This document is a discussion and implementation of the second extended example from Horton (2013): I Hear, I Forget. I Do, I Understand: A Modified Moor-Method Mathematical Statistics Course. *The American Statistician* 67:4 219-228 (although the actual example is originally from Evans and Rosenthal (2004): *Probability and Statistics: The Science of Uncertainty*, New York: W. H. Freeman and Company). The purpose of this document is to illustrate some of the challenges that instructors might face when teaching topics that require computation to actually implement, but where computational skills *may or may not* be part of the learning objectives for the course. This is in slight contrast to the [other tutorial](http://awstringer1.github.io/leaf2018/horton2013-introduction.html) covering an example from the Horton (2013) paper, which focusses more on how computation can be used to enhance student understanding of topics which would traditionally be taught in a purely analytical manner. 

This document contains the fully worked example, discussion, and several implmentations of algorithms for sampling from a probability distribution. Both base `R` and, where appropriate, more modern implementations are given side-by-side, and instructors can choose which to use depending on the learning outcomes for the course. In particular, the base `R` versions are usually better for a course where computation is used but not necessarily a main learning objective; in contrast, the more advanced versions might be better if `R` programming is a main learning objective.


## Background & Learning Objectives

The computational concepts here are appropriate for students with one introductory programming course, and ideally one applied course that uses `R`. If they don't have the latter, a brief introduction to `R` programming may be required.

The learning objectives of the example covered here may include:

  - Using computation to solve otherwise intractable analytical problems
  - Using computation to help understand and verify the solution to difficult analytical problems
  - Develop intermediate `R` skills and use them fluently in the above; for example, understand functions well enough to be able to use `integrate` without difficulty
  - Fluency with `ggplot`, and for this to be the default choice for plotting
  

# The Example

The question for students is as follows (this is paraphrased but otherwise taken directly from Horton (2013)):

  - Can you find the CDF of a distribution with density $f(y) = c(1+|y|)^{3}\exp\left(-y^{4} \right)$, for $-\infty < y < \infty$, and can you find the normalizing constant $c$?
  - Write a program to sample from this density
  
The first part could be done analytically (although it's difficult and messy); we will show how this part of the question can be solved computationally. The second part literally asks students to write a program.

## First Part

The CDF is the integral of the density,
$$
F(y) = \int_{-\infty}^{y} f(s)ds
$$
Students might approach this problem by trying to brute-force evaluate this integral using their knowledge of calculus. This is difficult, but it's a difficult *calculus* problem, not a difficult *statistical* problem. If calculus skills are what's meant to be tested here then that's one thing, but if the point of the problem is to have students develop practical skills in working with complicated probability densities, then working through this problem analytically can remove focus from this goal.

Whatever the learning objectives, computation can be used to aid understanding of this problem, and give students a way to check if their answers are reasonable. `R` has a built in `integrate` function that can be used to numerically evaluate improper integrals of reasonably well-behaved functions. Students can use this right away to find the normalizing constant $c$:
```{r integrate-1}
# Define the function of interest
fx <- function(x) (1+abs(x))^3 * exp(-x^4)
# Integrate it over the entire domain
fx_int <- integrate(fx,-Inf,Inf)$value
fx_int
# The normalizing constant is the reciprocal of this
norm_const <- 1 / fx_int
norm_const
```
Students should then double-check that their normalized density integrates to 1:
```{r integrate-2}
fxc <- function(x) norm_const * fx(x)
integrate(fxc,-Inf,Inf)$value
```
The code here looks simple, but students with little knowledge of `R` can easily get trapped in trying to write it. In particular, it uses some features and concepts that might be unfamiliar to students with little programming background:

  - Functions are objects; they are defined using `function`, which is itself a function, and you give them an expression that takes one or more variables and returns a result. This is simple enough, but students will find their first functions hard to debug, in particular because the reference semantics that `R` uses (pass-by-value, but with lazy evaluation) are somewhat unique, and make it hard for first time users to read the stack trace and figure out what went wrong
  - Because functions are objects, they can be passed around like any other data. They can be passed as arguments to *other* functions, which is what happens with `integrate`. `integrate` takes an argument which is a function, and applies an algorithm that uses that function, returning a value (actually, returning a list, from which you have to extract that value- another concept to explain)
  - `R` uses *environments* (analagous to *namespaces* in langugages like python and C++) to store and search for data. We used this here in the definition of `fxc`: the variable `norm_const` and the function `fx` are stored in the global environment; the function `fxc` will first look for these values in its execution environment, and then not finding them, will search in its calling environment (the global environment), where it will find them. It then passes its own argument `x` to `fx`, evaluates that, multiplies the result by `norm_const`, and returns *this* value. An `R` programmer might think to write the code this way, but then to explain to students what it's doing and why it works requires the introduction of several new concepts.
  
An alternative to introducing some of these new concepts is to, well, not introduce them, and write the code in a different way:
```{r integrate-3}
# Print out the value of norm_const
norm_const
# Copy it and fx directly
fxc2 <- function(x) 0.1468513 * (1+abs(x))^3 * exp(-x^4)
integrate(fxc2,-Inf,Inf)$value
```
The central tradeoff for the instructor seems to be time spent teaching programming vs. quality of the resulting program. The second implementation of `fxc` gives exactly the same result, but requires copying and pasting of previously used code, which is tedious and introduces further possibility of human error. However one might argue that blindly pulling variables from the global environment *also* introduces the possibility of error and should be considered bad practice! Every instructor will probably have different opinions about technical points like this; the point is, these are difficulties that should be thought of ahead of showing these examples to students, in order to ensure that  
the example helps achieve your desired learning objectives and avoiding falling too far down a rabbit hole.

Moving on, we can also write an `R` function to evaluate the CDF. Whether this is used as the actual answer to the problem, or to aid students' understanding and let them check their answer numerically, it is a useful exercise.
```{r integrate-4}
fx_cdf <- function(x) integrate(fxc,-Inf,x)$value
# Test it out
fx_cdf(-1) # Arbitrary
fx_cdf(0) # Should equal fx_cdf(Inf) / 2, as density is symmetric
fx_cdf(1) # Arbitrary, should equal 1 - fx_cdf(-1) again by symmetry
fx_cdf(Inf) # Should equal 1
```
This is a good time to mention to students strategies for checking their work. We checked 2 properties of the CDF that we knew from its definition:

  - $F(x)$ should satisfy $F(\infty) = 1$, because densities integrate to 1
  - $F(x) = 1 - F(-x)$, because $f(x) = f(-x)$, i.e. the underlying density is symmetric
  
We probably want to check two more properties of $F(x)$: that it is non-decreasing in $x$, and that its shape actually matches that of $f(x)$. Both of these can be accomplished using a plot. If students aren't familiar with `ggplot`, they would have to be introduced at this point, and indeed this is a good example of using `ggplot` to plot functions:
```{r plot-function-1}
# Create a dataframe that holds the x-axis range that we want to plot
# Pass this to ggplot, and add a stat_function layer. ggplot does the rest

pdf_plt <- data_frame(x = c(-2,2)) %>%
  ggplot(aes(x = x)) +
  theme_light() +
  stat_function(fun = fxc,colour = "purple") +
  labs(title = "PDF",
       x = "y",
       y = "f(y)") +
  scale_x_continuous(breaks = seq(-2,2,by=0.5))

# Need a vectorized version of fx_cdf
fx_cdf_v <- function(x) {
  out <- numeric(length(x))
  for(i in 1:length(x)) {
    out[i] <- fx_cdf(x[i])
  }
  out
}

cdf_plt <- data_frame(x = c(-2,2)) %>%
  ggplot(aes(x = x)) +
  theme_light() +
  stat_function(fun = fx_cdf_v,colour = "purple") +
  labs(title = "CDF",
       x = "y",
       y = "F(y)") +
  scale_x_continuous(breaks = seq(-2,2,by=0.5))

cowplot::plot_grid(pdf_plt,cdf_plt,nrow = 1)
```
We can see from these plots that the CDF follows the shape of the PDF, increasing sharply where the PDF is high and slowly where the PDF is low. 

Even just plotting these functions uncovered another important concept in `R` programming, namely that of *vectorization*. `ggplot` requires functions used internally to accept and return vector arguments and values. If students aren't familiar with this yet, the instructor can dig a bit deeper into what happened here:

  - The `fxc` function uses operations that all are already vectorized, meaning they take a vector as an argument and return the operation evaluated on each element of that vector. The operations `+`, `*`, `abs`, `exp` and `^` all work this way, among many others
  - The `fx_cdf` function passes its argument to the `upper` named argument of the `integrate` function, which accepts only a single scalar value, **not** a vector. As such, the function `fx_cdf` can't automatically accept and return vectors. We can fix this somewhat easily as above, by wrapping the function in a `for` loop.

If computation itself is a learning objective of the course, this is a great place to introduce the concept of a *higher-order* function, one which takes another function as an argument, and possibly returns a function as its return value. We can write a higher order function that will vectorize its input as follows:
```{r vectorize-1}
vectorize <- function(f) {
  # f is a function that accepts scalar arguments and returns scalar values
  # vectorize(f) will return a function, which accepts vector arguments and returns vector values
  function(x) {
    out <- numeric(length(x))
    for (i in 1:length(x)) {
      out[i] <- f(x[i])
    }
    out
  }
}

fx_cdf_v2 <- vectorize(fx_cdf)
fx_cdf_v2(c(-1,0,1))
```
There is a lot going on up there; this is a useful exercise only, again, if `R` programming is part of the learning objectives for the course.

## Second Part

The second part of the problem is to write a program that will produce a sample from the density $f(y)$. We'll go over the practical aspects of doing this three ways: Inverse Transform Sampling, Importance Sampling, and MCMC. We won't do the theory or make any claims about which one is better suited to this problem; we'll just focus on potential challenges students might face trying to implement each.

### Inverse Transform Sampling

If $X \sim F$ with CDF $F(x)$, and $U \sim Unif(0,1)$, then $F^{-1}(U) \overset{d}{=} X$. This can be used to generate a single point from $F$ according to the following algorithm:

  - Draw $u$ from $U \sim Unif(0,1)$
  - Return $F^{-1}(u)$
  
This can be repeated using many independent random draws from $U \sim Unif(0,1)$ to get a random sample from $F$. The implementation is very straightforward... except we don't know $F(x)$ here, so we certainly don't know $F^{-1}(x)$.

We can proceed using `R` by noting that if $y = F^{-1}(x)$ then $F(y) - x = 0$. `R` has a built-in function `uniroot` for finding roots of functions, i.e. solving equations of the form $g(y) = 0$ for $y$. We can use this to find $y = F^{-1}(x)$ for any $x$ as follows:
```{r inv-1}
cdf_inv <- function(x) {
  # Function returns F^-1(x)
  # Define the function of y to pass to uniroot. y is now the variable, and x is considered a fixed constant
  g <- function(y) fx_cdf(y) - x
  # Pass it to uniroot; search for the root on a large interval (this practicality can be further
  # discussed with students)
  uniroot(g,lower = -10,upper = 10)$root
}
# Test it out
cdf_inv(fx_cdf(-1))
cdf_inv(fx_cdf(1))
cdf_inv(fx_cdf(-.5))
cdf_inv(fx_cdf(.5))
cdf_inv(fx_cdf(-2))
cdf_inv(fx_cdf(2)) # Yikes!
cdf_inv(fx_cdf(1.5))
cdf_inv(fx_cdf(1.6))
cdf_inv(fx_cdf(1.8))
```

We see that our method works okay on points around the range of the density that contains most of the mass (about -2 to 2, although by 2 it gets pretty bad); out in the tails it can be very inaccurate. It remains to be seen whether this would affect the sampling. This is a good time to have students reflect back on how cool what we've done is; we couldn't write down expressions for the CDF or its inverse, but we can still compute with them using `R`. It's also a good time to point out the limitations; the naive numerical integration and root-finding methods we used are good, but aren't always accurate, and understanding that this is something that needs to be checked in applications is important.

Now we have to write an algorithm to sample from the original density. This can be accomplished with a simple loop:
```{r inv-2,cache=TRUE}
set.seed(34879)
B <- 10000 # Size of sample to generate
thesample <- numeric(B)
for (b in 1:B) {
  # Generate a uniform random deviate
  u <- runif(1)
  # Compute the inverse CDF at it
  thesample[b] <- cdf_inv(u)
  # Print some progress. Don't print it every iteration- printing stuff to the screen
  # is actually much more computationally intensive than actually doing an iteration of
  # this algorithm!
  if (b %% 500 == 0) {
    cat("Produced",b,"samples\n")
  }
}

# Basic summary statistics; by symmetry we know the distribution we're sampling from has expected value 0,
# so let's check if we got that part right:
mean(thesample) # pretty close
sd(thesample)
range(thesample)
# ...and we make a plot to see for sure. We'll cut it off at x = 2, even though our sample goes past that
data_frame(x = thesample) %>%
  ggplot(aes(x = x)) +
  theme_classic() + 
  geom_histogram(aes(y = ..density..),bins = 100,colour = "black",fill = "orange") +
  stat_function(fun = fxc,colour = "purple") +
  coord_cartesian(xlim = c(-2,2)) +
  labs(title = "Sample from our density",
       subtitle = "Inverse Transform Sampling",
       x = "y",
       y = "f(y)")

```
To summarize, in implementing this simple sampling algorithm covered the following computational concepts:

  - Computing the inverse CDF, which introduced `uniroot` and a discussion about accuracy of numerical computations
  - Writing a basic loop and storing the results
  - Making a histogram with an overlayed density curve in `ggplot`
  
This sampling can also be introduced using more advanced programming constructs, namely the `map` and `reduce` functions from the `purrr` package (part of the `tidyverse`). The above loop is simple enough that this probably isn't necessary, but it is potentially a good example to help illustrate the use of map/reduce and working with data in lists in `R`. These concepts certainly aren't new, and are implemented in base `R` through `apply`, `lapply`, `sapply`, `vapply`... the implementations in the `purrr` package provide an API that is consistent in its inputs/outputs, and much simpler to use and learn.
```{r inv-3, cache=TRUE}
set.seed(34879)
B <- 10000
thesample <- runif(B) %>% # Generate the B unif(0,1) realizations in one line
  map(~c(x = cdf_inv(.x))) %>% # Apply the cdf_inv function to each element of the above vector, returning a list of named vectors
  reduce(bind_rows) # Concatenate the named vectors into a data_frame

thesample
```

Doing it this way introduces map/reduce in `R`: take data, apply a function to each datapoint individually (whether they be elements in a vector, rows in a data_frame, or arbitrary items in a list), then collapse the results together somehow. Specifically,

  - We generated a vector of `B` independent realizations of a Unif(0,1) random variable
  - We passed this to the `purrr::map` function, which takes a vector or list as an argument and a function of one variable as arguments, and applies that function to each element of the input. This is accomplished in base `R` using the `apply` family. Two advantages of using `purrr` are a consistent set of functions that are compatible with each other, take the same inputs in the same order as each other and return predictible output, and the compact formula `~` syntax for defining anonymous functions. This makes this type of functional programming easier for students to learn, as there is much less time focussing on indicental details (which `apply` function do I use? What are the arguments? What is it going to return? Will it always return a list or sometimes a vector? And so on).
  - We made sure the function we mapped returned a *named* vector, which allows us to `reduce` the list of named vectors returned by `map` into a single `data_frame` using `dplyr::bind_rows`.

### Rejection Sampling

We'll implement another sampling algorithm, Rejection Sampling, which is still simple but slightly less so. The Rejection Sampling algorithm is basically:

  - Choose a box $\lbrack x_{l},x_{u} \rbrack\times\lbrack 0,y_{u} \rbrack$ that contains the probability density from which you want to sample- this means identify the domain (if infinite, pick a finite range in which nearly all the mass is), and find the mode, and choose the three parameters accordingly
  - Sample $u \sim Unif(x_{l},x_{u})$
  - Accept $u$ with probability $f(u)$. In computational terms this means generate another uniform random variate $t \sim Unif(0,y_{u})$, and accept $u$ if $t < f(u)$, an event that happens with probability $f(u)$.
  - Repeat until $N$ points are accepted, where $N$ is the size of sample you want
  
This algorithm is straightforward to state. To implement it requires considering a few details:

  - To choose $x_{l}$ and $x_{u}$, we can look at the plot of the density and figure out where most of the mass is. Students should think about whether this is an okay thing to do; is there a bump somewhere out in the tails that we didn't include in the plot? The density in question is dominated by the $e^{-x^4}$ term when $|x| > 2$ (approximately), so it's fine here, but it's important for students to ask this question. Another important question to ask is: what happens if we get this wrong? If we're too wide then we will end up rejecting more samples and the code will take longer to run; if we're too narrow then we miss potentially large parts of the density. We'll be a bit conservative and go with $\lbrack 3,3 \rbrack$.
  - To choose $y_{u}$, we need to know the maximum value our density could take. Can we just look at the plot (looks like about 0.6), or do we need to know any exact value, by finding the critical points and checking if they are maxima? We can answer this by again considering: what actually happens if we get this value wrong? If it's too high, we will reject more points and the algorithm will run slower. If it's too low, we'll undersample important parts of the density. We can go conservative and choose, say, $y_{u} = 0.7$ in this case. The discussion is important for students to have.
  
With those decisions made we can implement our algorithm:
```{r importance-1,cache=TRUE}
set.seed(29432)
xmin <- -3
xmax <- 3
ymax <- 0.7

N <- 10000 # Number of samples we want, NOT number of iterations that will be performed
numiter <- 0 # A counter to count the number of iterations performed
acceptancerate <- numeric() # Track the acceptance rate over iterations
thesample <- numeric()

numaccepted <- 0 # Counter of number accepted
while(numaccepted < N) {
  u <- runif(1,xmin,xmax)
  if (runif(1,0,ymax) < fxc(u)) {
    numaccepted <- numaccepted + 1
    thesample <- c(thesample,u)
  }
  numiter <- numiter + 1
  acceptancerate <- c(acceptancerate,numaccepted / numiter)
  if (numiter %% 5000 == 0) {
    cat("Iterations:",numiter,"Accepted:",numaccepted,"Acceptance Rate:",scales::percent(numaccepted / numiter),"\n")
  }
}
cat("Final Iterations:",numiter,"\nFinal Accepted:",numaccepted,"\nFinal Acceptance Rate:",scales::percent(numaccepted / numiter),"\n")

# Summary statistics and plot
mean(thesample) # pretty close
sd(thesample)
range(thesample)
# ...and we make a plot to see for sure. We'll cut it off at x = 2, even though our sample goes past that
data_frame(x = thesample) %>%
  ggplot(aes(x = x)) +
  theme_classic() + 
  geom_histogram(aes(y = ..density..),bins = 100,colour = "black",fill = "orange") +
  stat_function(fun = fxc,colour = "purple") +
  coord_cartesian(xlim = c(-2,2)) +
  labs(title = "Sample from our density",
       subtitle = "Rejection Sampling",
       x = "y",
       y = "f(y)")

# Also plot the acceptance rate over iterations
data_frame(y = acceptancerate,
           x = 1:length(acceptancerate)) %>%
  ggplot(aes(x = x,y = y,group = 1)) +
  theme_classic() +
  geom_line() +
  scale_x_continuous(labels = scales::comma_format()) +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(title = "Acceptance Rate by Iteration",
       subtitle = "Rejection Sampling Algorithm",
       x = "Iteration",
       y = "Acceptance Rate")
```

Rejection sampling introduces some similar computational concepts as the previous example on inverse transform sampling (loops, etc). There are some new concepts introduced:

  - Monitoring the behaviour of an algorithm. With the inverse transform sampling, the procedure was formulaic: just put a uniform random sample into the inverse CDF, and the result is guaranteed to be a random sample from our distribution of interest. We had to monitor the *results*, to check the impact of numerical inaccuracies in computing the inverse CDF, but not much other than this could go wrong in the algorithm itself. For rejection sampling, the algorithm is more prescriptive, a list of steps that are to be iterated. In addition to looking at the output, we also chose to monitor the progress of the algorithm. It is useful for students to discuss how and why we might do this. We could be looking for programming errors on our part, instability in the algorithm, or something else.
  - The effect of parameters on the algorithm. We had to put some thought into how to choose the box in which we generate our uniform random variables; students might benefit from some "stress testing" on this. What happens when you choose the box differently? What if it's too big? Too small?


### Markov Chain Monte Carlo (MCMC)

We will implement one final sampling algorithm, namely Metropolis-Hastings MCMC. What we need to get the computer to do is:

  - At a given point $x$, sample a new point $x^{\prime}$ from a proposal density $g(x^{\prime}|x)$ that we choose and know how to sample from
  - Accept this new point with probability $\frac{f(x^{\prime})g(x|x^{\prime})}{f(x)g(x^{\prime}|x)}$. If this is greater than 1, definitely accept the new point. Else keep the old point $x$
  - Repeat a prespecified number of times
  
There are several details to be worked out:

  - How many iterations?
  - Choosing the starting point. How do we do this, and how sensitive are the results to this choice?
  - Choosing the proposal density. Again, how, and how sensitive are the results to this?
  - The results, since they are a realization of a markov chain, are not independent, and the head of the chain won't even be a sample from the desired distribution as the chain needs time to converge to its stationary distribution. We need to *thin* the chain and cut off a chosen number of the first iterations (*burn in*). How to choosing the thinning proportion and the number of burn-ins? How does this affect the results?

These represent a mix of theoretical and practical considerations. For our application here we'll make some choices, which you of course are free to change:

  - We'll run it for 1,000 iterations
  - We'll start at $x_{0} = )$, the mean/median of our distribution
  - We'll use a normal proposal, $x^{\prime} \sim N(x,0.25)$. The choice of proposal standard deviation might need to be adjusted; this is both a theoretical and practical choice
  - We'll use a burn in of 100 iterations and thin the results by 1/9, for a final sample size of 100
  
Let's see how we might implement this:
```{r mcmc-1,cache = TRUE}
set.seed(24389)
niter <- 1000
burnin <- 100
thinning <- 9
xstart <- 0

iterations <- numeric(niter)
accepted <- numeric(niter)
curpoint <- xstart
for (i in 1:niter) {
  # Propose a new point
  newpoint <- rnorm(1,curpoint,.25)
  # Decide whether to accept
  if (runif(1) < (fx(newpoint) * dnorm(curpoint,newpoint,.25)) / (fx(curpoint) * dnorm(newpoint,curpoint,.25))) {
    iterations[i] <- newpoint
    accepted[i] <- 1
  } else {
    iterations[i] <- curpoint
    accepted[i] <- 0
  }
  curpoint <- iterations[i]
  
  if (i %% 100 == 0) {
    cat("Iteration: ",i,", Acceptance Rate: ",mean(accepted[1:i]),"\n")
  }
}

# Burn in and thin
iterations_out <- iterations[(burnin+1):niter][1:(niter-burnin) %% thinning  == 0]

# Check out the trace plot, and histogram of samples
traceplot <- data_frame(y = iterations,
                        x = 1:length(iterations)) %>%
  ggplot(aes(x = x,y = y,group = 1)) +
  theme_classic() +
  geom_path() +
  labs(title = "MCMC Trace Plot",
       x = "Iteration",
       y = "Accepted Value") +
  coord_flip()

histogramofsample <- data_frame(x = iterations_out) %>%
  ggplot(aes(x = x)) +
  theme_classic() +
  geom_histogram(aes(y = ..density..),bins = 30,colour = "black",fill = "lightblue") +
  stat_function(fun = fxc,colour = "purple") +
  labs(title = "Histogram of MCMC Samples",
       subtitle = "Purple curve: desired density",
       x = "Sampled Value",
       y = "Density")

cowplot::plot_grid(traceplot,histogramofsample,nrow = 1)
```

This implementation required some computational concepts:

  - Basic `for` loops
  - A bit of fancy indexing, in the burn-in and thinning part
  - Plotting with `ggplot`
  
Looking at the results, we see some immediate problems that can be discussed in a deeper treatment of MCMC. From a computational perspective, the natural thing to do is to change the inputs to the algorithm. We can make a principled choice here, based on our knowledge of the behaviour of the algorithm, to try a bimodal proposal density and lower the proposal standard deviation- but nonetheless, this requires rerunning the code with changed parameters. This introduces a potential integrity issue into our analysis: if we just start changing things willy-nilly, we lose track of which experiments we performed (you could keep track in a notebook, but most students probably won't). If we mitigate this by just copying and pasting the whole algorithm for every new set of inputs, our code becomes cluttered and we increase the chances of human error. Best practice dictates that we write a function.

This is another layer of computational complexity to what is supposed to be a practical problem. Discussion of why we're doing this, as above, should help the students appreciate what problem we're solving by adding this extra layer of work to our procedure.

Let's write the function. First we'll write a simple pair of proposal distribution functions, for computing the density of and sampling from an equal mixture of two normals, with low standard deviation. This should help our algorithm visit both modes of the distribution, without having to jump around the whole sample space, increasing the chances of visiting low-density areas. We'll then implement a function to perform the above MCMC algorithm.
```{r mcmc-2,cache = TRUE}
# Functions for computing density and sampling from a mixture of 2 normals,
# centered at mu and -mu
m2n_density <- function(x,mu,sig) {
  .5 * dnorm(x,mu,sig) + .5 * dnorm(x,-mu,sig)
}
m2n_sample <- function(mu,sig) {
  # mu is a vector of length 2
  if (rbinom(1,1,.5) == 0) {
    rnorm(1,mu,sig)
  } else {
    rnorm(1,-mu,sig)
  }
}

mcmc_sample1 <- function(niter,propsd,burnin,thinning,xstart = 0) {
  iterations <- numeric(niter)
  curpoint <- xstart
  for (i in 1:niter) {
    # Propose a new point
    newpoint <- m2n_sample(curpoint,propsd)
    # Compute the acceptance probability
    accprob <- (fx(newpoint) * m2n_density(curpoint,newpoint,propsd)) / (fx(curpoint) * m2n_density(newpoint,curpoint,propsd))
    # Decide whether to accept
    if (runif(1) < accprob) {
      iterations[i] <- newpoint
    } else {
      iterations[i] <- curpoint
    }
    curpoint <- iterations[i]
  }

  # Burn in and thin
  iterations_out <- iterations[(burnin+1):niter][1:(niter-burnin) %% thinning  == 0]

  # Check out the trace plot, and histogram of samples
  traceplot <- data_frame(y = iterations,
                          x = 1:length(iterations)) %>%
    ggplot(aes(x = x,y = y,group = 1)) +
    theme_classic() +
    geom_path() +
    labs(title = "MCMC Trace Plot",
         x = "Iteration",
         y = "Accepted Value") +
    scale_x_continuous(labels = scales::comma_format()) +
    coord_flip()

  histogramofsample <- data_frame(x = iterations_out) %>%
    ggplot(aes(x = x)) +
    theme_classic() +
    geom_histogram(aes(y = ..density..),bins = 30,colour = "black",fill = "lightblue") +
    stat_function(fun = fxc,colour = "purple") +
    labs(title = "Histogram of MCMC Samples",
         subtitle = "Purple curve: desired density",
         x = "Sampled Value",
         y = "Density")
  
  # Return the sample and the plots
  list(
    sample = iterations_out,
    traceplot = traceplot,
    histogram = histogramofsample
  )
}

# Try it out
testsample <- mcmc_sample1(niter = 1000,propsd = .25,burnin = 100,thinning = 9)
cowplot::plot_grid(testsample$traceplot,testsample$histogram,nrow = 1)
```
Not bad! The major advantage of keeping the code in a function is now if we want to assess the effect of changing an input, we can just call the function again. Our notebook contains a record of everything we've done, and the chance of human error remains the same as if we only called the function once.
```{r mcmc-3}
testsample <- mcmc_sample1(niter = 1000,propsd = .1,burnin = 100,thinning = 9)
cowplot::plot_grid(testsample$traceplot,testsample$histogram,nrow = 1)

testsample <- mcmc_sample1(niter = 10000,propsd = .1,burnin = 1000,thinning = 9)
cowplot::plot_grid(testsample$traceplot,testsample$histogram,nrow = 1)
```

If computation is a focus of the course itself, we can turn to the question of how to improve the above code. One improvement might be to pass the proposal distribution itself as an argument to the sampling function, rather than just its variance. This allows rapid testing of different proposals. This is a good point to introduce different function syntax in `R`, as provided by the `rlang::as_function` function. This lets us pass a function as an argument using

  - anonymous function syntax, `function(x) x + 1`
  - a string, `"rnorm"`
  - a function object, `rnorm`
  - most helpfully, a formula: `~ .x + 1`

These variants are becoming more and more popular in `R` programming. Adding this to our function can be done as follows:

```{r mcmc-4,cache=TRUE}
mcmc_sample2 <- function(niter,proposal_sample,proposal_density,burnin,thinning,xstart = 0) {
  # Convert proposal to functions
  proposal_sample <- rlang::as_function(proposal_sample)
  proposal_density <- rlang::as_function(proposal_density)
  
  # The rest of the code remains mostly unchanged
  iterations <- numeric(niter)
  curpoint <- xstart
  for (i in 1:niter) {
    # Propose a new point
    newpoint <- proposal_sample(curpoint)
    # Compute the acceptance probability
    accprob <- (fx(newpoint) * proposal_density(curpoint,newpoint)) / (fx(curpoint) * proposal_density(newpoint,curpoint,propsd))
    # Decide whether to accept
    if (runif(1) < accprob) {
      iterations[i] <- newpoint
    } else {
      iterations[i] <- curpoint
    }
    curpoint <- iterations[i]
  }

  # Burn in and thin
  iterations_out <- iterations[(burnin+1):niter][1:(niter-burnin) %% thinning  == 0]

  # Check out the trace plot, and histogram of samples
  traceplot <- data_frame(y = iterations,
                          x = 1:length(iterations)) %>%
    ggplot(aes(x = x,y = y,group = 1)) +
    theme_classic() +
    geom_path() +
    labs(title = "MCMC Trace Plot",
         x = "Iteration",
         y = "Accepted Value") +
    scale_x_continuous(labels = scales::comma_format()) +
    coord_flip()

  histogramofsample <- data_frame(x = iterations_out) %>%
    ggplot(aes(x = x)) +
    theme_classic() +
    geom_histogram(aes(y = ..density..),bins = 30,colour = "black",fill = "lightblue") +
    stat_function(fun = fxc,colour = "purple") +
    labs(title = "Histogram of MCMC Samples",
         subtitle = "Purple curve: desired density",
         x = "Sampled Value",
         y = "Density")
  
  # Return the sample and the plots
  list(
    sample = iterations_out,
    traceplot = traceplot,
    histogram = histogramofsample
  )
}

# Try it out with the same proposal as before
# Use the formula syntax to pass fixed arguments to the passed functions
testsample <- mcmc_sample2(niter = 1000,
                           proposal_sample = ~m2n_sample(mu = .x,sig = .1),
                           proposal_density = ~m2n_density(x = .x,mu = .y,sig = .1),
                           burnin = 100,
                           thinning = 9)
cowplot::plot_grid(testsample$traceplot,testsample$histogram,nrow = 1)

testsample <- mcmc_sample2(niter = 10000,
                           proposal_sample = ~m2n_sample(mu = .x,sig = .1),
                           proposal_density = ~m2n_density(x = .x,mu = .y,sig = .1),
                           burnin = 1000,
                           thinning = 9)
cowplot::plot_grid(testsample$traceplot,testsample$histogram,nrow = 1)

# Try it with a different proposal. Now you can do anything you want, like a noncentral t:
testsample <- mcmc_sample2(niter = 10000,
                           proposal_sample = ~rt(n=1,df=5,ncp=.x),
                           proposal_density = ~dt(x = .x,df=5,ncp = .y),
                           burnin = 1000,
                           thinning = 9)
cowplot::plot_grid(testsample$traceplot,testsample$histogram,nrow = 1)
```