-
Notifications
You must be signed in to change notification settings - Fork 0
/
intro_to_RStan.Rmd
475 lines (325 loc) · 21.2 KB
/
intro_to_RStan.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
# Introduction to Bayesian Inference using Stan
## Introduction
### Lecture video (Length: 1:26:17)
```{r, warnings=FALSE, message=FALSE, echo=FALSE}
library(vembedr)
embed_youtube('P0XttSki1jQ', height=400) %>% use_align('center')
```
[[**Watch on YouTube**]](https://youtu.be/P0XttSki1jQ)
### Learning outcomes
Today’s session aims to introduce you to the basic Stan programming etiquette for Bayesian analysis in RStudio using Stan as an Interface, and producing output and interpreting it's results. By the end of this session, you should be able to perform the following:
1. Develop basic Stan code
2. Know how to write and compile a Stan Program to compute the **posterior distribution** for simple parameters (i.e., mean, standard deviation, a proportion etc.,)
3. Know how to interpret the results churned from a Stan
You can follow the live walkthrough demonstration delivered in the first 1-hour of the practical, and then use the practical session to try the practical tutorials yourself by following the instructions and trying out the tasks.
### Live walk-through demonstration video (Length: 1:23:56)
```{r, warnings=FALSE, message=FALSE, echo=FALSE}
embed_youtube('N_P_EtT1MZk', height=400) %>% use_align('center')
```
[[**Watch on YouTube**]](https://youtu.be/N_P_EtT1MZk)
### Datasets & setting up the work directory
Since, this is our first practical lesson for **Day 1**, let us create a new folder **CPD-course** at the desktop location of our computer. Now, create a sub folder called "**Day 1**" within the **CPD-course** folder. Here, we will store all our R and Stan scripts. Set your work directory to the **Day 1** folder.
For Windows, the code for setting the work directory will be:
```{r, eval = FALSE}
setwd("C:/Users/AccountName/Desktop/CPD-course/Day 1")
```
For MAC, the code for setting the work directory will be:
```{r, eval = FALSE}
setwd("/Users/AccountName/Desktop/CPD-course/Day 1")
```
### Loading packages
To start writing scripts for Bayesian analysis, we will need to load the `rstan` package.
```{r, eval = FALSE}
# Load the packages with library()
library('rstan')
library('tidybayes')
library('tidyverse')
```
Note that when you load `rstan` from cran, you will see some recommendations on using multiple cores for speeding the process. For the best experience, we highly recommend using this code:
```{r, eval=FALSE}
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
```
This tells RStudio to use multiple core for parallel processing whenever Stan is being implemented. Every time you want to use Stan - make sure to load `parallel::detectCores()` and `rstan_options` code.
## Basic building blocks I: About Stan Programming
This section describes how to code up a basic Stan model. This section forms the foundation for later, and more complex models.
### Opening a Stan Script in RStudio
Alright, let's open a **Stan** file. You can do this by clicking and selecting `File` > `New File` > `Stan File`
```{r echo=FALSE, out.width = "100%", fig.align='center', cache=TRUE,}
knitr::include_graphics('images/general/Open_stan_file.png')
```
When you open a new **Stan** file, you will be greeted with an `untitled script` which contains the following bits of code:
```{r, eval=FALSE, highlight=FALSE}
//
// This Stan program defines a simple model, with a
// vector of values 'y' modeled as normally distributed
// with mean 'mu' and standard deviation 'sigma'.
//
// Learn more about model development with Stan at:
//
// http://mc-stan.org/users/interfaces/rstan.html
// https://github.com/stan-dev/rstan/wiki/RStan-Getting-Started
//
// The input data is a vector 'y' of length 'N'.
data {
int<lower=0> N;
vector[N] y;
}
// The parameters accepted by the model. Our model
// accepts two parameters 'mu' and 'sigma'.
parameters {
real mu;
real<lower=0> sigma;
}
// The model to be estimated. We model the output
// 'y' to be normally distributed with mean 'mu'
// and standard deviation 'sigma'.
model {
y ~ normal(mu, sigma);
}
```
Do not worry about that - it is just some formalities put there by default by the developers behind `rstan`. You can delete everything you see in this script as we will build our own basic script from scratch.
Once you have deleted the default information save the empty file in the **Day 1** folder naming it `Predicting a proportion.stan`. Whenever you are saving Stan programme in RStudio always make sure to save it with a `.stan` ending.
### Basic structure of a Stan script in RStudio
A typical **Stan** program consist of the following 6 **blocks**:
1. Data
2. Transformed data
3. Parameters
4. Transformed parameters
5. Model
6. Generated quantities
Out of these 6 blocks, the `Data`, `Parameters` and `Model` block **must** be specified. These are three compulsory blocks needed in any Stan script in order for a Bayesian model to work within the `rstan` environment. Let us define what these three important blocks are.
**FIRST:** The **data block** allows the user to declare how the model reads the dataset from RStudio by specifying the sample size **N** or observations; the number of **k** parameters that needs to be estimated; the names or list of independent variables for the corresponding parameters (e.g., coefficients); as well as data constraints etc., A **data** block is specified accordingly as:
```{r, eval=FALSE, highlight=FALSE}
data {
}
```
It is within these curly brackets will specify these details of our dataset. They must be precise as it will have to correspond with the data that is loaded in RStudio's memory.
**SECOND:** The **parameters block** allows the use to declare all primitive unknown quantities, including their respective storage types, dimensions, and constraints. The parameters that go here are the ones we want to infer or predict - e.g., includes the `mean`, `variance`, `sd`, `coefficient` and many more. A **parameters** block is specified after the **data** block:
```{r, eval=FALSE, highlight=FALSE}
data {
}
parameters {
}
```
**THIRD:** The **model block** allows the use to declare and specify the sampling statements for the **dependent variable** (i.e., likelihood function) and parameters (i.e., priors) to be used in the model. A **model** block is specified after the **parameters** block:
```{r, eval=FALSE, highlight=FALSE}
data {
}
parameters {
}
model {
}
```
Note that adding a double forward slashes ``//`` lets the user add a comment to script. Let add comments to the blocks:
```{r, eval=FALSE, highlight=FALSE}
// Add comments after double forward slashes
data {
// data block
}
parameters {
// parameters block
}
model {
// model block
}
```
<div class="note">
**Important Notes:** Since, the other blocks are not compulsory - we will leave them out for now. But we will come back and explain what those remaining blocks are in **Day 2 and 3**. Now, save your Stan script.
</div>
## Basic building blocks II: Data types and constraint declarations
In Stan, all parameters and data must be defined as variables with a specific type. Note, this is quite a pain but going through this step allows `rstan` to perform really fast. There are four basic **data** types:
1. `int`: for integers, used for specifying the sample size, and is applied to discrete variables
2. `real`: for continuous, is applied to continuous variables (i.e., ratio or interval)
3. `vector`: for a column vector of `reals`
4. `matrix`: for a matrix of `reals`
For **constraints**, we specify them on variables. For example, if we are dealing with a proportion **p** we will code it as `real<lower=0, upper=1> p` tells Stan that **p** can be any value from 0 to 1, inclusive. Note that specifying constraints really help speed Stan up so use them wherever possible.
Lastly, you can create **arrays** of variables. For example, `real p[10]` tells Stan that **p** is an array of 10 real values. We can also create a matrix to represent a set of independent variables. Now that we have discussed these points - let work with an actual demonstration to show **data types** and **constraints** work.
## Basic building blocks III: Developing our model
<div class="note">
**PROBLEM:** Last year, in the advanced GEOG0125 course, there was 13 students enrolled. It was Anwar's first time being a teacher on the course, and hence was quite curious to know what proportion of students will pass with a distinction. Since, he has no data of the pass rates - therefore has no **prior knowledge**. First, he assumes that the **proportion of those passing with a distinction** comes from a **Beta distribution** that is uniform **Beta(1.0, 1.0)**. In term 3, after tedious marking of reports, he observes that 4 students (out of the total: 13) passed with flying colours. What is the posterior distribution of those getting a distinction for GEOG0125?
</div>
Let us build our first model that predicts the posterior distribution of those passing the GEOG0125 module with a distinction. Let us extract the information:
- Total sample size is 13 (N) [i.e., **denominator**]
- The outcome is the number of students who passed (70% and above) is 4 (p) [i.e., **numerator**]
- **Data**: Proportion (or prevalence) of distinction grades is 4/13, but this is just one instance (it could have been these other likely outcomes: 0/13, 1/13, 2/13, 3/13, 4/13, ..., 12/13 or even 13/13). Let us represent proportion of distinction grades with some probability distribution $\theta$ which has a **Binomial distribution** (likelihood function).
- There is no prior knowledge, but we are dealing with a **proportion** here, so a **Beta distribution** is best for this problem and should be assigned as a prior. We are assuming that it's PDF has a uniform pattern i.e., **Beta(1.0, 1.0)** because all these instance (i.e., 0/13, 1/13, 2/13, 3/13, 4/13, ..., 12/13 or even 13/13) have a equal chance of happening. This is an example of an **uninformative prior**
We can code this information in Stan.
**FIRST:** We specify the total number of student as integer `int N` which cannot be a negative number in the **data** block. Also, we also need to specify the number of students that passed as an integer `int p` which cannot be a negative number in the **data** block too.
```{r, eval=FALSE, highlight=FALSE}
data {
int<lower = 0> N;
int<lower = 0> p;
}
```
**SECOND:** For the **parameters** block, here we will need to specify the name of the parameter that we want to infer. Here, its $\theta$ which is the **proportion** of those who got a distinction in GEOG0125. Note that $\theta$ follows a beta distribution and is therefore translated as a probability as a `real` value that is within the range of 0 to 1, inclusive.
```{r, eval=FALSE, highlight=FALSE}
data {
int<lower = 0> N;
int<lower = 0> p;
}
parameters {
real<lower=0, upper=1> theta;
}
```
**THIRD:** For the **model** block, here we will need to define our **posterior distributions**. Here, we need to state that **p** likelihood function is sampled from a **binomial distribution** as a function of **N** and $\theta$ (`binomial(N, theta)`). We also have to say $\theta$ is sampled from a **beta distribution** that is uniform. The model block will be:
```{r, eval=FALSE, highlight=FALSE}
data {
int<lower = 0> N;
int<lower = 0> p;
}
parameters {
real<lower=0, upper=1> theta;
}
model {
p ~ binomial(N, theta); // our likelihood function or observation model
theta ~ beta(1.0, 1.0); // our prior distribution alpha = 1 and beta = 1
}
```
<div class="note">
**COMPLIMENTS:** Well done, we have built our first Bayesian model. Let save it the script, what we need to do is compile and run it through RStudio to get our results.
</div>
## Basic building blocks IV: Compiling our Stan code in RStudio
Now, let us turn our attention to RStudio. The Stan script needs to be compiled from an R script. We will first need to create a list object using `list()` to connect the data to the information specified in the **data** before running the function `stan()` to the Bayesian model to get the results. The **N** and **p** will need to defined in the list object.
```{r, eval = FALSE}
# create list object and call it dataset
dataset <- list(N = 13, p = 4)
```
Now, using the `stan()` to compile and obtain the posterior estimation of the proportions of students passing with a distinction:
```{r, eval = FALSE}
# the directory needs to be set to where you save the datasets and Stan script
prediction.passes = stan("Predicting a proportion.stan", data=dataset, iter=3000, chains=3, verbose = FALSE)
```
Some notes on the above code's arguments:
- `data=` we are pushing the data we created from the `list()` to the Stan script. Stan is calling it.
- `iter=` we are asking the `stan()` to perform 3,000 iterations on each chain to generate the posterior samples. The chain can be a **MCMC**, **NUTS** or **HMC** algorithm (**NUTS** No-U-turn sampler is the default)
- `chains=` we are asking the `stan()` function to perform 3 chains (i.e., any of these algorithms can be stated in the function: i.e., **MCMC**, **NUTS** & **HMC**)
The resulting output can be printed with the function `print()`. Here, we are going to print the mean, standard error in mean, SD and the IQR ranges with 95% limits (i.e., 2.5% and 97.5%):
```{r, eval=FALSE}
print(prediction.passes, probs=c(0.025, 0.25, 0.5, 0.75, 0.975))
```
We obtain this summary table:
```{r, eval=FALSE, highlight=FALSE}
Inference for Stan model: Predicting a proportion.
3 chains, each with iter=3000; warmup=1500; thin=1;
post-warmup draws per chain=1500, total post-warmup draws=4500.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
theta 0.33 0.00 0.12 0.12 0.24 0.32 0.41 0.58 1495 1
lp__ -10.08 0.02 0.73 -12.21 -10.26 -9.80 -9.61 -9.55 1879 1
Samples were drawn using NUTS(diag_e) at Thu Jan 12 14:39:10 2023.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
```
<div class="note">
**What does it all mean?** The top part states that 3 chains were run for 3000 iterations. However, the first 1500 samples generated from each chain were discarded as warm-up, meaning that only 1500 samples from each chain were kept, resulting 4500 (1500x3) total post-warm-up sample draws. The output shows the summary statistics for our $\theta$. The `lp__` is the log-probability - which is used to quantify how well the model is for the data but, in my opinion, its not a useful estimate. Instead, use the effective sample size `n_eff` and `Rhat`. If the `Rhat` is less than 1.1 for all parameters - it means that the estimation of our parameters are fine.
</div>
## Basic building blocks V: Extract posterior samples and interpretation
At this point, let us extract the posterior samples and graph them to understand it posterior distribution. We use the `extract()` function from the `rstan` package, and graph them:
```{r, eval=FALSE}
# extracting the samples (it should be 4500)
theta.pass_draws <- rstan::extract(prediction.passes, 'theta')[[1]]
# create a graph i.e., histogram
hist(theta.pass_draws, xlim = c(0,0.8), ylim = c(0,800),
main= "Posterior samples", ylab = "Posterior Density", xlab = expression(paste("Probability of Parameter: ", theta, " [%]")))
```
```{r echo=FALSE, out.width = "100%", fig.align='center', cache=TRUE}
knitr::include_graphics('images/general/histogram_plot_pdy.png')
```
We want to compute the posterior mean, with the 0.025 and 0.975 quantiles (these limits are referred to as **95% credibility limits (95% CrI)**). Here is how we compute them:
```{r, eval=FALSE, highlight=FALSE}
# Calculating posterior mean (estimator)
mean(theta.pass_draws)
[1] 0.3305062
# Calculating posterior intervals
quantile(theta.pass_draws, probs=c(0.025, 0.975))
2.5% 97.5%
0.1239263 0.5800643
```
<div class="note">
**Interpretation:** The predicted proportion of students passing the GEOG0125 from our sample posterior distribution was 0.3305 (33.05%). Our predictions, with 95% credibility, can be within the limits of 0.1239 and 0.5801. Formally writing as $\theta$ = 33.05% (95% CrI: 12.39-58.01%).
</div>
## Basic building blocks VI: Updating our prediction with new information
We can use the 0.3305 as our new prior as opposed to using a uniform again to make an updated prediction. The updated beta distribution takes a new form of **Beta(5, 10)** centered on 0.3305. This is an **informative prior** because we have information now. This was derived from multiplying the likelihood function (i.e., **Binomial(13, 4)**) with the prior (**Beta(1,1)**) results in a Beta distribution again. This is known as a **conjugate**.
We can use this information in fact to make a prediction on the current 2022/23 cohort who can potentially get a distinction in GEOG0125 for this year? Let's see:
- This year, the total sample size is 26 (N)
- Assuming roughly a third (like last year) will probably get a distinction equates to 9 (p)
- Let us represent the proportion of distinction grades with some probability distribution $\theta$ with a **Binomial distribution** (likelihood function) [Binomial(26, 9)].
- We have prior knowledge of last year's **proportion** (0.3305) which has a **Beta distribution** [Beta(5, 10)] - this is our new **informative prior**
- Let updates the codes:
**Updated dataset in our RScript**
```{r, eval=FALSE}
dataset.updated <- list(N=26, p=9)
```
**Updated Stan script**
```{r, eval=FALSE, highlight=FALSE}
data {
int<lower = 0> N;
int<lower = 0> p;
}
parameters {
real<lower=0, upper=1> theta;
}
model {
p ~ binomial(N, theta); // our likelihood function or observation model
theta ~ beta(5, 10); // our prior distribution updated to alpha = 5 and beta = 10
}
```
**Obtaining updated posterior samples by compiling new Stan script**
```{r, eval=FALSE}
predicting.passes.updated = stan("Predicting a proportion updated.stan", data=dataset.updated, iter=3000, chains=3, verbose = FALSE)
print(predicting.passes.updated, probs=c(0.025, 0.25, 0.5, 0.75, 0.975))
```
**Updated output summary table**
```{r, eval=FALSE, highlight=FALSE}
Inference for Stan model: Predicting a proportion.
3 chains, each with iter=3000; warmup=1500; thin=1;
post-warmup draws per chain=1500, total post-warmup draws=4500.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
theta 0.34 0.00 0.07 0.20 0.29 0.34 0.39 0.49 1956 1
lp__ -26.83 0.02 0.74 -28.94 -26.97 -26.54 -26.37 -26.32 1582 1
Samples were drawn using NUTS(diag_e) at Thu Jan 12 17:20:47 2023.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
```
**Histogram of posterior samples' distribution**
```{r, eval=FALSE}
# extracting the samples (it should be 4500)
theta.pass_draws.new <- rstan::extract(predicting.passes.updated, 'theta')[[1]]
# create a graph i.e., histogram
hist(theta.pass_draws.new, xlim = c(0,1),
main= "Posterior samples [Updated]", ylab = "Posterior Density", xlab = expression(paste("Probability of Parameter: ", theta, " [%]")))
```
```{r echo=FALSE, out.width = "100%", fig.align='center', cache=TRUE}
knitr::include_graphics('images/general/histogram_plot_pdy2.png')
```
**Obtain updated posterior proportion and 95% CrI**
```{r, eval=FALSE, highlight=FALSE}
# Calculating posterior mean (estimator)
mean(theta.pass_draws.new)
[1] 0.3420825
# Calculating posterior intervals
quantile(theta.pass_draws.new, probs=c(0.025, 0.975))
2.5% 97.5%
0.2027673 0.4935101
```
<div class="note">
**Interpretation:** For the current cohort, the predicted proportion of students passing the GEOG0125 from our updated posterior distribution was 0.3421 (34.21%). Our predictions, with 95% credibility, can be within the limits of 0.2027 and 0.4935. We can formally write this as $\theta$ = 34.21% (95% CrI: 20.27-49.35%).
</div>
## Tasks
### Task 1 - Low-level arsenic poisoning in Cornwall, UK
**Try this first problem in Stan**: Suppose, in a small survey, a random sample of 50 villagers from a large population in Cornwall were at risk of arsenic poisoning due to long-term low-level environmental exposure status were selected. Each person’s disease status (i.e., metallic toxicity) was recorded as either **Diseased** or **None**. 19 of the respondents have found to be diseased. The distribution of the prevalence is assumed to be of a Beta distribution with **Beta(3,5)**.
What is the predicted posterior prevalence of arsenic poisoning in Cornwall and its 95% Credibility intervals?
### Task 2 - Body mass index (BMI) problem
**Try this second problem in Stan**: The mean BMI value is 23 with SD of 1.2. Simulate sample of 1000 with BMI values based on this distribution `N(23, 1.2)` and perform Bayesian inference.
What is the posterior mean BMI and its 95% Credibility intervals?
**Hints:**
1. In the R script, use the function `rnorm()` to generate your sample of 1000 BMI points
2. In the R script, create a `list()` with `N` and `bmi` defined
3. In the Stan script, define the data block in accordance to the `list()` object
4. In the Stan script, the `bmi` values are measured outcome. Code this in the model block as a likelihood function using the `norm(mu, sigma)` notation
5. In the Stan script, use the parameter block, and make sure to code your `mu` (mean) and `sigma` (standard deviation) as real (non-negative) numbers
<div class="note">
**Note:** Solutions for task 1 and 2 will be made available later today
</div>