# CBS Week 1 Exercise: Jupyter Notebooks and Basic Probability


In [1]:
suppressPackageStartupMessages({
    library(tidyverse)
    library(testthat)
    library(knitr)
    library(kableExtra)
    library(IRdisplay)  
})

A FAQ will be maintained for each notebook, and you can find the FAQ for this exercise by clicking the "Modules" tab on Canvas. If you have a question, please check the FAQ to see if your question has already been addressed. If it hasn't been addressed, please post your question to the "Week 1 Exercise" discussion thread on the Canvas Discussion board. Almost all class-related questions should be posted to the discussion board instead of sent to the instructors directly --- that way everyone in the class can see the responses. If you see a question on the Discussion board that you're able to answer, please jump in and do so.

## Introduction to Jupyter Notebooks

Future assignments and tutorials will take the form of Jupyter notebooks like this one. Typically you'll be asked to edit some of the code chunks and to add text in response to short answer questions. This introductory notebook will be graded automatically but the grade will not count. Completing and submitting the notebook, however, will get you ready for next week's tutorial and the notebooks to come.

Notebooks may include questions that ask for a free-text response. For example:

### Exercise 1 (0 points)

Name one way in which human brains are superior to the best current digital electronic computers. Provide your answer in the cell immediately below this one.


YOUR ANSWER HERE

Notebooks may also include questions that ask you to write or edit some code. For example:

### Exercise 2 (1 point)

Complete the definition of a function `add_pair` that returns the sum of two numbers.

In [None]:
add_pair <- function(a,b){
    # YOUR CODE HERE
    stop('No Answer Given!')
}

Some coding questions will be automatically graded. For example, the automatic grader might check that your `add_pair` function produces the right answers on a couple of examples. The tests here are formulated using the `testthat` package.

In [None]:
expect_that(add_pair(1,4), equals(5))
expect_that(add_pair(-100,-8), equals(-108))

Often the tests will be hidden. For example, here's a completely unfair question that we would never ask.

### Exercise 3 (1 point)
I'm thinking of an integer `x` between 1 and 1,000,000. What is `x`?

In [None]:
x <-
    # YOUR CODE HERE
    stop('No Answer Given!')

Here's one test that your solution should pass. 

In [None]:
expect_that(x>0, equals(TRUE))

But there's also a hidden test that you're not seeing which checks whether your guess matches the true value of `x`. Unless you're very lucky your solution will fail this test (but don't worry -- this notebook doesn't count for anything!)

After you've finished working through the notebook, try "validating" the notebook by pressing the Validate button in the toolbar above. The validation process checks that your code passes all of the *visible* tests but doesn't consider the hidden tests. So just because the notebook validates doesn't mean that all of your answers are correct. 

The remaining sections of this notebook introduce or review some things that will be useful for the module on probabilistic models of cognition. Programming in R should be familiar to all of the psychology students but may be less familiar to some others. On the other hand, the section on probability theory will probably be familiar to those of you with a technical background but may be less familiar to some of the psychology students. 


## Programming in R

We'll be assuming that you are fairly familiar with R and the tidyverse. The required background roughly corresponds to the material in the opening 4 sections of "Working with Data" from Danielle Navarro's [R for Psychological Science]( https://psyr.djnavarro.net/index.html ). These sections are called 

* [Prelude to Data]( https://psyr.djnavarro.net/prelude-to-data.html) 
* [Data Types]( https://psyr.djnavarro.net/data-types.html )
* [Describing Data](  https://psyr.djnavarro.net/describing-data.html )
* [Visualizing Data]( https://psyr.djnavarro.net/visualising-data.html )

Psychology students should already have this background by virtue of completing Andy Perfors' version of PSYC30013 (Research Methods for Human Inquiry). If you're not a psychology student, our assumption is that you're already familiar with at least one programming language, and should therefore be able to pick up R mostly on your own. We're happy to help, of course, if you run into problems, and posting to the discussion board or asking your tutor during tutorials are the best ways to seek advice.


## Probability Theory

Let's set up a joint distribution over three binary variables. Each variable takes one of two possible values (1 or 2), and because there are 3 variables there are 8 possible settings of the variables. We'll directly specify a joint probability distribution `P(x,y,z)` over these settings.

In [None]:
d1 <- tibble(x = c(1,1,1,1,2,2,2,2), 
             y = c(1,1,2,2,1,1,2,2), 
             z = c(1,2,1,2,1,2,1,2), 
             p_x_y_z = c(0.3, 0.25, 0.2, 0.1, 0.05, 0.05, 0.03, 0.02) )

print(d1)


### Exercise 4 (0 points)

To make sense as a probability distribution `d1$p_x_y_z` needs to sum to 1 -- please check that this condition is satisfied. Provide your answer by writing some code in the next cell.


In [None]:
# YOUR CODE HERE
stop('No Answer Given!')

Having the joint distribution `d1` allows you to compute distributions over any subset of variables given observations over any other subset of variables. We'll try a few examples.

### Exercise 5 (1 point)
What's the probability that x, y and z all equal 2? We'll use `p_x2_y2_z2` to denote $P(x=2,y=2,z=2)$. For this and all remaining questions you can either write code to compute the answer or figure things out by hand and just define the relevant variable (here `p_x2_y2_z2`) to have the correct value.

In [None]:
# compute P(x=2,y=2,z=2)
p_x2_y2_z2 <-
# YOUR CODE HERE
stop('No Answer Given!')

In [None]:
# this cell contains some hidden tests! You can leave it empty except for this comment

The handout for Week 1 includes equations for Marginalization and Conditional Probability that apply when there are just two variables `a` and `b`. Let's try out similar ideas for the three variable case.


### Exercise 6 (1 point)
What's $P(z=2)$, or the probability that z equals 2? We'll use `p_z2` to denote this probability.

In [None]:
# compute P(z = 2)
p_z2 <-
# YOUR CODE HERE
stop('No Answer Given!')

In [None]:
# this cell contains some hidden tests! You can leave it empty except for this comment

### Exercise 7 (1 point)
What's $P(x=2,y=2)$, or the probability that x and y both equal 2? 

In [None]:
# compute P(x = 2, y = 2)
p_x2_y2 <-
# YOUR CODE HERE
stop('No Answer Given!')

In [None]:
# this cell contains some hidden tests! You can leave it empty except for this comment

### Exercise 8 (1 point)
What's $P(z=2|x=2,y=2)$, or the probability that z=2 given that x and y both equal 2? We'll use `p_z2_given_x2_y2` as the variable name for this conditional probability.

In [None]:
# compute P(z=2|x=2, y=2)
p_z2_given_x2_y2 <-
# YOUR CODE HERE
stop('No Answer Given!')

In [None]:
# this cell contains some hidden tests! You can leave it empty except for this comment

### Exercise 9 (2 points)
What's $P(z=2|x=2)$, or the probability that z equals 2 given that x equals 2?

In [None]:
# compute P(z=2|x=2)
p_z2_given_x2 <-
# YOUR CODE HERE
stop('No Answer Given!')

In [None]:
# this cell contains some hidden tests! You can leave it empty except for this comment

Now let's think about another distribution on three variables x, y and z. This time we're not explicitly given the joint distribution `P(x,y,z)` -- instead we are given the distributions `P(x)`, `P(y|x)`, and `P(z|x,y)`.

Here's the distribution `P(x)`:


In [None]:
xtab <- tibble(`P(x=1)` = c(0.2), `P(x=2)` = c(0.8) )  
kable(xtab, "html", align="cc")  %>% 
    as.character()  %>% 
    display_html()


Here's the conditional probability distribution `P(y|x)`:

In [None]:
ytab <- tibble(x = c(1,2), `P(y=1|x)` = c(0.9,0.1), `P(y=2|x)` = c(0.1,0.9) )
kable(ytab, "html", align="ccc")  %>% 
    as.character()  %>% 
    display_html()

And here's the conditional probability distribution `P(z|x,y)`:

In [None]:
ztab <- tibble(x = c(1,1,2,2), y = c(1,2,1,2), `P(z=1|x,y)` = c(1,0.5,0.5,0), `P(z=2|x,y)` = c(0,0.5,0.5,1) )
kable(ztab, "html", align="cccc")  %>% 
    as.character()  %>% 
    display_html()

We can combine these three elements into a larger table `d2` as follows.

In [None]:
d2 <- tibble(x = c(1,1,1,1,2,2,2,2), 
             y = c(1,1,2,2,1,1,2,2), 
             z = c(1,2,1,2,1,2,1,2), 
             p_x = c(0.2, 0.2, 0.2, 0.2, 0.8, 0.8, 0.8, 0.8),
             p_y_given_x = c(0.9, 0.9, 0.1, 0.1, 0.1, 0.1, 0.9, 0.9), 
             p_z_given_x_y = c(1,0,0.5, 0.5,0.5,0.5,0,1) )

print(d2)

Let's pick just a single row in the table -- the third row. The entries in this row tell us  that $P(x = 1) = 0.2$, that  $P(y = 2 | x = 1) = 0.1$, and that  $P(z = 1|x = 1, y = 2) = 0.5$.  We've introduced some redundancy here --- for example, the first four rows all have $x = 1$ which means that `p_x` is identical for all four rows. 

Note that columns `p_x`, `p_y_given_x` and  `p_z_given_x_y` do NOT specify probability distributions over the 8 rows of the table -- for a start, they do not sum to 1.

The handout for Week 1 (available on Canvas) includes an equation for the Chain Rule that covers the two variable case. We can use a similar idea here to compute the joint distribution over the three variables `x`, `y` and `z`.

### Exercise 10 (1 point)

Add a column `p_x_y_z` to `d2` that specifies a joint distribution over the 8 possible settings of the three variables.

In [None]:
d2 <- d2  %>% 
# YOUR CODE HERE
stop('No Answer Given!')

In [None]:
expect_that(sum(d2$p_x_y_z), equals(1))

Now that we've computed the joint distribution, we can use it to compute distributions over any subset of variables given observations over any other subset of variables. We'll do just one example.

### Exercise 11 (1 point)

What's $P(x=2|z=2)$, or the probability that x = 2 given that z = 2?

In [None]:
p_x2_given_z2 <- 
# YOUR CODE HERE
stop('No Answer Given!')

In [None]:
# this cell contains some hidden tests! You can leave it empty except for this comment