# Lab 2: Random Variables and Sampling

**Name(s):**

1.

2.

<br>

---

In this notebook, we will have a look at sampling from probability mass functions (more generally, from *probability distributions*), and using functions in R to sample from different probability distributions to represent some fun processes.

<br>

---

## Exercise 1 - Dice Rolling

Consider rolling a six-sided die.

If the die is fair, then all outcomes are equally likely. The outcome of a single roll of this die can be thought of as the outcome of a single round of an experiment where you sample one of the numbers 1-6, all with equal probability. We can do that in R by using the `sample` function, as below: (select the code cell below and press Shift+Enter to run it)


In [None]:
#sample(x, size, replace = FALSE, prob = NULL)
roll <- sample(c(1,2,3,4,5,6), size=1)
print(roll)

### Question 1: 

You'll notice that the first argument given is `c(1,2,3,4,5,6)`. What do you think this represents in the function? Put your answer in the text cell below this one.

### Question 2:

To simulate rolling a die multiple times, we can include arguments for `size` (the number of rolls) and `replace` (a value of TRUE or FALSE that denotes where we sample with replacement or not). Modify the function call below in order to roll the die 10 times and save the result to `rolls`. Don't mess around with the `set.seed(251)` line of code. That just ensures that all of our random numbers end up as the same random numbers. Yes, that doesn't sound random at all... check out [Random seed](https://en.wikipedia.org/wiki/Random_seed) on Wikipedia for more information.

In [None]:
# replace the TODOs with proper arguments
set.seed(251) # <-- leave this alone
rolls <- sample(c(1,2,3,4,5,6), size=TODO, replace=TODO)
print(rolls)

### Question 3:

We can count up the number of occurrences of a particular outcome by using the `which` function in R. For example, our code from Question 2 should have left `rolls` with the value `[3, 4, 3, 4, 2, 3, 3, 6, 6, 2]`. Each element within that array has an *index*, which you can think of as an integer 1, 2, 3, ... representing the address of that element within the `rolls` array. So, `rolls[1]` is the first element in `rolls`, which is a 3. `rolls[10]` is the 10th element, a 2. And so on.

The `which` function returns an array that is full of all of the indices of our input array that satisfy the condition we plug in. For example, to find out all of the elements of `rolls` that are equal to 3, we can do:

In [None]:
which(rolls==3)

You can check `rolls` above and verify that those elements are indeed equal to 3. We can find out how *many* elements are equal to 3 by wrapping the `which` command in a command called `length`:

In [None]:
length(which(rolls==3))

Based on the total length of `rolls` and how many elements are equal to 3, can you write a line or two of code that will estimate the probability of rolling a 3 from our 10 rolls?

In [None]:
prob_3 <- 0 # TODO
print(prob_3)

### Question 4:

How does this compare to the actual probability of rolling a 3 on a fair die? Why are these different? How could you improve your estimate?

### Question 5:

Improve your estimate of the probability of rolling a 3. What do you need to do in order to get your estimate to be accurate within 0.01?

<br>

---

## Exercise 2 - Loaded dice

Consider now a *loaded* 6-sided die. That is, the different sides do not all necessarily have the same probability of being rolled.

### Question 6:

S'pose that the die is loaded such that you are twice as likely to roll either a 1, 2 or 3, as you are to roll either a 4, 5 or 6. That is, $p(1) = 2p(4)$, for example, and $p(1) = p(2)$, and $p(4) = p(5)$ (also for example).

Let $X$ be a random variable describing the outcome of rolling the die. What is the probability mass function for $X$? Write you answer using LaTeX and Markdown in a new text cell below. 

### Question 7:

Find the cumulative distribution function for $X$.  What is the probability that you roll a $4$ or lower with the loaded die? 

### Question 8:

We can use the `prob` argument for the `sample` function to give the outcomes unequal probabilities in our simulated roll of the die. `prob` should be an array of the same length as the set of outcomes, where each element in `prob` is the probability of the corresponding element from the set of outcomes. Replace the placeholder (`c(1,0,0,0,0,0)`) with the actual probabilities for each outcome of the die roll (from Question 6).

In [None]:
# replace the `prob` argument with what the probabilities for each outcome are
roll <- sample(c(1,2,3,4,5,6), size=1, replace=TRUE, prob=c(1,0,0,0,0,0))
print(roll)

### Question 9:

Verify that your `sample` function call works by estimating $p(3)$ and $p(5)$ using 10,000 samples. Are they close to the pmf values you solved for in Question 6?

### Question 10:

Now use a set of 10,000 sample rolls of the loaded die to check the value of the cdf at 4 from Question 7. That is, what is the probability that we observe a roll of 4 or less?

*Hint: Instead of `==` to check where our `rolls` array is equal to a number, we can use `<=` to check where `rolls` is less than or equal to a number.*

<br>

---

## Exercise 3 - Sampling from the geometric distribution

In class we saw that a random variable $X$ that follows a  **geometric** distribution with parameter $p$ has the pmf 

$$p(X=k) = (1-p)^{k-1}p,$$

for $k=1,2,\ldots$ and $p(X=k)=0$ for all other $k$.



### Question 11:

**Explain-it-back:** Explain in words out loud to your lab partner, and write below, what $k$ and $p$ represent in the above equation for the pmf. Give an example of a situation that might be appropriate to model using a geometric random variable. It may **not** involve flipping coins, because we've already beaten those examples to death.

### Question 12:

To draw a sample from a geometric distribution with parameter $p$, we can use the `rgeom` function in R. We must provide `rgeom` with an argument for `p`, the parameter for the geometric distribution, and `n`, the number of samples. We can draw a sample of size 20 from a geometric distribution with $p=0.5$ as follows.

In [None]:
x <- rgeom(p=0.5, n=20)
print(x)

Consider the pmf above, and the values for our samples from the geometric distribution. The pmf should be 0 for $X=0$, and yet we have a bunch of samples for $X$ that are 0. Why is this? 

*Hint: It might help to look up the documentation for the `rgeom` function in R. You might search the web for "geometric distribution in R".*

### Question 13:

How should we transform our sample from Question 12 to be consistent with the definition from class?