# **Week 4: Probability Review**

```
.------------------------------------.
|   __  ____  ______  _  ___ _____   |
|  |  \/  \ \/ / __ )/ |/ _ \___  |  |
|  | |\/| |\  /|  _ \| | | | | / /   |
|  | |  | |/  \| |_) | | |_| |/ /    |
|  |_|  |_/_/\_\____/|_|\___//_/     |
'------------------------------------'

```

This week, we’ll review basic probability concepts through problem solving.

## **Pre-Configurating the Notebook**

### **Switching to the R Kernel on Colab**

By default, Google Colab uses Python as its programming language. To use R instead, you’ll need to manually switch the kernel by going to **Runtime > Change runtime type**, and selecting R as the kernel. This allows you to run R code in the Colab environment.

However, our notebook is already configured to use R by default. Unless something goes wrong, you shouldn’t need to manually change runtime type.

### **Importing Required Packages**
**Run the following lines of code**:

In [None]:
#Do not modify

setwd("/content")

# Remove `MXB107-Notebooks` if exists,
if (dir.exists("MXB107-Notebooks")) {
  system("rm -rf MXB107-Notebooks")
}

# Fork the repository
system("git clone https://github.com/edelweiss611428/MXB107-Notebooks.git")

# Change working directory to "MXB107-Notebooks"
setwd("MXB107-Notebooks")

#
invisible(source("R/preConfigurated.R"))

**Do not modify the following**

In [None]:
if (!require("testthat")) install.packages("testthat"); library("testthat")

test_that("Test if all packages have been loaded", {

  expect_true(all(c("ggplot2", "tidyr", "dplyr", "stringr", "magrittr", "knitr") %in% loadedNamespaces()))

})

## **Workshop Questions**

### **Question 1**

Three (fair) six-sided dice are rolled and their values summed. Calculate the probability of observing exactly two dice showing the same value and the third showing a different value.

Here, we will utilise the combinatoric definition of probability (which can be applied when each possible outcome has the same probability):

Probability of an event $A$ is the number of possible ways for $A$ to occur divided by the number of possible outcomes.

$$
\Pr(A) = \frac{\text{Number of ways } A \text{ can occur}}{\text{Number of possible outcomes}}.
$$

#### **The Brute-Force Approach**

In many scenarios, we can use a brute-force approach to calculate the probability of an event. Essentially, this involves generating all possible outcomes in the sample space and then computing the probability of the event of interest.

We can generate all possible outcomes of rolling three 6-sided dice using the `expand.grid()` function. The results is a `data.frame`, assigned to `outcomes`.

In [None]:
die1 = die2 = die3 = 1:6
outcomes = expand.grid(die1, die2, die3)
outcomes %>% head(10) %>% kable()

##### **Question 1.1**

Rename `Var1`, `Var2`, and `Var3` columns to `Die1`, `Die2`, and `Die3`

<details>
<summary>▶️ Click to show the solution</summary>

```r
outcomes %>%
  rename(Die1 = Var1, Die2 = Var2, Die3 = Var3) -> outcomes
outcomes %>% head(10) %>% kable()
```

</details>

##### **The `ifelse()` Function**

The `ifelse(test, yes, no)` evaluates the logical or comparision expression in `test`. If `test` is `true`, the value in `yes` is returned. Otherwise, return the value in `no`.



In [None]:
ifelse(3>5, "Aha!", "Voila!")

`test` can be a vector operation.

In [None]:
ifelse(c(1,2,3,4,5)>3, "Aha!", "Voila!")

We can create a new column named `Die1GreaterThan3` where value is 1 if `Die1 > 3` and 0 otherwise.

In [None]:
outcomes %>%
  mutate(Die1GreaterThan3 = ifelse(Die1 > 3, 1, 0)) %>%
  head(10) %>%
  kable()

##### **Question 1.2**

Create a new column named `Satisfied` where value is 1 if we observe exactly two dice showing the same value and the third showing a different value and 0 otherwise.

<details>
<summary>▶️ Click to show hint</summary>

We need `Die1 == Die2` and `Die1 != Die3` **or** `Die2 == Die3` and `Die1 != Die3`, **or** `Die1 == Die3` and `Die1 != Die2`.

</details>

<details>
<summary>▶️ Click to show the solution</summary>

```r
outcomes %>%
  mutate(Satisfied = ifelse((Die1 == Die2 & Die1 != Die3) |
                            (Die2 == Die3 & Die1 != Die2)|
                            (Die1 == Die3 & Die2 != Die3), 1, 0)) -> outcomes
outcomes %>%                            
  head(10) %>%
  kable()
```

</details>

##### **Question 1.3**

The sum of `Satisfied` is the number of possible ways for our event of interest to occur. Calculate the probability of observing the event.


<details>
<summary>▶️ Click to show hint</summary>

`n()` can be used within `summarise()` to calculate the number of rows.

</details>

<details>
<summary>▶️ Click to show the solution</summary>

```r
outcomes %>%
  summarise(Pr = sum(Satisfied)/n())
```

</details>

#### **A "Smarter" Approach**

Sometimes, we avoid the brute-force approach — for example, when the sample space is too large for our computers to handle, when we lack the computational resources, or when the problem is simply too complex to solve this way.

**Recall that**:

The complement of an event $A$, denoted as $A^c$, includes all outcomes in the sample space $S$ but not in $A$.

$$
\Pr(A^c) = 1 - \Pr(A).
$$




Instead of directly calculating the probability of observing exactly two dice showing the same value and the third showing a different value, we could compute it indirectly by:

- Letting `Pr1` be the probability that all three dice show different values (Event `E1`)
- Letting `Pr2` be the probability that all three dice show the same value (Event `E2`)

Then, since these three cases are mutually exclusive and exhaustive, we can write:

$$\text{Pr(exactly two the same) = 1 − Pr1 − Pr2}.$$

##### **Question 1.4**
Calculate the probability of the event of interest following the "smarter" approach.

<details>
<summary>▶️ Click to show the solution</summary>

There are $6^3$ cases in total, in which $6 \times 5 \times 4$ cases are of event `E1` and $6$ cases are of event `E2`.


```r
N = 6*6*6
Pr1 = 6*5*4/N
Pr2 = 6/N
1 - Pr1 - Pr2
```

</details>

##### **Question 1.5**

Following the approaches from Questions 1.3 and 1.4, calculate the probability of observing the event of interest by calculating `Pr1` and `Pr2`.

<details>
<summary>▶️ Click to show the solution</summary>

```r
outcomes %>%
  mutate(E1 = ifelse(Die1 != Die2 & Die2 != Die3 & Die1 != Die3, 1, 0),
         E2 = ifelse(Die1 == Die2 & Die2 == Die3, 1, 0)) %>%
  summarise(Pr1 = sum(E1)/n(),
            Pr2 = sum(E2)/n()) %>%
  mutate(Pr = 1 - Pr1 - Pr2)
```

</details>

### **Question 2**
Assume a year with 365 days (i.e., non-leap year).

#### **Question 2.1**

What is the probability that two people share the same birthday (month and date; not year)?

The solution is most directly found by finding the probability that two people don’t share the same birthday and then use the property of complimentary events.

- Let $A$ be the event that two people share the same birthday
- Let $A^c$ be the event that two people dont’t share a birthday

There are $365 \times 365$ combinations of birthdays for two people, and $365 \times 364$ satisfy event $A$. Thus,

$$\text{Pr}(A^c) = \frac{365}{365}\times\frac{364}{365} = 0.9972603.$$

And, $\text{Pr}(A) = 1 - \text{Pr}(A^c) = 0.0027397$.

In R, we can use the `prod()` function to compute the product of all elements of a vector.


In [None]:
prod(364:365) == 364*365
prod(c(365, 365)) == 365^2

#### **Question 2.2**

Given that there are about 300 people enrolled in `MXB107`, what is the probability that no one shares the same birthday?


<details>
<summary>▶️ Click to show hint 1</summary>

Let $A^c$ be the event that no one shares the same birthday. Thus,

$$\text{Pr}(A^c) = \frac{365}{365}\times\frac{364}{365}\times ⋯ \frac{365-300+1}{365}.$$

</details>

<details>
<summary>▶️ Click to show hint 2</summary>

If you use `prod(365:(365-300+1))`, you will very likely encounter overflow, meaning the number becomes larger than what your computer can handle. As a result, you might see the output `Inf` (infinity) in R.

A better approach is to compute a vector of **proportions** and then use `prod()` on that vector.

</details>

<details>
<summary>▶️ Click to show the solution</summary>

```r
prop = (365:(365-300+1))/365
prod(prop)
```

</details>



#### **Question 2.3**

What if there are 366 people enrolled in `MXB107`?

<details>
<summary>▶️ Click to show the solution</summary>

Probability of no one sharing the same birthday is 0, according to the Dirichlet's (or pigeon) principle.
</details>

### **Question 3:**

A Covid-19 test has the following properties:

- If a person has Covid-19, the test correctly detects it 95% of the time.
- If a person does not have Covid-19, the test correctly shows a negative result 98% of the time.

In the population, 1% of people actually have the virus.

In [None]:
svgCode = paste(readLines("figures/bayes.svg", warn = F), collapse = "\n")
display_html(svgCode)

#### **Question 3.1**

If a randomly selected person is infected, what is the probability of testing positive?

In [None]:
# Write your solution in here (no R code)

<details>
<summary>▶️ Click to show the solution</summary>

Let $P$ be the event that a test is positive

Let $I$ be the event that a person is infected

From the information in the question, we can see that:

$$\Pr(P \mid I) = 0.95$$

This is what's called the "true positive rate".

</details>

#### **Question 3.2**

What is the probability of testing positive?

In [None]:
# Write your solution in here (no R code)

<details>
<summary>▶️ Click to show the solution</summary>

Here we are interested in determining the unconditional probabability $\Pr(P)$. We can calculate this using the law of total probability:

$$
\begin{align}
\Pr(P) &= \Pr(P|I)\times\Pr(I) + \Pr(P|I^c)\times\Pr(I^c) \\
&= 0.95 \times 0.01 + 0.02 \times 0.99 \\
&= 0.0293
\end{align}
$$

</details>

#### **Question 3.3**

If a randomly selected person tests positive, what is the probability that the person actually has Covid-19?

<details>
<summary>▶️ Click to show hint</summary>

Recall the Bayes' Theorem:

$$
\Pr(A \mid B) = \frac{\Pr(B \mid A) \times \Pr(A)}{\Pr(B)}
$$

</details>

In [None]:
# Write your solution in here (no R code)

<details>
<summary>▶️ Click to show the solution</summary>

Here we want to find $\Pr(I \mid P)$. We can obtain this via Bayes theorem:

$$
\begin{align}
\Pr(I \mid P) &= \frac{\Pr(P \mid I) \times \Pr(I)}{\Pr(P)} \\
&= \frac{0.95 \times 0.01}{0.0293} \\
&\approx 0.3242
\end{align}
$$

</details>




Note that the **probability of being tested positive given being infected** tells you how good a test is at detecting infected patients. However, what we do care about in practice is **the probability of being infected given being tested positive**.

Many serious diseases have hallmark symptoms that almost always appear in patients who have the disease (i.e., almost 100% "true positive rate"). But those same symptoms can also appear in many other, less severe or more common illnesses. So, seeing the symptom alone isn’t enough to confidently diagnose the severe disease.

**Examples**:

- Chest pain: It’s a classic symptom of a heart attack (which can be life-threatening), but it’s also common in many less serious conditions like acid reflux, muscle strain, or anxiety.
- Fatigue: Seen in severe diseases like cancer or autoimmune disorders, but also in common conditions like sleep deprivation or stress.

### **Question 4**

We have seen the classical definition of probability based on counting. There is another important (classical) definition — the frequentist definition — based on relative frequency. It defines the probability of an event as the limiting value of the proportion of times the event occurs as the number of trials becomes very large:

$$
\Pr(A) = \lim_{n \to \infty} \frac{N_A(n)}{n}
$$

where $N_A(n)$ is the number of times event $A$ occurs in $n$ trials.


We know that the probability of getting `Heads` (1) by tossing a fair coin is 0.5. Similarly, the probability of getting `Tails` (0) is 1 - 0.5 = 0.5. To simulate this coin tossing experiment `n` times, we could run `sample(0:1, size = n, replace = T)`.

In [None]:
n = 10
x = sample(0:1, size = n, replace = T) #Head = 1; Tail = 0
print(x)
mean(x) #The relative frequency of Head

Every time you rerun the block above, you get a new `x` vector. Our computers do not truly produce random values, though. Computers use pseudo-random number generators, which produce values that appear random but are actually generated by a deterministic process.

To ensure reproducibility (i.e., getting the same results each time you run the code), you can use `set.seed(seed)` to fix the starting point of the random number generator.

In [None]:
set.seed(123)
n = 10
x = sample(0:1, size = n, replace = T)
print(x)
mean(x)

Assume that you want to toss a coin `n` times then calculate the relative frequency of `Heads` and repeat the process `nRepeats` times. To perform this experiment in R, we can do the following steps:

- Create an empty numeric vector of length `nRepeats` to store the frequency of `Heads`.

In [None]:
nRepeats = 100
headFreqn10 = numeric(nRepeats) #Here, n = 10.

- Use a `for` loop to repeat the process `nRepeats` times and save the results to `headFreqn10`.

In [None]:
set.seed(123)
for(i in 1:nRepeats){ #Repeat for i varying from 1 to nRepeats
   x = sample(0:1, size = 10, replace = T) #Toss a coin 10 times
   headFreqn10[i] = mean(x) #Compute the relative frequency of `Head` in `x` and save the value to the i-th value in `headFreqn10`
}

#### **Question 4.1**

Follow the steps above to create new vectors named `headFreqn100`, `headFreqn1000`, and `headFreqn10000` that contain the relative frequencies of heads in 100, 1000, and 10000 tosses, respectively (repeated 100 times).

Set a `seed` value for ensuring reproducibility.

<details>
<summary>▶️ Click to show the solution</summary>

```r
set.seed(123)

nRepeats = 100
headFreqn100 = numeric(nRepeats)
headFreqn1000 = numeric(nRepeats)
headFreqn10000 = numeric(nRepeats)

for(i in 1:nRepeats){
   x = sample(0:1, size = 100, replace = T)
   headFreqn100[i] = mean(x)
}

for(i in 1:nRepeats){
   x = sample(0:1, size = 1000, replace = T)
   headFreqn1000[i] = mean(x)
}

for(i in 1:nRepeats){
   x = sample(0:1, size = 10000, replace = T)
   headFreqn10000[i] = mean(x)
}
```
</details>

#### **Question 4.2**

Group the simulated data in `headFreqn10`, `headFreqn100`, `headFreqn1000`, and `headFreqn10000` into a single data frame. Then, visualise the distribution of the relative frequency of heads for different numbers of coin tosses with boxplots. What observations can you make from the results?

<details>
<summary>▶️ Click to show the solution</summary>

```r
headFreq = data.frame(n10 = headFreqn10,
                      n100 = headFreqn100,
                      n1000 = headFreqn1000,
                      n10000 = headFreqn10000)

headFreq %>% boxplot()
abline(h = 0.5, col = "red")
```
</details>




**The remaining questions are for you to try on your own, with solutions released at the end of the week.**

### **Question 5**
- For any randomly selected card from a standard 52-card deck:
  - What is the probability that a player draws an ace?
  - What is the probability that a player draws a diamond?
- Are the events “player draws an ace” and “player draws a diamond” independent?
- Determine the probability of drawing an Ace of Diamonds from the deck.

In [None]:
# Write your solution in here (no R code)

<details>
<summary>▶️ Click to show the solution</summary>

```r
Solution will be released at the end of the week!
```
</details>




### **Question 6**

A fair six-sided die is rolled 6 times. What is the probability that the product of all 6 outcomes is an odd number?

In [None]:
# Write your solution in here (no R code)

<details>
<summary>▶️ Click to show the solution</summary>

```r
Solution will be released at the end of the week!
```
</details>




## **Question 7**

Sometimes there's no better option than using a brute-force method. Suppose a fair six-sided die is rolled 4 times, and the results are summed. Use a brute-force approach (with `dplyr` and `%>%`, as demonstrated earlier) to calculate the probability that the total sum is greater than 10.



<details>
<summary>▶️ Click to show the solution</summary>

```r
Solution will be released at the end of the week!
```
</details>




## **Question 8**

You have two coins:
- One two-headed coin (always lands on `Heads`),
- One fair coin (`Heads` or `Tails` with equal probability).
You randomly select one of the two coins and flip it once.

**The result is Heads!!!**

What is the probability that you selected the two-headed coin?

In [None]:
# Write your solution in here (no R code)

<details>
<summary>▶️ Click to show the solution</summary>

```r
Solution will be released at the end of the week!
```
</details>


