In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("gla06.ipynb")

<img src="./ccsf.png" alt="CCSF Logo" width=200px style="margin:0px -5px">

# Guided Learning Activity 06: The Law of Large Numbers

This Guided Learning Activity is designed for you to complete alongside a Data Ambassador from the course. You might find that it feels like a combination of the lectures and lab assignment. Whether you are participating live or watching the recording of the live meeting, let the Data Ambassador guide you through the following tasks. There will be moments for you to reflect and explore your own ideas as a way to solidify concepts and skills introduced by your instructor. Keep in mind that this is not a graded assignment for MATH 108 by default. If you have any concerns about participation, reach out to your instructor.

---

## Learning Objectives

1. Apply probability rules to perform calculations.
2. Use heatmaps to create and analyze visual representations of probabilistic data.
3. Create functions to simulate probabilistic behavior.
4. Distinguish between probability and empirical distributions.
5. Observe the Law of Large Numbers and use it to detect false claims.

---

## Configure the Notebook

Run the following code cell to set up the notebook.

In [None]:
from datascience import *
import numpy as np
from sequence import generate_sequence
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

---

## A Two-Digit Sequence Generator

Someone created a sequence generator function `generate_sequence` that creates two-digit sequences (as a list of two numbers) based on the numbers 1, 2, 3, 4, 5, 6. Run the following code cell a few times to see the function in action.

In [None]:
generate_sequence()

---

### Claimed Behavior

The creator of `generator_sequence` claims that the function generates sequences by randomly picking two digits in the following way:
```mermaid
graph TD;
    A[Start] --> B["Randomly select first digit from {1,2,3,4,5,6}"]
    B --> C{"Is the first digit even?"}
    
    C -->|"Yes"| E["Randomly select second digit from {1,2,3,4,5,6}"]
    C -->|"No"| D["Randomly select second digit from {2,4,6}"]
```

---

### Probability Exercises

For the next few tasks, practice applying the basic principles from probability theory to calculate some probability values. Work through these tasks assuming the function `generate_sequence` works as claimed.

### Task 01 📍

What is the probability that the first number in the sequence generated by `generate_sequence()` is `3`?

Assign `prob_first_3` to a value or expression that represents this probability value. The value should be a number between `0` and `1`, inclusive.

In [None]:
prob_first_3 = ...
prob_first_3

In [None]:
grader.check("task_01")

---

### Task 02 📍

What is the probability that `generate_sequence()` would produce the sequence `[5, 6]`?

Assign `prob_56` to a value or expression that represents this probability value. The value should be a number between `0` and `1`, inclusive.

In [None]:
prob_56 = ...
prob_56

In [None]:
grader.check("task_02")

---

### Task 03 📍

What is the probability that `generate_sequence()` would produce the sequence `[6, 5]`?

Assign `prob_65` to a value or expression that represents this probability value. The value should be a number between `0` and `1`, inclusive.

In [None]:
prob_65 = ...
prob_65

In [None]:
grader.check("task_03")

---

### Task 04 📍

What is the probability that `generate_sequence()` would produce the sequence `[1, 1]`?

Assign `prob_11` to a value or expression that represents this probability value. The value should be a number between `0` and `1`, inclusive.

In [None]:
prob_11 = ...
prob_11

In [None]:
grader.check("task_04")

---

### Task 05 📍

What is the probability that `generate_sequence()` will produce a two-digit sequence where both digits in the generated sequence are greater than `4`?

Assign `prob_both_above_4` to a value or expression that represents this probability value. The value should be a number between `0` and `1`, inclusive.

In [None]:
prob_both_above_4 = ...
prob_both_above_4

In [None]:
grader.check("task_05")

---

### Task 06 📍

If `generate_sequence()` is ran 10 times, what is the probability that at least one of the ten two-digit sequences is `[5, 6]`?

Assign `prob_at_least_one_56` to a value or expression that represents this probability value. The value should be a number between `0` and `1`, inclusive.

In [None]:
prob_at_least_one_56 = ...
prob_at_least_one_56

In [None]:
grader.check("task_06")

---

### Probability Distribution

* A **Probability distribution** describes the theoretical likelihood of the different outcomes in a random process.
    * They are based on mathematical models and assumptions.
* For the two-number sequence generated from the claimed behavior of the `generate_sequence` function, the probability distribution of this experiment is the collection of all possible two-number sequences together with the probability that each will occur.

---

### Task 07 📍

Create a function called `prob_of_sequence` that:
* Takes two integers as input (assumed to be numbers between 1 and 6, inclusive)
* Returns the probability of generating a two-digit sequence based on the provided integers and the claimed behavior of the `generate_sequence` function. Here is the logic again for the how the function is claimed to work:

```mermaid
graph LR;
    A[Start] --> B["Randomly select first digit from {1,2,3,4,5,6}"]
    B --> C{"Is the first digit even?"}
    
    C -->|"Yes"| E["Randomly select second digit from {1,2,3,4,5,6}"]
    C -->|"No"| D["Randomly select second digit from {2,4,6}"]
```

**Hint:** `prob_of_sequence(5, 5)` should return `0`.

In [None]:
# Define the function
def prob_of_sequence(first_digit, second_digit):
    ...

# Check the function
prob_of_sequence(5, 5)

In [None]:
grader.check("task_07")

---

### Task 08 📍

Create a table called `prob_dist_tbl` that contains 3 columns `'First Digit'`, `'Second Digit'`, and `'Probability'` and the rows show all the possible sequences and their related probabilities. For example, the table should be in the following format:

|First Digit|Second Digit|Probability|
|---|---|---|
|`1`|`1`|`prob_of_sequence(1, 1)`|
|...|...|...|
|`5`|`6`|`prob_of_sequence(5, 6)`|
|...|...|...|
|`6`|`6`|`prob_of_sequence(6, 6)`|

**Hints**: 
* Try starting from an empty table and building up `prob_dist_tbl` row by row using `for` loops and the `with_row` table method.
* In each `for` loop, go through all the numbers 1 through 6.
* Use the function `prob_of_sequence` to calculate the probability of the sequence.

In [None]:
digits = np.arange(1, 7)
prob_dist_tbl = Table(["First Digit", "Second Digit", "Probability"])
for first_digit in ...:
    for second_digit in ...:
        ...

prob_dist_tbl

In [None]:
grader.check("task_08")

---

### Heat Map

A heat map is a visualization that utilizes variations in shading to convey additional information. When relating pairs of values with a numerical value, you can use a heat map to visualize the intensity of the numerical value for each pair. 

<img src="./heatmap.svg" alt="Heatmap example"/>

For example, the above heat map shows a colored rectangular region corresponding to a town and a month. The shading of the region is defined by the average temperature in each relative town and month combination. This visual provides you with a way to quickly see how it is hotter in the summer and cooler in the winter. 

Notice that a heat map somewhat resembles a colorized pivot table. Next, we'll guide you to use a heat map to visualize the probability distribution for the two-digit sequence scenario above. 

### Task 09 📍

To help create a heat map, write a function `to_pivot` that takes in a 3 column table in the format of `prob_dist_tbl` as input and returns the same information as a pivot table where the First Digit values are on the rows and the Second Digit values are on the columns. You can expect from the table provided as input that:
* The first column is labeled `'First Digit'` and has distinct values.
* The second column is labeled `'Second Digit'` and has distinct values.
* The third column contains numerical values like probabilities (or relative frequencies).

Here is a partial visual of the structure of the output of this function when applied to `prob_dist_tbl`.

<img src="./to_pivot.png" alt="A preview of the resulting pivot table" width=500px/>


**Hints**: 
* You can use a function like `sum` for the `collect` function with the `pivot` method.
* Try calling the function on `prob_dist_tbl` to see if you get the same information from the previous task but in a different format.

In [None]:
def to_pivot(tbl):
    return ...

In [None]:
grader.check("task_09")

---

### `heatmap` Function

### Task 10 📍

We've created the following function `heatmap` that takes in a table like `prob_dist_tbl`, uses your `to_pivot` function, and creates a heat map. Run the following code cell to define that function and try it out with the `prob_dist_tbl`.

In [None]:
def heatmap(tbl):
    '''
    Produces a heat map for a tbl like prob_dist_tbl where the first column is
    labeled "First Digit", the second column is labeled "Second Digit", and
    the third column is a column of probability or relative frequency values.
    '''
    import plotly.express as px
    fig = px.imshow(
            to_pivot(tbl).to_df().set_index('First Digit'),
            labels=dict(x="First Digit", y="Second Digit", color="Probability/Relative Frequency"),
            text_auto=True,
            aspect="auto"
        )
    fig.update_layout(
        yaxis = dict(
        tickmode = 'array',
        tickvals = np.arange(1, 11))
    )
    display(fig)

# Call the function on prob_dist_tbl
heatmap(prob_dist_tbl)

In [None]:
grader.check("task_10")

---

### Task 11 📍🔎

<!-- BEGIN QUESTION -->

What do you notice about the shade variation in the heat map visualization for the probability distribution for the two-digit sequence generation?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

---

### Empirical Distributions

* **Empirical distributions** are derived from actual observed data. Instead of being based on theoretical probabilities, they reflect the frequencies of observed occurrences in a repeated experiment.
* You can create an empirical distribution for the two-digit sequence generation by running the `generate_sequence` function many times and calculating the relative frequency for all the sequences based on your observations.

---

### Task 12 📍

Write a function called `simulate_one_sequence_as_claimed` that takes no input and returns a list of 2 items where
* The first item is a random number betwen 1 and 6, inclusive.
* The second item is:
    * a random number between 1 and 6, inclusive if the first item is even,
    * a random number from 2, 4, 6 if the first number is odd.

In [None]:
def simulate_one_sequence_as_claimed():
    ...

# Try calling the function
simulate_one_sequence_as_claimed()

In [None]:
grader.check("task_12")

---

### Observing Many Simulations

Now that you have a way of simulating the generation of one two-digit sequence based on the way the function `generate_sequence` actual works and based on the way the function claims to work, you can use a `for` loop to simulate generating these two number sequences many many times. To speed things a long, we used all the Python code you know to create a function for you that generates `n` two-digit sequences based on a function like `generate_sequence` or `simulate_one_sequence_as_claimed` and produces a heat map visualization showing an empirical distribution for those sequences.

Run the following code cell a few times to define the function and generate an empirical distribution for 100 observations of two number sequences generated from `generate_sequence` and `simulate_one_sequence_as_claimed`.

In [None]:
def simulate_two_number_sequence(function, n):
    simulated_observation_tbl = Table(['First Digit', 'Second Digit'])
    
    for _ in np.arange(n):
        simulated_observation_tbl = simulated_observation_tbl.with_row(
            function()
        )
    empirical_dist_tbl = simulated_observation_tbl.group(['First Digit', 'Second Digit'])
    frequencies = empirical_dist_tbl.column('count')
    relative_frequencies = frequencies / len(frequencies)
    empirical_dist_tbl = empirical_dist_tbl.with_column('count', relative_frequencies)
    empirical_dist_tbl = empirical_dist_tbl.relabeled('count', 'Relative Frequency')
    heatmap(empirical_dist_tbl)

# Run the simulation 100 times.
n = 100
print('An emperical distribution (n=100) based on how `generate_sequence` actual works:')
simulate_two_number_sequence(generate_sequence, n)
print('An emperical distribution (n=100) based on how `generate_sequence` is claimed to work:')
simulate_two_number_sequence(simulate_one_sequence_as_claimed, n)

---

## Law of Large Numbers (Law of Averages)

The Law of Large Numbers states that if a chance experiment is repeatedly simulated under identical conditions, the relative frequency of an event will converge to its theoretical probability as the number of repetitions increases. "Converge" is a technical term that roughly means that, with a sufficiently large sample size, the relative frequency becomes very close to the theoretical probability.

---

### Convergence Example

For example, imagine drawing numbered ($1$, $2$, $3$, $4$, $5$) tokens from a bag at random where each token has an equal chance of being selected. The theoretical probability of drawing any one number is $1/5$ ($20\%$).

* Small Sample Size ($n = 10$):
    * If you randomly draw 10 numbers, you might get $1$ five times ($50\%$) and never get $3$ ($0\%$), which deviates significantly from the expected $20\%$ for each number.
    * The empirical distribution (a bar chart of how often each number appears) will look uneven and may not reflect the true distribution.
* Larger Sample Size ($n = 100$)
    * With $100$ draws, each number appears closer to $20\%$ of the time, but there will still be some variation (e.g., $1$ appears $18$ times, $3$ appears $22$ times).
* Even Larger Sample Size ($n = 10,000$)
    * With $10,000$ draws, the proportion of each number appearing will be very close to $20\%$, and the empirical distribution will closely match the true uniform distribution.
 
<img src="./empirical_dists.png" alt="Example empirical distributions for n = 10, n = 100, and n = 10,000">

* Since generating all possible samples and computing exact probabilities is often infeasible, empirical histograms approximate probability distributions.
* By repeatedly simulating random processes, you can estimate probabilities without complex mathematical calculations, making computer simulations a powerful tool for understanding randomness.

---

### Task 13 📍🔎

<!-- BEGIN QUESTION -->

Using the slider belwo, slowly increase the value of `n` to see how the heatmaps generated from `generate_sequence` and `simulate_one_sequence_as_claimed` differ as the value of `n` increases. What do you notice about the distribution in relation to the probability distribution heat map [Task 10](#Task-10), and how does your observation relate to the Law of Large Numbers (Law of Averages)?

Here is the probability distribution heat map for reference:

In [None]:
heatmap(prob_dist_tbl)

In [None]:
import ipywidgets as widgets
from IPython.display import display

slider = widgets.IntSlider(value=10, min=10, max=10000, step=10, description='n:', continuous_update=False)

def update_plot(n):
    print(f'An empirical distribution (n={n}) based on how `generate_sequence` works:')
    simulate_two_number_sequence(generate_sequence, n)
    print(f'An empirical distribution (n={n}) based on how `simulate_one_sequence_as_claimed` works:')
    simulate_two_number_sequence(simulate_one_sequence_as_claimed, n)

widgets.interactive(update_plot, n=slider)

_Type your answer here, replacing this text._

<!-- END QUESTION -->

---

## Reflection

In this activity, you practiced using fundamental probability rules to perform calculations. Additionally, you explored the Law of Large Numbers by conducting probability simulations and analyzing their outcomes. You used computational tools to visualize results, interpret statistical trends, and understand how increasing trial counts lead to expected theoretical probabilities. By engaging in hands-on experimentation and reflection, you gained insight into randomness, convergence, and the importance of large sample sizes in probability theory.

---

## License

This content is licensed under the <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)</a>.

<img src="./by-nc-sa.png" width=100px>