In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab06.ipynb")

<img src="./ccsf.png" alt="CCSF Logo" width=200px style="margin:0px -5px">

# Lab 06: Simulation and Chance

## References

* [Sections 9.3 - 9.4 of the Textbook](https://ccsf-math-108.github.io/textbook/chapters/09/4/Monty_Hall_Problem.html)
* [Sections 10.0 - 10.4 of the Textbook](https://ccsf-math-108.github.io/textbook/chapters/10/Sampling_and_Empirical_Distributions.html)
* [datascience Documentation](https://datascience.readthedocs.io/)

---

## Lab Assignment Reminders

- 🚨 Make sure to run the code cell at the top of this notebook that starts with `# Initialize Otter` to load the auto-grader.
- Your tasks are categorized as auto-graded (📍) and manually graded (📍🔎):
    - **For all auto-graded tasks:**
        - Replace the `...` in the provided code cell with your own code.
        - Run the `grader.check` code cell to execute tests on your code.
        - There are no hidden auto-grader tests in the lab assignments. This means if you pass the tests, you can assume you've completed the task successfully.
    - **For all manually graded tasks:**
        - You may need to provide your own response to the provided prompt. Replace the template text "_Type your answer here, replacing this text._" with your own words.
        - You might need to produce a graphic or another output using code. Replace the `...` in the code cell to generate the image, table, etc.
        - In either case, check your response with a classmate, a tutor, or the instructor before moving on.
- Throughout this assignment and all future ones, please **do not re-assign variables** throughout the notebook! _For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. Otherwise, you may fail tests that you thought you were passing previously!_
- You may [submit](#Submit-Your-Assignment-to-Canvas) this assignment as many times as you want before the deadline. Your instructor will score the last version you submit once the deadline has passed.
- **Collaborating on labs is encouraged!** You should rarely remain stuck for more than a few minutes on questions in labs, so ask an instructor or classmate for help. (Explaining things is beneficial, too -- the best way to solidify your knowledge of a subject is to explain it.) However, please don't just share answers.

---

## Configure the Notebook

Run the following cell to configure this Notebook.

In [None]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

---

## Probability

### Electric Board

An electronic board with lights contains 10 equal-sized zones labeled with values from 1 to 10.

<img src="./number_grid.png" alt="a 2x5 grid of numbers 1 - 10" width=40%></img>

The board illuminates a sequence of numbers at random. First, a number is lit up with a pink color, then a number is lit up with an orange color where the second number is different from the first. For example, the following image shows 6 lit up with a pink color followed by 5 lit up with an orange color.

<img src="./number_grid_color.png" alt="a 2x5 grid of numbers 1 - 10 where 6 is pink and 5 is orange" width=40%></img>

After the sequence of two numbers is shown, the board resets.

<img src="./number_grid.png" alt="a 2x5 grid of numbers 1 - 10" width=40%></img>

Use this situation to practice with the basic ideas of probability that you've learned so far.

---

### A Definition of Probability

Given an experiment where all possible outcomes are equally likely, the probability of an event occurring is calculated by dividing the number of ways that event can occur by the total number of possible outcomes.

---

### Task 01 📍

A sequence of two numbers is illuminated on the board as described above. What is the probability that the first number in the sequence is 6?

Assign `chance_6` to a value or expression that represents this probability value. The value should be a number between `0` and `1`, inclusive.

In [None]:
chance_6 = ...
chance_6

In [None]:
grader.check("task_01")

---

### The Multiplication Rule

To calculate the probability of one event happening and then another event happening, multiply the probability of the first event by the probability of the second event given that the first event happened. This is generally referred to as the multiplication rule.

---

### Task 02 📍

Consider the sequence of two numbers illuminated on this board as described above (first the 6, and then the 5). What is the probability that this sequence (6, 5) occurs?

Assign `chance_65` to a value or expression that represents this probability value. The value should be a number between `0` and `1`, inclusive.

In [None]:
chance_65 = ...
chance_65

In [None]:
grader.check("task_02")

---

### Task 03 📍

Suppose another sequence of two numbers is illuminated on this board. What is the probability that the sequence is (6, 6)?

Assign `chance_66` to a value or expression that represents this probability value. The value should be a number between `0` and `1`, inclusive.

In [None]:
chance_66 = ...
chance_66

In [None]:
grader.check("task_03")

---

### Task 04 📍

If all possible sequences are equally likely, what is the chance that any sequence of two numbers (where the two numbers are different) is illuminated on the board?

Assign `chance_of_any_sequence` to a value or expression that represents this probability value. The value should be a number between `0` and `1`, inclusive.

In [None]:
chance_of_any_sequence = ...
chance_of_any_sequence

In [None]:
grader.check("task_04")

---

### The Addition Rule for Disjoint Events

To calculate the probability of one event or another event happening, where the two events are disjoint (or mutually exclusive of one another), we calculate the probability of each event happening and then add the values. This is called the addition rule for disjoint events.

---

### Task 05 📍

Suppose another sequence of two numbers is illuminated on this board. What is the probability that the sequence is either (6, 5) or (5, 6)?

Assign `chance_65_56` to a value or expression that represents this probability value. The value should be a number between `0` and `1`, inclusive.

In [None]:
chance_65_56 = ...
chance_65_56

In [None]:
grader.check("task_05")

---

### The Complement Rule

One way to determine the probability of an event not occurring is to subtract the probability that it will occur from 1. This is generally referred to as the complement rule.

---

### Task 06 📍

Twenty sequences of two numbers are illuminated on the board. What is the probability that at least one of the sequences is (5, 6)?

Assign `chance_at_least_one_56` to a value or expression that represents this probability value. The value should be a number between `0` and `1`, inclusive.

In [None]:
chance_not_56 = ...
chance_at_least_one_56 = ...
chance_at_least_one_56

In [None]:
grader.check("task_06")

---

## Distributions

You recently learned about probability and empirical distributions. As a reminder:

* **Probability distributions** describe the theoretical likelihood of different outcomes in a random process. They are based on mathematical models and assumptions.
* **Empirical distributions** are derived from actual observed data. Instead of being based on theoretical probabilities, they reflect the frequencies of observed occurrences in a repeated experiment.

---

## Sampling Basketball Data

We will now move on to the topic of sampling which we discussed in more depth in this week's lectures. We'll guide you through this code, but if you wish to read more about different kinds of samples before attempting these tasks, you can check out [section 10 of the textbook](https://ccsf-math-108.github.io/textbook/chapters/10/Sampling_and_Empirical_Distributions.html).

We will be sampling from a data set that contains information about NBA players, including their ages and salaries. Run the cell below to load the data.

Run the cell below to load player and salary data that we will use for our sampling. 

In [None]:
player_data = Table().read_table("player_data.csv")
salary_data = Table().read_table("salary_data.csv")
full_data = salary_data.join("PlayerName", player_data, "Name")

# The show method immediately displays the contents of a table. 
# This way, we can display the top of two tables using a single cell.
player_data.show(3)
salary_data.show(3)
full_data.show(3)

Rather than getting data on every player (as in the tables loaded above), imagine that we had gotten data on only a smaller subset of the players. For 492 players, it's not so unreasonable to expect to see all the data, but usually we aren't so lucky. 

If we want to make estimates about a certain numerical property of the population (known as a statistic, e.g. the mean or median), we may have to come up with these estimates based only on a smaller sample. Whether these estimates are useful or not often depends on how the sample was gathered. We have prepared some example sample datasets to see how they compare to the full NBA dataset. Later we'll ask you to create your own samples to see how they behave.

To save typing and increase the clarity of your code, we will package the analysis code into a few functions. This will be useful in the rest of the lab as we will repeatedly need to create histograms and collect summary statistics from that data.

We've defined the `histograms` function below, which takes a table with columns `Age` and `Salary` and draws a histogram for each one. It uses bin widths of 1 year for `Age` and $1,000,000 for `Salary`.

In [None]:
def histograms(t):
    ages = t.column('Age')
    salaries = t.column('Salary')/1000000
    t1 = t.drop('Salary').with_column('Salary', salaries)
    age_bins = np.arange(min(ages), max(ages) + 2, 1) 
    salary_bins = np.arange(min(salaries), max(salaries) + 1, 1)
    t1.hist('Age', bins=age_bins, unit='year')
    plt.title('Age distribution')
    t1.hist('Salary', bins=salary_bins, unit='million dollars')
    plt.title('Salary distribution') 
    
histograms(full_data)
print('Two histograms should be displayed below')

### Task 07 📍

Create a function called `compute_statistics` that takes a table containing ages and salaries and:
- Draws a histogram of ages
- Draws a histogram of salaries
- Returns a two-element array containing the average age and average salary (in that order)

You can call the `histograms` function to draw the histograms! 

*Note:* More charts will be displayed when running the test cell. Please feel free to ignore the charts.


In [None]:
def compute_statistics(age_and_salary_data):
    ...
    age = ...
    salary = ...
    ...
    

full_stats = compute_statistics(full_data)
full_stats

In [None]:
grader.check("task_07")

### Simple random sampling

One way to create a sample from a population is to sample uniformly at random from the population. In this demonstration, we are thinking of the population as the 492 players in the original data set. In a **simple random sample (SRS) without replacement**, we ensure that each player is selected at most once. Imagine writing down each player's name on a card, putting the cards in an box, and shuffling the box.  Then, pull out cards one by one and set them aside, stopping when the specified sample size is reached.



### Producing simple random samples

Sometimes, it’s useful to take random samples even when we have the data for the whole population. It helps us understand sampling accuracy.

### `sample`

The table method `sample` produces a random sample from the table. By default, it draws at random **with replacement** from the rows of a table. It takes in the sample size as its argument and returns a **table** with only the rows that were selected. 

Run the cell below to see an example call to `sample()` with a sample size of 5, with replacement. Because this is done with replacement, it is possible to see a player more than once in the resulting table.

In [None]:
# Just run this cell

salary_data.sample(5)

The optional argument `with_replacement=False` can be passed through `sample()` to specify that the sample should be drawn without replacement.

Run the cell below to see an example call to `sample()` with a sample size of 5, without replacement. In this case, it is not possible to see a player more than once in the resulting table as it is done without replacement.

In [None]:
# Just run this cell

salary_data.sample(5, with_replacement=False)

### Task 08 📍🔎

<!-- BEGIN QUESTION -->

Produce a simple random sample of size 44 from `full_data`. Run your analysis (compute_statistics) on it again. Run the cell several times to see how the histograms and statistics change across different samples. Then answer the questions below:

- How much does the average age change across samples? 
- What about the average salary?

**Note:** Since this task does not have an auto-grader, make sure to check your results with a classmate, a tutor, or the instructor before moving on.

_Type your answer here, replacing this text._

In [None]:
my_small_srswor_data = ...
my_small_stats = ...
my_small_stats

<!-- END QUESTION -->

### Task 09 📍🔎

<!-- BEGIN QUESTION -->

As in the previous question, analyze several simple random samples of size 100 from full_data. Then, answer the following questions:
- Do the histogram shapes seem to change more or less across samples of 100 than across samples of size 44?  
- Are the sample averages and histograms closer to their true values/shape for age or for salary?  What did you expect to see?

**Note:** Since this task does not have an auto-grader, make sure to check your results with a classmate, a tutor, or the instructor before moving on.

_Type your answer here, replacing this text._

In [None]:
my_large_srswor_data = ...
my_large_stats = ...
my_large_stats

<!-- END QUESTION -->

## Sampling from a Distribution

Suppose that you are studying a categorical variable from some population and you have a summary of the proportion of values. For example, some sandwich shop has a record from last year that 70% of their customers purchased chips with their sandwich and 30% did not. In this hypothetical situation, the variable `bought_chips` has two values `True` and `False`.

You could try to gather the original data or create a table that shows `True` on 70% of the rows and `False` on 30%. Instead, the `datascience` library has a function called `sample_proportions` that is helpful for generating random samples from a population like this. The basic format of the command is:

``` python 
sample_proportions(sample_size, probabilities)
```
where `sample_size` is the size of the sample and `probabilities` is an array of probabilities that reflect the chance of randomly selecting one of the variable's values.

If you wanted to simulate randomly sampling 3 customers from the sandwich shop customer population, you could define `sample_size = 3` and set up an array `population_arr = make_array(0.70, 0.30)` where the first value represents the chance of someone from the population purchasing chips with their sandwich. From that point running `sample_proportions(sample_size, population_arr)` would create an array of 2 values. The first value in the array would be the proportion `True` values (to represent the proportion of the 3 random customers that purchased chips) and the second value would be the proportion of `False` values (reflecting the customers that didn't purchase chips).

Run the following command to see how this works. You will get a variety of results, but you should see arrays with `1`, `0`, `0.66666667`, and `0.33333333` values.

In [None]:
sample_size = 3
population_arr = make_array(0.70, 0.30)
sample_proportions(sample_size, population_arr)

### Task 10 📍

Suppose that a typical lunch rush for the sandwich shop consists of 30 sandwich purchases. Use the `sample_proportions` function to simulate randomly selecting 30 customers from a population where it is assumed that 70% purchase chips with their sandwich and 30% do not. Run this simulation 10,000 times and create an array called `chips` that contains the proportion of the randomly sampled 30 customers that bought chips. Finally, make a histogram of the proportions.

Your histogram should look **similar** to the following one where the histogram is approximately centered on the population value of 70% (0.7).

<img src="proportion_bought_chips.png" alt="histogram of 10,000 random sample" width = 40%>

In [None]:
sample_size = ...
population_arr = ...

chips = make_array()

for ...:
    random_sample = ...
    prop_bought_chips = ...
    chips = np.append(chips, prop_bought_chips)

Table().with_column('bought_chips', ...).hist('bought_chips')
plt.title('Proportion Bought Chips (Sample Size=30)')
plt.show()

In [None]:
grader.check("task_10")

---

## Submit Your Assignment to Canvas

Follow these steps to submit your lab assignment:

1. **Check the Assignment Completion Requirements:** This assignment is scored as Complete or Incomplete. Make sure to check with your instructor about their requirements for a Complete score. 
2. **Run the Auto-Grader:** Ensure you have executed the code cell containing the command `grader.check_all()` to run all tests for auto-graded tasks marked with 📍. This command will execute all auto-grader tests sequentially.
3. **Complete Manually Graded Tasks:** Verify that you have responded to all the manually graded tasks marked with 📍🔎.
4. **Save Your Work:** In the notebook's Toolbar, go to `File -> Save Notebook` to save your work and create a checkpoint.
5. **Download the Notebook:** In the notebook's Toolbar, go to `File -> Download HTML` to download the HTML version (`.html`) of this notebook.
6. **Upload to Canvas:** On the Canvas Assignment page, click "Start Assignment" or "New Attempt" to upload the downloaded `.html` file.

---

## Attribution

This content is licensed under the <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)</a> and derived from the <a href="https://www.data8.org/">Data 8: The Foundations of Data Science</a> offered by the University of California, Berkeley.

<img src="./by-nc-sa.png" width=100px>

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()