In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("pre06.ipynb")

<table style="width: 100%;">
<tr style="background-color: transparent;">
<td width="100px"><img src="https://cs104williams.github.io/assets/cs104-logo.png" width="90px" style="text-align: center"/></td>
<td>
  <p style="margin-bottom: 0px; text-align: left; font-size: 18pt;"><strong>CSCI 104: Data Science and Computing for All</strong><br>
                Williams College<br>
                Fall 2023</p>
</td>
</tr>


# Prelab 6: Probability and Simulation

**Instructions**
- Before you begin, execute the cell at the TOP of the notebook to load the provided tests, as well as the following cell to setup the notebook by importing some helpful libraries. Each time you start your server, you will need to execute these cells again.  
- Be sure to consult your [Python Reference](https://cs104williams.github.io/assets/python-library-ref.html)!
- Complete this notebook by filling in the cells provided. 
- Please be sure to not re-assign variables throughout the notebook.  For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you thought you were passing previously.
- There are no hidden tests in prelabs.

<hr/>
<h2>Setup</h2>


In [None]:
# Run this cell to set up the notebook.
# These lines import the numpy, datascience, and cs104 libraries.

import numpy as np
from datascience import *
from cs104 import *
%matplotlib inline

<hr style="margin-bottom: 0px; padding:0; border: 2px solid #500082;"/>


## 1. Probability and Randomness (20 pts)



<font color='#B1008E'>
    
##### Learning objectives
- Reason rigorously about probability of outcomes involving random chance.
- Practice the addition and multiplication rules for probabilities.
- Use a loop to simulate and save outcomes dictated random chance.
</font>

You may wish to review [Section 9.5](https://inferentialthinking.com/chapters/09/5/Finding_Probabilities.html) to help gain intuition for probabilities. Good ways to approach probability calculations include:

- Thinking one trial at a time: What does the first one have to be? Then what does the next one have to be?
- Breaking up the event into distinct ways in which it can happen.
- Seeing if it is easier to find the chance that the event does not happen.

**Running example:** An Olympic archer is able to hit the bull’s-eye 80% of the time. Assume each shot is independent of the
others. If the archer shoots 6 arrows, what is the probability of each result described below.

#### Part 1.1 (5 pts)


The first bull’s-eye comes on the third arrow.

In [None]:
first_bullseye_on_third = ...
first_bullseye_on_third

In [None]:
grader.check("p1.1")

#### Part 1.2 (5 pts)


The bull’s-eye is hit all six times.

In [None]:
bullseye_hit_all_six = ...
bullseye_hit_all_six

In [None]:
grader.check("p1.2")

#### Part 1.3 (5 pts)


The bull’s-eye is missed at least once.  (Think about how this relates to the previous part!)

In [None]:
bullseye_missed_atleast_once = ...
bullseye_missed_atleast_once

In [None]:
grader.check("p1.3")

#### Part 1.4 (5 pts)


Use the [simulate](http://cs.williams.edu/~cs104/auto/inference-library-ref.html#simulate) function in our inference library to create an empirical distribution of the number of bulls eyes hit when our archer shoots 6 arrows.  We provide a function to make one outcome.  You should use 10,000 trials.

In [None]:
def hits_in_6():
    """
    Return the number of bullseye's hit in a random sample of 6 shots.
    """
    hit_miss_proportions = make_array(0.8, 0.2)
    return 6 * sample_proportions(6, hit_miss_proportions).item(0)

hits = ...

Table().with_columns('Hits', hits).hist(bins = np.arange(0.5, 7.5, 1))

In [None]:
grader.check("p1.4")

Here is the histogram again, with a line marking the percentage of samples with 6 hits predicted by your `bullseye_hit_all_six` probability.  Does your histogram agree with that value?

In [None]:
plot = Table().with_columns('Hits', hits).hist(bins = np.arange(0.5, 7.5, 1))
plot.line(y = bullseye_hit_all_six, color='yellow', linestyle='--')

<hr style="margin-bottom: 0px; padding:0; border: 2px solid #500082;"/>


## 2. Scrabble Tiles (35 pts)



<font color='#B1008E'>
    
##### Learning objectives
- Write functions that compute probabilities for outcomes dictated by random chance.
- Compute the empirical distribution of a test static for a model.
</font>

In this question, we'll simulate properties of a sample of tiles from [Scrabble](https://en.wikipedia.org/wiki/Scrabble).  In this game, players randonly draw tiles that show both a letter and a point value, and then attempt to create inter-connected words as you might find in a crossword puzzle.  

The next two cells create a table containing the letter and point value for every tile in Scrabble and an array of vowels.

In [None]:
scrabble_tiles = Table().read_table("tiles.csv")
scrabble_tiles.sample(10)

In [None]:
vowels_array = make_array("A", "E", "I", "O", "U") #we won't count Y as a vowel for the purposes of this prelab

#### Part 2.1 (5 pts)


Write a function to compute the **exact** probability of randomly drawing a particular letter from the full collection of tiles.

In [None]:
def probability_of_drawing_letter(letter):
    ...

In [None]:
# Example call of the function 
probability_of_drawing_letter("A")

In [None]:
grader.check("p2.1")

#### Part 2.2 (5 pts)


Set `probability_of_drawing_vowel` to the probability that a tile drawn from the full set of tiles is a vowel. 

*Hints:* 
- We gave you the `vowels_array` at the beginning of this question. 

In [None]:
# You may use more than one line if you like.
probability_of_drawing_vowel = ...
probability_of_drawing_vowel

In [None]:
grader.check("p2.2")

#### Part 2.3 (5 pts)


Write a function to compute the probability of drawing all vowels when you draw `n` tiles.

*Hint:* The probability you just computed may be helpful here. 

In [None]:
def probability_of_drawing_all_vowels(n):
    ...

# Example call: drawing 5 tiles 
probability_of_drawing_all_vowels(5)

In [None]:
grader.check("p2.3")

The following plots the probability of drawing all vowels as you increase `n`.  You may run it once you complete the above cell.

In [None]:
probabilities_of_all_vowels = Table(["Num Tiles", "Probability of All Vowels"])
for i in np.arange(0,16):
    probabilities_of_all_vowels.append([i, probability_of_drawing_all_vowels(i)])
probabilities_of_all_vowels.scatter("Num Tiles")

#### Part 2.4 (5 pts)


For the last parts of this question, we will compute an empirical distribution for the proportion of letter E's in samples of `n` tiles drawn randomly from the full set of tiles.  As a first step, we'll need to be able to draw `n` random tiles from `scrabble_tile`.  We'll use [table.sample](https://www.cs.williams.edu/~cs104/auto/python-library-ref.html#sample).  Notice that we don't sample with replacement, since once we select a tile once, we don't want to select the same tile again.  Run the following cell a few times to verify our sample works as expected.

In [None]:
scrabble_tiles.sample(5, with_replacement=False)

Complete the following function, `sample_letters`. This function creates a random tile sample of size `sample_size` and returns the letters from that sample as an array.

In [None]:
def sample_letters(sample_size):
    ...

In [None]:
grader.check("p2.4")

#### Part 2.5 (5 pts)


Now, complete the following function `proportion_e`. This function takes `sample`, an array of letters, as input
and returns the proportion of E's in that sample.

In [None]:
def proportion_e(sample):
    ...

In [None]:
grader.check("p2.5")

#### Part 2.6 (5 pts)


Let's create an empirical distribution for the proportion of E's in `num_trials` samples, each of size `sample_size`.  

*Hint:*
This function can be written in one line using `simulate_sample_statistic` and the functions you created above.

In [None]:
def empirical_distribution_for_proprtion_e(sample_size, num_trials):
    ...

empirical_distribution_for_proprtion_e(15, 5)

Here is one example use of the `empirical_distribution_for_proprtion_e` function you just wrote that uses our plotting code to draw the histogram of the computed distribution.  

You do not need to change anything in the cells below. Use it to help debug your `empirical_distribution_for_proprtion_e` function. 

In [None]:
def plot_empirical_distribution_for_proportion_e(sample_size, num_trials):
    statistics = empirical_distribution_for_proprtion_e(sample_size, num_trials)
    
    # Create a histogram of our sample statistic.
    Table().with_column("Proportion of E's", statistics).hist(bins=np.arange(0,1.01,1/sample_size))
    
plot_empirical_distribution_for_proportion_e(15, 10000)

In [None]:
grader.check("p2.6")

Once your function is complete, you can run the following cell, which creates an interactive visualization where you can view the distributions for different choices sample size and number of trials.

In [None]:
all_letters = np.unique(scrabble_tiles.sort("Letter").column("Letter"))
interact(plot_empirical_distribution_for_proportion_e, sample_size = Slider(1,50,1), 
        num_trials=Choice(make_array(10,20,50,100,1000,10000)))

#### Part 2.7 (5 pts)


Supposed you don't know how many of each letter are in your collection of 98 tiles.  We can use the empirical distribution of the number of E's found in randomly selected samples of size 15 to estimate the number of E's in the collection.

We start by gathering our sample statistic (proportion of sample that is E) for 10000 samples:

In [None]:
# No work necessary here. Just run this cell.
sample_size = 15
sample_statistics = empirical_distribution_for_proprtion_e(sample_size, 10000)
Table().with_column("Proportion of E's in samples of size 15", sample_statistics).hist(bins=np.arange(0,1,1/15))

If our samples are truly random, the average proportion of E's in our samples will approximate the proportion of E's in the population.  Here is that average proportion in our **samples**.

In [None]:
np.mean(sample_statistics)

Use this observation to estimate the number of E's in our collection, namely the whole **population** of 98 tiles.  Make sure your answer should be a whole number since we can't have fractional tiles.

In [None]:
estimate_of_es = ...
estimate_of_es

In [None]:
grader.check("p2.7")

<hr class="m-0" style="border: 3px solid #500082;"/>

# You're Done!
Follow these steps to submit your work:
* Run the tests and verify that they pass as you expect. 
* Choose **Save Notebook** from the **File** menu.
* **Run the final cell** and click the link below to download the zip file. 

Once you have downloaded that file, go to [Gradescope](https://www.gradescope.com/) and submit the zip file to 
the corresponding assignment. For Prelab N, the assignment will be called "Prelab N Autograder".

Once you have submitted, your Gradescope assignment should show you passing all the tests you passed in your assignment notebook.


## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)