In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("pre08.ipynb")

<table style="width: 100%;">
<tr style="background-color: transparent;">
<td width="100px"><img src="https://cs104williams.github.io/assets/cs104-logo.png" width="90px" style="text-align: center"/></td>
<td>
  <p style="margin-bottom: 0px; text-align: left; font-size: 18pt;"><strong>CSCI 104: Data Science and Computing for All</strong><br>
                Williams College<br>
                Fall 2024</p>
</td>
</tr>


# Prelab 8: Estimation

**Instructions**
- Before you begin, execute the cell at the TOP of the notebook to load the provided tests, as well as the following cell to setup the notebook by importing some helpful libraries. Each time you start your server, you will need to execute these cells again.  
- Be sure to consult your [Python Reference](https://cs104williams.github.io/assets/python-library-ref.html)!
- Complete this notebook by filling in the cells provided. 
- Please be sure to not re-assign variables throughout the notebook.  For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you thought you were passing previously.
- There are no hidden tests in prelabs.

<hr/>
<h2>Setup</h2>


In [None]:
# Run this cell to set up the notebook.
# These lines import the numpy, datascience, and cs104 libraries.

import numpy as np
from datascience import *
from cs104 import *
%matplotlib inline

<hr style="margin-bottom: 0px; padding:0; border: 2px solid #500082;"/>


## 1. Bootstrapping Snow's Study (30 pts)



<font color='#B1008E'>
    
##### Learning objectives
- Use the `bootstrap_statistic()` function to estimate uncertainty from a sampling process. 

</font>


In the last lab, we performed a hypothesis test to determine the cause of Cholera using a random subset of John Snow's original data from his natural experiment.  (Spoiler: it's the water!)  

Here's the first few observations in the sample.  Recall that we're treating each row as one individual whose water supply is from the given 'Water Company', either S&V or Lambeth.  The 'Cholera death' column is 1 if that person died during the study and 0 if that person did not.

In [None]:
snow_sample = Table().read_table('snow_data_small.csv')
snow_sample.show(5)

#### Part 1.1 S&V sample (5 pts)


As a first step, extract just the S&V water consumers from our `snow_sample`.  Create a new table called `s_and_v` with **only one** column, "Cholera death", that contains those values for all people drink S&V-supplied water in `snow_sample`.

In [None]:
s_and_v = ...

# Sanity checks
s_and_v.show(5)

In [None]:
grader.check("p1.1")

#### Part 1.2 Chosen Statistic: Proportion of fatalities (5 pts)


Using this one sample, we will estimate the parameter for the entire population.  Specifically, we'll estimate the parameter of the proportion of individuals in the S&V water district that died of Cholera.  (We'll just stick to that one district -- it's straightforward to look at the other district once you understand the steps.) 

Later, we will use confidence intervals to compute a range of values that reflects the uncertainty of our estimates.


Complete the function `proportion_fatalities()`. This function inputs an array where 0 encodes survival and 1 encodes a fatility, as in the column 'Colera death'. The function outputs the proportion of individuals that died of Cholera. 

In [None]:
def proportion_fatalities(sample):
    """
    This function inputs an array of 0's and 1's, as in the 'Colera death' column and 
    outputs the proportion of individuals that died of Cholera. 
    """
    prop_fatality = ...
    return prop_fatality

In [None]:
s_and_v_cholera_proportion = proportion_fatalities(s_and_v.column('Cholera death'))
print("Proportion of deaths is " + str(s_and_v_cholera_proportion))  # should be 0.036828

In [None]:
grader.check("p1.2")

#### Part 1.3 Bootstrap the Sample (5 pts)


Now, we'll use the `bootstrap_statistic()` function to obtain bootstrap resamples from `s_and_v` and estimate the uncertainty in our parameter. 

Read the documentation for `bootstrap_statistic()` [here](https://www.cs.williams.edu/~cs104/auto/inference-library-ref.html). In Lab 8, you'll implement this function yourself, but for now, our goal is to understand what the function outputs and why its important for data science. 

In [None]:
num_trials = 5 # we'll start with just a small number of resamples 
tiny_bootstrap = bootstrap_statistic(...,
                                     ...,
                                     num_trials)


tiny_bootstrap

In [None]:
grader.check("p1.3")

#### Part 1.4 Boostrapped Proportions Distribution (5 pts)


Create 5000 bootstrap resamples using the same `bootstrap_statistic` function.  

We have provided code to plot a histogram of those proportions. 

*Note:* This might take a minute or two to run. 

In [None]:
bootstrap_proportions = ...

# No need to modify. Plots histogram of bootstrap resamples. 
Table().with_columns("Proportion fatalities", bootstrap_proportions).hist()

In [None]:
grader.check("p1.4")

#### Part 1.5 Confidence Interval (5 pts)


Using the array `bootstrap_proportions`, find the values at the two edges of the middle 95% of the bootstrapped proportion estimates. Store these in the `lower_bound` and `upper_bound` variables below.

In [None]:
lower_bound = ...
lower_bound

In [None]:
upper_bound = ...
upper_bound

In [None]:
grader.check("p1.5")

#### Part 1.6 True Parameter (5 pts)


The following plots your bootstrapped proportions again, this time with a yellow line delineating your 95% confidence interval, a red point at the original sample's proportion of Cholera deaths, and a green square point at the proportion for the whole population of S&V water drinkers.

In [None]:
plot = Table().with_columns("Proportions", bootstrap_proportions).hist()

# 95% confidence interval
plot.interval(lower_bound, upper_bound)

# Red dot: proportion for the original sample of S&V customers
plot.dot(s_and_v_cholera_proportion)

# Green square: proportion for *all* of S&V customers
plot.square(0.0315)

Examine the plot above and indicate whether each of these statements is true of false by assigning `True` or `False` to the four variables in the next cell.

- **Statement 1**: The original sample's proportion of Cholera deaths is within your confidence interval.
- **Statement 2**: There is a 5% chance that the populations' proportion of Cholera deaths is lower than your confidence interval's lower bound. 
- **Statement 3**: Starting with a larger initial sample will make your confidence interval narrower.
- **Statement 4**: Creating more boostrap resamples will make your confidence interval narrower.

You can, of course, get the right answers by trial and error, but try to think about the statements carefully before answering!

In [None]:
statement_1 = ...
statement_2 = ...
statement_3 = ...
statement_4 = ...

In [None]:
grader.check("p1.6")

<hr class="m-0" style="border: 3px solid #500082;"/>

# You're Done!
Follow these steps to submit your work:
* Run the tests and verify that they pass as you expect. 
* Choose **Save Notebook** from the **File** menu.
* **Run the final cell** and click the link below to download the zip file. 

Once you have downloaded that file, go to [Gradescope](https://www.gradescope.com/) and submit the zip file to 
the corresponding assignment. For Prelab N, the assignment will be called "Prelab N Autograder".

Once you have submitted, your Gradescope assignment should show you passing all the tests you passed in your assignment notebook.


## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)