# Homework 5: Permutation Testing, Percentiles, and Bootstrapping

## Due Tuesday, May 23rd at 11:59PM

Welcome to Homework 5! This homework will cover:

- Permutation Testing (see [CIT 12.0-12.2](https://inferentialthinking.com/chapters/12/Comparing_Two_Samples.html))
- Percentiles (see [CIT 13.1](https://inferentialthinking.com/chapters/13/1/Percentiles.html))
- Bootstrapping and Confidence Intervals (see [CIT 13.2](https://inferentialthinking.com/chapters/13/2/Bootstrap.html) and [CIT 13.3](https://inferentialthinking.com/chapters/13/3/Confidence_Intervals.html))

### Instructions

Remember to start early and submit often. You are given six slip days throughout the quarter to extend deadlines. See the syllabus for more details. With the exception of using slip days, late work will not be accepted unless you have made special arrangements with your instructor.

**Important**: For homeworks, the `otter` tests don't usually tell you that your answer is correct. More often, they help catch careless mistakes. It's up to you to ensure that your answer is correct. If you're not sure, ask someone (not for the answer, but for some guidance about your approach). These are great questions for office hours (see the schedule on the [Calendar](https://dsc10.com/calendar)) or Ed. Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged.

In [None]:
# Don't change this cell; just run it. 
import babypandas as bpd
import numpy as np

import warnings
warnings.simplefilter('ignore')

import matplotlib.pyplot as plt
plt.style.use('ggplot')

import otter
grader = otter.Notebook()

%reload_ext pandas_tutor

### Aside: Random Seeds 🌱

Throughout this homework – and in upcoming assignments – you'll notice that we frequently call the function `np.random.seed` with an integer argument. What exactly does that do?

To see for yourself, run the cell below several times.

In [None]:
np.random.seed(25)

print(np.random.multinomial(10, [0.5, 0.5]))
print(np.random.multinomial(10, [0.5, 0.5]))

`np.random.multinomial(10, [0.5, 0.5])` should return a random result each time it's called. However, each time you ran the cell above, you saw the same output – `[7 3]` and `[5 5]`.

**If you call `np.random.seed` in a cell, then every time you run the cell, you will see the same results, even if there are calls to "random" functions and methods in the cell.** Think of calling `np.random.seed` as "undoing" the randomness in the cell. If you change the `25` above to some other number, you may see something other than `[7 3]` and `[5 5]`, but each time you run the changed cell, you will still see the same result.

We use seeds to make it easier to autograde questions that rely on randomness, such as those that require you to bootstrap. When we use a particular seed in a question, we know exactly what the correct answer should be. When we don't, the range of correct answers is much wider, so it's harder to tell whether you actually answered the question correctly.

You're not responsible for understanding how seeds and random number generators work under the hood – all you need to know is that when you see a call to `np.random.seed`:
- Don't change it.
- Don't be alarmed if you see the same results each time you run that cell.

If you're interested in learning more, read [this Wikipedia article](https://en.wikipedia.org/wiki/Pseudorandom_number_generator).
<!-- It turns out that generating _truly_ random numbers is quite difficult. Instead, computers often generate _pseudorandom_ numbers, which are numbers that look like they were generated randomly (such as those in the cell above) but were actually generated by a complicated, non-random process. Each of these processes has a "key", or "seed," that determines the initial conditions for this non-random process. -->

## 1. Python vs Java 🐍☕

[Stack Overflow](https://stackoverflow.com/) is a forum where users can ask and answer questions about code. (If you've never used it before, you should use it as a resource!)

In this section, we'll work with a dataset of Stack Overflow questions from 2016 to 2020, downloaded from [Kaggle](https://www.kaggle.com/datasets/imoore/60k-stack-overflow-questions-with-quality-rate?resource=download&select=train.csv). The data has been cleaned and condensed for the purposes of this question.

The rideshare data contains six columns: `'Id'`, `'Title'`, `'Body'`, `'Tags'`, `'CreationDate'`, `'Ratings'`. Let's read it in and store it as a DataFrame called `stack_overflow`.

| Column | Description |
| --- | --- |
| `'Id'` | ID of the question |
| `'Title'` | Title of the question |
| `'Body'` | Description of the question |
| `'Tags'` | Tags used to categorize question |
| `'CreationDate'` | Date it was made |
| `'Ratings'` | Rating of the post |

In [None]:
stack_overflow = bpd.read_csv('data/stack_overflow.csv')
stack_overflow

**Question 1.1.**
Stack Overflow gives users the ability to upvote or downvote questions and answers, which means that each post has a rating. **We're interested in exploring whether Python posts have higher ratings than Java posts.**

To determine the language of a post, we can look in the `'Tags'` column. The values in the `'Tags'` column contain several tags that are used to categorize posts. One of the tags for each post will be the language the post is about – for instance, in row 2 above, one of the tags for the post is `'<python>'`.

Below, assign `python_java` to a DataFrame that only contains questions that used the tags `'<python>'` or `'<java>'`. Don't worry about capitalization as all the tags have already been lowercased. Note that these questions may include other tags as well, as long as they have at least one of `'<python>'` or `'<java>'`.

*Hints:* 
- Use `str.contains`.
- There is a tag called `'<javascript>'`; make sure that is not in your final DataFrame.

In [None]:
python_java = ...
python_java

In [None]:
grader.check("q1_1")

Upon further investigation, it looks like there are some posts that contain both the tags `'<python>'` and `'<java>'`. For the purposes of answering our question, we only want posts that have `'<python>'` or `'<java>'`, but not both. We've gone ahead and removed the posts that contained both tags and saved the resulting DataFrame to `fixed_python_java`, which you should use in Question 1.2.

In [None]:
# Don't change this cell; just run it.
fixed_python_java = python_java[python_java.get('Tags').str.contains('<python>') & python_java.get('Tags').str.contains('<java>') == False]
fixed_python_java

**Question 1.2.** Each post has associated with it many tags, but the only piece of information in the `'Tags'` column we're interested in is whether the language of the post is Python or Java.

Complete the implementation of the function `simplify_tag`, which takes in a string of tags associated with a single post and returns either `'Python'` or `'Java'`. Once you've done that, create a new DataFrame named `python_java_with_language` that has all the same columns as `fixed_python_java`, in the same order, with an additional column named `'CodingLanguage'` that contains the coding language associated with the post.

In [None]:
def simplify_tag(tag): 
    ...
    
python_java_with_language = ...
python_java_with_language

In [None]:
grader.check("q1_2")

**Question 1.3.** As a reminder, we're interested in exploring whether Python posts have higher ratings than Java posts. In order to do that, we need to have ratings in the form of numbers, but right now, the `'Ratings'` column contains categorical values. The [Kaggle](https://www.kaggle.com/datasets/imoore/60k-stack-overflow-questions-with-quality-rate?resource=download&select=train.csv) page describes these values as follows: 

- `'HQ'`: High-quality posts without a single edit.
- `'LQ_EDIT'`: Low-quality posts with a negative score, and multiple community edits. However, they still remain open after those changes.
- `'LQ_CLOSE'`: Low-quality posts that were closed by the community without a single edit.

We're going to assign a numerical rating of `1` to low-quality posts (`'LQ_CLOSE'` and `'LQ_EDIT'`) and a numerical rating of `2` to high-quality posts (`'HQ'`).

Complete the implementation of the function `numerical_rating`, which takes in a **string** from the `'Ratings'` column and returns the corresponding numerical rating as described above. Once you've implemented `numerical_rating`, use it to replace the values in the `'Ratings'` column of `python_java_with_language` with the corresponding numerical ratings.

*Note*: Our solution is only two lines long, and one of the lines involves defining a dictionary. You don't have to use a dictionary in your implementation, but doing so will help keep your code concise!

In [None]:
def numerical_rating(rating):
    ...

python_java_with_language = ...
python_java_with_language

In [None]:
grader.check("q1_3")

**Question 1.4.** Using the DataFrame `python_java_with_language`, calculate the difference between the **mean** `'Rating'` of Python posts and Java posts. Assign your answer to `observed_difference`.

$$\text{observed difference} = \text{mean Python post rating} - \text{mean Java post rating}$$

In [None]:
observed_difference = ...
observed_difference

In [None]:
grader.check("q1_4")

**Question 1.5.** What does the number you obtained for `observed_difference` mean? Assign `q1_4` to 1, 2, 3, or 4, corresponding to the best explanation below.

1. In our sample, the mean Python post rating is higher than the mean Java post rating by about 14 percent.
1. In our sample, the mean Python post rating is higher than the mean Java post rating by about 14 points.
1. In our sample, the mean Java post rating is higher than the mean Python post rating by about 14 percent.
1. In our sample, the mean Java post rating is higher than the mean Python post rating by about 14 points.

In [None]:
q1_4 = ...

In [None]:
grader.check("q1_5")

Now we want to conduct a permutation test to see if it is by chance that the average rating for Python posts is higher than the average rating of Java posts in our sample, or if Python posts have a higher rating on average than Java posts.

- **Null Hypothesis**: The rating of Python posts and Java posts come from the same distribution.  
- **Alternative Hypothesis**: The rating of Python posts are higher on average than the rating of Java posts.

**Question 1.6.** Assign `python_java_rating` to a DataFrame with only two columns, `'CodingLanguage'` and `'Ratings'`, since these are the only relevant columns in `python_java_with_language` for this permutation test.

<!--
BEGIN QUESTION
name: q1_6
-->

In [None]:
python_java_rating = ...
python_java_rating

In [None]:
grader.check("q1_6")

**Question 1.7.** To perform the permutation test, 1000 times, create two random groups by shuffling the `'CodingLanguage'` column of `python_java_rating`. Don't change the `'Ratings'` column. For each pair of random groups, calculate the difference in mean ratings (Python minus Java) and store your 1000 differences in the `differences` array.  

*Note*: Since we are working with a relatively large data set, it may take **up to five minutes** to generate 1000 permutations. One suggestion is to make sure your code works correctly with fewer repetitions, say, 20, before using 1000 repetitions.

In [None]:
differences = ...

# Just display the first ten differences.
differences[:10]

In [None]:
grader.check("q1_7")

**Question 1.8.** Compute a p-value for this hypothesis test and assign your answer to `p_val`. To decide whether to use `<=` or `>=` in the calculation of the p-value, think about whether larger values or smaller values of our test statistic favor the alternative hypothesis.

In [None]:
p_val = ...
p_val

In [None]:
grader.check("q1_8")

**Question 1.9.** Assign the variable `q1_9` to a **list** of all the true statements below.

1. We accept the null hypothesis at the 0.05 significance level.
2. We reject the null hypothesis at the 0.01 significance level.
3. We fail to reject the null hypothesis at the 0.01 significance level.
4. We accept the null hypothesis at the 0.01 significance level.
5. We fail to reject the null hypothesis at the 0.05 significance level.
6. We reject the null hypothesis at the 0.05 significance level.

In [None]:
q1_9 = ...

In [None]:
grader.check("q1_9")

**Question 1.10.** Suppose in this question you had shuffled the `'Ratings'` column instead and kept the `'CodingLanguage'` column in the same order. Assign `q1_10` to either 1, 2, 3, or 4, corresponding to the true statement below.


1. The new p-value from shuffling `'Ratings'` would be $1 - p$, where $p$ is the old p-value from shuffling `'CodingLanguage'` (i.e. your answer to Question 1.8).
2. We would need to change our null hypothesis in order to shuffle the `'Ratings'` column. 
3. There would be no difference in the conclusion of the test if we had shuffled the `'Ratings'` column instead.
4. The `'Ratings'` column cannot be shuffled because it contains numbers.

In [None]:
q1_10 = ...

In [None]:
grader.check("q1_10")

**Question 1.11.** Which of the following choices best describes the purpose of shuffling one of the columns in our dataset in a permutation test? Assign `q1_11` to either 1, 2, 3, or 4.

1. Shuffling mitigates noise in our data by generating new permutations of the data.
1. Shuffling is a special case of bootstrapping and allows us to produce interval estimates.
1. Shuffling allows us to generate new data under the null hypothesis, which we can use in testing our hypothesis.
1. Shuffling allows us to generate new data under the alternative hypothesis, which explains that the data come from different distributions.

In [None]:
q1_11 = ...

In [None]:
grader.check("q1_11")

## 2. Video Game Price Percentiles 🎮

Percentiles associate numbers in a dataset to their positions when the dataset is sorted in ascending order. You may be familiar with the idea of percentiles from height and weight measurements at the doctor's office, or from standardized test scores.

There are many different ways to precisely define a percentile. In [Lecture 19](https://dsc10.com/resources/lectures/lec19/lec19.html#Percentiles), we saw two different approaches:
- Using a mathematical definition (see the slide in Lecture 19 titled _[How to calculate percentiles using the mathematical definition](https://dsc10.com/resources/lectures/lec19/lec10.html#How-to-calculate-percentiles-using-the-mathematical-definition)_).
- Using `np.percentile`.

In Questions 2.1 through 2.4, we will use the mathematical definition, and in Question 2.5, we will use `np.percentile`.

The file `steam_games.csv` contains information about various games sold on the online video game store and distribution service [Steam](https://store.steampowered.com/). The data comes from [Kaggle](https://www.kaggle.com/datasets/thedevastator/get-your-game-on-metacritic-recommendations-and?resource=download).

The columns are:
- `'Game'`: The name of the video game
- `'ReleaseDate'`: The date it was released
- `'Metacritic'`: The review score it earned on [metacritic.com](https://www.metacritic.com/), or 0 if it's not reviewed
- `'RecommendationCount'`: Number of times it has been recommended in Steam 
- `'IsFree'`: Whether the game is free or not
- `'GenreIsXXX'`:  Whether the game belongs to genre XXX (many such columns)
- `'Price'`: The price of the game when it was initially released

Let's read in the data and explore the full set of column names.

In [None]:
steam = bpd.read_csv('data/steam_games.csv')
steam

**Question 2.1.** Pick the best choice of bins below for a histogram showing the distribution of `'Price'`, then create the histogram. Make sure all the data is included!

Use one of the following:

- `rating_bins = np.arange(0, 100, 10)`
- `rating_bins = np.arange(0, 200, 10)`
- `rating_bins = np.arange(0, 500, 10)`
- `rating_bins = np.arange(0, 700, 10)`

In [None]:
rating_bins = ...

# Now create a density histogram showing the distribution of rating using rating_bins

Some games are marked as not free according to the `'IsFree'` column, yet their `'Price'` is listed as 0. This is likely due to the game having a page on the Steam website before it was readily available to be purchased. We'll say that a game is a *paid game* if both `'IsFree'` is `False` and `'Price'` is nonzero.

For the paid games only, let's compare the prices of indie games and non-indie games. An indie (short for independent) game is one that has been developed by an individual or small team as opposed to a large studio.

Run the cell below to create a DataFrame of paid games that are not indie games, and a sorted array of their prices.

In [None]:
paid_not_indie_df = steam[(steam.get('GenreIsIndie') == False) & 
                          (steam.get('IsFree') == False) & 
                          (steam.get('Price') != 0)]
paid_not_indie_prices  = paid_not_indie_df.get('Price')
paid_not_indie_prices = np.sort(paid_not_indie_prices)
paid_not_indie_prices

**Question 2.2.** Calculate the 63rd percentile of `paid_not_indie_prices` using the [mathematical definition](https://dsc10.com/resources/lectures/lec19/lec19.html#How-to-calculate-percentiles-using-the-mathematical-definition) given in Lecture 19. That is:
- Set `n` to be the number of elements in `paid_not_indie_prices`. 
- Set `k` to be the smallest integer greater than or equal to $\frac {63}{100} \cdot n$. 
- Assign the 47rd percentile of the array `paid_not_indie_prices` to `paid_not_indie_prices_63rd`.

You must use the variables provided for you when solving this problem. For this problem, **do not** use `np.percentile`.

In [None]:
n = ...
k = ...

# Don't change this line. In order to proceed, k needs to be stored as an int, not a float.
# This line is not changing the mathematical value of k, just how it is stored.
k = int(k)

paid_not_indie_prices_63rd = ...

In [None]:
grader.check("q2_2")

**Question 2.3.** Now we'll compare the value we just calculated with the 63rd percentile of the prices of non-free **indie** games.

Create a DataFrame called `paid_indie_df` containing only the paid games that belong to the indie genre. Calculate the 63rd percentile of prices for these games, using the same mathematical procedure. Assign to the variable `absolute_difference` the absolute difference in the 63rd percentile of prices for paid indie games and paid non-indie games.

As before, use the variables provided and **do not** use `np.percentile`.

*Hint*:  Remember to sort the prices using `np.sort` before computing percentiles.

In [None]:
paid_indie_df = ...

paid_indie_prices = ...

n_2 = ...
k_2 = ...

k_2 = int(k_2) # Don't change this.

paid_indie_prices_63rd = ...

absolute_difference = ...
absolute_difference

In [None]:
grader.check("q2_3")

**Question 2.4.** Say that UCSD is developing a new game where students will be able to create custom avatars of themselves and take classes at a virtual UCSD. The university decides to set the price of this new game at 15 dollars and then advertise it as a great bargain for an excellent education.

This game is one that is: 
- Not free.
- Still in early access (that is, not finished yet).
- Massively multiplayer.
- Not indie.

Consider a new collection of values, containing the prices of all the games that share these characteristics, plus one more, $15 for UCSD's game:

In [None]:
new_collection_df = steam[(steam.get('IsFree') == False) & 
                          (steam.get('Price') != 0 ) & 
                          (steam.get('GenreIsEarlyAccess') == True ) &  
                          (steam.get('GenreIsMassivelyMultiplayer') == True ) & 
                          (steam.get('GenreIsIndie') == False)]
new_collection = np.array(new_collection_df.get('Price'))
new_collection = np.sort(np.append(new_collection, 15))
new_collection

For what integer values of $p$ would we be able to say that this new collection of values has 15 as its $p$th percentile? Create a **list** called `percentile_range` of all integer values of $p$ such that the $p$th percentile of the new collection equals 15, according to the **mathematical** definition of percentile. 

This is a math question, not a coding question. You should create the list `percentile_range` manually, by solving a math problem on paper and inputting your answer in the form of a Python list.

**Do not use `np.percentile`.**

In [None]:
percentile_range = ...

In [None]:
grader.check("q2_4")

**Question 2.5**. The first _quartile_ of a numerical collection is the 25th percentile, the second quartile is the 50th percentile, and the third quartile is the 75th percentile. Quartiles are so named because they divide the collection into quarters.

Make a list called `price_quartiles` that contains the values for the first, second, and third quartiles (in that order) of the `'Price'` data provided in `steam`. For this problem, calculate the percentiles **using `np.percentile`**.

In [None]:
price_quartiles = ...
price_quartiles

In [None]:
grader.check("q2_5")

## 3. Live Crystal Scoops 🔮

<center><img src='images/crystals.jpg' width=30%>(<a href="https://www.youtube.com/watch?v=JHdKb-LumRk">source</a>)</center>

Over the last year, _live crystal scoops_ have become popular on TikTok. There are TikTok pages that collect and sell [crystals](https://en.wikipedia.org/wiki/Crystal), which some believe have the power to heal both the body and the mind. These pages don't sell crystals individually, but rather they "scoop" a random collection of their inventory, put the collected crystals in a bag, and send that bag to the customer. What makes them _live_ crystal scoops is that these pages typically livestream the act of scooping these crystals for every order they receive and include the order number in the stream, so that customers can verify that what they receive is actually what was scooped. For instance, [@chloesmith.uk](https://www.tiktok.com/@chloesmith.uk) is one such page.

Last night, you were scrolling endlessly on TikTok, and came across crystal scooping livestreams by two accounts, _Scoops by Shelly_ and _Crystals by Cathy_. Both are selling scoops for $29.99. Intrigued, you decide to order a scoop from Cathy, and in the livestream it seems that you pulled a hefty scoop. When your order is finally delivered, however, you're disappointed to find that the total weight of the crystals you received is much lower than what you expected given what you saw on the livestream. Should you have purchased a scoop from Shelly instead?

**Question 3.1.** Ideally, you want to determine the mean weight of **all** scoops from *Crystals by Cathy*. However, it's not feasible to do so, because her scoops are very expensive and she has many other customers. Instead, you will collect a sample of scoops to obtain a ____________ statistic to estimate this ____________ parameter.

Complete the sentence above by filling in the blanks. Set `q3_1` to 1, 2, 3, or 4.

1. sample; population
2. test; sample
3. population; sample
4. test; population

In [None]:
q3_1 = ...

In [None]:
grader.check("q3_1")

Fortunately, you have an incredible crystal resource at your disposal, the [Crystals Live Share Group](https://www.facebook.com/groups/846961549165998) on Facebook. You make a post and ask the members who've bought scoops from *Scoops by Shelly* and *Crystals by Cathy* to weigh their packages in grams. You're overwhelmed by the amazing community response and receive 80 different scoop weights in total from other buyers, 40 from *Scoops by Shelly* buyers and 40 from *Crystals by Cathy* buyers.  

Let's look at all the data that you crowdsourced. Each entry in the `'Weight'` column represents the weight of one scoop, in grams.

In [None]:
crystal_weights = bpd.read_csv('data/crystals.csv')
crystal_weights

**Question 3.2.** To start, we'll look at only the scoops in our sample from *Crystals by Cathy*. Below, assign `cathy_crystals` to a DataFrame with only the scoops from *Crystals by Cathy*. Then, assign `cathy_mean` to the mean weight of the *Crystals by Cathy* scoops in our sample.

In [None]:
cathy_crystals = ...
cathy_mean = ...
cathy_mean

In [None]:
grader.check("q3_2")

You're done! Or are you? You have a single estimate for the true mean weight of Cathy's scoops. However, you don't know how close that estimate is, or how much it could have varied if you'd had a different sample. In other words, you have an estimate, but no understanding of how close that estimate is to the true mean weight of *all* of Cathy's scoops.

This is where the idea of resampling via **[bootstrapping](https://inferentialthinking.com/chapters/13/2/Bootstrap.html)** comes in. Assuming that our sample resembles the population fairly well, we can resample from our original sample to produce more samples. From each of these resamples, we can produce another estimate for the true mean weight, which gives us a distribution of sample means that describes how the estimate might vary given different samples. We can then use this distribution to produce an interval that estimates the true mean weight of Cathy's scoops.

**Question 3.3.** Complete the following code to produce 1000 bootstrapped estimates for the mean weight of Cathy's scoops. Store your 1000 estimates in an array called `resample_means`.

In [None]:
resample_means = ...
for i in np.arange(1000):
    resample = ...
    resample_mean = ...
    resample_means = ...
resample_means

In [None]:
grader.check("q3_3")

Let's look at the distribution of your estimates:

In [None]:
bpd.DataFrame().assign(BootstrappedMeans = resample_means).plot(kind='hist', density=True, ec='w', bins=20, figsize=(10, 5));

**Question 3.4.** Using the array `resample_means`, compute an approximate 95% confidence interval for the true mean weight of Cathy's scoops. Save the lower and upper bounds of the interval as `cathy_lower_bound` and `cathy_upper_bound`, respectively.

*Hint*: Use `np.percentile`.

In [None]:
cathy_lower_bound = ...
cathy_upper_bound = ...

# Print the confidence interval.
print("Bootstrapped 95% confidence interval for the true mean weight of Cathy's scoops: [{:f}, {:f}]".format(cathy_lower_bound, cathy_upper_bound))

In [None]:
grader.check("q3_4")

**Question 3.5.** Which of the following would likely make the histogram from Question 3.3 wider? If you believe more than one would, choose the answer with the most substantial effect. Assign to `q3_5` either 1, 2, 3, or 4.

1. Increasing the number of resamples (repetitions of the bootstrap) to 3000.
1. Decreasing the number of resamples (repetitions of the bootstrap) to 500.
1. Starting with a larger sample of 100 scoops.
1. Starting with a smaller sample of 20 scoops.

In [None]:
q3_5 = ...
q3_5

In [None]:
grader.check("q3_5")

**Question 3.6.** Suppose you want to estimate the weight of the lightest scoop Cathy has ever scooped, her biggest scam. Would bootstrapping be effective in estimating this weight? Assign `bootstrapping_effective` to either `True` or `False`, representing your answer.

In [None]:
bootstrapping_effective = ...

In [None]:
grader.check("q3_6")

**Question 3.7.** Now let's address a different question: how does the average weight of a *Scoops by Shelly* scoop compare to the average weight of a *Crystals by Cathy* scoop? Create a DataFrame called `shelly_scoops` that contains only the weights of scoops from *Scoops by Shelly*, and set `shelly_mean` equal to the mean weight of Shelly's scoops as you did for *Crystals by Cathy* in Question 3.2. Then, set `observed_diff_mean` to the difference in mean scoop weight for the Cathy and Shelly's scoops in our sample.

$$\text{difference} = \text{mean weight of Cathy's scoops} - \text{mean weight of Shelly's scoops}$$

In [None]:
shelly_scoops = ...
shelly_mean = ...
observed_diff_mean = ...
observed_diff_mean

In [None]:
grader.check("q3_7")

If you completed Question 3.7 correctly, you should have found that Shelly and Cathy's mean scoop weights were quite different. Remember, all we have access to are samples of size 40 from each seller. Would we see this large of a difference if we had access to the population – that is, the weights of all scoops ever produced by both sellers – or was it just by chance that our samples displayed this difference? Let's do a **hypothesis test** to find out. We'll state our hypotheses as follows:

- **Null Hypothesis**: The mean weight of scoops from *Crystals by Cathy* is equal to the mean weight of scoops from *Scoops by Shelly*. Equivalently, the difference in the mean scoop weight for the two sellers equals 0 grams.

- **Alternative Hypothesis**: The mean weight of scoops from *Crystals by Cathy* is not equal to the mean weight of scoops from *Scoops by Shelly*. Equivalently, the difference in the mean scoop weight for the two sellers does not equal 0 grams.

Since we were able to set up our hypothesis test as a question of whether a certain population parameter – the difference in mean scoop weight for *Crystals by Cathy* and *Scoops by Shelly* – is equal to a certain value, we can **test our hypotheses by constructing a confidence interval** for the parameter. This is the method we used in [Lecture 20](https://dsc10.com/resources/lectures/lec20/lec20.html). You can read more about conducting a hypothesis test with a confidence interval in [CIT 13.4](https://inferentialthinking.com/chapters/13/4/Using_Confidence_Intervals.html).

*Note*: We are not conducting a permutation test here, although that would also be a valid approach to test these hypotheses.

**Question 3.8.** Compute 1000 bootstrapped estimates for the difference in the mean scoop weight for *Scoops by Shelly* and *Crystals by Cathy*. As in Question 3.7, do Cathy minus Shelly. Store your 1000 estimates in the `difference_means` array.

You should generate your Shelly resamples by sampling from `shelly_scoops`, and your Cathy resamples by sampling from `cathy_crystals`. You should not use `crystal_weights` at all.

In [None]:
np.random.seed(23) # Ignore this, and don't change it.

difference_means = ...

# Just display the first ten differences.
difference_means[:10]

In [None]:
grader.check("q3_8")

Let's visualize your estimates:

In [None]:
bpd.DataFrame().assign(BootstrappedDifferenceMeans = difference_means).plot(kind = 'hist', density=True, ec='w', bins=20, figsize=(10, 5));

**Question 3.9.** Compute a 95% confidence interval for the difference in mean weights of Shelly and Cathy's scoops (as before, Cathy minus Shelly). Assign the left and right endpoints of this confidence interval to `left_endpoint` and `right_endpoint` respectively. Use `np.percentile` to find the endpoints.

In [None]:
left_endpoint = ...
right_endpoint = ...

print("Bootstrapped 95% confidence interval for the mean difference in weights of Shelly and Cathy's scoops:\n [{:f}, {:f}]".format(left_endpoint, right_endpoint))

In [None]:
grader.check("q3_9")

**Question 3.10.** Based on the confidence interval you've created, would you reject the null hypothesis at the 0.05 significance level? Set `reject_null` to True if you would reject the null hypothesis, and False if you would not.

In [None]:
reject_null = ...

In [None]:
grader.check("q3_10")

**Question 3.11.** What if the Facebook group members had recorded all of their scoop weights in pounds instead of grams? Would your hypothesis test still come to the same conclusion either way? Set `same_conclusion` to True or False.

In [None]:
same_conclusion = ...

In [None]:
grader.check("q3_11")

## 4. Chocolate Shop  🍫

You are planning to open a chocolate shop and you want to get a sense of the local residents' chocolate preferences. You survey 510 randomly-selected local residents and ask which type of chocolate they prefer the most among four options – `'dark'`, `'milk'`, `'white'`, `'bittersweet'`. You also record some indecisive individuals as `'undecided'`.

Run the next cell to load in the results of the survey.

In [None]:
chocolate = bpd.read_csv('data/chocolate.csv')
chocolate

Assume that your sample is a uniform random sample of the local population. Below, we compute the proportion of people in your sample that prefer each type of chocolate.

In [None]:
chocolate.assign(counts=chocolate.get('chocolate')).groupby('chocolate').count().get('counts') / chocolate.shape[0]

What you're truly interested in, though, is the proportion of *all local residents* that prefer each type of chocolate. These are *population parameters* (plural, because there are 5 proportions).

In this question, we will start by computing a confidence interval for the true proportion of residents that prefer `'dark'` chocolate, and then later compute a confidence interval for the true difference in proportions of residents that prefer `'dark'` chocolate over `'milk'` chocolate. 

<center><img src="images/choco-pun.jpeg" width=35%></center>


Below, we have given you code that computes 1000 bootstrapped estimates of the true proportion of residents who prefer `'dark'` chocolate over the other options. Run the next cell to calculate these estimates and display a histogram of their values.

In [None]:
def proportions_in_resamples():
    np.random.seed(55) # Ignore this, and don't change it.
    num_residents = chocolate.shape[0]
    proportions = np.array([])
    for i in np.arange(1000):
        resample = chocolate.sample(num_residents, replace = True)
        resample_proportion = np.count_nonzero(resample.get('chocolate') == 'dark') / num_residents
        proportions = np.append(proportions, resample_proportion)
    return proportions

boot_dark_proportions = proportions_in_resamples()
bpd.DataFrame().assign(Estimated_Proportion_Dark=boot_dark_proportions).plot(kind='hist', density=True, ec='w', figsize=(10, 5));

**Question 4.1.** Using the array `boot_dark_proportions`, compute an approximate **99%** (not 95%) confidence interval for the true proportion of residents who prefer `'dark'` chocolate.  Compute the lower and upper ends of the interval, named `dark_lower_bound` and `dark_upper_bound`, respectively.

*Note*: As we did in lecture, use `np.percentile` whenever computing confidence intervals.

In [None]:
dark_lower_bound = ...
dark_upper_bound = ...

# Print the confidence interval:
print("Bootstrapped 99% confidence interval for the true proportion of residents who prefer dark chocolate in the population:\n[{:f}, {:f}]".format(dark_lower_bound, dark_upper_bound))

In [None]:
grader.check("q4_1")

**Question 4.2.**
Is it true that 99% of the population lies in the range `dark_lower_bound` to `dark_upper_bound`? Assign the variable `q4_2` to either `True` or `False`. 

In [None]:
q4_2 = ...

In [None]:
grader.check("q4_2")

**Question 4.3.**
Is it true that the true proportion of residents who prefer `'dark'` chocolate over the other chocolates is a random quantity with approximately a 99% chance of falling between `dark_lower_bound` and `dark_upper_bound`? Assign the variable `q4_3` to either `True` or `False`.

In [None]:
q4_3 = ...

In [None]:
grader.check("q4_3")

**Question 4.4.**
Suppose we were somehow able to produce 20,000 new samples, each one a uniform random sample of 510 residents taken directly from the population. For each of those 20,000 new samples, we create a 99% confidence interval for the proportion of residents who prefer `'dark'` chocolate. Roughly how many of those 20,000 intervals should we expect to actually contain the true proportion of the population? Assign your answer to the variable `how_many` below. It should be of type `int`, representing the *number* of intervals, not the proportion or percentage.

In [None]:
how_many = ...
how_many

In [None]:
grader.check("q4_4")

**Question 4.5.** We also created 90%, 95%, and 99.9% confidence intervals from one sample (shown below), but forgot to label which confidence intervals were which! Match the interval to the percent of confidence the interval represents and assign your choices (either 1, 2, or 3) to variables `ci_90`, `ci_95`, and `ci_999`, corresponding to the 90%, 95%, and 99.9% confidence intervals respectively.

*Hint*: Drawing the confidence intervals out on paper might help you visualize them better.

1. $[0.273, 0.363]$


2. $[0.268, 0.380]$


3. $[0.295, 0.354]$


In [None]:
ci_90 = ...
ci_95 = ...
ci_999 = ...
ci_90, ci_95, ci_999

In [None]:
grader.check("q4_5")

**Question 4.6.** Based on the survey results shown at the start of the question, it seems that `'dark'` chocolate is more popular than `'milk'` chocolate among residents. We would like to construct a range of likely values – that is, a confidence interval – for the difference in popularity, which we define as:

$$\text{(Proportion of residents who prefer dark chocolate)} - \text{(Proportion of residents who prefer milk chocolate)}$$

Create a function, `differences_in_resamples`, that creates **1000 bootstrapped resamples of the original survey data** in the `chocolate` DataFrame, computes the difference in proportions for each resample, and returns an array of these differences. Store your bootstrapped estimates in an array called `boot_differences` and plot a histogram of these estimates.

*Note*: While this might sound like a job for permutation testing, this is instead a bootstrapping question. Note that our goal is to estimate a population parameter – the difference between the proportion of all residents that prefer dark chocolate and the proportion of all residents that prefer milk chocolate – not to answer a question about whether two samples come from the same distribution.

*Hint*: Use the code for `proportions_in_resamples` given to you above as a starting point.

In [None]:
def differences_in_resamples():
    np.random.seed(55) # Ignore this, and don't change it.
    ...

boot_differences = ...

# Plot a histogram of boot_differences.

In [None]:
grader.check("q4_6")

**Question 4.7.** Compute an approximate 99% confidence interval for the difference in proportions. Assign the lower and upper bounds of the interval to `diff_lower_bound` and `diff_upper_bound`, respectively.

In [None]:
diff_lower_bound = ...
diff_upper_bound = ...

# Print the confidence interval:
print("Bootstrapped 99% confidence interval for the difference in popularity between dark chocolate and milk chocolate:\n[{:f}, {:f}]".format(diff_lower_bound, diff_upper_bound))

In [None]:
grader.check("q4_7")

**Question 4.8.** In this question, you computed two 99% confidence intervals:
- In Question 4.1, you found a 99% confidence interval for the proportion of residents who prefer `'dark'` chocolate among the four chocolate options. Let's call this the "dark chocolate CI."
- In Question 4.7, you found a 99% confidence interval for the difference between the proportion of residents who prefer `'dark'` chocolate and the proportion of residents who prefer `'milk'` chocolate. Let's call this the "difference CI." 

Which of the explanations below best describes the widths of these two confidence intervals? Set `q4_8` to either 1, 2, 3, or 4.

1. The dark chocolate CI is **wider** than the difference CI because we have **more certainty** in an estimate of a single unknown parameter than in the difference between two unknown parameters.
1. The dark chocolate CI is **narrower** than the difference CI because we have **more certainty** in an estimate of a single unknown parameter than in the difference between two unknown parameters.
1. The dark chocolate CI is **wider** than the difference CI because we have **less certainty** in an estimate of a single unknown parameter than in the difference between two unknown parameters.
1. The dark chocolate CI is **narrower** than the difference CI because we have **less certainty** in an estimate of a single unknown parameter than in the difference between two unknown parameters.

In [None]:
q4_8 = ...

In [None]:
grader.check("q4_8")

## Finish Line 🏁

Congratulations! You are done with Homework 5, the second-to-last homework of the quarter!

To submit your assignment:

1. Select `Kernel -> Restart & Run All` to ensure that you have executed all cells, including the test cells.
1. Read through the notebook to make sure everything is fine and all tests passed.
1. Run the cell below to run all tests, and make sure that they all pass.
1. Download your notebook using `File -> Download as -> Notebook (.ipynb)`, then upload your notebook to Gradescope.
1. Stick around while the Gradescope autograder grades your work. Make sure you see that all tests have passed on Gradescope.
1. Check that you have a confirmation email from Gradescope and save it as proof of your submission.

In [None]:
# For your convenience, you can run this cell to run all the tests at once!
grader.check_all()