# Homework 4: Simulation, Sampling, and Bootstrapping

## Due Tuesday, May 14th at 11:59PM

Welcome to Homework 4! This homework will cover:
- Simulations (see [CIT 9.3-9.4](https://inferentialthinking.com/chapters/09/3/Simulation.html))
- Sampling and Empirical Distributions (see [CIT 10-10.4](https://inferentialthinking.com/chapters/10/Sampling_and_Empirical_Distributions.html))
- Bootstrapping and Confidence Intervals (see [CIT 13.2](https://inferentialthinking.com/chapters/13/2/Bootstrap.html) and [CIT 13.3](https://inferentialthinking.com/chapters/13/3/Confidence_Intervals.html))

### Instructions

Remember to start early and submit often. You are given six slip days throughout the quarter to extend deadlines. See the syllabus for more details. With the exception of using slip days, late work will not be accepted unless you have made special arrangements with your instructor.

**Important**: For homeworks, the `otter` tests don't usually tell you that your answer is correct. More often, they help catch careless mistakes. It's up to you to ensure that your answer is correct. If you're not sure, ask someone (not for the answer, but for some guidance about your approach). These are great questions for office hours (the schedule can be found [here](https://dsc10.com/calendar)) or Ed. Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. 

In [None]:
# Please don't change this cell, but do make sure to run it.
import babypandas as bpd
import numpy as np

import matplotlib.pyplot as plt
plt.style.use('ggplot')

import otter
grader = otter.Notebook()

## 1. Triton Spirit Night Lotto, continued üî± üèÄ 

In the last homework, we calculated the probability of winning the grand prize (free housing) on a Spirit Night Lotto lottery ticket, and found that it was quite low üò≠.

In [None]:
# Just run this cell, do not change it!
free_housing_chance = (1 / 66) * (1 / 65) * (1 / 64) * (1 / 63) * (1 / 62) * (1 / 15)
free_housing_chance

In this question, we'll approach the same question not using math, but using simulation. 

It's important to remember how this lottery works:
- First, you pick five **different** numbers, one at a time, from 1 to 66, representing the winning score at the Spirit Night game.
- Then, you separately pick a number from 1 to 15. This is because the top scoring athlete (Bryce Pope) scored 15 points during the game. Let's say you select 9.
- The six numbers you have selected, or  **your numbers**, can be represented all together as (7, 32, 24, 65, 13, 9). This is a _sequence_ of six numbers ‚Äì **order matters**!

The **winning numbers** are chosen by King Triton, who draws five balls, one at a time, **without replacement**, from a pot of white balls numbered 1 to 66. Then, he draws a gold ball, the Tritonball, from a pot of gold balls numbered 1 to 15. Both pots are completely separate, hence the different ball colors. For example, maybe the winning numbers are (65, 9, 24, 23, 1, 9).

We‚Äôll assume for this problem that in order to win the grand prize (free housing), all six of your numbers need to match the winning numbers and be in the **exact same positions**. In other words, your entire sequence of numbers must be exactly the same as King Triton's sequence of winning numbers. However, if some numbers in your sequence match up with the corresponding number in the winning sequence, you will still win some Triton Cash. 

Suppose again that you select (7, 32, 24, 65, 13, 9) and the winning numbers are (65, 9, 24, 23, 1, 9). In this case, two of your numbers are considered to match two of the winning numbers. 
- Your numbers: (7, 32, **24**, 65, 13, **9**)
- Winning numbers: (65, 9, **24**, 23, 1, **9**)

You won't win free housing, but you will win some Triton Cash. Note that although both sequences include the number 65 within the first five numbers (representing a white ball), since they are in different positions, that's not considered a match.


**Question 1.1.** Implement a function called `simulate_one_ticket`. It should take no arguments, and it should return an array with 6 random numbers, simulating how the numbers are selected for a single Spirit Night Lotto ticket. The first five numbers should all be randomly chosen without replacement, from 1 to 66. The last number should be between 1 and 15.

In [None]:
def simulate_one_ticket():
    """Simulate one Spirit Night Lotto ticket."""
    ...

In [None]:
grader.check("q1_1")

**Question 1.2.** It's draw day. You checked the winning numbers King Triton drew, which happened to be **(26, 49, 64, 5, 12, 7)**. Below, calculate how many matches there are between the winning numbers and a randomly generated ticket, and save the result in `num_matches`. Remember, order matters when counting matches!

***Hint:*** You don't need a `for`-loop for this question. There is a one-line solution using `np.count_nonzero`.

In [None]:
winning = np.array([26, 49, 64, 5, 12, 7])
simulated_ticket = simulate_one_ticket()
num_matches = ...

print(f"The number of matches between the winning numbers {winning} and the simulated ticket {simulated_ticket} is {num_matches}.")

In [None]:
grader.check("q1_2")

**Question 1.3.** You are disappointed because you bought a lottery ticket but you did not win free housing. To make yourself feel better, you write a simulation to remind yourself how unlikely it is to win the grand prize. 

Implement a simulation where you call the function `simulate_one_ticket` 100,000 times. In your 100,000 tickets, **how many times did you win the grand prize (free housing)?** Assign your answer to `count_free_housing`. (It would cost a fortune if you were to buy 100,000 tickets ‚Äì it's pretty nice to be able to simulate this experiment instead of doing it in real life!) 

***Hint:*** Start by writing a simulation where you only buy 10 tickets. Once you are sure you have that figured out, then ramp it up to 100,000 tickets. This is a good general practice for writing simulations: start small! It may take a little while (up to a minute) for Python to perform the calculations when you are buying 100,000 tickets. 

In [None]:
winning = np.array([26, 49, 64, 5, 12, 7])
count_free_housing = ...
...
count_free_housing

In [None]:
grader.check("q1_3")

How many times did you win free housing? Remember, the mathematical probability of winning free housing is quite low, on the order of $10^{-11}$. That's a lot lower than 1 in 100,000, which is $10^{-5}$.

**Question 1.4.** As we've seen, you would need to be extremely lucky to win the grand prize. To encourage more students to buy Spirit Night Lotto tickets despite the terrible odds, there are some additional prizes. Students can win Triton Cash if *some* of their numbers match the corresponding winning numbers, as described in the introduction. Again, simulate the act of buying 100,000 tickets, but this time find **the greatest number of matches achieved by any of your tickets**, and assign this number to `most_matches`. 

For example, if 90,000 of your tickets matched 1 winning number and 10,000 of your tickets matched 2 winning numbers, then you would set `most_matches` to 2. If 99,999 of your tickets matched 1 winning number and one of your tickets matched 4 winning numbers, you would set `most_matches` to 4. If you happened to win the grand prize on one of your tickets, you would set `most_matches` to 6. 

***Hint:*** There are several ways to approach this; one way involves storing the number of matches per ticket in an array and finding the largest number in that array. 

In [None]:
winning = np.array([26, 49, 64, 5, 12, 7])
most_matches = ...
...
most_matches

In [None]:
grader.check("q1_4")

**Question 1.5.** Suppose one Spirit Night Lotto ticket costs $5.

The Spirit Night Lotto advertisement on Instagram promises you will never lose money because of the following generous prizes:

- Win $15 with a 1-number match

- Win $50 with a 2-number match

- Win $100 with a 3-number match

- Win $1,000 with a 4-number match

- Win $5,000 with a 5-number match

- Win $25,000 with a 6-number match (free housing!)

If you had the money to buy 100,000 tickets, what would be your net winnings from buying these tickets? Since this is net winnings, this should account for the prizes you win and the cost of buying the tickets. Assign the amount to `net_winnings`. Note that a positive value means you won money overall, and a negative value means you lost money overall. Do you believe the advertisement's claims?

The winning numbers are the same from the previous part: **(26, 49, 64, 5, 12, 7)**.

***Hint:*** Again, there are a few ways you could approach this problem. One way involves generating another 100,000 random tickets and counting the amount earned per ticket, adding to a running total. Alternatively, if you created an array of the number of matches per ticket in Question 1.4, you could loop through that array. For practice, you can try solving this problem multiple ways!

In [None]:
net_winnings = ...
...
net_winnings

In [None]:
grader.check("q1_5")

## 2. Sampling with NBA Data üèÄ

In this question, we will use a [dataset](https://www.kaggle.com/datasets/jamiewelsh2/nba-player-salaries-2022-23-season?resource=download) consisting of information about NBA player salaries and per game statistics for the 2022-23 season. We'll use this data to get some practice with sampling. Run the cell below to load the data into a DataFrame.

In [None]:
# Just run this cell, do not change it!
nba_data = bpd.read_csv('data/nba_players.csv')
nba_data

We've provided a function called `compute_statistics` that takes as input a DataFrame with two columns, `'Points'` and `'Salary'`, and then:
- draws a histogram of `'Points'`,
- draws a histogram of `'Salary'`, and
- returns a two-element array containing the mean `'Points'` and mean `'Salary'`.

Run the cell below to define the `compute_statistics` function, and a helper function called `histograms`. Don't worry about how this code works, and please don't change anything.

In [None]:
# Don't change this cell, just run it.
def histograms(df):
    points = df.get('Points').values
    salaries = df.get('Salary').values
    
    plt.subplots(1, 2, figsize=(15, 4), dpi=100)

    plt.subplot(1, 2, 1)
    plt.hist(points, density=True, alpha=0.5, color='blue', ec='w', bins=np.arange(0, 35, 1))
    plt.title('Distribution of Points')

    plt.subplot(1, 2, 2)
    plt.hist(salaries, density=True, alpha=0.5, color='blue', ec='w', bins=np.arange(0, 50000000, 1000000))
    plt.title('Distribution of Salaries')
    
def compute_statistics(points_and_salaries_data, draw=True):
    if draw:
        histograms(points_and_salaries_data)
    avg_points = points_and_salaries_data.get('Points').mean()
    avg_salary = points_and_salaries_data.get('Salary').mean()
    avg_array = np.array([avg_points, avg_salary]) 
    return avg_array

We can use this `compute_statistics` function to show the distribution of `'Points'` and `'Salary'` and compute their means, for any collection of players. 

Run the next cell to show these distributions and compute the means for all NBA players. Notice that an array containing the mean `'Points'` and mean `'Salary'` values is displayed before the histograms.

In [None]:
nba_stats = compute_statistics(nba_data)
nba_stats

Now, imagine that instead of having access to the full *population* of NBA players, we only have access to data on a smaller subset of players, or a *sample*.  For 467 players, it's not so unreasonable to expect to see all the data, but usually we aren't so lucky.  Instead, we often make *statistical inferences* about a large underlying population using a smaller sample.

**Statistical inference** is the process of using data in a sample to _infer_ some characteristic about the population from which the sample was drawn. A common strategy for statistical inference is to estimate a parameter of the population by computing a corresponding statistic on a sample. This strategy sometimes works well and sometimes doesn't.  The degree to which it gives us useful answers depends on several factors.

One very important factor in the utility of samples is how they were gathered. Let's look at some different sampling strategies.

### Convenience sampling
One sampling methodology, which is **generally a bad idea**, is to choose players which are somehow convenient to sample.  For example, you might choose players from a team that's near your house, since it's easier to collect information about them.  This is called *convenience sampling*.

**Question 2.1.**  Suppose you live in New Jersey and you decide to manually look up information from the three closest teams:
- New York Knicks (`'NYK'`)
- Brooklyn Nets (`'BRK'`)
- Philadelphia 76ers (`'PHI'`)

Assign `convenience_sample` to a subset of `nba_data` that contains only the rows for players that are in one of these three teams.

In [None]:
convenience_sample = ...
convenience_sample

In [None]:
grader.check("q2_1")

**Question 2.2.** Assign `convenience_stats` to an array of the mean `'Points'` and mean `'Salary'` of your convenience sample.  Since they're computed on a sample, these are called *sample means*. 

***Hint:*** Use the function `compute_statistics`; it's okay if histograms are displayed as well.

In [None]:
convenience_stats = ...
convenience_stats

In [None]:
grader.check("q2_2")

Next, we'll compare the distribution of `'Points'` in our convenience sample to the distribution of `'Points'` for all the players in our dataset.

In [None]:
# Just run this cell, do not change it!
def compare_points(first, second, first_title, second_title):
    """Compare the points in two DataFrames."""
    bins = np.arange(0, 35, 1)
    
    plt.subplots(1, 2, figsize=(15, 4), dpi=85)

    plt.subplot(1, 2, 1)
    plt.hist(first.get('Points'), bins=bins, density=True, ec='w', color='blue', alpha=0.5)
    plt.title(f'Points ({first_title})')
    
    plt.subplot(1, 2, 2)
    plt.hist(second.get('Points'), bins=bins, density=True, ec='w', color='blue', alpha=0.5)
    plt.title(f'Points ({second_title})')

compare_points(nba_data, convenience_sample, 'All Players', 'Convenience Sample')

**Question 2.3.** From what you see in the histograms above, did the convenience sample give us an accurate picture of the points for the full population of NBA players?  Why or why not?

Assign either 1, 2, 3, or 4 to the variable `sampling_q3` below. 
1. Yes. The sample is large enough, so it is an accurate representation of the population.
1. No. Convenience samples generally don't give us an accurate representation of the population.
1. No. Normally convenience samples give us an accurate representation of the population, but we just got unlucky.
1. No. Normally convenience samples give us an accurate representation of the population, but only if the sample size is large enough. Our convenience sample here was too small.

In [None]:
sampling_q3 = ...

In [None]:
grader.check("q2_3")

### Simple random sampling
A more principled approach is to sample uniformly at random from the players.  If we ensure that each player is selected at most once, this is a **random sample without replacement**, sometimes called a "**simple random sample**" or "**SRS**".  Imagine writing down each player's name on a card, putting the cards in a hat, and shuffling the hat.  To sample, pull out cards one by one and set them aside, stopping when the specified *sample size* is reached.

We've produced two simple random samples of `nba_data`: the variable `small_srs_data` contains a SRS of size 70, and the variable `large_srs_data` contains a SRS of size 180.

Now we'll run the same analyses on the small simple random sample, the large simple random sample, and the convenience sample. The subsequent code draws the histograms and computes the means for `'Points'` and `'Salary'`.

In [None]:
# Don't change this cell, but do run it.
small_srs_data = bpd.read_csv('data/small_srs_salary.csv')
large_srs_data = bpd.read_csv('data/large_srs_salary.csv')

small_stats = compute_statistics(small_srs_data, draw=False);
large_stats = compute_statistics(large_srs_data, draw=False);
convenience_stats = compute_statistics(convenience_sample, draw=False);

print('Full data stats:                 ', nba_stats)
print('Small SRS stats:                 ', small_stats)
print('Large SRS stats:                 ', large_stats)
print('Convenience sample stats:        ', convenience_stats)

color_dict = {
    'small SRS': 'blue',
    'large SRS': 'green',
    'convenience sample': 'orange'
}

plt.subplots(3, 2, figsize=(15, 15), dpi=100)
i = 1

for df, name in zip([small_srs_data, large_srs_data, convenience_sample], color_dict.keys()):
    plt.subplot(3, 2, i)
    i += 2
    plt.hist(df.get('Points'), density=True, alpha=0.5, color=color_dict[name], ec='w', 
             bins=np.arange(0, 35, 1))
    plt.title(f'Points ({name})');

i = 2
for df, name in zip([small_srs_data, large_srs_data, convenience_sample], color_dict.keys()):
    plt.subplot(3, 2, i)
    i += 2
    plt.hist(df.get('Salary'), density=True, alpha=0.5, color=color_dict[name], ec='w', 
             bins=np.arange(0, 50000000, 1000000))
    plt.title(f'Salaries ({name})');

### Producing simple random samples
Often it's useful to take random samples even when we have a larger dataset available.  One reason is that doing so can help us understand how inaccurate other samples are.

As we saw in [Lecture 13](https://dsc10.com/resources/lectures/lec13/lec13.html#Sampling-rows-from-a-DataFrame), DataFrames have a `.sample` method for producing simple random samples.  Note that its default is to sample **without** replacement, which aligns with how simple random samples are drawn.

**Question 2.4.** Produce a simple random sample of size 70 from `nba_data`. Store an array containing the mean `'Points'` and mean `'Salary'` of your SRS in `my_small_stats`. Again, it's fine if histograms are displayed.

In [None]:
my_small_stats = ...
my_small_stats

Run the cell above many times, to collect new samples and compute their sample means.

<br>

Now, recall, `small_stats` is an array containing the mean `'Points'` and mean `'Salary'` for the one small SRS that we provided you with:

In [None]:
small_stats

Answer the following two-fold question:
- Are the values in `my_small_stats` (the mean `'Points'` and `'Salary'` for **your** small SRS) similar to the values in `small_stats` (the mean `'Points'` and `'Salary'` for the small SRS **we provided you with**)? 
- Each time you collect a new sample ‚Äì i.e. each time you re-run the cell where `my_small_stats` is defined ‚Äì do the values in `my_small_stats` change a lot?

Assign either 1, 2, 3, or 4 to the variable `sampling_q4` below.
1. The values in `my_small_stats` are identical to the values in `small_stats`, and change a bit each time a new sample is collected.
1. The values in `my_small_stats` are identical to the values in `small_stats`, and don't change at all each time a new sample is collected.
1. The values in `my_small_stats` are slightly different from the values in `small_stats`, and change a bit each time a new sample is collected.
1. The values in `my_small_stats` are very different from the values in `small_stats`, and don't change at all each time a new sample is collected.

In [None]:
sampling_q4 = ...

In [None]:
grader.check("q2_4")

**Question 2.5.** Similarly, create a simple random sample of size 180 from `nba_data` and store an array of the sample's mean `'Points'` and mean `'Salary'` in `my_large_stats`.

In [None]:
my_large_stats = ...
my_large_stats

Run the cell in which `my_large_stats` is defined many times. Do the histograms and  mean statistics (mean `'Points'` and mean `'Salary'`) seem to change more or less across samples of size 180 than across samples of size 70?

Assign either 1, 2, or 3 to the variable `sampling_q5` below. 

1. The statistics change *less* across samples of size 180 than across samples of size 70.
1. The statistics change an *equal amount* across samples of size 180 and across samples of size 70.
1. The statistics change *more* across samples of size 180 than across samples of size 70.

In [None]:
sampling_q5 = ...

In [None]:
grader.check("q2_5")

## 3. Milk Tea, Yippee! ü•õüçµ

You are planning to open a milk tea shop in La Jolla! To get a sense of the local residents' milk tea preferences, you survey 200 randomly selected La Jolla residents and ask which type of tea they prefer the most among six options ‚Äì `'jasmine'`, `'oolong'`, `'black'`, `'golden'`, `'matcha'`, `'Thai'`. 

<center><img src="images/tea.png" width=70%></center>

Run the next cell to load in the results of the survey.

In [None]:
survey = bpd.read_csv('data/tea.csv')
survey

What you're truly interested in, though, is the proportion of *all La Jolla residents* that prefer each type of tea. These are *population parameters* (plural, because there are six proportions).

Your friends tell you that jasmine tea is popular and that your shop should focus on jasmine tea-based creations. To make an informed decision, you decide to look at your survey data to determine the proportion of La Jolla residents that prefer `'jasmine'` tea over all other types of teas.

**Question 3.1.** Start by calculating the proportion of residents in your sample who prefer `'jasmine'` tea. Assign this value to `jasmine_proportion`.



In [None]:
jasmine_proportion = ...
jasmine_proportion

In [None]:
grader.check("q3_1")

You're done... or are you? You have a single estimate for the true proportion of residents who prefer `'jasmine'` tea. However, you don't know how close that estimate is, or how much it could have varied if you'd had a different sample. In other words, you have an estimate, but no understanding of how close that estimate is to the true proportion of all local residents who prefer `'jasmine'` tea.

This is where the idea of resampling via **[bootstrapping](https://inferentialthinking.com/chapters/13/2/Bootstrap.html)** comes in. Assuming that our sample resembles the population fairly well, we can resample from our original sample to produce more samples. From each of these resamples, we can produce another estimate for the true proportion of residents who prefer `'jasmine'` tea, which gives us a distribution of sample proportions that describes how the estimate might vary given different samples. We can then use this distribution to understand the **variability** in the estimated proportion of residents who prefer `'jasmine'` tea.


**Question 3.2.** Now, let's use bootstrapping to get a sense of the distribution of the sample proportion. Complete the following code to produce 1,000 bootstrapped estimates for the proportion of residents who prefer `'jasmine'` tea. Store your 1,000 estimates in an array named `boot_jasmine_proportions`.

In [None]:
boot_jasmine_proportions = ...
for i in np.arange(1000):
    resample = ...
    resample_proportion = ...
    boot_jasmine_proportions = ...

boot_jasmine_proportions

In [None]:
grader.check("q3_2")

**Question 3.3.** Using the array `boot_jasmine_proportions`, compute an approximate **95%** confidence interval for the true proportion of residents who prefer `'jasmine'` tea.  Compute the lower and upper ends of the interval, named `jasmine_lower_bound` and `jasmine_upper_bound`, respectively.

In [None]:
jasmine_lower_bound = ...
jasmine_upper_bound = ...

# Print the confidence interval:
print("Bootstrapped 95% confidence interval for the true proportion of residents who prefer jasmine tea in the population:\n[{:f}, {:f}]".format(jasmine_lower_bound, jasmine_upper_bound))

In [None]:
grader.check("q3_3")

**Question 3.4.**
Is it true that 95% of the population lies in the range `jasmine_lower_bound` to `jasmine_upper_bound`? Assign the variable `q3_4` to either `True` or `False`. 

In [None]:
q3_4 = ...

In [None]:
grader.check("q3_4")

**Question 3.5.**
Is it true that the proportion of La Jolla residents who prefer `'jasmine'` tea over the other teas is a random quantity with approximately a 95% chance of falling between `jasmine_lower_bound` and `jasmine_upper_bound`? Assign the variable `q3_5` to either `True` or `False`.

In [None]:
q3_5 = ...

In [None]:
grader.check("q3_5")

**Question 3.6.**
Suppose we were somehow able to produce 2,000 new samples, each one a uniform random sample of 200 La Jolla residents taken directly from the population. For each of those 2,000 new samples, we create a 95% confidence interval for the proportion of residents who prefer `'jasmine'` tea. Roughly how many of those 2,000 intervals should we expect to actually contain the true proportion of the population? Assign your answer to the variable `how_many` below. It should be of type `int`, representing the *number* of intervals, not the proportion or percentage.

In [None]:
how_many = ...
how_many

In [None]:
grader.check("q3_6")

**Question 3.7.** We also created 90%, 96%, and 99% confidence intervals from one sample (shown below), but forgot to label which confidence intervals were which! Match the interval to the percent of confidence the interval represents and assign your choices (either 1, 2, or 3) to variables `ci_90`, `ci_96`, and `ci_99`, corresponding to the 90%, 96%, and 99% confidence intervals respectively.

***Hint:*** Drawing the confidence intervals out on paper might help you visualize them better.


1. $[0.185, 0.31]$

2. $[0.195, 0.3]$

3. $[0.175,  0.325]$




In [None]:
ci_90 = ...
ci_96 = ...
ci_99 = ...
ci_90, ci_96, ci_99

In [None]:
grader.check("q3_7")

**Question 3.8.** Based on the results in `survey`, it seems that `'jasmine'` tea is more popular than `'black'` tea among residents. We would like to construct a range of likely values ‚Äì that is, a confidence interval ‚Äì for the difference in popularity, which we define as:

$$\text{(Proportion of residents who prefer jasmine tea)} - \text{(Proportion of residents who prefer black tea)}$$

Create a function, `differences_in_resamples`, that creates **1000 bootstrapped resamples of the original survey data** in the `survey` DataFrame, computes the difference in proportions for each resample, and returns an array of these differences. Store your bootstrapped estimates in an array called `boot_differences` and plot a histogram of these estimates.

***Hints:*** 
- Use your code from Question 3.2 as a starting point.
- To plot your histogram, you'll first need to create a DataFrame with one column, whose entries are the values in `boot_differences`. 

In [None]:
def differences_in_resamples():
    ...

boot_differences = ...

# Plot a histogram of boot_differences.

In [None]:
grader.check("q3_8")

**Question 3.9.** Compute an approximate 95% confidence interval for the difference in proportions. Assign the lower and upper bounds of the interval to `diff_lower_bound` and `diff_upper_bound`, respectively.

In [None]:
diff_lower_bound = ...
diff_upper_bound = ...

# Print the confidence interval:
print("Bootstrapped 95% confidence interval for the difference in popularity between jasmine tea and black tea:\n[{:f}, {:f}]".format(diff_lower_bound, diff_upper_bound))

In [None]:
grader.check("q3_9")

**Question 3.10.** In this question, you computed two 95% confidence intervals:
- In Question 3.3, you found a 95% confidence interval for the proportion of residents who prefer `'jasmine'` tea among the six tea options. Let's call this the "jasmine CI."
- In Question 3.9, you found a 95% confidence interval for the difference between the proportion of residents who prefer `'jasmine'` tea and the proportion of residents who prefer `'black'` tea. Let's call this the "difference CI." 

Choose how to best fill in the blanks to describe the widths of these two confidence intervals. Set `q3_10` to either 1, 2, 3, or 4.

>The jasmine CI is ________________________ than the difference CI because we have a ________________________ for a single unknown parameter than the difference between two unknown parameters.

1. wider; more accurate guess
1. narrower; more accurate guess
1. wider; less accurate guess
1. narrower; less accurate guess

In [None]:
q3_10 = ...

In [None]:
grader.check("q3_10")

## 4. Need a Lyft? üöó
Sofia is planning on traveling to Boston this summer. Since she won't have a car there, she's planning on using Uber or Lyft, two popular ride-sharing apps. Let's compare the cost of these ride-sharing apps to help Sofia save some money!

Our [dataset](https://www.kaggle.com/datasets/ravi72munde/uber-lyft-cab-prices/data) contains a **sample** of all Uber and Lyft rides in Boston. We have information on the `'app'` (Lyft or Uber), the `'mode'` (level of service), the `'destination'` and `'source'` neighborhoods, the `'price'` in dollars, and the `'distance'` in miles. Run the cell below to load this data into the DataFrame `rideshare`.

In [None]:
rideshare = bpd.read_csv('data/rideshare.csv')
rideshare

**Question 4.1.** Let's start by determining the mean price for each of the two ridesharing apps. Create a DataFrame called `uber_lyft_means`, indexed by `'app'`, with one column called `'price'` that contains the mean price for each ridesharing app. Sort `uber_lyft_means` in ascending order of `'price'`.

***Hint:*** This takes just one line of code.

In [None]:
uber_lyft_means = ...
uber_lyft_means

In [None]:
grader.check("q4_1")

**Question 4.2.** Based on the data we have, one rideshare app appears to be cheaper than the other. But the data we have access to is only a sample of all rides, and thus the mean prices we computed above are only sample statistics, not  population parameters. Let's now extend each of our estimates to create a confidence interval for the mean price of **all** rides on each app. We'll start with Uber.

Produce 1,000 bootstrapped estimates for the mean price of **all** Uber rides. Store the estimates in the `uber_means` array. Then, use the `uber_means` array to calculate an approximate **99% confidence interval** for the true mean price of all Uber rides. Assign the endpoints of your interval to `lower_bound` and `upper_bound`.

***Hint:*** Make sure to query **before** resampling!

In [None]:
uber_means = ...


lower_bound = ...
upper_bound = ...

# Display the estimates in a histogram.
bpd.DataFrame().assign(Estimated_Mean_Price=uber_means).plot(kind='hist', density=True, ec='w', figsize=(10, 5), title='Uber');
plt.plot([lower_bound, upper_bound], [0, 0], color='gold', linewidth=10, label='99% confidence interval');

# Don't change what's below (though you will need to copy and change it in 4.3).
app_name = 'Uber'
f'A 99% confidence interval for the average {app_name} ride price is [{lower_bound}, {upper_bound}].'

In [None]:
grader.check("q4_2")

**Question 4.3.** Now we want to calculate the corresponding confidence interval for Lyft. Instead of copying our code from Question 4.2 and changing it to work for Lyft, let's write a more general function that works for _both_ Uber and Lyft. 

Create a function called `app_and_hist`, which takes in the name of a ridesharing app as a string, and:
1. **Plots the histogram** of 1,000 bootstrapped estimates for that app's mean price.
2. **Returns** a string describing the approximate 99% confidence interval for that app's mean price, formatted in the same way as the string displayed for Uber in Question 4.2. 

***Notes:*** 
- Make sure your function both plots a histogram and **returns** a string. For example, `mode_and_hist('Uber')` should return a string that starts with `'A 99% confidence interval for the average Uber ride price is'`. It's ok if you see the return string displayed before the plot.
- The string displayed at the end of Question 4.2 was created using a feature of Python called f-strings. You'll need to copy and change that f-string expression. Read [this article](https://realpython.com/python-f-strings/#simple-syntax) for more details about f-strings if you're interested.

In [None]:
def app_and_hist(app_name):
    ...
    
# Example calls to the function. Don't change the lines below.
uber_string = app_and_hist('Uber')
print(uber_string)
lyft_string = app_and_hist('Lyft')
print(lyft_string)

In [None]:
grader.check("q4_3")

## Finish Line: Almost there, but make sure to follow the steps below to submit! üèÅ

**_Citations:_** Did you use any generative artificial intelligence tools to assist you on this assignment? If so, please state, for each tool you used, the name of the tool (ex. ChatGPT) and the problem(s) in this assignment where you used the tool for help.

<hr style="color:Maroon;background-color:Maroon;border:0 none; height: 3px;">

Please cite tools here.

<hr style="color:Maroon;background-color:Maroon;border:0 none; height: 3px;">

To submit your assignment:

1. Select `Kernel -> Restart & Run All` to ensure that you have executed all cells, including the test cells.
2. Read through the notebook to make sure everything is fine and all tests passed.
3. Run the cell below to run all tests, and make sure that they all pass.
4. Download your notebook using `File -> Download as -> Notebook (.ipynb)`, then upload your notebook to Gradescope.
5. Stick around while the Gradescope autograder grades your work. Make sure you see that all tests have passed on Gradescope.
6. Check that you have a confirmation email from Gradescope and save it as proof of your submission. 

In [None]:
grader.check_all()