# Homework 5: Simulation, Sampling, and Hypothesis Testing

## Due Tuesday, November 9 at 11:59pm PST

Welcome to Homework 5! This homework will cover:
- Simulations: see [DDS 5.1](https://eldridgejm.github.io/dive_into_data_science/05-probability_and_simulation/probability_and_simulation.html)
- Populations, Samples, Parameters, and Statistics: see [DDS 6.1](https://eldridgejm.github.io/dive_into_data_science/06-populations_and_samples/1_populations_and_samples.html) and [DDS 6.2]()
- Models and Hypothesis Testing: see [DDS 7.1](https://eldridgejm.github.io/dive_into_data_science/06-populations_and_samples/2_parameters_and_statistics.html)

### Instructions

This assignment is due Tuesday, November 9 at 11:59pm. You are given six slip days throughout the quarter to extend deadlines. See the syllabus for more details. With the exception of using slip days, late work will not be accepted unless you have made special arrangements with your instructor.

**Important**: For homeworks, the `otter` tests don't usually tell you that your answer is correct. More often, they help catch careless mistakes. It's up to you to ensure that your answer is correct. If you're not sure, ask someone (not for the answer, but for some guidance about your approach). These are great questions for office hours (schedule on Canvas) or your team's chatroom on Campuswire. Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. 

In [None]:
# please don't change this cell, but do make sure to run it
import babypandas as bpd
import matplotlib.pyplot as plt
import numpy as np
import otter
grader = otter.Notebook()

## 1. Powerball, Continued 🎱

In the last homework, we calculated the probability of winning the jackpot on a Powerball lottery ticket, and found that it was quite low. (Surprise!)

In [None]:
jackpot_chance = (1/77)*(1/76)*(1/75)*(1/74)*(1/73)*(1/26) 
jackpot_chance

In this question, we'll approach the same question not using math, but using simulation. 

It's important to remember how this lottery game works:
- When you buy a Powerball ticket, you pick five different numbers, one at a time, from 1 to 77. Then you separately pick a number from 1 to 26. These are **your numbers**. For example, you may select (67, 35, 10, 3, 52, 23). This is a sequence of six numbers - order matters!
- The **winning numbers** are chosen by somebody drawing five balls, one at a time without replacement, from a collection of white balls numbered 1 to 77. Then they draw a red ball (the Powerball) from a collection of red balls numbered 1 to 26. For example, maybe the winning numbers are (35, 42, 10, 8, 73, 23).

We’ll assume for this problem that in order to win the biggest prize (the jackpot), all six of your numbers need to match the winning numbers and be in the **exact same order**. In other words, your entire sequence of numbers must be exactly the same. However, if some numbers in your sequence match up with the corresponding number in the winning sequence, you can still win some money. In the example below, two of your numbers are considered to match two of the winning numbers. Notice that although both sequences include the number 35, since they are in different positions, that's not considered a match.

- your numbers: (67, 35, **10**, 3, 52, **23**)
- winning numbers: (35, 42, **10**, 8, 73, **23**)

**Question 1.1.** Write a function called `simulate_one_ticket`. It should take no arguments, and it should return an array with 6 random numbers. The first five numbers should all be randomly chosen (without replacement) from between 1 and 77. The last number should be between 1 and 26.

In [None]:
def simulate_one_ticket():
    """Simulate one Powerball lottery ticket."""
    ...

In [None]:
grader.check("q1_1")

**Question 1.2.** It's draw day, and you checked the winning numbers, which happened to be (77, 2, 43, 11, 29, 3). You didn't win the jackpot, and you are quite sad. 

Suppose you want to remind yourself how unlikely it is to win a jackpot. Call the function `simulate_one_ticket` 100,000 times. (It would cost at least $500,000 if you were to buy that many! It's pretty nice to be able to simulate this experiment instead of doing it in real life). In your 100,000 tickets, how many times did you win the jackpot? Assign your answer to `count_jackpot`.

*Hint*: Try it first with only buying 10 tickets. Once you are sure you have that figured out, change it to 100,000 tickets. It will take a little while (around a minute) for Python to perform the calculations when you are buying 100,000 tickets.

*Hint 2*: You'll have to count how many of the numbers you chose match the numbers that were drawn. One way to do this involves [`np.count_nonzero`](https://numpy.org/doc/stable/reference/generated/numpy.count_nonzero.html).

In [None]:
count_jackpot = ...
...
count_jackpot

In [None]:
grader.check("q1_2")

Remember, the mathematical probability of winning the jackpot is quite low, on the order of $10^{-11}$. That's a lot lower than than 1 in 100,000, which is $10^{-5}$.

**Question 1.3.** Suppose you can win a smaller prize if some of your numbers match the corresponding winning numbers, as described above. Again simulate buying 100,000 tickets, but this time keep track of **the greatest number of matches achieved by any of your tickets**, and assign this to `wins`. 

For example, if 90,000 of your tickets matched 1 winning number and 10,000 of your tickets matched 2 winning numbers, then you would set `wins` to 2. If you happened to win the jackpot on one of your tickets, you would set `wins` to 6.

In [None]:
wins = ...
...
wins

In [None]:
grader.check("q1_3")

**Question 1.4.** Suppose one Powerball lottery ticket costs $5.

The Powerball commercial on TV promises you will never lose money because of the following generous prizes:

- Win $15 with a 1-number match

- Win $100 with a 2-number match

- Win $1,000 with a 3-number match

- Win $10,000 with a 4-number match

- Win $100,000 with a 5-number match

- Win $1,000,000 with a 6-number match (Jackpot!)

If you had the money to buy 100,000 tickets, how much money would you win from buying these tickets, accounting for the cost of buying the tickets? Assign the amount to `winning_money`. Note that a positive value means you won money overall, and a negative value means you lost money overall. Do you believe the commercial's claims?

*Hint:* If you used a `for`-loop in 1.3, you can likely write a similar `for`-loop in this part. If you determined the number of tickets with 1 match, 2 matches, 3 matches, etc. in 1.3, then you don't need a `for`-loop at all in this part.

In [None]:
winning_money = ...
...
winning_money

In [None]:
grader.check("q1_4")

## 2. Sampling with NBA Data 🏀

In this question, we'll use our familiar player and salary data from the 2015-16 NBA season to get some practice with sampling. Run the cells below to load the player and salary data, which come from different DataFrames, and to merge them into a single DataFrame, indexed by player.

In [None]:
player_data = bpd.read_csv("data/player_data.csv").set_index('Name')
salary_data = bpd.read_csv("data/salary_data.csv").set_index('PlayerName')
full_data = salary_data.merge(player_data, left_index=True, right_index=True)
full_data

We'll start by creating a function called `compute_statistics` that takes as input a DataFrame with two columns, `'Points'` and `'Salary'`, and then:
- draws a histogram of `'Points'`,
- draws a histogram of `'Salary`', and
- returns a two-element array containing the mean `'Points'` and mean `'Salary'`.


Run the cell below to define the `compute_statistics` function, and a helper function called `histograms`. Don't worry about how this code works, and please don't change anything.

In [None]:
# Don't change this cell, just run it.

def histograms(df):
    points = df.get('Points').values
    salaries = df.get('Salary').values
    
    a = plt.figure(1)
    plt.hist(points, density=True, alpha=0.5, color='blue', ec='w', bins=np.arange(0, 2500, 50))
    plt.title('Distribution of Points')
    s = plt.figure(2)
    plt.hist(salaries, density=True, alpha=0.5, color='blue', ec='w', bins=np.arange(0, 3.5 * 10**7, 2.5 * 10**6))
    plt.title('Distribution of Salaries')
    
def compute_statistics(points_and_salary_data, draw=True):
    if draw:
        histograms(points_and_salary_data)
    points = np.average(points_and_salary_data.get('Points').values)
    salary = np.average(points_and_salary_data.get('Salary').values)
    avg_points_salary_array = np.array([points, salary]) 
    return avg_points_salary_array

We can use this `compute_statistics` function to show the distribution of `'Points'` and `'Salary'` and compute their means, for any collection of players. 

Run the next cell to show these distributions and compute the means for all NBA players. Notice that the array containing the mean `'Points'` and mean `'Salary'` values is displayed before the histograms, and the numbers are given in scientific notation.

In [None]:
full_stats = compute_statistics(full_data)
full_stats

Now, imagine that instead of having access to the full *population* of NBA players, we had only gotten data on a smaller subset of the players, or a *sample*.  For 492 players, it's not so unreasonable to expect to see all the data, but usually we aren't so lucky.  Instead, we often make *statistical inferences* about a large underlying population using a smaller sample.

A **statistical inference** is a statement about some characteristic of the underlying population, such as "the average salary of NBA players in 2014 was $3 million".  You may have heard the word "inference" used in other contexts.  It's important to keep in mind that statistical inferences can be wrong.

A common strategy for inference using samples is to estimate parameters of the population by computing the same statistics on a sample.  This strategy sometimes works well and sometimes doesn't.  The degree to which it gives us useful answers depends on several factors.

One very important factor in the utility of samples is how they were gathered. Let's look at some different sampling strategies.

### Convenience sampling
One sampling methodology, which is **generally a bad idea**, is to choose players who are somehow convenient to sample.  For example, you might choose players from a team that's near your house, since it's easier to survey them.  This is called, somewhat pejoratively, *convenience sampling*.

**Question 2.1.**  Suppose you live in New Jersey and you only survey players from the three closest teams:
- New York Knicks (`'NYK'`)
- Brooklyn Nets (`'BRK'`)
- Philadelphia 76ers (`'PHI'`)

Assign `convenience_sample` to a subset of `full_data` that contains only the rows for players on one of these three teams.

In [None]:
convenience_sample = ...
convenience_sample

In [None]:
grader.check("q2_1")

**Question 2.2.** Assign `convenience_stats` to an array of the mean `'Points'` and mean `'Salary'` of your convenience sample.  Since they're computed on a sample, these are called *sample means*. 

*Hint*: It's fine to draw two histograms as well as assign the variable `convenience_stats`.

In [None]:
convenience_stats = ...
convenience_stats

In [None]:
grader.check("q2_2")

Next, we'll compare the distribution of points in our convenience sample with distribution of points for all players in our dataset.

In [None]:
# just run this cell, don't change it
def compare_points(first, second, first_title, second_title):
    """Compare the points in two DataFrames."""
    bins = np.arange(0, 2500, 50)
    first.plot(kind='hist', y='Points', bins=bins, density=True, ec='w', color='blue', alpha=0.5)
    plt.title('Points Distribution for ' + first_title)
    second.plot(kind='hist', y='Points', bins=bins, density=True, ec='w', color='blue', alpha=0.5)
    plt.title('Points Distribution for ' + second_title)

compare_points(full_data, convenience_sample, 'All Players', 'Convenience Sample')

**Question 2.3.** From what you see in the histogram above, does the convenience sample give us an accurate picture of points for the full population of NBA players?  Would you expect it to, in general?  Assign either 1, 2, 3, or 4 to the variable `sampling_q3` below. 
1. Yes. The sample is large enough, so it is an accurate representation of the population.
2. No. The sample is too small, so it won't give us an accurate representation of the population.
3. No. But this was just an unlucky sample, normally this would give us an accurate representation of the population.
4. No. This type of sample doesn't give us an accurate representation of the population.

In [None]:
sampling_q3 = ...
sampling_q3

In [None]:
grader.check("q2_3")

For some context, check out how the teams in the New Jersey area did during the 2015-2016 season. Here are the [official standings](https://www.espn.com/nba/standings/_/season/2016).

### Simple random sampling
A more principled approach is to sample uniformly at random from the players.  If we ensure that each player is selected at most once, this is a *simple random sample without replacement*, sometimes abbreviated to "simple random sample" or "SRS".  Imagine writing down each player's name on a card, putting the cards in a hat, and shuffling the hat.  To sample, pull out cards one by one and set them aside, stopping when the specified *sample size* is reached.

We've produced two samples of `salary_data` in this way: `small_srs_salary.csv` and `large_srs_salary.csv` contain, respectively, a sample of size 67 and a larger sample of size 150.  



Now we'll run the same analyses on the small simple random sample, the large simple random sample, and the convenience sample. The `load_data` function below loads a salary table and merges it with `player_data`. The subsequent code draws the histograms and computes the means for `'Points'` and `'Salary'`.

In [None]:
# Don't change this cell, but do run it.

def load_data(salary_file):
    return player_data.merge(bpd.read_csv(salary_file), left_index=True, right_on='PlayerName')

small_srs_data = load_data('data/small_srs_salary.csv')
large_srs_data = load_data('data/large_srs_salary.csv')

small_stats = compute_statistics(small_srs_data, draw=False);
large_stats = compute_statistics(large_srs_data, draw=False);
convenience_stats = compute_statistics(convenience_sample, draw=False);

print('Full data stats:                 ', full_stats)
print('Small SRS stats:', small_stats)
print('Large SRS stats:', large_stats)
print('Convenience sample stats:        ', convenience_stats)

color_dict = {
    'small simple random': 'blue',
    'large simple random': 'green',
    'convenience': 'orange'
}

plt.subplots(3, 2, figsize=(15, 15), dpi=100)
i = 1

for df, name in zip([small_srs_data, large_srs_data, convenience_sample], color_dict.keys()):
    plt.subplot(3, 2, i)
    i += 2
    plt.hist(df.get('Points'), density=True, alpha=0.5, color=color_dict[name], ec='w', 
             bins=np.arange(0, 2500, 50));
    plt.title(f'Points histogram for {name} sample')

i = 2
for df, name in zip([small_srs_data, large_srs_data, convenience_sample], color_dict.keys()):
    plt.subplot(3, 2, i)
    i += 2
    plt.hist(df.get('Salary'), density=True, alpha=0.5, color=color_dict[name], ec='w', 
             bins=np.arange(0, 3.5 * 10**7, 2.5 * 10**6));
    plt.title(f'Salary histogram for {name} sample')

#plt.show()

### Producing simple random samples
Often it's useful to take random samples even when we have a larger dataset available.  One reason is that it can help us understand how inaccurate other samples are.

DataFrames provide the method `sample()` for producing simple random samples.  Note that its default is to sample **without** replacement. 

**Question 2.4.** Produce a simple random sample *without replacement* of size 67 from `full_data`. Run your analysis on it again, and store the resulting array of mean `'Points'` and mean `'Salary'` in `my_small_stats`.

In [None]:
my_small_stats = ...
my_small_stats

Run the cell containing `my_small_stats` several times to get new samples and new sample means.

Are your results similar to those in the small sample we provided you? Do things change a lot across separate samples?  Assign either 1, 2, 3, or 4 to the variable `sampling_q4` below.
1. The results are very different from the small sample, and don't change at all across separate samples.
2. The results are not at all different from the small sample, and change a bit across separate samples.
3. The results are somewhat different from the small sample, and change a bit across separate samples.
4. The results are not at all different from the small sample, and don't change at all across separate samples.

<!--
BEGIN QUESTION
name: q2_4
-->

In [None]:
sampling_q4 = ...
sampling_q4

In [None]:
grader.check("q2_4")

**Question 2.5.** Similarly, create a simple random sample *without replacement* of size 175 from `full_data` and store an array of the sample's mean `'Points'` and mean `'Salary'` in `my_large_stats`.

In [None]:
my_large_stats = ...
my_large_stats

Run the cell containing `my_large_stats` many times.

Do the histograms and  mean statistics seem to change more or less across samples of this size than across samples of size 56?  And for which variable are the sample means and histograms closer to their true values – `'Points'` or `'Salary'`?  Assign either 1, 2, 3, 4, or 5 to the variable `sampling_q5` below. 

Is this what you expected to see?
1. The statistics change *less* across samples of this size than across smaller samples. The statistics are closer to their true values for `'Points'` than they are for `'Salary'`.
2. The statistics change *less* across samples of this size than across smaller samples. The statistics are closer to their true values for `'Salary'` than they are for `'Points'`.
3. The statistics change *more* across samples of this size than across smaller samples. The statistics are closer to their true values for `'Points'` than they are for `'Salary'`.
4. The statistics change *more* across samples of this size than across smaller samples. The statistics are closer to their true values for `'Salary'` than they are for `'Points'`.
5. The statistics change an *equal amount* across samples of this size as across smaller samples. The statistics for `'Points'` and `'Salary'` are *equally close* to their true values.

In [None]:
sampling_q5 = ...
sampling_q5

In [None]:
grader.check("q2_5")

## 3. Google Play Store Apps 📱

In this problem, we will work with the data set of [Google Play Store Apps](https://www.kaggle.com/lava18/google-play-store-apps) that we used in Lab 3.  This time, we'll pretend you're an app developer, looking to draw some insight from this data set to help you make a better app.

In [None]:
# Run this cell to load the data
playstore_apps = bpd.read_csv('data/googleplaystore.csv')
playstore_apps

As a reminder, each row in the DataFrame is an app. We've cleaned the data a bit from the last time we've seen it, removing rows with missing values.

Let's set the index of the DataFrame to the app's name in order to be able to interpret what the rows represent more easily. 

In [None]:
playstore_apps = playstore_apps.set_index('App')
playstore_apps

Suppose that as an app developer, you want to address the question of whether there is any significant difference in rating between free apps and paid apps.  The only columns of data we'll need to answer this question are `'Rating'` and `'Type'`, so let's keep just those by using `.get()` and passing in a `list` of columns.

In [None]:
playstore_apps = playstore_apps.get(['Rating', 'Type'])
playstore_apps

**Question 3.1.** You want to determine if free or paid apps have a higher average rating. Calculate the difference between the mean rating for all `Paid` apps and the mean rating for all `Free` apps (do `Paid` minus `Free`) and store the result in variable `true_difference`.

In [None]:
true_difference = ...
true_difference 

In [None]:
grader.check("q3_1")

**Question 3.2.** Create a function that takes as input a DataFrame of apps with columns `'Rating'` and `'Type'`, and returns the difference between the mean rating for `Paid` apps and the mean rating for `Free` apps (again, calculate `Paid` minus `Free`).

When called on input `playstore_apps`, the output should be the same as `true_difference`, however, this function should work on *any* DataFrame of apps, provided there are at least some `Paid` apps and some `Free` apps.

In [None]:
def mean_diff(app_df):
    ...

mean_diff(playstore_apps)

In [None]:
grader.check("q3_2")

**Question 3.3.** Let's suppose that as an app developer, you don't know the value of `true_difference` because it is calculated from all the apps in the data set, while you can only load 1000 apps at a time. You want to look at 1000 random apps, sampled without replacement, to get a representative sample of the full data set. Write a function called `pick_1000` that simulates this. Specifically, the function should take *no* arguments and should return a DataFrame of 1000 randomly selected apps.

In [None]:
def pick_1000():
    """Randomly select 1000 different apps from Google Play Store."""
    ...
pick_1000()

In [None]:
grader.check("q3_3")

Now, even without access to the full `playstore_apps` data set, you can get an idea of the difference between mean ratings of `Paid` and `Free` apps, based on the 1000 in your random sample. The `mean_diff` function you wrote should be able to calculate the difference in mean ratings for our random sample. 

In [None]:
mean_diff(pick_1000())

But what if you'd picked a different random 1000 apps for your sample? Surely, you'd get a different answer, but how different? Run the cell above a few times. You should get different results each time. If not, check for a mistake in your `mean_diff` function or your `pick_1000` function.

To answer this question of how the mean difference changes as our sample changes, let's repeat our experiment.

**Question 3.4.** 500 times, randomly select 1000 apps and calculate the difference of mean ratings between `Paid` and `Free` apps (do `Paid` minus `Free`). Record the 500 differences of mean ratings in an array called `experiment_differences`.

*Hint*: Feel free to use previously defined functions. First try simulating 10 trials. Once you are sure you have that figured out, change it to 500 trials. It may take about a minute to run with 500 trials.

In [None]:
experiment_differences = ...

In [None]:
grader.check("q3_4")

**Question 3.5.** When you ran your experiment 500 times, you got 500 different estimates for the difference of mean ratings between `Paid` and `Free` apps, stored in `experiment_differences`. These estimates are statistics because they come from samples. Create a density histogram showing the distribution of these statistics.

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q3_5
manual: true
-->

In [None]:
# Create your histogram here.

<!-- END QUESTION -->



**Question 3.6.** Compute the average value of the 500 statistics in `experiment_differences` and store your average in `approximate_difference`. This average is an estimate of the difference in mean ratings for the full data set, which is the population parameter you're trying to approximate here.

In [None]:
approximate_difference = ...
approximate_difference

In [None]:
grader.check("q3_6")

**Question 3.7.**  Now you have an estimate for the difference in mean ratings between `Paid` and `Free` apps, but you'd like to know how good of an estimate it is. How far is `approximate_difference`, calculated from your sample statistics, from `true_difference`, the parameter calculated from the full `playstore_apps` population? Compute the absolute difference between the two values and store it in the variable `error`. 

In [None]:
error = ...
error

In [None]:
grader.check("q3_7")

If you'd like to explore this some more, try taking samples of different sizes, and calculating the error in your corresponding estimates. Do estimates derived from bigger samples tend to be more accurate?

## 4. Among Us ඞ

Among Us is an online game that exploded in popularity last year during the COVID-19 quarantine. It is best described as a social deduction game. The goal is to identify an Impostor among a group of Crewmates. If you want to learn more about the game (or play - it's free!), check out their [wiki](https://among-us.fandom.com/wiki/Among_Us) or watch gameplay on YouTube. We will provide all information necessary to answer the questions in this section.

In a game of Among Us, you choose a color to assign to your character and then you are randomly assigned to either be a Crewmate or an Impostor. Being the analytical person you are, you notice that there are some colors that are chosen to be an Impostor more often than others. You decide to explore this. Your model is:

<table>
    <tr><th>Player</th><th>Estimated Chance of Impostor</th></tr>
    <tr><td>Red</td><td>13%</td></tr>
        <tr><td>Orange</td><td>10%</td></tr>
        <tr><td>Yellow</td><td>5%</td></tr>
        <tr><td>Green</td><td>11%</td></tr>
       <tr> <td>Lime</td><td>5%</td></tr>
        <tr><td>Blue</td><td>9%</td></tr>
        <tr><td>Cyan</td><td>10%</td></tr>
        <tr><td>Purple</td><td>8%</td></tr>
    <tr><td>Pink</td><td>4%</td></tr>
    <tr><td>Black</td><td>10%</td></tr>
    <tr><td>White</td><td>2%</td></tr>
    <tr><td>Brown</td><td>13%</td></tr>
</table>

Let's store these values in an array called `among_us_distribution`, representing our assumptions about the distribution of Imposters across different colors.

In [None]:
among_us_distribution = np.array([0.13, 0.10, 0.05, 0.11, 0.05, 0.09, 0.10, 0.08, 0.04, 0.10, 0.02, 0.13])
among_us_distribution

The color that you select for your character is Pink, so you estimate that you have a 4% chance of being an Impostor in any given game. During a long and boring quarantine, you play 100 games, and you are selected to be an  Impostor 8 times. You start to suspect that 4% might be *too low* of an estimate, and that your model is wrong.

**Question 4.1.** Using your model in which you have a 4% chance of being Impostor as your null hypothesis, write a simulation that runs 100 games and keeps track of the **absolute difference** between: 
- the number of games in which you are an Impostor, and 
- the number of times you'd expect to be an Impostor in 100 games according to your model (4).

Run your simulation 5000 times. Keep track of the differences in an *array* called `among_us_differences`. 

In [None]:
# With all your simulations, try a small number of repetitions first, then increase it.
n_repetitions = ...

...

# Visualize with a histogram
bpd.DataFrame().assign(Difference = among_us_differences).plot(kind='hist', bins=np.arange(10), density=True, ec='w');

In [None]:
grader.check("q4_1")

**Question 4.2.** Your null hypothesis was that you have a 4% chance of being an Impostor, but you got Impostor 8 times out of 100. Compute the proportion of times in our simulation that we saw an outcome at least as extreme as what we observed in real life. Assign your result to `among_us_p_value`

In [None]:
among_us_p_value = ...
among_us_p_value

In [None]:
grader.check("q4_2")

**Question 4.3.** Based on the histogram and the p-value, set the variable `among_us_null_hypothesis` below to `True` if you think your model is plausible or `False` if it should be rejected at the standard 0.05 significance level.

In [None]:
among_us_null_hypothesis = ...
among_us_null_hypothesis

In [None]:
grader.check("q4_3")

**Question 4.4.** In this question, we chose as our test statistic the absolute difference between the number of times you were an Impostor and the number of times you expected to be an Impostor. But this is not the only statistic we could have chosen; there are many that could have worked here. From the options below, choose the test statistic that would **not** have worked for this hypothesis test, and save your choice in the variable `among_us_bad_choice`. 

1. The number of times you were an Impostor.
2. The proportion of times you were an Impostor.
3. The absolute difference between the proportion of times you were an Impostor and the proportion of times you expected to be an Impostor.
4. The sum of the number of times you were an Impostor and the number of times you expected to be an Impostor.

In [None]:
among_us_bad_choice = ...

In [None]:
grader.check("q4_4")

## 5. A distribution of M&M's Ⓜ️

It turns out that the different colors of the popular candy [M&M's](https://en.wikipedia.org/wiki/M%26M%27s) are made separately by different machines, and then combined into bags for sale. Each bag contains 6 colors (red, orange, yellow, green, brown, and blue). You're curious whether all six of the colors are equally represented in each bag, so you get to work eating your way through an absurd number of bags of M&M's. 

Your data is below. Each row represents one bag, and each column represents a color. You've counted how many of each color were in each bag.

In [None]:
m = bpd.read_csv("data/m&m.csv")
m

Imagine dumping all 468 bags together and then separating all those M&M's by color to get a distribution of M&Ms by color. 

The array below represents the actual proportions of M&M's of each color (the order is red, orange, yellow, green, brown, blue). This represents an empirical distribution because it's based on the data that you actually observed.

In [None]:
# You don't have to know how this code works.
# It's summing up each column in the DataFrame above to find the total number of each color, 
# and dividing the total number of each color by the total number of all colors combined.

empirical_color_distribution = m.to_numpy().sum(axis=0) / m.to_numpy().sum()
empirical_color_distribution

Your original belief is that all colors are distributed uniformly overall.

The array below represents the proportions of M&M's of each color, according to your model. This is a theoretical probability distribution because it's based on your theoretical model.

In [None]:
theoretical_color_distribution = np.array([1/6, 1/6, 1/6, 1/6, 1/6, 1/6]) 
theoretical_color_distribution

**Question 5.1.** You'd like to do a hypothesis test to determine whether your model is accurate in describing your data. However, there are six categories of M&M colors: red, orange, yellow, green, brown, and blue, so you're a little confused about which test statistic to use here. Which of the following is **not** a reasonable choice of test statistic? Save your choice in the variable `unreasonable_test_statistic`. You may only choose one.

1. The total variation distance between the theoretical distribution (expected proportion of colors) and the empirical distribution (actual proportion of colors).
2. The sum of the absolute difference between the theoretical distribution (expected proportion of colors) and the empirical distribution (actual proportion of colors).
3. The absolute difference between the sum of the theoretical distribution (expected proportion of colors) and the sum of the empirical distribution (actual proportion of colors).

In [None]:
unreasonable_test_statistic = ...

In [None]:
grader.check("q5_1")

**Question 5.2.** We'll use the TVD, i.e. total variation distance, as our test statistic. Below, complete the implementation of the function `total_variation_distance`, which takes in two distributions (stored as arrays) as arguments and returns the total variation distance between the two arrays.

Then, use the function `total_variation_distance` to determine the TVD between the empirical color distribution you observed and the theoretical color distribution. Assign this TVD to `observed_tvd`.

In [None]:
def total_variation_distance(first_distrib, second_distrib):
    '''Computes the total variation distance between two distributions.'''
    ...

observed_tvd = ...
observed_tvd

In [None]:
grader.check("q5_2")

**Question 5.3.** Now we'll calculate 5000 simulated TVDs to see what a typical TVD between an empirical distribution and the theoretical distribution would look like if our model were accurate. Since our real-life data includes 33,335 M&Ms, in each trial of the simulation, we'll draw 33,335 M&M's at random from our theoretical distribution, then calculate the TVD between the color distribution from this sample and the theoretical color distribution. Store these 5000 simulated TVDs in an array called `simulated_tvds`.

In [None]:
simulated_tvds = ...

# Visualize the distribution of TVDs with a histogram
bpd.DataFrame().assign(TVD = simulated_tvds).plot(kind='hist', density=True, ec='w');

In [None]:
grader.check("q5_3")

**Question 5.4.** Now, we check the p-value of our claim by computing the proportion of times in our simulation that we saw a TVD greater than or equal to our observed TVD. Assign your result to `color_p_value`. 

Additionally, conclude whether we should reject the null hypothesis at the standard 0.05 significance level. Set the variable `color_null` below to `True` if you think your model is plausible or `False` if you think the null hypothesis should be rejected.

In [None]:
color_p_value = ...
color_null = ...
color_p_value, color_null

In [None]:
grader.check("q5_4")

## 6. Sentiment Analysis 📰

Using text analysis, data scientists can identify positive, negative, and neutral expressions in text. This is known as  sentiment analysis.

Suppose that you consider UCSD's student-run newspaper, [The Guardian](https://ucsdguardian.org/), to be a fairly positive publication. You estimate that about 10% of sentences are negative, 30% are neutral, and 60% are positive. We'll represent that as an array called `guardian_model`.

In [None]:
guardian_model = np.array([0.1, 0.3, 0.6]) 
guardian_model

Let's do a hypothesis test to check if our model is accurate. Suppose you run a sentiment analysis program on the most recent issue of The Guardian, and find that out of 4700 sentences, 2772 are positive (a proportion of $\frac{2772}{4700} \approx 0.590$). 

**Question 6.1.** Complete the implementation of the function `one_simulation`, which has no arguments and returns the proportion of sentences with a positive sentiment, out of 4700 sentences whose sentiments are randomly generated according to our model.

In [None]:
def one_simulation():
    ...

one_simulation()

In [None]:
grader.check("q6_1")

**Question 6.2.** Now, let's conduct 5000 simulations. Create an array named `proportion_diffs` that keeps track of the **absolute difference between the proportion of positive sentences in a given simulation and the expected proportion of positive sentences in our model**. Utilize the function created in the previous question to perform this task. 

In [None]:
proportion_diffs = ...

#: Visualize with a histogram
bpd.DataFrame().assign(Abs_Diff = proportion_diffs).plot(kind='hist', bins=np.arange(0, 0.04, 0.0025), density=True, ec='w');  

In [None]:
grader.check("q6_2")

**Question 6.3.** Recall that our sentiment analysis program found 2772 out of 4700 sentences to be positive. Use this information to calculate the p-value for this hypothesis test. Assign your result to `guardian_p`.

In [None]:
guardian_p = ...
guardian_p

In [None]:
grader.check("q6_3")

**Question 6.4.** Assign the variable `guardian_conclusion` to the best conclusion of this hypothesis test.
   
   1. We should reject the null hypothesis because it is unlikely that we'd see the observed number of positive sentences if our model were correct. 
    
   2. We should accept the null hypothesis because our observed data is consistent with our model.
    
   3. We should neither reject nor accept the null hypothesis because we haven't seen any evidence that our model is wrong, but we also don't know that it's accurate.
    

In [None]:
guardian_conclusion = ...

In [None]:
grader.check("q6_4")

# Finish Line

To submit your assignment:

1. Select `Kernel -> Restart & Run All` to ensure that you have executed all cells, including the test cells.
2. Read through the notebook to make sure everything is fine and all tests passed.
3. Run the cell below to run all tests, and make sure that they all pass.
4. Download your notebook using `File -> Download as -> Notebook (.ipynb)`, then upload your notebook to Gradescope.

In [None]:
grader.check_all()