# Homework 7: Confidence Intervals, the Normal Distribution, and the Central Limit Theorem

## Due Saturday, March 5th at 11:59PM

Welcome to Homework 7! This week, we will cover confidence intervals, normal distributions, and the Central Limit Theorem. You can find additional help on these topics in the following readings:

* [Note 24](https://notes.dsc10.com/06-estimation/2_confidence_intervals.html): Confidence Intervals
* [Note 25](https://notes.dsc10.com/06-estimation/3_ht_using_intervals.html): Hypothesis Tests for Parameters using Confidence Intervals
* [CIT 14.2](https://www.inferentialthinking.com/chapters/14/2/Variability.html): Variability, Standard Deviation, Standard Units, Chebyshev's Bounds
* [CIT 14.3](https://www.inferentialthinking.com/chapters/14/3/SD_and_the_Normal_Curve.html): The Standard Deviation (SD) and the Normal Curve 
* [CIT 14.4](https://www.inferentialthinking.com/chapters/14/4/Central_Limit_Theorem.html): The Central Limit Theorem
* [CIT 14.5](https://www.inferentialthinking.com/chapters/14/5/Variability_of_the_Sample_Mean.html): The Variability of the Sample Mean
* [CIT 14.6](https://inferentialthinking.com/chapters/14/6/Choosing_a_Sample_Size.html): Choosing a Sample Size

### Instructions

This assignment is due Saturday, March 5th at 11:59PM. You are given six slip days thoughout the quarter to extend deadlines. See the syllabus for more details. With the exception of using slip days, late work will not be accepted unless you have made special arrangements with your instructor.

**Important**: For homeworks, the `otter` tests don't usually tell you that your answer is correct. More often, they help catch careless mistakes. It's up to you to ensure that your answer is correct. If you're not sure, ask someone (not for the answer, but for some guidance about your approach). These are great questions for office hours (see the schedule on the [Calendar](https://dsc10.com/calendar)) or Campuswire. Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged.

In [None]:
# Please don't change this cell, but do make sure to run it
import babypandas as bpd
import numpy as np

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

import otter
grader = otter.Notebook()

from IPython.display import IFrame
def show_clt_slides():
    src = "https://docs.google.com/presentation/d/e/2PACX-1vTcJd3U1H1KoXqBFcWGKFUPjZbeW4oiNZZLCFY8jqvSDsl4L1rRTg7980nPs1TGCAecYKUZxH5MZIBh/embed?start=false&loop=false&delayms=3000"
    width = 700
    height = 370
    display(IFrame(src, width, height))

## 1. Comparing UCSD Writing Courses ✍️
Suppose it's application season and you're a current high school senior looking to apply to UCSD for data science. (Did you know that UCSD was the the [second-most applied to college in the US](https://ucsdnews.ucsd.edu/pressrelease/uc-san-diego-admits-record-52946-first-year-and-transfer-students) for Fall 2021?) Say you're not too good at writing and want to strategically analyze the writing courses of each college, to figure out where you have the best shot at getting a decent grade. Luckily, you're made aware of UCSD's [Course and Professor Evalutions (CAPE)](https://cape.ucsd.edu/), which keeps data on the average grade received in each course for each quarter. 

In the DataFrame below, each row corresponds to a particular quarter's offering of a college writing course, except for Muir's writing courses, for which data is unavailable. We have information on the name of each course, the average study hours per week for the quarter, and the average grade for the quarter (on a 4.0 GPA scale). Now it's time to analyze and figure out whether the writing course rumors are true, or if people just like complaining.

In [None]:
writing = bpd.read_csv('data/writing_courses_ucsd.csv', index_col=0)
writing

**Question 1.1.** Let's start by determining the mean study hours and mean grade for each course. Create a table called `course_means`, indexed by `course`, with columns `'Study Hrs/wk'` and `'grades'`, showing the mean study hours and mean grade for each course.

In [None]:
course_means = ...
course_means

In [None]:
grader.check("q1_1")

**Question 1.2.** Marshall's writing course, `'DOC'`, seems to have a pretty high mean grade based on the data in our sample, but this sample doesn't include all course offerings. Produce 1,000 bootstrapped estimates for the mean grade of all offerings of `'DOC'`. Store the estimates in the `doc_averages` array. 

Then, use the `doc_averages` array to calculate an approximate 99% confidence interval for the true mean grade. Assign the corresponding bounds to `lower_bound` and `upper_bound`. 

In [None]:
doc_averages = ...

...

# Display the estimates in a histogram.
bpd.DataFrame().assign(Estimated_Average_Grades=doc_averages).plot(kind='hist', density=True, ec='w', figsize=(10, 5));

lower_bound = ...
upper_bound = ...

# Don't change the line below (though you will need to copy and change it in 1.3)
"A 99% confidence interval for {} is [{}, {}]".format("DOC", lower_bound, upper_bound)

In [None]:
grader.check("q1_2")

**Question 1.3.** You want to create a similar histogram for each of the other courses, and also calculate the corresponding confidence intervals. Repeating the process above 4 times would be time-consuming. Create a function called `ci_and_hist`, which takes in a course name as its input, **plots the histogram** for 1,000 bootstrapped estimates for the mean grade **and returns** a string describing the approximate 99% confidence interval for the course's mean grade, formatted in the same way as the string displayed for `'DOC'` in Question 1.2. Start with the code from 1.2 and generalize it to work for any course.

**_Note_:** Make sure your function both plots a histogram and **returns** a string. For example, `ci_and_hist('MMW')` should return a string that starts with `'A 99% confidence interval for MMW is'`.

In [None]:
def ci_and_hist(course_name):
    ...
    
# Example call to the function. Don't change the lines below
test_var = ci_and_hist('MMW')
print(test_var)
other_var = ci_and_hist('WCWP')
print(other_var)

In [None]:
grader.check("q1_3")

**Question 1.4.** Your friend claims that Warren's writing course, `'WCWP'`, is actually not as easy as everyone says. She claims that since our CAPE data is only a sample of the full population of course offerings, the actual mean grade for `'WCWP'`  could be 2.95. Run the cell below to use the `ci_and_hist` function you defined above to calculate an approximate 99% confidence interval for the mean grade in `'WCWP'`. Do you reject her hypothesis at the 0.01 significance level? Assign 1, 2, 3, or 4 to `q1_4`.

1. No, because the confidence interval includes 2.95.
2. No, because the confidence interval doesn't include 2.95.
3. Yes, because the confidence interval includes 2.95.
4. Yes, because the confidence interval doesn't include 2.95.

In [None]:
ci_and_hist("WCWP") 

In [None]:
q1_4 = ...

In [None]:
grader.check("q1_4")

## 2. Testing the Central Limit Theorem: Coin Flips and Midterm Scores 💯


The Central Limit Theorem tells us that the probability distribution of the sum or mean of a large random sample drawn with replacement will be roughly normal, *regardless of the distribution of the population from which the sample is drawn*.

That's a pretty big claim, but the theorem doesn't stop there. It further states that, if we're using the mean as our statistic, the standard deviation of this normal distribution is given by $$\text{SD of Distribution of Possible Sample Means} = \frac{\text{Population SD}}{\sqrt{\text{sample size}}}$$

 In other words, suppose we start with *any distribution* that has standard deviation $\sigma$, take a sample of size $n$ (where $n$ is a large number) from that distribution with replacement, and compute the mean of that sample. If we repeat this procedure many times, then those sample means will have a normal distribution with standard deviation $\frac{\sigma}{\sqrt{n}}$.

That's an even bigger claim than the first one! The proof of the theorem is beyond the scope of this class.

Run the cell below to see a short presentation that describes the CLT at a high level.

In [None]:
show_clt_slides()

In this exercise, we will be exploring some data to see the CLT in action.

**Question 2.1.** The CLT only applies when sample sizes are "sufficiently large." This isn't a very precise statement. Is 10 large?  How about 50?  The truth is that it depends both on the original population distribution and just how "normal" you want the result to look. Let's use a simulation to get a feel for how the distribution of the sample mean changes as sample size goes up.

Consider a coin flip. If we say heads is $1$ and tails is $0$, then there's a 50% chance of getting a 1 and a 50% chance of getting a 0, which is definitely not a normal distribution.  The average of several coin tosses is equal to the proportion of heads in those coin tosses, so the CLT should apply if we compute the sample proportion of heads many times.

Write a function called `simulate_sample_n` that takes in a sample size `n`. It should repeat, 5000 times, the process of:
- simulating `n` flips of a fair coin, and
- counting the proportion of flips that were heads

`simulate_sample_n` should return an array that contains 5000 sample proportions, using the process outlined above.

In [None]:
def simulate_sample_n(n):
    ...
simulate_sample_n(5)

In [None]:
grader.check("q2_1")

<br>
The code below will use the function you just defined to plot the empirical distribution of the sample mean for several different sample sizes.

In [None]:
bins = np.arange(-0.01, 1.05, 0.02)

for sample_size in np.array([2, 5, 10, 20, 50, 100, 200, 400]):
    bpd.DataFrame().assign(**{'Sample_Size:{}'.format(sample_size) : simulate_sample_n(sample_size)}) \
                   .plot(kind='hist', density=True, ec='w', bins=bins, 
                         title=f'Sample Size {sample_size}', legend=None, figsize=(5, 3));
    plt.xlim(-0.01, 1.05)
    plt.ylim(0, 25);

You can see that even the means of samples of 10 items follow a roughly bell-shaped distribution.  A sample of 50 items looks quite bell-shaped. Note also that as sample sizes increased, the distributions of sample proportions became narrower and taller.

Now we will test the second claim of the CLT: That the SD of the sample mean is the SD of the original distribution, divided by the square root of the sample size.

$$\text{SD of Distribution of Possible Sample Means} = \frac{\text{Population SD}}{\sqrt{\text{sample size}}}$$

Below, we will read in the scores of this quarter's Midterm Exam (which we have modified slightly for anonymity). We'll treat this DataFrame as our population, and we'll take samples directly from it. We've computed the standard deviation of the midterm scores for you; you will need to use this at some point in the question.

In [None]:
midterm = bpd.read_csv('data/wi22_midterm_scores.csv')
midterm_std = np.std(midterm.get('Score'))
midterm_std

**Question 2.2.** Write a function called `predict_sd` that takes in a sample size `n`. It returns the predicted standard deviation (according to the CLT) of the sample mean's distribution, for samples of size `n` taken from the midterm data.

**_Hint:_** Use `midterm_std`, and **do not** use `simulate_sample_n`.

In [None]:
def predict_sd(n):
    ...

predict_sd(10)

In [None]:
grader.check("q2_2")

**Question 2.3.** Write a function called `empirical_sd` that takes a sample size `n` as its argument. The function should simulate drawing 1000 samples of size `n` from the midterm scores dataset, with replacement, and it should return the **standard deviation of the distribution of the means** of those 1000 samples.

**_Hint:_** This function will be similar to the `simulate_sample_n` function you wrote earlier.

In [None]:
def empirical_sd(n): 
    sample_means = np.array([])
    ...
    return np.std(sample_means)
empirical_sd(10)

In [None]:
grader.check("q2_3")

The cell below will plot the predicted SDs (computed by your `predict_sd` function) and empirical SDs (computed by your `empirical_sd` function) for the flight delay data for various sample sizes. It may take a few moments to run.

In [None]:
sd_df = bpd.DataFrame().assign(Sample_Size = np.arange(1, 101, 10))
predicted = sd_df.get('Sample_Size').apply(predict_sd)
empirical = sd_df.get('Sample_Size').apply(empirical_sd)
sd_df = sd_df.assign(Predicted_SD = predicted, Empirical_SD = empirical)
ax = sd_df.plot(kind='scatter',x='Sample_Size', y='Empirical_SD',label = 'Empirical_SD', color = 'green', alpha=.7, s=300, figsize=(10, 5));
ax = sd_df.plot(kind='scatter',x='Sample_Size', y='Predicted_SD',label = 'Predicted_SD', color = 'orange', alpha=.7, s=300, ax=ax)
ax.set_ylabel('Standard Deviation');

It looks like the CLT is pretty accurate!

## 3. UCSD's Housing Crisis 🏠

[The article we linked in Question 1](https://ucsdnews.ucsd.edu/pressrelease/uc-san-diego-admits-record-52946-first-year-and-transfer-students) notes that UCSD admitted a record-breaking number of students this application season, despite the pandemic. According to [data provided by the Registrar's Office](https://blink.ucsd.edu/instructors/courses/enrollment/week3.html), there are now 40,873 students at UCSD. 

Also during the pandemic, UCSD's Housing Dining Hospitality (HDH) announced their decision to remove triple occupancy dorm rooms and replace two-year housing guarantees with a priority system. As a result of the new housing policy, many students were left scrambling to secure housing for the 2021-22 school year. In response, UCSD
- directed students to an off-campus housing website,
- hosted an off-campus housing webinar, and
- offered an option for students to live in local hotels at a discounted rate.

A data scientist at UCSD wanted to see if students were actually satisfied with the solutions UCSD provided. She polled a uniform random sample of all UCSD students, and determined that 210 of the 700 sampled students found the solutions that UCSD provided to be satisfactory.

In [None]:
# Run this cell, but don't change it.
survey = bpd.DataFrame().assign(
    Opinion=np.array(["Satisfactory", "Unsatisfactory"]),
    Count=np.array([210,   490]))
sample_size = survey.get("Count").sum()
survey_results = survey.assign(
    Proportion=survey.get("Count") / sample_size)
survey_results

Next, she used 1000 bootstrap resamples to compute a confidence interval for the proportion of all UCSD students who found the solutions satisfactory.  Run the next cell to see the empirical distribution of `'Satisfactory'` proportions in the 1000 resamples.

Note that we're using `np.random.multinomial` to do the resampling here, since each element of the resample is either 0 (unsatisfactory) or 1 (satisfactory) with known probabilities. This accomplishes the same thing as using `.sample(, replace=True)`, but runs much quicker.

In [None]:
boot_proportions = np.array([])
for i in np.arange(1000):
    resample = np.random.multinomial(sample_size, survey_results.get('Proportion')) / sample_size
    boot_proportions = np.append(boot_proportions, resample[0])
bpd.DataFrame().assign(boot_proportions = boot_proportions).plot(kind='hist', density=True, ec='w', bins=np.arange(0.15, 0.45, .01), figsize=(10,5));

Recall, the Central Limit Theorem says

$$\text{SD of Distribution of Possible Sample Means} = \frac{\text{Population SD}}{\sqrt{\text{sample size}}}$$

Furthermore, in a population whose members are 0 and 1, there is a simple formula for the standard deviation of that population:

$$\text{Population SD} = \sqrt{(\text{Proportion of 0s in Population}) \times (\text{Proportion of 1s in Population})}$$

(Figuring out this formula, starting from the definition of the standard deviation, is a fun exercise for those who enjoy algebra.)

**Question 3.1.**
**Without accessing the data in `boot_proportions` in any way**, compute an approximation of the standard deviation of the array `boot_proportions` and assign it to the variable `approximate_sd`.

Instead of using `boot_proportions` directly, use **both** the Central Limit Theorem and the population standard deviation formula above. Since you don't know the true proportions of 0s and 1s in the population, use the proportions in the sample instead (since they're likely to be similar).

In [None]:
approximate_sd = ...
approximate_sd

In [None]:
grader.check("q3_1")

**Question 3.2.** Compute the actual standard deviation of the array `boot_proportions` to verify that your answer to Question 3.1 is approximately correct.

In [None]:
exact_sd = ...
exact_sd

In [None]:
grader.check("q3_2")

**Question 3.3.**
**Still without accessing `boot_proportions` in any way**, compute an approximate 95% confidence interval for the proportion of students that would find UCSD's solutions satisfactory. The cell below `grader.check("q3_3")` draws your interval in green below the histogram of `boot_proportions`; use that to verify that your answer looks right.

**_Hint:_** In the past, we've used `np.percentile` on the array of bootstrapped estimates to find the bounds for the confidence interval. Now, **we're not allowed to use the bootstrapped distribution**, so we can't do it that way. But we don't need to! The Central Limit Theorem tells us that the distribution of the sample mean is normal with a certain standard deviation. We also know that 95% of the area of the normal distribution falls within a certain number of standard deviations (2) from the mean.

In [None]:
lower_limit = ...
upper_limit = ...

# Your interval is:
[lower_limit, upper_limit]

In [None]:
grader.check("q3_3")

In [None]:
# Run this cell to plot your confidence interval.
bpd.DataFrame().assign(boot_proportions = boot_proportions).plot(kind='hist', density=True, ec='w', bins=np.arange(0.15, 0.45, 0.01), figsize=(10, 5), alpha=0.65);
plt.plot([upper_limit, lower_limit], [0, 0], color='green', linewidth=10, label='Normal CI');
plt.legend();

Your confidence interval should make it clear that we're pretty confident that relatively few students were satisfied by UCSD's solutions. This makes sense, as the proportion of `'Satisfactory'` opinions in the sample was only 0.30. 

Just to confirm that this conclusion is accurate, the data scientist decides to redo the survey with a larger sample to estimate the population proportion of `'Satisfactory'` opinions with greater accuracy. She would be happy if the **standard deviation of the sample mean were only 0.005**.  She'll need to take a new sample that's large enough to achieve that. Polling is time-consuming, so the sample also shouldn't be bigger than necessary.

Instead of making the conservative assumption that the population standard deviation is 0.5 (the largest possible SD of a 0-1 population), she decides to assume that it's equal to the standard deviation of her first sample. That is,

$$\text{Population SD} \approx \sqrt{(\text{Proportion of 0s in Sample}) \times (\text{Proportion of 1s in Sample})}$$

Under that assumption, she computes the smallest sample size necessary in order to be confident that the standard deviation of the sample mean is only 0.005.

**Question 3.4.**
What sample size did she find? Assign your answer to the variable `new_sample_size`, which should be of type `int`.

Use the fact that $$\text{SD of Distribution of Possible Sample Means} = \frac{\text{Population SD}}{\sqrt{\text{sample size}}}$$

**_Hint:_** When converting to `int`, don't round down!

In [None]:
new_sample_size = ...
new_sample_size

In [None]:
grader.check("q3_4")

**Question 3.5.** Suppose the data scientist wants to be even more precise and take a sample of sufficient size such that the standard deviation of the sample mean distribution is 0.00125. Is it possible for her to do this? Choose the best answer and explanation, then assign `q3_5` to either 1, 2, 3, or 4.
1. Yes. She can repeat the sample again until she comes across a sample with a standard deviation of 0.0025.
2. Yes. Since the 0.00125 is a quarter of 0.005, the required sample size is a fourth of `new_sample_size`.
3. Yes. Since the 0.00125 is a quarter of 0.005, the required sample size is four times `new_sample_size`.
3. No, the sample size required to reach that sample mean standard deviation is larger than the number of students at UCSD.

In [None]:
q3_5 = ...

In [None]:
grader.check("q3_5")

## 4. Key Concepts 🔑

**Question 4.1.** How do we standardize a value of 23 if it comes from a dataset where the mean is 55 and the standard deviation is 8? Assign `q4_1` to either 1, 2, 3, or 4.

1.
$\dfrac{({55-23})^2}{8}$

2.
$\dfrac{55-23}{8}$

3.
$\dfrac{23-55}{8}$

4.
$\dfrac{{23-55}}{\sqrt{8}}$

In [None]:
q4_1 = ...

In [None]:
grader.check("q4_1")

**Question 4.2.** According to Chebyshev's Inequality, for any dataset, at least half the data falls within how many standard deviations of the mean? Assign the smallest correct answer to `q4_2`.

1. 0.87
2. 1.28
3. 1.42
4. 1.83

In [None]:
q4_2 = ...

In [None]:
grader.check("q4_2")

**Question 4.3.** Assign `q4_3` to a **list** of all statements below that are **always** true.

1. If we know the mean and SD of a distribution, we can calculate a 95% confidence interval by stepping out two standard deviations from the mean in either direction.
2. An empirical histogram of the sample average of a large random sample drawn with replacement from a population will be roughly normal.
3. An empirical histogram of the sample median of a large random sample drawn with replacement from a population will be roughly normal.
4. For any distribution, 68% of the data falls within one standard deviation of the mean.
5. For any distribution, at least 68% of the data falls within two standard deviations of the mean.

In [None]:
q4_3 = ...

In [None]:
grader.check("q4_3")

## Finish Line 🏁

Congratulations! You are done with Homework 7 – the final homework of the quarter! 🎉

To submit your assignment:

1. Select `Kernel -> Restart & Run All` to ensure that you have executed all cells, including the test cells.
2. Read through the notebook to make sure everything is fine and all tests passed.
3. Run the cell below to run all tests, and make sure that they all pass.
4. Download your notebook using `File -> Download as -> Notebook (.ipynb)`, then upload your notebook to Gradescope.

In [None]:
grader.check_all()