## 3B: Practice
In this exercise, you will practice inferential statistics with confidence intervals, bootstrapping, and hypothesis testing. Problems may involve a combination of math and code. 

Recall that you can use LaTeX to nicely format your math inside Markdown cells by enclosing equations in single dollar signs (e.g., $x^2+4=8$) for inline math or double dollar signs for centered equations like $$P(X > 5) = \frac{1}{6}.$$ For a reference if you are new to LaTeX, see the [overleaf documentation for mathematical expressions](https://www.overleaf.com/learn/latex/mathematical_expressions). 

Show your work and/or briefly explain your answers. In general you will not receive full credit for numeric answers with no accompanying work or justification (math, code, explanation). For numeric answers, we will accept answers that are very slightly off due to rounding, z score of 2 vs. 1.96, etc. 

When you finish please go to Kernel --> Restart and Run All, and then double check that your notebook looks correct before submitting your .ipynb file (the notebook file) on gradescope.

In [None]:
# Run this code cell to import relevant libraries
import numpy as np
import pandas as pd
from scipy import stats

<!-- BEGIN QUESTION -->

### Question 1
The General Social Survey asked the following question to a random sample of 1,155 Americans: “After an average work day, about how many hours do you have to relax or pursue activities that you enjoy?” A 95% confidence interval for the mean number of hours spent relaxing or pursuing activities they enjoy was (1.38, 1.92).

1. Your friend reads the survey and says it means "95% of the survey respondents reported between 1.38 and 1.92 hours." Is this a valid interpretation of the confidence interval? Why or why not?
2. Suppose another set of researchers reported a confidence interval of (1.29, 2.01) based on the same sample of 1,155 Americans. Is this indicative of a higher or lower confidence level (the percentage)?
3. Suppose next year a new survey asking the same question is conducted, and this time the sample size
is 2,500. Assuming that the summary statistics (mean and standard deviation) are roughly the same as before, how will the new confidence interval differ from the (1.38, 1.92) computed before? Why?


<!--
BEGIN QUESTION
name: q1
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

### Question 2
1. A random survey of 1,000 US adults found that 42% believe raising the minimum wage will help the economy. Using the normal distribution, construct a 95% confidence interval for the true percentage of US adults who believe this using the normal distribution.
2. A study of 19 random Risso's dolphins finds that the average amount of micrograms of mercury per wet gram of muscle in a dolphin is 4.4, with a standard deviation of 2.3. Construct a 95% confidence interval around this empirical mean using the student's t-distribution.   

<!--
BEGIN QUESTION
name: q2
manual: false
points:
    - 2
    - 2
    - 2
    - 2
-->

In [None]:
# Code for question 2 (or can use a hand calculator and show work)
# Please store the answer for each questions as following variables

q2_1 = ... # 95%CI stored as a list or tuple with [lower-bound, upper-bound]
q2_2 = ... # 95%CI stored as a list or tuple with [lower-bound, upper-bound]

...

_Type your answer here, replacing this text._

### Question 3
You have a small dataset of the total number of miles that a random subset of individuals have walked over the last week: `data = [1, 3, 4, 8, 14, 23, 39, 51, 106, 319]` as defined in the code below.
1. Construct a 95% confidence interval for the mean of `data` using the student's t-distribution.
2. Use bootstrapping with 100,000 bootstrap resamples to construct a 95% confidence interval for the mean of `data`.
3. Which confidence interval is more reasonable? Why?

<!--
BEGIN QUESTION
name: q3
manual: false
points:
    - 2
    - 2
    - 2
    - 2
-->

In [None]:
# Run but do not modify
data = np.array([1, 3, 4, 8, 14, 23, 39, 51, 106, 319])

# Code for question 3
# Please store the answer for each questions as following variables
# Store the answer as a list or tuple with [lower-bound, upper-bound]


q3_1 = ... # 95% CI using t-distribution
q3_2 = ... # 95% CI with bootstrapping


...

##### Q 3.3
_Type your answer here, replacing this text._

<!-- END QUESTION -->

### Question 4
#### Part 1. 
It is believed that nearsightedness affects about 8% of all children. In a random sample of 194 children, 21 are nearsighted. Consider the following question: do these data provide evidence that the 8% value is inaccurate? State the specific hypotheses you will test to answer this question and indicate whether it is a one-sided or two-sided test (you can do either, just clarify which). Use a significance level of 0.05. Conduct the hypothesis test and calculate the p-value using the normal distribution. Interpret your result.

#### Part 2.
A USA Today/Gallup poll asked a group of unemployed and underemployed Americans if they have had major problems in their relationships with their spouse or another close family member as a result of not having a job (if unemployed) or not having a full-time job (if underemployed). 27% of the 1,145 unemployed respondents and 25% of the 675 underemployed respondents said they had major problems in relationships as a result of their employment status. Consider the following question: is the percentage of those having major problems different for unemployed versus underemployed Americans? State the specific hypotheses you will test to answer this question and indicate whether it is a one-sided or two-sided test (you can do either, just clarify which). 

Use a significance level of 0.05. Conduct the hypothesis test and calculate the p-value. You can do so most easily using [`scipy.stats.ttest_ind_from_stats`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind_from_stats.html#scipy.stats.ttest_ind_from_stats), though you can also look up the standard error calculations for the difference of proportions in Chapter 6.2 of the openIntro Statistics book referenced in the Prepare if you wish to run the test using the normal distribution for a (very) slightly tighter p-value (you will get similar p-values and the same conclusion either way). Interpret your result.



<!--
BEGIN QUESTION
name: q4
manual: False
points:
    - 1
    - 1
    - 1
    - 1
-->

In [None]:
# Code for question 4
# Please store the answer for each questions as following variables

q4_1 = ... # part 1: P_value
q4_2 = ... # part 2: P_value

...

##### Question 4 Interpretation

_Type your answer here, replacing this text._

<!-- END QUESTION -->

### Question 5
Below we import two data sets `university_data` both with and without NAs in the `cost_after_aid` column. The first dataset contains information about 311 universities in the United States. In general, private universities charge higher tuition rates than public universities. However, private universities often argue that once you take financial aid into account, the cost is often not different. In this question you will explore this issue.

1. First, report the average `tuition` of `public` schools and the average `tuition` of `private` schools to confirm the basic notion that `private` schools charge higher tuition on average.

2. Consider the null hypothesis that `private` and `public` universities have the same average `cost_after_aid`. Conduct a two-sided t-test to determine whether the dataset provides statistically significant evidence to reject the null hypothesis in favor of the alternative hypothesis that they have different average `cost_after_aid`. You will notice that some universities do not have a value recorded for `cost_after_aid`. For now, use `uni_caa_no_na` instead for your analysis and assume that the remaining are a random sample of American universities. Report the resulting p-value. Interpret your results at a significance level of 0.05.

3. In the previous step you tested for statistical significance of the difference in `cost_after_aid` between public and private schools. What is the effect size? Report the average `cost_after_aid` of `public` schools and the average `cost_after_aid` of `private` schools.

4. In step 2 we assumed that we could omit the universities with missing data and the remainder would be a random sample of American universities. Is that assumption well justified? Consider especially the average values you computed in steps 1 and 3 and consider which universities are missing the `cost_after_aid` information. Given this, what can you say about the claim that "private universities often argue that once you take financial aid into account, the cost is often not different?"

<!--
BEGIN QUESTION
name: q5_auto
manual: false
points:
    - 2
    - 2
    - 1
    - 1
    - 2
    - 2
-->

In [None]:
# Run but do not modify this code
uni = pd.read_csv("university_data.csv")
uni_caa_no_na = pd.read_csv("university_data_caa_no_na.csv")
uni.head()

# Code for question 5 here
# Place your answer in a list in following orders

q5_1 = ... # [public, private]
q5_2 = ... # p-value
q5_3 = ... # [public, private, effect_size], notes: the effect size should be positve


##### Question 5.4

_Type your answer here, replacing this text._