# Statistics 1
In this exercise, you will practice inferential statistics with confidence intervals, bootstrapping, and hypothesis testing. Problems may involve a combination of math and code. 

Recall that you can use LaTeX to nicely format your math inside Markdown cells by enclosing equations in single dollar signs (e.g., $x^2+4=8$) for inline math or double dollar signs for centered equations like $$P(X > 5) = \frac{1}{6}.$$ For a reference if you are new to LaTeX, see the [overleaf documentation for mathematical expressions](https://www.overleaf.com/learn/latex/mathematical_expressions). 

In [2]:
import numpy as np
import pandas as pd
from scipy import stats

### Question 1
The General Social Survey asked the following question to a random sample of 1,155 Americans: “After an average work day, about how many hours do you have to relax or pursue activities that you enjoy?” A 95% confidence interval for the mean number of hours spent relaxing or pursuing activities they enjoy was (1.38, 1.92).
1. Your friend reads the survey and says it means "95% of the survey respondents reported between 1.38 and 1.92 hours." Is this a valid interpretation of the confidence interval? Why or why not?
2. Suppose another set of researchers reported a confidence interval of (1.29, 2.01) based on the same sample of 1,155 Americans. Is this indicative of a higher or lower confidence level (the percentage)?
3. Suppose next year a new survey asking the same question is conducted, and this time the sample size
is 2,500. Assuming that the summary statistics (mean and standard deviation) are roughly the same as before, how will the new confidence interval differ from the (1.38, 1.92) computed before? Why?

### Answer 1
1. This is not a valid interpretation of the confidence interval, because a 95% confidence interval does not mean 95% of the data points falls within the interval. Instead, it means that if repeated samples were taken and the 95% confidence interval was computed for each sample, 95% of the intervals would contain the population mean. A 95% confidence interval has a 0.95 probability of containing the population mean. 95% of the population distribution is contained in the confidence interval.

2. This is indicative of a higher confidence level, because the sample sample size is taken but the interval is widened from (1.38, 1.92) to (1.29, 2.01), so it is more likely for the true parameter to fall into the new interval. 

3. The new confidence interval will be narrower, because as shown by the confidence interval formula CI = Xbar +/- z(s/sqrt(n)), an increase in n decreases the two-way distance between xbar and the two separate ends of the confidence interval.  

### Question 2
1. A random survey of 1,000 US adults found that 42% believe raising the minimum wage will help the economy. Using the normal distribution, construct a 95% confidence interval for the true percentage of US adults who believe this using the normal distribution.
2. A study of 19 random Risso's dolphins finds that the average amount of micrograms of mercury per wet gram of muscle in a dolphin is 4.4, with a standard deviation of 2.3. Construct a 95% confidence interval around this empirical mean using the student's t-distribution.   

In [4]:
# 2.1
print(stats.norm.interval(0.95, loc = 0.42, scale=np.sqrt(0.42*(1-0.42))/np.sqrt(1000)))

# 2.2
print(stats.t.interval(0.95, loc = 4.4, scale=2.3/np.sqrt(19), df=18))

(0.38940948891043026, 0.4505905110895697)
(3.2914354851665495, 5.508564514833451)


### Answer 2
1. (0.38940948891043026, 0.4505905110895697)
2. (3.2914354851665495, 5.508564514833451)

### Question 3
You have a small dataset of the total number of miles that a random subset of individuals have walked over the last week: `data = [1, 3, 4, 8, 14, 23, 39, 51, 106, 319]` as defined in the code below.
1. Construct a 95% confidence interval for the mean of `data` using the student's t-distribution.
2. Use bootstrapping with 100,000 bootstrap resamples to construct a 95% confidence interval for the mean of `data`.
3. Which confidence interval is more reasonable? Why?

In [5]:
data = np.array([1, 3, 4, 8, 14, 23, 39, 51, 106, 319])

In [6]:
# 3.1
print(stats.t.interval(0.95, loc=np.mean(data), scale=np.std(data)/np.sqrt(len(data)), df=len(data)-1))

# 3.2
sample = np.random.choice(data, size=(100000, len(data)), replace=True)
sampleMeans = np.average(sample, axis=1)
ciL = np.percentile(sampleMeans, 2.5)
ciU = np.percentile(sampleMeans, 97.5)
print((ciL, ciU))

(-9.412687084679476, 123.01268708467947)
(13.7, 122.10249999999941)


### Answer 3
1. (-9.412687084679476, 123.01268708467947)
2. (13.7, 121.3)
3. The bootstrapping confidence interval is more reasonable because our sample size is extremely small and the data points vary a lot from each other, so taking repeated samples from the original sample with replacement helps make a more reasonable estimate of the standard deviation parameter.   

### Question 4
#### Part 1. 
It is believed that nearsightedness affects about 8% of all children. In a random sample of 194 children, 21 are nearsighted. Consider the following question: do these data provide evidence that the 8% value is inaccurate? State the specific hypotheses you will test to answer this question and indicate whether it is a one-sided or two-sided test (you can do either, just clarify which). Use a significance level of 0.05. Conduct the hypothesis test and calculate the p-value using the normal distribution. Interpret your result.

#### Part 2.
A USA Today/Gallup poll asked a group of unemployed and underemployed Americans if they have had major problems in their relationships with their spouse or another close family member as a result of not having a job (if unemployed) or not having a full-time job (if underemployed). 27% of the 1,145 unemployed respondents and 25% of the 675 underemployed respondents said they had major problems in relationships as a result of their employment status. Consider the following question: is the percentage of those having major problems different for unemployed versus underemployed Americans? State the specific hypotheses you will test to answer this question and indicate whether it is a one-sided or two-sided test (you can do either, just clarify which). 

Use a significance level of 0.05. Conduct the hypothesis test and calculate the p-value. You can do so most easily using [`scipy.stats.ttest_ind_from_stats`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind_from_stats.html#scipy.stats.ttest_ind_from_stats), though you can also look up the standard error calculations for the difference of proportions in Chapter 6.2 of the openIntro Statistics book referenced in the prepare if you wish to run the test using the normal distribution for a (very) slightly tighter p-value (you will get similar p-values and the same conclusion either way). Interpret your result.

In [7]:
# Code for question 4.1
print(1 - stats.norm.cdf(np.sqrt(194)*(21/194-0.08)/np.sqrt(0.08*0.92)))

0.07349538001845213


In [8]:
# Code for question 4.2
stats.ttest_ind_from_stats(0.27, np.sqrt(0.27*(1-0.27)),1145,0.25,np.sqrt(0.25*(1-0.25)),675)

Ttest_indResult(statistic=0.9368337461051707, pvalue=0.3489685143193123)

### Answer 4
1. Using a one-sided test at a signficance level of 0.05, we fail to reject the null hypothesis that p=0.08 since p-value = 0.0735 > 0.05.

2. Using a two-sided test at a significance level of 0.05, we fail to reject the null hypothesis that there is no difference between the percentage of unemployed and underempoyed Americans who have relationship problems as a result of their employment status. This is shown by how p-value = 0.349 > 0.05. 

### Question 5
Below we import the `university_data` dataset we have looked at before. It contains information about 311 universities in the United States. In general, private universities charge higher tuition rates than public universities. However, private universities often argue that once you take financial aid into account, the cost is often not different. In this question you will explore this issue.
1. First, report the average `tuition` of `public` schools and the average `tuition` of `private` schools to confirm the basic notion that `private` schools charge higher tuition on average.
2. Consider the null hypothesis that `private` and `public` universities have the same average `cost_after_aid`. Conduct a two-sided t-test to determine whether the dataset provides statistically significant evidence to reject the null hypothesis in favor of the alternative hypothesis that they have different average `cost_after_aid`. You will notice that some universities do not have a value recorded for `cost_after_aid`. For now, simply omit those universities from your analysis and assume that the remaining are a random sample of American universities. Report the resulting p-value. Interpret your results at a significance level of 0.05.
3. In the previous step you tested for statistical significance of the difference in `cost_after_aid` between public and private schools. What is the effect size? Report the average `cost_after_aid` of `public` schools and the average `cost_after_aid` of `private` schools.
4. In step 2 we assumed that we could omit the universities with missing data and the remainder would be a random sample of American universities. Is that assumption well justified? Consider especially the average values you computed in steps 1 and 3 and consider which universities are missing the `cost_after_aid` information. Given this, what can you say about the claim that "private universities often argue that once you take financial aid into account, the cost is often not different?"

In [8]:
uni = pd.read_csv("university_data.csv")
uni.head()

Unnamed: 0,act_avg,sat_avg,enrollment,city,acceptance_rate,percent_receiving_aid,cost_after_aid,state,hs_gpa_avg,tuition,Institution_name,institution_type,us_rank
0,32.0,1400.0,5400.0,Princeton,7.0,60.0,16793.0,NJ,3.9,47140,Princeton University,private,1.0
1,32.0,1430.0,6710.0,Cambridge,5.0,55.0,16338.0,MA,4.0,48949,Harvard University,private,2.0
2,32.0,1450.0,5941.0,Chicago,8.0,42.0,27767.0,IL,4.0,54825,University of Chicago,private,3.0
3,32.0,1420.0,5472.0,New Haven,6.0,50.0,18385.0,CT,,51400,Yale University,private,3.0
4,32.0,1430.0,6113.0,New York,6.0,48.0,21041.0,NY,,57208,Columbia University,private,5.0


In [9]:
# Q1 
print("Average public tuition: ")
print(uni[uni['institution_type']=='public']['tuition'].mean())
print("Average private tuition: ")
print(uni[uni['institution_type']=='private']['tuition'].mean())

Average public tuition: 
25968.97894736842
Average private tuition: 
40871.0350877193


In [10]:
# Q2
publicCAA = uni[uni['institution_type']=='public']['cost_after_aid'].dropna()
privateCAA = uni[uni['institution_type']=='private']['cost_after_aid'].dropna()
print(stats.ttest_ind_from_stats(publicCAA.mean(), publicCAA.std(),len(publicCAA), privateCAA.mean(), privateCAA.std(),len(privateCAA)))

Ttest_indResult(statistic=3.6952785181489114, pvalue=0.000313509738634415)


In [11]:
# Q3
publicCAA.mean() - privateCAA.mean()

4515.956964006258

In [12]:
# Q4
uni[uni['cost_after_aid'].isna()]['tuition'].mean()

23898.958333333332

### Answer 5
1. Average public tuition: 25968.97894736842, Average private tuition: 40871.0350877193
2. The p-value of the two-sided test is 0.000314, which is less than our confidence level of 0.05. Therefore, we reject the null hypothesis that average public tuition cost-after-aid is the same as average private tuition cost-after-aid. 
3. Effect Size: 4515.956964006258
4. The assumption is not well justified because the universities that do not offer aid usually have low tuition costs themselves. The average cost-after-aid for public universities is even higher than the average tuition of public universities that don't offer aid. 