## Problem Set 2: Hypothesis testing (90 pts)
Problem set 2 is about confidence intervals and hypothesis testing.  

### Instructions
- Please follow below questions and instructions to complete this problem set. In some questions, please write and execute Python code for data analysis in Cell mode. Comment your code to explain each step. And use `Shift+Enter` to excute your code.
- Some questions need text discussion. Please Provide a detailed discussion of your results, including interpretations and answers to questions in Raw mode.

- Once you have completed the assignment, save your Jupyter notebook with the following naming convention: ECN310_ProblemSetX_LastName_FirstName.ipynb (replace X with the assignment number).

_Hint:_ If you want to use numpy, pandas, or scipy.stats be sure to import the modules `import numpy as np` and `import pandas as pd` and `import scipy.stats`. Import these libraries in the codeblock below

In [1]:
import numpy as np
import pandas as pd
import scipy.stats

## Problem 1 Unbiased Estimators and confidence interval (20 pts, 4 pts per question)
Suppose you are a data scientist working for a health research company. The company is interested in estimating the average resting heart rate (in beats per minute) for a population of adults aged 20-40 years in a certain city. Due to budget and time constraints, you cannot measure the heart rate of every adult in this age group in the city. Therefore, you decide to take a random sample.

You randomly select 150 adults aged 20-40 years from the city and measure their resting heart rates. The data collected from these 150 individuals are as follows:

- Sample mean (x̄) = 72 beats per minute

- Sample standard deviation (s) = 12 beats per minute


a. Explain what does it mean for the sample mean $\bar{x}$ to be an unbiased estimator of the population mean µ.


b. Calculate the standard error of the mean (SE) for your sample. Use the formula: $$ SE = \frac{s}{\sqrt{n}}$$ where s is the sample standard deviation and n is the sample size.




In [7]:
# Please put your answer below. You can code it or calculate it manually.

s = 12
n = 150
SE = s / (n**0.5)
print(SE)

0.9797958971132713


c. Discuss how increasing the sample size might affect the standard error of the mean and the reliability of your estimate.


d. Assume significance level $\alpha=0.05$ Use the scipy.stats module to find a critical t value for degrees of freedom of 49, `df=49` and 
significance $\alpha/2$

_hint:_ use `scipy.stats.t.ppf()`

In [8]:
# Type your code below and be sure to use print() to print the critical t value

tval = scipy.stats.t.ppf(0.05/2, df=49)
print(tval) # this value is negative but we can use abs() to make it positive

-2.0095752371292397


e. Calcuate a 95% confidence interval for the population mean using a sample mean from above 

In [57]:
# Type your code below and be sure to use print() to print your answer when the code is run

# using t interval function from lecture slides and df = 149
ci = scipy.stats.t.interval(confidence=(1-0.05),loc=72, scale=SE, df=n-1)
print(ci)

(70.0639103958682, 73.9360896041318)


# Hypothesis Testing (27 pts, 3 pts per question)


### Problem 2: T test and confidence intervals
Scenario: An administractor is gathering information about student performance in two classes.  She randomly selects 10 students from each class. The scores are as follows:

- Class A sample scores: 72, 85, 90, 70, 80, 92, 78, 84, 89, 73
- Class B sample scores: 82, 80, 75, 86, 90, 87, 84, 83, 88, 85



a. Write code to produce the sample mean and sample standard deviation of scores of each class.  Make sure to use the `print()` function to print out the mean and standard deviation in the code block. (4 pts).
   _Hint_: You need to import numpy and create numpy arrays from the samples above first

In [58]:
# Write the code for your answer in this codeblock

A = [72, 85, 90, 70, 80, 92, 78, 84, 89, 73]
B = [82, 80, 75, 86, 90, 87, 84, 83, 88, 85]

mean_A = sum(A)/len(A)
mean_B = sum(B)/len(B)

# for sample stdev, we need to use ddof=1 - Source: https://stackoverflow.com/questions/34050491/standard-deviation-in-numpy
std_A = np.std(A, ddof=1)
std_B = np.std(B, ddof=1)

print(mean_A, mean_B)
print(std_A, std_B)

81.3 84.0
7.930952023559341 4.320493798938574


b. Consider only the Class A sample.  With $\alpha = 5\%$ conduct a hypothesis test with a null hypothesis that $\mu \leq 83$.  Code the test in the block below and print either 'reject the null' or 'fail to reject'
_hint:_ you want to import the scipy.stats module and make use of the  `scipy.stats.ttest_1samp()` method


In [60]:
# Code the answer to part b in this codeblock

results = scipy.stats.ttest_1samp(A, 83, alternative='greater')
print("Reject the null" if results.pvalue < 0.05 else "Fail to reject") # i have used python ternary operation to reduce code to one line

Fail to reject


c. Construct a 95% percent confidence interval for the population mean for class B.

In [61]:
# Code your answer for part c in this codeblock

ci = scipy.stats.t.interval(confidence=(1-0.05),loc=mean_B, scale=(std_B/(n**0.5)), df=len(B)-1)
print(ci)

(83.20198596273929, 84.79801403726071)


d. At a signifigcance level of 5%, determine if there is difference between the scores of Class A and Class B. In the space below, the code explain your answer. 


In [26]:
# Code your answer here

results = scipy.stats.ttest_ind(A, B, equal_var=True)
results.pvalue

0.35698249333842125

### Problem 2: t-test using built-in function in Python
Scenario: A nutritionist claims that a new diet plan significantly reduces an individual's body mass index (BMI) over a period of 3 months. To test this claim, 12 participants were selected, and their BMIs were recorded before starting the diet plan and after completing 3 months on the diet. The BMIs are as follows:

- Before: 25, 30, 28, 32, 24, 29, 27, 31, 23, 26, 28, 33
- After: 24, 28, 27, 30, 23, 27, 25, 29, 22, 25, 27, 31

Task: Use Python to perform a two sample t-test on the BMI data before and after the diet plan. Determine if the diet plan has the desired and significant effect on BMI.


1. Prepare the data for the t-test by creating numpy arrays for before and after


In [62]:
# Please write your executable code here
before = [25, 30, 28, 32, 24, 29, 27, 31, 23, 26, 28, 33]
after = [24, 28, 27, 30, 23, 27, 25, 29, 22, 25, 27, 31]

np.mean(before), np.mean(after)

(28.0, 26.5)

2. Perform the two sample t-test.


In [64]:
# Please write your executable code here

# I am using ttest_rel because this appears to be a paired t test. 
# The two samples arent from different people but from the same group.
# Source - https://www.geeksforgeeks.org/how-to-conduct-a-paired-samples-t-test-in-python/#

results = scipy.stats.ttest_rel(before, after)
results

TtestResult(statistic=9.9498743710662, pvalue=7.773530457329027e-07, df=11)

3. Interpret the results. Is there evidence that the diet plan had the desired result at a significance level of 5%, at 10%?


# Part 3: Hypothesis testing with income data (25 pts)

In this problem, you will perform hypothesis tests using the database of individuals born in the early 1980s in the United States -- the same dataset that you have used in Problem Set 1. Below are some variables we could use. We will only use gender, but feel free to explore on your own some other variables.
- educ: number of years of education completed
- annual_income: Annual income from wages
- gender: denotes the gender of the individual
- minority: 1 if the individual belongs to a minority group, 0 otherwise.
- mother_educ: 1 if the individual’s mother has a college degree, 0 otherwise.
- gpa_grade_9: GPA in 8th grade.
- retention: 1 if the individual was required to repeat a grade during middle school, 0 otherwise

1. Load the data and needed libraries(3pts).

In [65]:
# Please write your executatble code here
df = pd.read_excel("data_NLSY97.xlsx", index_col=0)
df

Unnamed: 0_level_0,educ,gpa_grade_8,retention,annual_income,total_weeks_exper,black,hispanic,white,mother_educ,minority,gender
PUBID - YTH ID CODE 1997,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,16.0,3.0,0.0,,,0.0,0.0,1.0,0.0,0.0,female
2,14.0,3.5,0.0,115000.0,965.0,0.0,1.0,0.0,0.0,1.0,male
3,14.0,3.0,0.0,,776.0,0.0,1.0,0.0,0.0,1.0,female
4,12.0,4.0,0.0,45000.0,1008.0,0.0,1.0,0.0,0.0,1.0,female
5,12.0,2.5,0.0,150000.0,890.0,0.0,1.0,0.0,0.0,1.0,male
...,...,...,...,...,...,...,...,...,...,...,...
9018,8.0,2.0,0.0,90000.0,942.0,0.0,0.0,1.0,0.0,0.0,female
9019,13.0,1.5,1.0,43000.0,824.0,0.0,1.0,0.0,0.0,1.0,male
9020,14.0,3.5,0.0,,494.0,0.0,0.0,1.0,0.0,0.0,male
9021,13.0,1.5,0.0,47000.0,965.0,0.0,0.0,1.0,0.0,0.0,male


2. Create two data series from the data frame, one with annual_income for females and one for males (4pts)

In [66]:
# Please write your executatble code here
female = df[df['gender'] == 'female']['annual_income']
male = df[df['gender'] == 'male']['annual_income']

3. Conduct a two sample hypothesis test to determine if there is a statistically significant difference between the annual incomes of males and females (6pts)  

Hint: Because the series you created include NaN for missing values,  you need to set the keyword `nan_policy='omit'` in the  `scipy.stats.ttest_ind()` method. Also set the key word argument `equal_var=False` to assume that the variance in annual incomes are different for males and females.  To learn more about the scipy.stats.ttest_ind method, execute `scipy.stats.ttest_ind?`

In [2]:
# Please write your executatble code here

results = stats.ttest_ind(male, female, nan_policy='omit', equal_var=False)
results

NameError: name 'stats' is not defined

4. Interpret the result (4pts)


5. Now conduct a two sample hypothesis to test if males earn more annual income than females. Again, you will need to set the keywords `equal_var=False` and `nan_policy='omit'` as well as the correct `alternative= ` (4pts)

In [55]:
# Please write your code below

results = stats.ttest_ind(male, female, nan_policy='omit', equal_var=False, alternative='less')
results

TtestResult(statistic=12.080281708450256, pvalue=1.0, df=4977.467789359935)

6. Interpret the results (4pts). Specifically, what can you conclude and what can you _not_ conclude from the test?