# Practice notebook for hypothesis tests using NHANES data

This notebook will give you the opportunity to perform some hypothesis tests with the NHANES data that are similar to
what was done in the week 3 case study notebook.

You can enter your code into the cells that say "enter your code here", and you can type responses to the questions into the cells that say "Type Markdown and Latex".

Note that most of the code that you will need to write below is very similar to code that appears in the case study notebook.  You will need to edit code from that notebook in small ways to adapt it to the prompts below.

To get started, we will use the same module imports and read the data in the same way as we did in the case study:

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import statsmodels.api as sm
import statsmodels.stats.proportion as smprop
import numpy as np
import scipy.stats.distributions as dist

In [2]:
da = pd.read_csv("nhanes_2015_2016.csv")

## Question 1

Conduct a hypothesis test (at the 0.05 level) for the null hypothesis that the proportion of women who smoke is equal to the proportion of men who smoke.

In [4]:
# parparing the data and drop any missing values

da["SMQ020x"] = da.SMQ020.replace({1: "Yes", 2: "No", 7: np.nan, 9: np.nan})
da["RIAGENDRx"] = da.RIAGENDR.replace({1: "Male", 2: "Female"})
dx = da[["SMQ020x", "RIDAGEYR", "RIAGENDRx"]].dropna()


# making small views one for females and one for males 
# make the columns countable by converting it 1 and 0

dx_females = dx.loc[dx.RIAGENDRx=="Female", "SMQ020x"].replace({"Yes": 1, "No": 0})
dx_males = dx.loc[dx.RIAGENDRx=="Male", "SMQ020x"].replace({"Yes": 1, "No": 0})


# then applaying satats model to calculate the t-test and p-value

sm.stats.ttest_ind(dx_females, dx_males) # prints test statistic, p-value, degrees of freedom

(-16.420585558984445, 3.0320887866906843e-59, 5723.0)

__Q1a.__ Write 1-2 sentences explaining the substance of your findings to someone who does not know anything about statistical hypothesis tests.

what we have here p_value = 3.032e-58 ,
what if the differece between males and females = zero (meaning thier is no differece)
and we get this portions of males and females that smoking before like this in our sample
the probablity is extermly low = 3.032e-58   (p-value)
which mean that we can say that  according to this probability that the differece between them
is not zero

__Q1b.__ Construct three 95% confidence intervals: one for the proportion of women who smoke, one for the proportion of men who smoke, and one for the difference in the rates of smoking between women and men.

### Using Stats module

In [6]:
# 95% CI for the proportion of females who smoke  
sm.stats.proportion_confint( sum(dx_females), len(dx_females))  

(0.2882949879861214, 0.32139545615923526)

In [7]:
# 95% CI for the proportion of Males who smoke 
sm.stats.proportion_confint( sum(dx_males), len(dx_males))

(0.49458749263718593, 0.5319290347874418)

In [8]:
smprop.confint_proportions_2indep( sum(dx_females), len(dx_females), sum(dx_males), len(dx_males), compare='diff', alpha=0.05, correction=True)

(-0.23316728428702627, -0.18329691308756202)

__Q1c.__ Comment on any ways in which the confidence intervals that you found in part b reinforce, contradict, or add support to the hypothesis test conducted in part a.

there is no differece they both confirm the difference between them is not Zero 
but Confidence interval add more support and quantify the decision  

### Using Numpy module 

In [8]:
# insert your code here
p = .304845
n = 2972
se_female = np.sqrt(p * (1 - p)/n)
se_female

0.00844415041930423

In [18]:
# CI for females
p = .304845 
lcb = p - 1.96 * se_female
ucb = p + 1.96 * se_female
(lcb, ucb)

(0.2882944651781637, 0.32139553482183625)

In [9]:
p = .513258
n = 2753
se_male = np.sqrt(p * (1 - p)/ n)
se_male

0.009526078787008965

In [19]:
# CI for Males
p = .513258 
lcb = p - 1.96 * se_male
ucb = p + 1.96 * se_male
(lcb, ucb)

(0.49458688557746244, 0.5319291144225375)

In [10]:
se_diff = np.sqrt(se_female**2 + se_male**2)
se_diff

0.012729880335656654

In [11]:
d = .304845 - .513258
lcb = d - 1.96 * se_diff
ucb = d + 1.96 * se_diff
(lcb, ucb)

(-0.23336356545788706, -0.18346243454211297)

## Question 2

Partition the population into two groups based on whether a person has graduated college or not, using the educational attainment variable [DMDEDUC2](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#DMDEDUC2).  Then conduct a test of the null hypothesis that the average heights (in centimeters) of the two groups are equal.  Next, convert the heights from centimeters to inches, and conduct a test of the null hypothesis that the average heights (in inches) of the two groups are equal.

In [28]:
# perpar the data 
# 5 means graduated

da.DMDEDUC2.value_counts()
da['grad'] = da['DMDEDUC2'].replace({5: "Yes", 4: "No", 3: "No", 1: "No",2: "No",9: "No"})
dx = da[["grad", "BMXHT"]].dropna()


# making two views for out data 
dx_grad = dx.loc[dx.grad == 'Yes',"BMXHT"]
dx_not_grad =dx.loc[dx.grad == 'No',"BMXHT"]

# calculating the difference between the height in two groups 
cm_test = sm.stats.ztest(dx_grad,dx_not_grad)

# convert to inches & made the test 
dx_grad_inch = dx_grad * 0.393701
dx_not_grad_inch = dx_not_grad * 0.393701

inch_test = sm.stats.ztest(dx_grad_inch,dx_not_grad_inch)
(cm_test,inch_test)

((7.578706943765076, 3.4901585776605263e-14),
 (7.578706943765011, 3.490158577662278e-14))

__Q2a.__ Based on the analysis performed here, are you confident that people who graduated from college have a different average height compared to people who did not graduate from college?

yes

__Q2b:__ How do the results obtained using the heights expressed in inches compare to the results obtained using the heights expressed in centimeters?

it's basically the same 

## Question 3

Conduct a hypothesis test of the null hypothesis that the average BMI for men between 30 and 40 is equal to the average BMI for men between 50 and 60.  Then carry out this test again after log transforming the BMI values.

In [55]:
# insert your code here
dx = da[["RIDAGEYR", "RIAGENDRx","BMXBMI"]].dropna()
dx = dx.loc[dx.RIAGENDRx == "Male"]

male34 = dx.loc[(dx.RIDAGEYR >= 30) & (dx.RIDAGEYR <= 40 ), ['BMXBMI']]
male56 = dx.loc[(dx.RIDAGEYR >= 50) & (dx.RIDAGEYR <= 60 ), ['BMXBMI']]


normal_data = sm.stats.ztest(male34.BMXBMI, male56.BMXBMI)

# log data calculation 
male34_log = np.log(male34.BMXBMI)
male56_log = np.log(male56.BMXBMI)

log_data = sm.stats.ztest(male34_log, male56_log)

(normal_data,log_data)

((0.8984008016755222, 0.36897190924214873),
 (0.7057844184100666, 0.4803222133688403))

__Q3a.__ How would you characterize the evidence that mean BMI differs between these age bands, and how would you characterize the evidence that mean log BMI differs between these age bands?

these tests indecates that their is no differece between them,
mean while the differece between the normal_data and log_data 
refers to that log data is transformed to be more normal than the raw data 

## Question 4

Suppose we wish to compare the mean BMI between college graduates and people who have not graduated from college, focusing on women between the ages of 30 and 40.  First, consider the variance of BMI within each of these subpopulations using graphical techniques, and through the estimated subpopulation variances.  Then, calculate pooled and unpooled estimates of the standard error for the difference between the mean BMI in the two populations being compared.  Finally, test the null hypothesis that the two population means are equal, using each of the two different standard errors.

In [None]:
# insert your code here

__Q4a.__ Comment on the strength of evidence against the null hypothesis that these two populations have equal mean BMI.

__Q4b.__ Comment on the degree to which the two populations have different variances, and on the extent to which the results using different approaches to estimating the standard error of the mean difference give divergent results.

## Question 5

Conduct a test of the null hypothesis that the first and second diastolic blood pressure measurements within a subject have the same mean values.

In [None]:
# insert your code here

__Q5a.__ Briefly describe your findings for an audience that is not familiar with statistical hypothesis testing.

__Q5b.__ Pretend that the first and second diastolic blood pressure measurements were taken on different people.  Modfify the analysis above as appropriate for this setting.

In [None]:
# insert your code here

__Q5c.__ Briefly describe how the approaches used and the results obtained in the preceeding two parts of the question differ.