# Practice notebook for confidence intervals using NHANES data

This notebook will give you the opportunity to practice working with confidence intervals using the NHANES data.

You can enter your code into the cells that say "enter your code here", and you can type responses to the questions into the cells that say "Type Markdown and Latex".

Note that most of the code that you will need to write below is very similar to code that appears in the case study notebook. You will need to edit code from that notebook in small ways to adapt it to the prompts below.

To get started, we will use the same module imports and read the data in the same way as we did in the case study:


In [5]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
import statsmodels.api as sm
import scipy.stats

da = pd.read_csv("nhanes_2015_2016.csv")

## Question 1

Restrict the sample to women between 35 and 50, then use the marital status variable [DMDMARTL](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#DMDMARTL) to partition this sample into two groups - women who are currently married, and women who are not currently married. Within each of these groups, calculate the proportion of women who have completed college. Calculate 95% confidence intervals for each of these proportions.


In [6]:
da.head()

Unnamed: 0,SEQN,ALQ101,ALQ110,ALQ130,SMQ020,RIAGENDR,RIDAGEYR,RIDRETH1,DMDCITZN,DMDEDUC2,...,BPXSY2,BPXDI2,BMXWT,BMXHT,BMXBMI,BMXLEG,BMXARML,BMXARMC,BMXWAIST,HIQ210
0,83732,1.0,,1.0,1,1,62,3,1.0,5.0,...,124.0,64.0,94.8,184.5,27.8,43.3,43.6,35.9,101.1,2.0
1,83733,1.0,,6.0,1,1,53,3,2.0,3.0,...,140.0,88.0,90.4,171.4,30.8,38.0,40.0,33.2,107.9,
2,83734,1.0,,,1,1,78,3,1.0,3.0,...,132.0,44.0,83.4,170.1,28.8,35.6,37.0,31.0,116.5,2.0
3,83735,2.0,1.0,1.0,2,2,56,3,1.0,5.0,...,134.0,68.0,109.8,160.9,42.4,38.5,37.7,38.3,110.1,2.0
4,83736,2.0,1.0,1.0,2,2,42,4,1.0,4.0,...,114.0,54.0,55.2,164.9,20.3,37.4,36.0,27.2,80.4,2.0


In [7]:
# dataframe with only women 35 to 50
women35to50 = da.query("RIDAGEYR >= 35 & RIDAGEYR <= 50 & RIAGENDR ==2")
assert np.all(women35to50["RIAGENDR"] == 2)
assert np.all(women35to50["RIDAGEYR"] >= 35)
assert np.all(women35to50["RIDAGEYR"] <= 50)

# partitioned dataframes
married_women35to50 = women35to50.query("DMDMARTL == 1")
nonmarried_women35to50 = women35to50.query("DMDMARTL != 1")
assert np.all(married_women35to50["DMDMARTL"] == 1)
assert np.all(nonmarried_women35to50["DMDMARTL"] != 1)

In [8]:
completed_college_married = married_women35to50.query("DMDEDUC2 == 5")
completed_college_nonmarried = nonmarried_women35to50.query("DMDEDUC2 == 5")

# phat_married = np.mean(completed_college_married)
# phat_nonmarried = np.mean(completed_college_nonmarried)

# pmarried = (completed_college_married["DMDEDUC2"].size / married_women35to50["DMDEDUC2"].size)
# pnonmarried = (completed_college_nonmarried["DMDEDUC2"].size/ nonmarried_women35to50["DMDEDUC2"].size)

confmarried = sm.stats.proportion_confint(
    completed_college_married["DMDEDUC2"].size, married_women35to50["DMDEDUC2"].size
)
confnonmarried = sm.stats.proportion_confint(
    completed_college_nonmarried["DMDEDUC2"].size,
    nonmarried_women35to50["DMDEDUC2"].size,
)
print(
    f"The 95% confidence interval for the proportion of married women who completed college is {confmarried}"
)
print(
    f"The 95% confidence interval for the proportion of non-married women who completed college is {confnonmarried}"
)

# lcb = pmarried - 1.96 * np.sqrt(pmarried * (1 - pmarried) / married_women35to50.size)
# ucb = pmarried + 1.96 * np.sqrt(pmarried * (1 - pmarried) / married_women35to50.size)
# print(lcb, ucb)

The 95% confidence interval for the proportion of married women who completed college is (0.31638193710753626, 0.4052216263668512)
The 95% confidence interval for the proportion of non-married women who completed college is (0.16936816767089768, 0.2566673352876822)


**Q1a.** Identify which of the two confidence intervals is wider, and explain why this is the case.


**Q1b.** Write 1-2 sentences summarizing these findings for an audience that does not know what a confidence interval is (the goal here is to report the substance of what you learned about how marital status and educational attainment are related, not to teach a person what a confidence interval is).


## Question 2

Construct a 95% confidence interval for the proportion of smokers who are female. Construct a 95% confidence interval for the proportion of smokers who are male. Construct a 95% confidence interval for the **difference** between those two gender proportions.


In [9]:
females = da.query("RIAGENDR == 2")
males = da.query("RIAGENDR == 1")

female_smoker = females["SMQ020"] == 2
male_smoker = males["SMQ020"] == 1

In [10]:
female_smoker_ci = sm.stats.proportion_confint(
    np.sum(female_smoker),
    female_smoker.size,
)
male_smoker_ci = sm.stats.proportion_confint(
    np.sum(male_smoker),
    male_smoker.size,
)
print(
    f"The 95% confidence interval for the proportion of female smokers is {female_smoker_ci}"
)
print(
    f"The 95% confidence interval for the proportion of male smokers is {male_smoker_ci}"
)

The 95% confidence interval for the proportion of female smokers is (0.677667131270064, 0.7107737289449898)
The 95% confidence interval for the proportion of male smokers is (0.49349056386167417, 0.5307935970661982)


In [11]:
smoker_proportion_diff_ci = sm.stats.confint_proportions_2indep(
    np.sum(male_smoker), male_smoker.size, np.sum(female_smoker), female_smoker.size
)
smoker_proportion_diff_ci
print(
    f"A CI for the difference between male smoking proportion and the female smoking proportion is {smoker_proportion_diff_ci}"
)

A CI for the difference between male smoking proportion and the female smoking proportion is (-0.20684777261087148, -0.15700176164490956)


**Q2a.** Why might it be relevant to report the separate gender proportions **and** the difference between the gender proportions?


**Q2b.** How does the **width** of the confidence interval for the difference between the gender proportions compare to the widths of the confidence intervals for the separate gender proportions?


## Question 3

Construct a 95% interval for height ([BMXHT](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BMX_I.htm#BMXHT)) in centimeters. Then convert height from centimeters to inches by dividing by 2.54, and construct a 95% confidence interval for height in inches. Finally, convert the endpoints (the lower and upper confidence limits) of the confidence interval from inches to back to centimeters


In [12]:
heights = da["BMXHT"].dropna()

heights_ci = sm.stats.DescrStatsW(heights).zconfint_mean()
heights_ci

(165.88055125872887, 166.40511769949427)

In [13]:
inches_height = heights / 2.54
inches_heights_ci = sm.stats.DescrStatsW(inches_height).zconfint_mean()
inches_heights_ci

(65.30730364516884, 65.51382586594264)

**Q3a.** Describe how the confidence interval constructed in centimeters relates to the confidence interval constructed in inches.


## Question 4

Partition the sample based on 10-year age bands, i.e. the resulting groups will consist of people with ages from 18-28, 29-38, etc. Construct 95% confidence intervals for the difference between the mean BMI for females and for males within each age band.


In [48]:
da["age_ranges"] = pd.cut(da["RIDAGEYR"], bins=[10, 18, 28, 38, 48, 58, 68, 78, 88])
# da.head()
gender_age_bmi = (
    da.groupby(["age_ranges", "RIAGENDR"])
    .agg({"BMXBMI": [np.mean, np.var, np.size]})
    .unstack()
)
diff_means = (
    gender_age_bmi[("BMXBMI", "mean", 1)] - gender_age_bmi[("BMXBMI", "mean", 2)]
)
gender_age_bmi["diff_means"] = diff_means
gender_age_bmi.head()

# gender_age_bmi["var"]
# gender_age_bmi.reset_index(inplace=True)

# male_age_bmi = gender_age_bmi["RIAGENDR"] == 1
# female_age_bmi = gender_age_bmi["RIAGENDR"] == 2
# female_age_bmi = gender_age_bmi.query("RIAGENDR == 2")

  da.groupby(["age_ranges", "RIAGENDR"])
  .agg({"BMXBMI": [np.mean, np.var, np.size]})
  .agg({"BMXBMI": [np.mean, np.var, np.size]})


Unnamed: 0_level_0,BMXBMI,BMXBMI,BMXBMI,BMXBMI,BMXBMI,BMXBMI,diff_means
Unnamed: 0_level_1,mean,mean,var,var,size,size,Unnamed: 7_level_1
RIAGENDR,1,2,1,2,1,2,Unnamed: 7_level_2
age_ranges,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3
"(10, 18]",26.333333,26.394118,56.016129,51.616084,63,70,-0.060784
"(18, 28]",27.058186,28.019433,44.61592,64.784043,458,498,-0.961247
"(28, 38]",29.69718,29.943443,45.248362,63.347226,467,494,-0.246263
"(38, 48]",29.514646,31.003733,37.270418,64.716266,398,514,-1.489086
"(48, 58]",29.385132,30.787361,37.841365,58.48564,419,454,-1.40223


In [49]:
# gender_age_bmi["BMXBMI", "var", 1]

In [15]:
da["age_ranges"] = pd.cut(da["RIDAGEYR"], bins=[10, 18, 28, 38, 48, 58, 68, 78, 88])
male_bmi = da.query("RIAGENDR == 1")
male_bmi = male_bmi[["BMXBMI", "age_ranges"]]
male_age_bmi = male_bmi.groupby("age_ranges")
# male_age_bmi.head()

female_bmi = da.query("RIAGENDR == 2")
female_bmi = female_bmi[["BMXBMI", "age_ranges"]]
female_age_bmi = female_bmi.groupby("age_ranges")
female_age_bmi.head()

  male_age_bmi = male_bmi.groupby("age_ranges")
  female_age_bmi = female_bmi.groupby("age_ranges")


Unnamed: 0,BMXBMI,age_ranges
3,42.4,"(48, 58]"
4,20.3,"(38, 48]"
5,28.6,"(68, 78]"
7,28.2,"(28, 38]"
12,26.6,"(28, 38]"
13,43.7,"(58, 68]"
15,35.4,"(48, 58]"
16,32.8,"(18, 28]"
17,25.3,"(18, 28]"
18,38.0,"(18, 28]"


In [35]:
diff_means = sm.stats.CompareMeans(male_age_bmi["BMXBMI"], female_age_bmi["BMXBMI"])

In [50]:
age_ranges = pd.cut(da["RIDAGEYR"], bins=[0, 10, 20, 30, 40, 50, 60, 70, 80])
unique_age_ranges = list(set(age_ranges))


def MakeUnPooledMeanConfidenceInterval(data_one, data_two, confidence):
    n1 = data_one.size
    n2 = data_two.size

    v1 = np.var(data_one, ddof=1)
    v2 = np.var(data_two, ddof=1)

    standard_error = np.sqrt(v1 / n1 + v2 / n2)

    t_multiplier = scipy.stats.t.ppf(1 - (1 - confidence) / 2, df=min(n1, n2))

    diff = np.mean(data_one) - np.mean(data_two)

    lower_bound = diff - t_multiplier * standard_error
    upper_bound = diff + t_multiplier * standard_error

    return {"n1": n1, "n2": n2, "lower_bound": lower_bound, "upper_bound": upper_bound}


for age_range in sorted(unique_age_ranges):
    is_this_age = da[age_ranges == age_range]
    males_bmi = is_this_age[is_this_age["RIAGENDR"] == 1]["BMXBMI"]
    females_bmi = is_this_age[is_this_age["RIAGENDR"] == 2]["BMXBMI"]
    ci = MakeUnPooledMeanConfidenceInterval(males_bmi, females_bmi, 0.95)
    print(
        "Age: {} | CI for difference in male (n1:{}) bmi less female (n2:{}) bmi: ({:.2f}, {:.2f}) | Width: {:.2f}".format(
            age_range,
            ci["n1"],
            ci["n2"],
            ci["lower_bound"],
            ci["upper_bound"],
            ci["upper_bound"] - ci["lower_bound"],
        )
    )

Age: (10, 20] | CI for difference in male (n1:175) bmi less female (n2:165) bmi: (-1.70, 1.29) | Width: 3.00
Age: (20, 30] | CI for difference in male (n1:432) bmi less female (n2:514) bmi: (-1.63, 0.21) | Width: 1.85
Age: (30, 40] | CI for difference in male (n1:458) bmi less female (n2:474) bmi: (-1.68, 0.25) | Width: 1.93
Age: (40, 50] | CI for difference in male (n1:401) bmi less female (n2:502) bmi: (-2.38, -0.49) | Width: 1.90
Age: (50, 60] | CI for difference in male (n1:454) bmi less female (n2:470) bmi: (-2.39, -0.64) | Width: 1.75
Age: (60, 70] | CI for difference in male (n1:437) bmi less female (n2:441) bmi: (-2.59, -0.78) | Width: 1.81
Age: (70, 80] | CI for difference in male (n1:402) bmi less female (n2:410) bmi: (-1.96, -0.40) | Width: 1.56


**Q4a.** How do the widths of these confidence intervals differ? Provide an explanation for any substantial differences in the confidence interval widths that you see.


## Question 5

Construct a 95% confidence interval for the first and second systolic blood pressure measures, and for the difference between the first and second systolic blood pressure measurements within a subject.


In [99]:
first_systolic = da["BPXSY1"].dropna()
second_systolic = da["BPXSY2"].dropna()

diff_systolic = first_systolic - second_systolic
diff_systolic.describe

<bound method NDFrame.describe of 0        4.0
1        6.0
2        6.0
3       -2.0
4      -14.0
        ... 
5730     0.0
5731     2.0
5732     8.0
5733    -2.0
5734     4.0
Length: 5567, dtype: float64>

In [100]:
first_systolic_ci = sm.stats.DescrStatsW(first_systolic).zconfint_mean()
second_systolic_ci = sm.stats.DescrStatsW(second_systolic).zconfint_mean()
# diff_systolic_ci = sm.stats.DescrStatsW(diff_systolic).zconfint_mean()
diff_systolic_ci = sm.stats.confint_proportions_2indep(
    np.sum(first_systolic),
    first_systolic.size,
    np.sum(second_systolic),
    second_systolic.size,
)
print(first_systolic_ci)
print(second_systolic_ci)
print(diff_systolic_ci)

(124.59174272058787, 125.57748520016754)
(124.29493306967777, 125.27110125733216)
(nan, nan)


  dist = crit * np.sqrt(


**Q5a.** Based on these confidence intervals, would you say that a difference of zero between the population mean values of the first and second systolic blood pressure measures is consistent with the data?


**Q5b.** Discuss how the width of the confidence interval for the within-subject difference compares to the widths of the confidence intervals for the first and second measures.


## Question 6

Construct a 95% confidence interval for the mean difference between the average age of a smoker, and the average age of a non-smoker.


In [None]:
# insert your code here

**Q6a.** Use graphical and numerical techniques to compare the variation in the ages of smokers to the variation in the ages of non-smokers.


In [None]:
# insert your code here

**Q6b.** Does it appear that uncertainty about the mean age of smokers, or uncertainty about the mean age of non-smokers contributed more to the uncertainty for the mean difference that we are focusing on here?
