# Practice notebook for confidence intervals using NHANES data

This notebook will give you the opportunity to practice working with confidence intervals using the NHANES data.

You can enter your code into the cells that say "enter your code here", and you can type responses to the questions into the cells that say "Type Markdown and Latex".

Note that most of the code that you will need to write below is very similar to code that appears in the case study notebook.  You will need to edit code from that notebook in small ways to adapt it to the prompts below.

To get started, we will use the same module imports and read the data in the same way as we did in the case study:

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
import statsmodels.api as sm
from scipy.stats import t

da = pd.read_csv("nhanes_2015_2016.csv")

In [2]:
da.head()

Unnamed: 0,SEQN,ALQ101,ALQ110,ALQ130,SMQ020,RIAGENDR,RIDAGEYR,RIDRETH1,DMDCITZN,DMDEDUC2,...,BPXSY2,BPXDI2,BMXWT,BMXHT,BMXBMI,BMXLEG,BMXARML,BMXARMC,BMXWAIST,HIQ210
0,83732,1.0,,1.0,1,1,62,3,1.0,5.0,...,124.0,64.0,94.8,184.5,27.8,43.3,43.6,35.9,101.1,2.0
1,83733,1.0,,6.0,1,1,53,3,2.0,3.0,...,140.0,88.0,90.4,171.4,30.8,38.0,40.0,33.2,107.9,
2,83734,1.0,,,1,1,78,3,1.0,3.0,...,132.0,44.0,83.4,170.1,28.8,35.6,37.0,31.0,116.5,2.0
3,83735,2.0,1.0,1.0,2,2,56,3,1.0,5.0,...,134.0,68.0,109.8,160.9,42.4,38.5,37.7,38.3,110.1,2.0
4,83736,2.0,1.0,1.0,2,2,42,4,1.0,4.0,...,114.0,54.0,55.2,164.9,20.3,37.4,36.0,27.2,80.4,2.0


## Question 1

Restrict the sample to women between 35 and 50, then use the marital status variable [DMDMARTL](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#DMDMARTL) to partition this sample into two groups - women who are currently married, and women who are not currently married.  Within each of these groups, calculate the proportion of women who have completed college.  Calculate 95% confidence intervals for each of these proportions.

In [3]:
# enter your code here
women = da[(da['RIAGENDR'] == 2)]
women_35_50 = women[women['RIDAGEYR'].isin(range(35,51))]
women_35_50.head()

Unnamed: 0,SEQN,ALQ101,ALQ110,ALQ130,SMQ020,RIAGENDR,RIDAGEYR,RIDRETH1,DMDCITZN,DMDEDUC2,...,BPXSY2,BPXDI2,BMXWT,BMXHT,BMXBMI,BMXLEG,BMXARML,BMXARMC,BMXWAIST,HIQ210
4,83736,2.0,1.0,1.0,2,2,42,4,1.0,4.0,...,114.0,54.0,55.2,164.9,20.3,37.4,36.0,27.2,80.4,2.0
34,83799,,,,2,2,37,2,1.0,4.0,...,110.0,72.0,66.6,161.6,25.5,,,,,2.0
50,83828,1.0,,2.0,2,2,39,1,2.0,3.0,...,100.0,62.0,71.3,162.0,27.2,36.8,34.6,29.1,94.6,
52,83832,2.0,1.0,4.0,2,2,50,1,2.0,1.0,...,,,105.9,157.7,42.6,29.2,35.0,40.7,129.1,
55,83837,2.0,2.0,,2,2,45,1,1.0,2.0,...,114.0,68.0,77.5,148.3,35.2,30.5,34.0,34.4,107.6,2.0


In [4]:
married_women = women_35_50[women_35_50['DMDMARTL'] == 1]
married_women.head()

Unnamed: 0,SEQN,ALQ101,ALQ110,ALQ130,SMQ020,RIAGENDR,RIDAGEYR,RIDRETH1,DMDCITZN,DMDEDUC2,...,BPXSY2,BPXDI2,BMXWT,BMXHT,BMXBMI,BMXLEG,BMXARML,BMXARMC,BMXWAIST,HIQ210
34,83799,,,,2,2,37,2,1.0,4.0,...,110.0,72.0,66.6,161.6,25.5,,,,,2.0
50,83828,1.0,,2.0,2,2,39,1,2.0,3.0,...,100.0,62.0,71.3,162.0,27.2,36.8,34.6,29.1,94.6,
55,83837,2.0,2.0,,2,2,45,1,1.0,2.0,...,114.0,68.0,77.5,148.3,35.2,30.5,34.0,34.4,107.6,2.0
61,83851,2.0,1.0,1.0,1,2,37,3,1.0,3.0,...,122.0,74.0,85.1,155.3,35.3,32.5,33.6,36.1,106.5,1.0
62,83853,,,,2,2,49,3,1.0,3.0,...,116.0,84.0,76.1,166.7,27.4,39.2,38.6,32.6,88.7,2.0


In [5]:
unmarried_women = women_35_50[~(women_35_50['DMDMARTL'] == 1)]
unmarried_women.head()

Unnamed: 0,SEQN,ALQ101,ALQ110,ALQ130,SMQ020,RIAGENDR,RIDAGEYR,RIDRETH1,DMDCITZN,DMDEDUC2,...,BPXSY2,BPXDI2,BMXWT,BMXHT,BMXBMI,BMXLEG,BMXARML,BMXARMC,BMXWAIST,HIQ210
4,83736,2.0,1.0,1.0,2,2,42,4,1.0,4.0,...,114.0,54.0,55.2,164.9,20.3,37.4,36.0,27.2,80.4,2.0
52,83832,2.0,1.0,4.0,2,2,50,1,2.0,1.0,...,,,105.9,157.7,42.6,29.2,35.0,40.7,129.1,
58,83845,1.0,,,1,2,44,4,1.0,1.0,...,116.0,78.0,133.3,171.5,45.3,37.3,35.7,48.7,,2.0
100,83911,1.0,,1.0,2,2,43,4,1.0,4.0,...,126.0,76.0,91.1,172.3,30.7,40.4,38.2,34.6,101.6,
127,83958,2.0,1.0,2.0,1,2,47,4,1.0,3.0,...,148.0,76.0,58.6,160.6,22.7,36.8,35.2,28.0,77.9,


**DMDEDUC2 values:**
- 4: Some college or AA degree
- 5: College graduate or above	

**Will use 5**

In [6]:
married_graduates = married_women[(married_women['DMDEDUC2'].isin([5]))]
married_graduates.head()

Unnamed: 0,SEQN,ALQ101,ALQ110,ALQ130,SMQ020,RIAGENDR,RIDAGEYR,RIDRETH1,DMDCITZN,DMDEDUC2,...,BPXSY2,BPXDI2,BMXWT,BMXHT,BMXBMI,BMXLEG,BMXARML,BMXARMC,BMXWAIST,HIQ210
63,83854,2.0,1.0,1.0,2,2,46,1,1.0,5.0,...,116.0,68.0,110.2,162.7,41.6,39.2,35.6,43.4,110.2,2.0
76,83875,2.0,2.0,,2,2,42,4,2.0,5.0,...,102.0,74.0,91.6,163.1,34.4,41.3,39.5,35.8,99.4,2.0
114,83935,,,,2,2,44,5,1.0,5.0,...,102.0,72.0,55.7,154.4,23.4,35.3,33.5,26.0,86.6,2.0
124,83953,1.0,,5.0,2,2,46,1,1.0,5.0,...,118.0,66.0,67.9,153.8,28.7,36.0,34.5,32.5,93.0,2.0
166,84016,1.0,,2.0,2,2,41,1,2.0,5.0,...,100.0,62.0,68.9,169.0,24.1,39.0,36.0,26.9,88.4,1.0


In [7]:
unmarried_graduates = unmarried_women[(unmarried_women['DMDEDUC2'].isin([5]))]
unmarried_graduates.head()

Unnamed: 0,SEQN,ALQ101,ALQ110,ALQ130,SMQ020,RIAGENDR,RIDAGEYR,RIDRETH1,DMDCITZN,DMDEDUC2,...,BPXSY2,BPXDI2,BMXWT,BMXHT,BMXBMI,BMXLEG,BMXARML,BMXARMC,BMXWAIST,HIQ210
519,84625,2.0,1.0,1.0,2,2,40,5,2.0,5.0,...,,,50.6,158.2,20.2,,,,,
531,84646,,,,2,2,50,3,1.0,5.0,...,102.0,66.0,62.8,168.9,22.0,38.3,33.4,26.8,76.2,2.0
537,84659,2.0,1.0,1.0,2,2,47,4,1.0,5.0,...,128.0,84.0,136.2,168.1,48.2,32.4,38.5,40.8,151.2,2.0
549,84677,2.0,1.0,1.0,1,2,43,4,1.0,5.0,...,134.0,54.0,85.6,166.9,30.7,43.1,35.9,31.9,95.1,2.0
658,84859,2.0,2.0,,2,2,42,4,1.0,5.0,...,108.0,62.0,99.6,163.7,37.2,37.0,41.0,39.7,124.7,2.0


In [8]:
# married_graduate proportion
n1 = len(married_women)
p1 = len(married_graduates)/n1

print("There are {} married women, {} of which are graduates".format(n1, len(married_graduates)))
print("The proportion of married women who graduated collage is {}".format(p1))

There are 449 married women, 162 of which are graduates
The proportion of married women who graduated collage is 0.36080178173719374


In [9]:
# unmarried_graduate proportion
n2 = len(unmarried_women)
p2 = len(unmarried_graduates)/n2

print("There are {} unmarried women, {} of which are graduates".format(n2, len(unmarried_graduates)))
print("The proportion of married women who didn't graduate collage is {}".format(p2))

There are 338 unmarried women, 72 of which are graduates
The proportion of married women who didn't graduate collage is 0.21301775147928995


In [10]:
conf_int1 = sm.stats.proportion_confint(n1 * p1, n1)
conf_int1

(0.31638193710753626, 0.4052216263668512)

In [11]:
conf_int2 = sm.stats.proportion_confint(n2 * p2, n2)
conf_int2

(0.16936816767089768, 0.2566673352876822)

In [12]:
conf_int1[1] - conf_int1[0]

0.08883968925931496

In [13]:
conf_int2[1] - conf_int2[0]

0.08729916761678452

__Q1a.__ Identify which of the two confidence intervals is wider, and explain why this is the case. 

- The confidence interval for married graduates is slightly wider.

- It might be because all other things being equal, the larger the sample size, the smaller the standard error and the smaller the confidence interval. The proportion for the married women is closer to 0.5 than is the proportion for non-married women, and as "conservative confidence interval" stated, the maximum variance occurs when the proportion is 0.5. 

- The closer we get to a proportion of 0.5 (the proportion of maximum uncertainty), the larger the variance.

__Q1b.__ Write 1-2 sentences summarizing these findings for an audience that does not know what a confidence interval is (the goal here is to report the substance of what you learned about how marital status and educational attainment are related, not to teach a person what a confidence interval is).

**Based on our analysis of graduate women aged 35-50, the estimated proportion of married graduates is larger than unmarried ones. That means a larger number of married women have a degree compared to unmarried ones. (assuming that the population numbers are not too variant. 450&350)**

## Question 2

Construct a 95% confidence interval for the proportion of smokers who are female. Construct a 95% confidence interval for the proportion of smokers who are male. Construct a 95% confidence interval for the **difference** between those two gender proportions.

In [14]:
f_smokers = women[women['SMQ020'] == 1]
f_smokers.head()

Unnamed: 0,SEQN,ALQ101,ALQ110,ALQ130,SMQ020,RIAGENDR,RIDAGEYR,RIDRETH1,DMDCITZN,DMDEDUC2,...,BPXSY2,BPXDI2,BMXWT,BMXHT,BMXBMI,BMXLEG,BMXARML,BMXARMC,BMXWAIST,HIQ210
12,83752,1.0,,2.0,1,2,30,2,1.0,4.0,...,104.0,50.0,71.2,163.6,26.6,37.3,35.7,31.0,90.7,2.0
18,83762,,,,1,2,27,4,1.0,4.0,...,144.0,84.0,107.9,168.5,38.0,40.1,39.0,41.6,114.8,1.0
22,83775,2.0,1.0,,1,2,69,2,1.0,1.0,...,132.0,48.0,77.7,160.2,30.3,32.7,37.6,30.7,106.8,2.0
27,83785,2.0,1.0,1.0,1,2,60,2,1.0,5.0,...,136.0,74.0,75.6,145.2,35.9,31.0,33.1,36.0,108.0,2.0
30,83788,2.0,1.0,1.0,1,2,69,3,1.0,4.0,...,148.0,72.0,84.0,164.6,31.0,35.0,35.8,33.0,103.0,2.0


In [15]:
men = da[(da['RIAGENDR'] == 1)]
m_smokers = men[men['SMQ020'] == 1]
m_smokers.head()

Unnamed: 0,SEQN,ALQ101,ALQ110,ALQ130,SMQ020,RIAGENDR,RIDAGEYR,RIDRETH1,DMDCITZN,DMDEDUC2,...,BPXSY2,BPXDI2,BMXWT,BMXHT,BMXBMI,BMXLEG,BMXARML,BMXARMC,BMXWAIST,HIQ210
0,83732,1.0,,1.0,1,1,62,3,1.0,5.0,...,124.0,64.0,94.8,184.5,27.8,43.3,43.6,35.9,101.1,2.0
1,83733,1.0,,6.0,1,1,53,3,2.0,3.0,...,140.0,88.0,90.4,171.4,30.8,38.0,40.0,33.2,107.9,
2,83734,1.0,,,1,1,78,3,1.0,3.0,...,132.0,44.0,83.4,170.1,28.8,35.6,37.0,31.0,116.5,2.0
6,83741,1.0,,8.0,1,1,22,4,1.0,4.0,...,112.0,74.0,76.6,165.4,28.0,38.8,38.0,34.0,86.6,
10,83747,1.0,,1.0,1,1,46,3,1.0,5.0,...,150.0,90.0,86.2,176.7,27.6,41.0,38.0,33.6,104.3,2.0


In [16]:
nf = len(women)
pf = len(f_smokers)/nf

f_se = np.sqrt((pf * (1-pf))/nf)

In [17]:
print("There are {} females {} of which are smokers".format(nf, len(f_smokers)))
print("The proportion of female smokers is {}".format(pf))

There are 2976 females 906 of which are smokers
The proportion of female smokers is 0.30443548387096775


In [18]:
f_CI = sm.stats.proportion_confint(nf * pf, nf)
f_CI

(0.2879026244757051, 0.3209683432662304)

In [19]:
nm = len(men)
pm = len(m_smokers)/nm

m_se = np.sqrt((pm * (1-pm))/nm)

In [20]:
m_CI = sm.stats.proportion_confint(nm * pm, nm)
m_CI

(0.49349056386167417, 0.5307935970661982)

In [21]:
print("There are {} males {} of which are smokers".format(nm, len(m_smokers)))
print("The proportion of male smokers is {}".format(pm))

There are 2759 males 1413 of which are smokers
The proportion of male smokers is 0.5121420804639362


In [22]:
se_diff = np.sqrt(f_se**2 + m_se**2)
se_diff

0.012716649609722899

In [23]:
d = pm - pf
lcb = d - 1.96 * se_diff
ucb = d + 1.96 * se_diff
(lcb, ucb)

(0.18278196335791153, 0.2326312298280253)

In [24]:
f_CI[1] - f_CI[0]

0.033065718790525334

In [25]:
m_CI[1] - m_CI[0]

0.03730303320452405

In [26]:
ucb - lcb

0.049849266470113784

__Q2a.__ Why might it be relevant to report the separate gender proportions **and** the difference between the gender proportions?

- Separate gender proportions provide insights of each gender's smoking habits.
- The proportion difference indicates that almost half of the men are smokers, while only 30% of women are smokers.

__Q2b.__ How does the **width** of the confidence interval for the difference between the gender proportions compare to the widths of the confidence intervals for the separate gender proportions?

**It is larger than the confidence intervals for the separate gender proportions, that indicates it has a larger margin of error.**

## Question 3

Construct a 95% interval for height ([BMXHT](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BMX_I.htm#BMXHT)) in centimeters.  Then convert height from centimeters to inches by dividing by 2.54, and construct a 95% confidence interval for height in inches.  Finally, convert the endpoints (the lower and upper confidence limits) of the confidence interval from inches to back to centimeters   

In [27]:
def get_CI(col):
    
    mean = np.mean(col)
    std = np.std(col)
    df = len(col) - 1
    t_star = t.ppf(0.95, df)
    se = std / np.sqrt(len(col))
    ub = mean - t_star * se
    lb = mean + t_star * se
    
    return (ub, lb)

In [28]:
get_CI(da['BMXHT'])

(165.9238964301693, 166.3617725280539)

In [29]:
get_CI(da['BMXHT']/2.54)

(65.32436867329498, 65.4967608378165)

In [30]:
165.9238964301693/65.32436867329498

2.5400000000000005

__Q3a.__ Describe how the confidence interval constructed in centimeters relates to the confidence interval constructed in inches.

**The confidende interval of the height in cm is 2.54 * the confidence interval of the height in inches.**

## Question 4

Partition the sample based on 10-year age bands, i.e. the resulting groups will consist of people with ages from 18-28, 29-38, etc. Construct 95% confidence intervals for the difference between the mean BMI for females and for males within each age band.

In [31]:
da['RIAGENDR'] = da.RIAGENDR.replace({1: "Male", 2: "Female"})

In [32]:
da.head()

Unnamed: 0,SEQN,ALQ101,ALQ110,ALQ130,SMQ020,RIAGENDR,RIDAGEYR,RIDRETH1,DMDCITZN,DMDEDUC2,...,BPXSY2,BPXDI2,BMXWT,BMXHT,BMXBMI,BMXLEG,BMXARML,BMXARMC,BMXWAIST,HIQ210
0,83732,1.0,,1.0,1,Male,62,3,1.0,5.0,...,124.0,64.0,94.8,184.5,27.8,43.3,43.6,35.9,101.1,2.0
1,83733,1.0,,6.0,1,Male,53,3,2.0,3.0,...,140.0,88.0,90.4,171.4,30.8,38.0,40.0,33.2,107.9,
2,83734,1.0,,,1,Male,78,3,1.0,3.0,...,132.0,44.0,83.4,170.1,28.8,35.6,37.0,31.0,116.5,2.0
3,83735,2.0,1.0,1.0,2,Female,56,3,1.0,5.0,...,134.0,68.0,109.8,160.9,42.4,38.5,37.7,38.3,110.1,2.0
4,83736,2.0,1.0,1.0,2,Female,42,4,1.0,4.0,...,114.0,54.0,55.2,164.9,20.3,37.4,36.0,27.2,80.4,2.0


In [33]:
df = da[da['RIAGENDR'].isin(['Female', 'Male'])].copy()

In [34]:
df["AGEGRP"] = pd.cut(df.RIDAGEYR, [18, 29, 39, 49, 59, 69, 80])
df.head()

Unnamed: 0,SEQN,ALQ101,ALQ110,ALQ130,SMQ020,RIAGENDR,RIDAGEYR,RIDRETH1,DMDCITZN,DMDEDUC2,...,BPXDI2,BMXWT,BMXHT,BMXBMI,BMXLEG,BMXARML,BMXARMC,BMXWAIST,HIQ210,AGEGRP
0,83732,1.0,,1.0,1,Male,62,3,1.0,5.0,...,64.0,94.8,184.5,27.8,43.3,43.6,35.9,101.1,2.0,"(59, 69]"
1,83733,1.0,,6.0,1,Male,53,3,2.0,3.0,...,88.0,90.4,171.4,30.8,38.0,40.0,33.2,107.9,,"(49, 59]"
2,83734,1.0,,,1,Male,78,3,1.0,3.0,...,44.0,83.4,170.1,28.8,35.6,37.0,31.0,116.5,2.0,"(69, 80]"
3,83735,2.0,1.0,1.0,2,Female,56,3,1.0,5.0,...,68.0,109.8,160.9,42.4,38.5,37.7,38.3,110.1,2.0,"(49, 59]"
4,83736,2.0,1.0,1.0,2,Female,42,4,1.0,4.0,...,54.0,55.2,164.9,20.3,37.4,36.0,27.2,80.4,2.0,"(39, 49]"


In [35]:
pr = df.groupby(["AGEGRP", "RIAGENDR"]).agg({"BMXBMI": [np.mean, np.std, np.size]}).unstack()
pr

Unnamed: 0_level_0,BMXBMI,BMXBMI,BMXBMI,BMXBMI,BMXBMI,BMXBMI
Unnamed: 0_level_1,mean,mean,std,std,size,size
RIAGENDR,Female,Male,Female,Male,Female,Male
AGEGRP,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3
"(18, 29]",28.08245,27.230677,7.890613,6.587966,551.0,508.0
"(29, 39]",30.208211,29.772422,8.192074,6.825048,481.0,452.0
"(39, 49]",30.922332,29.563409,7.911045,6.179002,511.0,402.0
"(49, 59]",30.864732,29.193807,7.584018,5.974769,451.0,437.0
"(59, 69]",31.029806,29.322426,7.79901,5.904651,468.0,449.0
"(69, 80]",29.284897,28.214483,6.398495,5.201107,444.0,448.0


In [36]:
# calculate the StandardErrorOfMean for females and for males within each age band
pr["BMXBMI", "sem", "Female"] = pr["BMXBMI", "std", "Female"] / np.sqrt(pr["BMXBMI", "size", "Female"])
pr["BMXBMI", "sem", "Male"] = pr["BMXBMI", "std", "Male"] / np.sqrt(pr["BMXBMI", "size", "Male"]) 

In [37]:
pr

Unnamed: 0_level_0,BMXBMI,BMXBMI,BMXBMI,BMXBMI,BMXBMI,BMXBMI,BMXBMI,BMXBMI
Unnamed: 0_level_1,mean,mean,std,std,size,size,sem,sem
RIAGENDR,Female,Male,Female,Male,Female,Male,Female,Male
AGEGRP,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3
"(18, 29]",28.08245,27.230677,7.890613,6.587966,551.0,508.0,0.336151,0.292294
"(29, 39]",30.208211,29.772422,8.192074,6.825048,481.0,452.0,0.373526,0.321023
"(39, 49]",30.922332,29.563409,7.911045,6.179002,511.0,402.0,0.349964,0.308181
"(49, 59]",30.864732,29.193807,7.584018,5.974769,451.0,437.0,0.357117,0.285812
"(59, 69]",31.029806,29.322426,7.79901,5.904651,468.0,449.0,0.360509,0.278658
"(69, 80]",29.284897,28.214483,6.398495,5.201107,444.0,448.0,0.303659,0.245729


In [38]:
# calculate the mean difference of BMI between females and males within each age band
# calulate its SE and the lower and upper limits of its 95% CI.
pr["BMXBMI", "mean_diff", ""] = pr["BMXBMI", "mean", "Female"] - pr["BMXBMI", "mean", "Male"]
pr["BMXBMI", "sem_diff", ""] = np.sqrt(pr["BMXBMI", "sem", "Female"]**2 + pr["BMXBMI", "sem", "Male"]**2) 
pr["BMXBMI", "lcb_diff", ""] = pr["BMXBMI", "mean_diff", ""] - 1.96 * pr["BMXBMI", "sem_diff", ""] 
pr["BMXBMI", "ucb_diff", ""] = pr["BMXBMI", "mean_diff", ""] + 1.96 * pr["BMXBMI", "sem_diff", ""] 

pr["BMXBMI", "width", ""] = pr["BMXBMI", "ucb_diff", ""] - pr["BMXBMI", "lcb_diff", ""] 

In [39]:
pr

Unnamed: 0_level_0,BMXBMI,BMXBMI,BMXBMI,BMXBMI,BMXBMI,BMXBMI,BMXBMI,BMXBMI,BMXBMI,BMXBMI,BMXBMI,BMXBMI,BMXBMI
Unnamed: 0_level_1,mean,mean,std,std,size,size,sem,sem,mean_diff,sem_diff,lcb_diff,ucb_diff,width
RIAGENDR,Female,Male,Female,Male,Female,Male,Female,Male,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2
AGEGRP,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3
"(18, 29]",28.08245,27.230677,7.890613,6.587966,551.0,508.0,0.336151,0.292294,0.851772,0.445459,-0.021326,1.724871,1.746198
"(29, 39]",30.208211,29.772422,8.192074,6.825048,481.0,452.0,0.373526,0.321023,0.435789,0.492522,-0.529554,1.401132,1.930686
"(39, 49]",30.922332,29.563409,7.911045,6.179002,511.0,402.0,0.349964,0.308181,1.358923,0.466315,0.444945,2.272902,1.827957
"(49, 59]",30.864732,29.193807,7.584018,5.974769,451.0,437.0,0.357117,0.285812,1.670925,0.457407,0.774407,2.567443,1.793036
"(59, 69]",31.029806,29.322426,7.79901,5.904651,468.0,449.0,0.360509,0.278658,1.70738,0.45565,0.814306,2.600454,1.786149
"(69, 80]",29.284897,28.214483,6.398495,5.201107,444.0,448.0,0.303659,0.245729,1.070414,0.39063,0.30478,1.836049,1.531269


__Q4a.__ How do the widths of these confidence intervals differ?  Provide an explanation for any substantial diferences in the confidence interval widths that you see.

**In the first two groups there are not enough evidence to say that Females' BMI is larger than the males', but from age 39, we estimate that Females have BMI greater than males**

## Question 5

Construct a 95% confidence interval for the first and second systolic blood pressure measures, and for the difference between the first and second systolic blood pressure measurements within a subject.

Cols: BPXSY1 & BPXSY2

In [40]:
CI1 = sm.stats.DescrStatsW(da["BPXSY1"].dropna()).zconfint_mean()
CI1

(124.59174272058787, 125.57748520016754)

In [41]:
CI2 = sm.stats.DescrStatsW(da["BPXSY2"].dropna()).zconfint_mean()
CI2

(124.29493306967777, 125.27110125733216)

In [42]:
from statsmodels.stats import weightstats as sms

In [43]:
cm = sms.CompareMeans(sms.DescrStatsW(da["BPXSY1"].dropna()), sms.DescrStatsW(da["BPXSY2"].dropna()))

In [44]:
CI3 = cm.tconfint_diff(usevar='unequal')
CI3

(-0.3921284617422677, 0.9953220554877552)

In [45]:
CI1[1] - CI1[0]

0.9857424795796703

In [46]:
CI2[1] - CI2[0]

0.9761681876543946

In [47]:
CI3[1] - CI3[0]

1.3874505172300229

__Q5a.__ Based on these confidence intervals, would you say that a difference of zero between the population mean values of the first and second systolic blood pressure measures is consistent with the data?

- YES, as 0 is within our estimated range of values.

__Q5b.__ Discuss how the width of the confidence interval for the within-subject difference compares to the widths of the confidence intervals for the first and second measures.

- The width of the confidence intervals for the first and second measures are very similar, however, of the confidence intervals for the within-subject difference is slightly larger by 0.4

## Question 6

Construct a 95% confidence interval for the mean difference between the average age of a smoker, and the average age of a non-smoker.

In [48]:
smokers = da[da['SMQ020'] == 1]['RIDAGEYR']
non_smokers = da[~(da['SMQ020'] == 1)]['RIDAGEYR']

In [49]:
CI_S1 = sm.stats.DescrStatsW(smokers.dropna()).zconfint_mean()
CI_S1

(51.38591951147112, 52.80726720694198)

In [50]:
CI_S2 = sm.stats.DescrStatsW(non_smokers.dropna()).zconfint_mean()
CI_S2

(44.68411549118788, 45.929467646985415)

In [51]:
cm2 = sms.CompareMeans(sms.DescrStatsW(smokers.dropna()), sms.DescrStatsW(non_smokers.dropna()))

In [52]:
CI_S3 = cm2.tconfint_diff(usevar='unequal')
CI_S3

(5.844708831852225, 7.73489474838758)

__Q6a.__ Use graphical and numerical techniques to compare the variation in the ages of smokers to the variation in the ages of non-smokers.  

__Q6b.__ Does it appear that uncertainty about the mean age of smokers, or uncertainty about the mean age of non-smokers contributed more to the uncertainty for the mean difference that we are focusing on here?

- No, the whole interval is positive.