# Confidence Intervals


This tutorial is going to demonstrate how to load data, clean/manipulate a dataset, and construct a confidence interval for the difference between two population proportions and means.

We will use the 2015-2016 wave of the NHANES data for our analysis.

*Note: We have provided a notebook that includes more analysis, with examples of confidence intervals for one population proportions and means, in addition to the analysis I will show you in this tutorial.  I highly recommend checking it out!

For our population proportions, we will analyze the difference of proportion between female and male smokers.  The column that specifies smoker and non-smoker is "SMQ020" in our dataset.

For our population means, we will analyze the difference of mean of body mass index within our female and male populations.  The column that includes the body mass index value is "BMXBMI".

Additionally, the gender is specified in the column "RIAGENDR".

In [30]:
import pandas as pd
import numpy as np
import matplotlib
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt
import statsmodels.api as sm

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [18]:
url = "nhanes_2015_2016.csv"
da = pd.read_csv(url)

### Investigating and Cleaning Data

In [20]:
da["SMQ020"] = da["SMQ020"].replace({1: "Yes", 2: "No", 7: np.nan, 9: np.nan})
da["SMQ020"].head(10)

0    Yes
1    Yes
2    Yes
3     No
4     No
5     No
6    Yes
7     No
8     No
9     No
Name: SMQ020, dtype: object

In [21]:
da["RIAGENDR"] = da["RIAGENDR"].replace({1: "Male", 2: "Female"})
da["RIAGENDR"].head(10)

0      Male
1      Male
2      Male
3    Female
4    Female
5    Female
6      Male
7    Female
8      Male
9      Male
Name: RIAGENDR, dtype: object

In [24]:
dx = da[["SMQ020", "RIAGENDR"]].dropna()
pd.crosstab(dx.RIAGENDR, dx.SMQ020)

SMQ020,No,Yes
RIAGENDR,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,2066,906
Male,1340,1413


In [8]:
dx["SMQ020x"] = dx["SMQ020x"].replace({"Yes": 1, "No": 0})

dz = dx.groupby("RIAGENDR").agg({"SMQ020x":[np.mean, np.size]})
dz.columns = ["Proportion", "Total n"]
dz

Unnamed: 0_level_0,Proportion,Total n
RIAGENDR,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,0.304845,2972
Male,0.513258,2753


### Constructing Confidence Intervals

Now that we have the population proportions of male and female smokers, we can begin to calculate confidence intervals.  From lecture, we know that the equation is as follows:

$$Best\ Estimate \pm Margin\ of\ Error$$

Where the *Best Estimate* is the **observed population proportion or mean** from the sample and the *Margin of Error* is the **t-multiplier**.

The equation to create a 95% confidence interval can also be shown as:

$$Population\ Proportion\ or\ Mean\ \pm (t-multiplier *\ Standard\ Error)$$

The Standard Error (SE) is calculated differenly for population proportion and mean:

$$Standard\ Error \ for\ Population\ Proportion = \sqrt{\frac{Population\ Proportion * (1 - Population\ Proportion)}{Number\ Of\ Observations}}$$

$$Standard\ Error \ for\ Mean = \frac{Standard\ Deviation}{\sqrt{Number\ Of\ Observations}}$$

Lastly, the standard error for difference of population proportions and means is:

$$Standard\ Error\ for\ Difference\ of\ Two\ Population\ Proportions\ Or\ Means = \sqrt{(SE_{\ 1})^2 + (SE_{\ 2})^2}$$

#### Difference of Two Population Proportions

In [3]:
p1 = .304845
n1 = 2972
se_female = np.sqrt(p1 * (1 - p1)/n1)
se_female

0.00844415041930423

In [4]:
p2 = .513258
n2 = 2753
se_male = np.sqrt(p2 * (1 - p2)/ n2)
se_male

0.009526078787008965

In [5]:
se_diff = np.sqrt(se_female**2 + se_male**2)
se_diff

0.012729880335656654

In [6]:
"""
Calculate a 95% CI
"""

d = .513258 - .304845 
zstar = 1.96
lcb = d - zstar * se_diff
ucb = d + zstar * se_diff
(lcb, ucb)

(0.18346243454211297, 0.23336356545788706)

In [None]:
"""
With 95% confidence, the population proportion of males who smoke is
estimated to be 18.3% to 23.3% higher than the population proportion
of female smokers.
"""

#### Difference of Two Population Means

In [25]:
dx = da[["BMXBMI", "RIAGENDR"]].dropna()

In [27]:
dz = dx.groupby("RIAGENDR").agg({"BMXBMI": [np.mean, np.std, np.size]})
dz.columns = ['mean', 'std', 'size']
dz

Unnamed: 0_level_0,mean,std,size
RIAGENDR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Female,29.939946,7.753319,2944.0
Male,28.778072,6.252568,2718.0


In [25]:
s1, n1 = 7.753319, 2944
s2, n2 = 6.252568, 2718

# estimated standard error of sample mean
sem_female = s1 / np.sqrt(n1)
sem_male = s2 / np.sqrt(n2)
(sem_female, sem_male)

(0.14289556500661638, 0.11993161927499355)

In [31]:
"""
Unpooled approach
"""

sem_diff = np.sqrt(sem_female**2 + sem_male**2)
sem_diff

d = 29.939946 - 28.778072
zstar = 1.96

lcb = d - zstar * sem_diff
ucb = d + zstar * sem_diff
(lcb, ucb)

'\nUnpooled approach\n'

0.18655491363264096

(0.7962263692800211, 1.5275216307199737)

In [28]:
df = n1+n2-2
df

"""
For df = 5660 and CL = 95%; the critical t value, tstar = 1.96
"""

5660

In [32]:
"""
pooled approach
"""

sem_diff_pool = np.sqrt(((n1-1)*s1**2 + (n2-1)*s2**2)/(n1+n2-2)) * np.sqrt((n1+n2)/(n1*n2))
tstar = 1.96

lcb = d - tstar * sem_diff_pool
ucb = d + tstar * sem_diff_pool
(lcb, ucb)

'\npooled approach\n'

(0.7931204088659315, 1.5306275911340634)

In [None]:
"""
There is not much difference between the pooled and unpooled methods.
What can you conclude from this??

The CIs don't deviate by much. So we can say that our results are fairly robust
whether we assume equal variances or not.
"""