Confidence Intervals

This tutorial is going to demonstrate how to load data, clean/manipulate a dataset, and construct a confidence interval for the difference between two population proportions and means.

We will use the 2015-2016 wave of the NHANES data for our analysis.

For our population proportions, we will analyze the difference of proportion between female and male smokers. The column that specifies smoker and non-smoker is "SMQ020" in our dataset.

For our population means, we will analyze the difference of mean of body mass index within our female and male populations. The column that includes the body mass index value is "BMXBMI".

Additionally, the gender is specified in the column "RIAGENDR".


In [27]:
import pandas as pd
import numpy as np
import matplotlib
matplotlib.use('Agg')
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt
import statsmodels.api as sm

In [4]:
url = "C:/Users/eli/Desktop/nhanes_2015_2016.csv"
da = pd.read_csv(url)

Investigating and Cleaning Data

In [5]:
# Recode SMQ020 from 1/2 to Yes/No into new variable SMQ020x
da["SMQ020x"] = da.SMQ020.replace({1: "Yes", 2: "No", 7: np.nan, 9: np.nan})
da["SMQ020x"]

0       Yes
1       Yes
2       Yes
3        No
4        No
       ... 
5730    Yes
5731     No
5732    Yes
5733    Yes
5734     No
Name: SMQ020x, Length: 5735, dtype: object

In [6]:
# Recode RIAGENDR from 1/2 to Male/Female into new variable RIAGENDRx
da["RIAGENDRx"] = da.RIAGENDR.replace({1: "Male", 2: "Female"})
da["RIAGENDRx"]

0         Male
1         Male
2         Male
3       Female
4       Female
         ...  
5730    Female
5731      Male
5732    Female
5733      Male
5734    Female
Name: RIAGENDRx, Length: 5735, dtype: object

In [9]:
dx = da[["SMQ020x", "RIAGENDRx"]].dropna()
pd.crosstab(dx.SMQ020x, dx.RIAGENDRx)

RIAGENDRx,Female,Male
SMQ020x,Unnamed: 1_level_1,Unnamed: 2_level_1
No,2066,1340
Yes,906,1413


In [11]:
# Recode SMQ020x from Yes/No to 1/0 into existing variable SMQ020x
dx["SMQ020x"] = dx.SMQ020x.replace({"Yes": 1, "No": 0})

In [13]:
dz = dx.groupby("RIAGENDRx").agg({"SMQ020x": [np.mean, np.size]}) # np.mean in this case is about smokers
dz.columns = ["Proportion of smokers", "Total n"]
dz

Unnamed: 0_level_0,Proportion of smokers,Total n
RIAGENDRx,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,0.304845,2972
Male,0.513258,2753


Difference of Two Population Proportions

In [17]:
p_hat_f = 0.304845
q_hat_f = 1 - p_hat_f
n_f = 2972

p_hat_m = 0.513258
q_hat_m = 1 - p_hat_m
n_m = 2753

In [16]:
Z_score = 1.96 # for 95% Confidence Interval

In [19]:
diff = Z_score * np.sqrt((p_hat_f * q_hat_f / n_f) + (p_hat_m * q_hat_m / n_m))  
diff 

0.024950565457887045

In [21]:
lower_bound = (p_hat_f - p_hat_m) - diff
lower_bound

-0.23336356545788706

In [23]:
upper_bound = (p_hat_f - p_hat_m) + diff
upper_bound

-0.18346243454211297

In [None]:
I am 95% sure that the difference between population proportions of women and men smokers is between
-0.23336356545788706 and -0.18346243454211297

Difference of Two Population Means

In [24]:
da["BMXBMI"].head()

0    27.8
1    30.8
2    28.8
3    42.4
4    20.3
Name: BMXBMI, dtype: float64

In [25]:
da.groupby("RIAGENDRx").agg({"BMXBMI": [np.mean, np.std, np.size]})

Unnamed: 0_level_0,BMXBMI,BMXBMI,BMXBMI
Unnamed: 0_level_1,mean,std,size
RIAGENDRx,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Female,29.939946,7.753319,2976
Male,28.778072,6.252568,2759


In [34]:

# First approach - assumption that (σ1 * σ1) is not equal to (σ2 * σ2)
X_bar1 = 29.939946 
X_bar2 = 28.778072
s1 = 7.753319
s2 = 6.252568 

In [35]:
n1 = 2976
n2 = 2759
Degrees_of_Freedom = min((n1 - 1), (n2 - 1))
Degrees_of_Freedom

2758

In [31]:
T_score = 1.96 # T_score is equal to Z_score for larger sample; 95% Confidence Level
diff_means = T_score * np.sqrt((s1 * s1 / n1) + (s2 * s2 / n2))
diff_means

0.3633642745235303

In [36]:
lower = (X_bar1 - X_bar2) - diff_means
lower

0.798509725476467

In [37]:
upper =  (X_bar1 - X_bar2) + diff_means
upper

1.5252382745235278

I am 95% sure that the difference between population means of BMI of women and men is between
0.798509725476467 and 1.5252382745235278