# Module 1 Quiz

**==========================================================================================**

Is there an association between hypertension and body mass index (BMI)? In the following assessment, you will use the BPUrban1000 data set to conduct a series of hypothesis tests, and interpret the results from those tests.

The BMI group variable from BPUrban1000 uses the cutoff value of 27.49 kg/m2 in agreement with the JFCM paper that we read earlier. BMI greater than the cutoff was found to be significantly associated with hypertension.

Produce a 2x2 table showing the distribution of hypertension by bmi_group, where  bmi_group is a categorical variable taking on values 'BMI>27.49' and 'BMI<=27.49'. Note that the BPUrban1000 data set contains the variables hypertension and bmi_group.

## Import Libraries

In [18]:
import numpy as np
import pandas as pd
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import random

import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.formula.api import ols

from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split

import datetime
from datetime import datetime, timedelta, date

%matplotlib inline
#sets the default autosave frequency in seconds
%autosave 60 
sns.set_style('dark')
sns.set(font_scale=1.2)
#sns.set(rc={'figure.figsize':(14,10)})

plt.rc('axes', titlesize=9)
plt.rc('axes', labelsize=14)
plt.rc('xtick', labelsize=12)
plt.rc('ytick', labelsize=12)

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns',None)
#pd.set_option('display.max_rows',None)
pd.set_option('display.width', 1000)
pd.set_option('display.float_format','{:.2f}'.format)

random.seed(0)
np.random.seed(0)
np.set_printoptions(suppress=True)

Autosaving every 60 seconds


## Import Data

In [2]:
df = pd.read_csv("BPUrban1000.csv")

In [None]:
df

In [None]:
df.info()

In [None]:
df.dtypes.value_counts()

In [None]:
df.isnull().sum()

In [None]:
# Create the 2x2 table for hypertension by bmi_group
contingency_table = pd.crosstab(df['hypertension'], df['bmi_group'], margins=True)

contingency_table

In [None]:
#How many hypertensive adults have BMI>27.49?

In [None]:
#How many non-hypertensive adults have BMI>27.49?

In [None]:
#How many hypertensive adults have BMI<=27.49?

In [None]:
#How many non-hypertensive adults have BMI<=27.49?

In [None]:
#What proportion of all adults (i.e. regardless of BMI) have hypertension?

In [None]:
#What proportion of adults do not have hypertension?

In [None]:
#What the proportion of adults with BMI>27.49 have hypertension?

In [None]:
#What is proportion of adults with BMI<=27.49 have hypertension?

In [None]:
#Now, conduct a formal hypothesis test for the difference in the true population proportions (two proportions z-test).

from statsmodels.stats.proportion import proportions_ztest

# Define the counts and sample sizes for the two groups
count = np.array([contingency_table.loc[1, 'BMI>27.49'], contingency_table.loc[1, 'BMI<=27.49']])
nobs = np.array([contingency_table.loc['All', 'BMI>27.49'], contingency_table.loc['All', 'BMI<=27.49']])

# Conduct the two-proportion z-test
stat, p_value = proportions_ztest(count, nobs)
stat, p_value


In [None]:
# Now, conduct Fisher’s exact test for count data on the 2x2 table of hypertension and BMI group.

from scipy.stats import fisher_exact

# Define the contingency table
contingency_table = [
    [363, 425],  # No hypertension
    [58, 154]    # Hypertension
]

# Conduct Fisher's exact test
odds_ratio, p_value = fisher_exact(contingency_table)

odds_ratio, p_value

# Module 2 Quiz

Are smoking and use of chewing tobacco associated with an elevated odds of hypertension among urban India adults? In this assessment, you will walk through the steps needed to answer this question using the BPUrban1000 data set.

In two separate simple logistic regression models, produce summary tables for the association between hypertension and smoking and hypertension and using chewing tobacco. Note that the smoking variable is called smoking and the variable corresponding to use of chewing tobacco is called chewing.

In [3]:
df.head()

Unnamed: 0,hypertension,systolic,diastolic,bmi,bmi_group,cholesterol,age,age_group,parent_history_HTN,salt,salt_group,smoking,chewing,activity,weight,waist_circumference,hip_circumference,waist_hip_ratio,fasting_glucose,caste,marital_status,religion,education,gender
0,0,136,59,35.0,BMI>27.49,183,63,>=60,both,1.5,Salt<=5,False,True,sedentary,82.1,61.9,82.3,0.75,75,Schedule tribe,Married,Muslim,>=Graduate,Female
1,0,134,90,18.4,BMI<=27.49,127,30,30-44,either,9.1,5<Salt<=10,False,False,moderate,49.3,99.1,93.7,1.06,130,Other caste,Married,Sikh,>=Graduate,Male
2,1,149,76,28.1,BMI>27.49,132,24,20-29,both,15.2,10<Salt<=20,False,True,severe,78.1,101.1,88.9,1.14,129,General,Married,Hindu,<=Primary,Female
3,0,117,65,29.1,BMI>27.49,126,52,45-59,none,6.0,5<Salt<=10,False,False,moderate,66.6,88.7,100.5,0.88,126,General,Unmarried,Hindu,Middle,Female
4,0,110,74,33.1,BMI>27.49,182,61,>=60,none,13.8,10<Salt<=20,False,False,moderate,73.4,60.4,81.8,0.74,76,Schedule caste,Unmarried,Hindu,<=Primary,Female


In [4]:
df['smoking'] = df['smoking'].astype(int)
df['chewing'] = df['chewing'].astype(int)

In [5]:
df.head()

Unnamed: 0,hypertension,systolic,diastolic,bmi,bmi_group,cholesterol,age,age_group,parent_history_HTN,salt,salt_group,smoking,chewing,activity,weight,waist_circumference,hip_circumference,waist_hip_ratio,fasting_glucose,caste,marital_status,religion,education,gender
0,0,136,59,35.0,BMI>27.49,183,63,>=60,both,1.5,Salt<=5,0,1,sedentary,82.1,61.9,82.3,0.75,75,Schedule tribe,Married,Muslim,>=Graduate,Female
1,0,134,90,18.4,BMI<=27.49,127,30,30-44,either,9.1,5<Salt<=10,0,0,moderate,49.3,99.1,93.7,1.06,130,Other caste,Married,Sikh,>=Graduate,Male
2,1,149,76,28.1,BMI>27.49,132,24,20-29,both,15.2,10<Salt<=20,0,1,severe,78.1,101.1,88.9,1.14,129,General,Married,Hindu,<=Primary,Female
3,0,117,65,29.1,BMI>27.49,126,52,45-59,none,6.0,5<Salt<=10,0,0,moderate,66.6,88.7,100.5,0.88,126,General,Unmarried,Hindu,Middle,Female
4,0,110,74,33.1,BMI>27.49,182,61,>=60,none,13.8,10<Salt<=20,0,0,moderate,73.4,60.4,81.8,0.74,76,Schedule caste,Unmarried,Hindu,<=Primary,Female


In [None]:
X = df["smoking"]
y = df["hypertension"]

In [None]:
# Add a constant to the model (for the intercept)
X = sm.add_constant(X)

In [None]:
# Logistic regression for hypertension and smoking
logit_smoking = sm.Logit(y, X)
result_smoking = logit_smoking.fit()

In [None]:
result_smoking.summary()

In [None]:
X = df["chewing"]
y = df["hypertension"]

In [None]:
# Add a constant to the model (for the intercept)
X = sm.add_constant(X)

In [None]:
# Logistic regression for hypertension and smoking
logit_chewing = sm.Logit(y, X)
result_chewing = logit_chewing.fit()

In [None]:
result_chewing.summary()

Estimate the association between hypertension and smoking and using chewing tobacco in a single model while controlling for age, religion, and gender. Specifically, estimate the-log odds ratio between hypertension and smoking while controlling for these other variables.

In [9]:
df = pd.get_dummies(df, columns=['religion', 'gender'], drop_first=True)

In [10]:
df.head()

Unnamed: 0,hypertension,systolic,diastolic,bmi,bmi_group,cholesterol,age,age_group,parent_history_HTN,salt,salt_group,smoking,chewing,activity,weight,waist_circumference,hip_circumference,waist_hip_ratio,fasting_glucose,caste,marital_status,education,religion_Hindu,religion_Muslim,religion_Other,religion_Sikh,gender_Male
0,0,136,59,35.0,BMI>27.49,183,63,>=60,both,1.5,Salt<=5,0,1,sedentary,82.1,61.9,82.3,0.75,75,Schedule tribe,Married,>=Graduate,0,1,0,0,0
1,0,134,90,18.4,BMI<=27.49,127,30,30-44,either,9.1,5<Salt<=10,0,0,moderate,49.3,99.1,93.7,1.06,130,Other caste,Married,>=Graduate,0,0,0,1,1
2,1,149,76,28.1,BMI>27.49,132,24,20-29,both,15.2,10<Salt<=20,0,1,severe,78.1,101.1,88.9,1.14,129,General,Married,<=Primary,1,0,0,0,0
3,0,117,65,29.1,BMI>27.49,126,52,45-59,none,6.0,5<Salt<=10,0,0,moderate,66.6,88.7,100.5,0.88,126,General,Unmarried,Middle,1,0,0,0,0
4,0,110,74,33.1,BMI>27.49,182,61,>=60,none,13.8,10<Salt<=20,0,0,moderate,73.4,60.4,81.8,0.74,76,Schedule caste,Unmarried,<=Primary,1,0,0,0,0


In [11]:
df.columns

Index(['hypertension', 'systolic', 'diastolic', 'bmi', 'bmi_group', 'cholesterol', 'age', 'age_group', 'parent_history_HTN', 'salt', 'salt_group', 'smoking', 'chewing', 'activity', 'weight', 'waist_circumference', 'hip_circumference', 'waist_hip_ratio', 'fasting_glucose', 'caste', 'marital_status', 'education', 'religion_Hindu', 'religion_Muslim', 'religion_Other', 'religion_Sikh', 'gender_Male'], dtype='object')

In [13]:
X = df[["smoking","chewing","age",'religion_Hindu', 'religion_Muslim', 'religion_Other', 'religion_Sikh', 'gender_Male']]
y = df["hypertension"]

In [14]:
# Add a constant to the model (for the intercept)
X = sm.add_constant(X)

In [15]:
# Logistic regression for hypertension and rest
logit_rest = sm.Logit(y, X)
result_rest = logit_rest.fit()

Optimization terminated successfully.
         Current function value: 0.416138
         Iterations 7


In [16]:
result_rest.summary()

0,1,2,3
Dep. Variable:,hypertension,No. Observations:,1000.0
Model:,Logit,Df Residuals:,991.0
Method:,MLE,Df Model:,8.0
Date:,"Sun, 16 Jun 2024",Pseudo R-squ.:,0.1945
Time:,17:30:10,Log-Likelihood:,-416.14
converged:,True,LL-Null:,-516.59
Covariance Type:,nonrobust,LLR p-value:,4.102e-39

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-6.3938,0.828,-7.718,0.000,-8.018,-4.770
smoking,0.8009,0.287,2.795,0.005,0.239,1.363
chewing,0.6041,0.173,3.491,0.000,0.265,0.943
age,0.0924,0.008,10.899,0.000,0.076,0.109
religion_Hindu,-0.5463,0.650,-0.841,0.401,-1.820,0.727
religion_Muslim,-0.8441,0.686,-1.230,0.219,-2.189,0.501
religion_Other,-1.0914,0.897,-1.217,0.224,-2.850,0.667
religion_Sikh,-0.5174,0.735,-0.704,0.481,-1.957,0.923
gender_Male,0.1297,0.174,0.745,0.456,-0.211,0.471


In [17]:
# Calculate the 95% confidence interval for the log-odds ratio between hypertension and smoking while controlling for age, religion, gender, and the use of chewing tobacco.

## Module 3 Quiz

In the practice quiz for this module, we found the AUC/ROC for the logistic regression model used to explain hypertension versus smoking, chewing, age, religion, and gender. Based on the coefficient p-values from the coefficient table, remove predictors of hypertension that are not statistically significant based upon a significance threshold of 0.05 and recalculate AUC for the new (reduced) model.

**==========================================================================================================**

In [19]:
df.head()

Unnamed: 0,hypertension,systolic,diastolic,bmi,bmi_group,cholesterol,age,age_group,parent_history_HTN,salt,salt_group,smoking,chewing,activity,weight,waist_circumference,hip_circumference,waist_hip_ratio,fasting_glucose,caste,marital_status,education,religion_Hindu,religion_Muslim,religion_Other,religion_Sikh,gender_Male
0,0,136,59,35.0,BMI>27.49,183,63,>=60,both,1.5,Salt<=5,0,1,sedentary,82.1,61.9,82.3,0.75,75,Schedule tribe,Married,>=Graduate,0,1,0,0,0
1,0,134,90,18.4,BMI<=27.49,127,30,30-44,either,9.1,5<Salt<=10,0,0,moderate,49.3,99.1,93.7,1.06,130,Other caste,Married,>=Graduate,0,0,0,1,1
2,1,149,76,28.1,BMI>27.49,132,24,20-29,both,15.2,10<Salt<=20,0,1,severe,78.1,101.1,88.9,1.14,129,General,Married,<=Primary,1,0,0,0,0
3,0,117,65,29.1,BMI>27.49,126,52,45-59,none,6.0,5<Salt<=10,0,0,moderate,66.6,88.7,100.5,0.88,126,General,Unmarried,Middle,1,0,0,0,0
4,0,110,74,33.1,BMI>27.49,182,61,>=60,none,13.8,10<Salt<=20,0,0,moderate,73.4,60.4,81.8,0.74,76,Schedule caste,Unmarried,<=Primary,1,0,0,0,0


In [None]:
(df["bmi"]).mean()

In [None]:
(df["age"]).mean()

In [None]:
(df["salt"]).mean()

In [None]:
# bmi_cent = bmi - mean(bmi),
#          age_cent = age - mean(age),
#          salt_cent = salt - mean(salt),

In [None]:
df["bmi_cent"] = df["bmi"] - (df["bmi"]).mean()

In [None]:
df["age_cent"] = df["age"] - (df["age"]).mean()

In [None]:
df["salt_cent"] = df["salt"] - (df["bmi"]).mean()

In [None]:
df.head()

In [None]:
df.smoking.value_counts()

In [None]:
df['smoking'] = df['smoking'].astype(int)

In [None]:
df.smoking.value_counts()

In [None]:
df.chewing.value_counts()

In [None]:
df['chewing'] = df['chewing'].astype(int)

In [None]:
df.chewing.value_counts()

In [None]:
df["parent_history_HTN"].value_counts()

In [None]:
df_dummies = pd.get_dummies(df['parent_history_HTN'], prefix='parent_history_HTN')

In [None]:
df_dummies

In [None]:
df = pd.concat([df, df_dummies], axis=1)

In [None]:
df.head()

In [None]:
df.activity.value_counts()

In [None]:
df_dummies = pd.get_dummies(df['activity'], prefix='activity')

In [None]:
df_dummies

In [None]:
df = pd.concat([df, df_dummies], axis=1)

In [None]:
df

In [None]:
df.columns

In [None]:
X = df[['bmi_cent', 'age_cent', 'salt_cent', 'parent_history_HTN_both', 'parent_history_HTN_either', 'parent_history_HTN_none', 'activity_mild', 'activity_moderate', 
        'activity_sedentary', 'activity_severe', 'smoking', 'chewing']]

In [None]:
y = df[["systolic"]]

In [None]:
X = sm.add_constant(X)

In [None]:
model = sm.OLS(y, X).fit()

In [None]:
model.summary()

#### Python code done by Dennis Lam