# nhanes_inferential_2023
NHANES 2021-2023 Inferential Analytics Assignment

In this assignment, you will use NHANES data to perform basic inferential statistics using Python in Google Colab. You will explore relationships and differences in health metrics and demographic variables, utilizing the skills learned in class to answer key questions about the dataset. Your final analysis should be saved as a Google Colab notebook and uploaded to a GitHub repository.


In [7]:

import pandas as pd
from scipy.stats import chi2_contingency
import numpy as np

In [8]:

demo = pd.read_sas('https://wwwn.cdc.gov/Nchs/Nhanes/2021-2022/DEMO_L.XPT', format='xport')
bp = pd.read_sas('https://wwwn.cdc.gov/Nchs/Nhanes/2021-2022/BPXO_L.XPT', format='xport')
body = pd.read_sas('https://wwwn.cdc.gov/Nchs/Nhanes/2021-2022/BMX_L.XPT', format='xport')
chol_total = pd.read_sas('https://wwwn.cdc.gov/Nchs/Nhanes/2021-2022/TCHOL_L.XPT', format='xport')
glycohemo = pd.read_sas('https://wwwn.cdc.gov/Nchs/Nhanes/2021-2022/GHB_L.XPT', format='xport')
crp = pd.read_sas('https://wwwn.cdc.gov/Nchs/Nhanes/2021-2022/HSCRP_L.XPT', format='xport')
dm = pd.read_sas('https://wwwn.cdc.gov/Nchs/Nhanes/2021-2022/DIQ_L.XPT', format='xport')
phy = pd.read_sas('https://wwwn.cdc.gov/Nchs/Nhanes/2021-2022/PAQ_L.XPT', format='xport')
whd = pd.read_sas('https://wwwn.cdc.gov/Nchs/Nhanes/2021-2022/WHQ_L.XPT', format='xport')

In [9]:
nhanes_data = demo.merge(bp, on='SEQN', how='left')\
                  .merge(body, on='SEQN', how='left')\
                  .merge(chol_total, on='SEQN', how='left')\
                  .merge(glycohemo, on='SEQN', how='left')\
                  .merge(crp, on='SEQN', how='left')\
                  .merge(dm, on='SEQN', how='left')\
                  .merge(phy, on='SEQN', how='left')\
                  .merge(whd, on='SEQN', how='left')


print(nhanes_data)

           SEQN  SDDSRVYR  RIDSTATR  RIAGENDR  RIDAGEYR  RIDAGEMN  RIDRETH1  \
0      130378.0      12.0       2.0       1.0      43.0       NaN       5.0   
1      130379.0      12.0       2.0       1.0      66.0       NaN       3.0   
2      130380.0      12.0       2.0       2.0      44.0       NaN       2.0   
3      130381.0      12.0       2.0       2.0       5.0       NaN       5.0   
4      130382.0      12.0       2.0       1.0       2.0       NaN       3.0   
...         ...       ...       ...       ...       ...       ...       ...   
11928  142306.0      12.0       2.0       1.0       9.0       NaN       2.0   
11929  142307.0      12.0       2.0       2.0      49.0       NaN       4.0   
11930  142308.0      12.0       2.0       1.0      50.0       NaN       2.0   
11931  142309.0      12.0       2.0       1.0      40.0       NaN       2.0   
11932  142310.0      12.0       2.0       2.0      80.0       NaN       3.0   

       RIDRETH3  RIDEXMON  RIDEXAGM  ...  PAD790U  

In [10]:
nhanes_data['DMDMARTZ'] = nhanes_data['DMDMARTZ'].apply(lambda x: 1 if x == 1 else 0)
nhanes_data['DMDEDUC2'] = nhanes_data['DMDEDUC2'].apply(lambda x: 1 if x in [4, 5] else 0)
nhanes_data['PAD680'].replace([7777, 9999], np.nan, inplace=True)
nhanes_data['WHD020'].replace([7777, 9999], np.nan, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  nhanes_data['PAD680'].replace([7777, 9999], np.nan, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  nhanes_data['WHD020'].replace([7777, 9999], np.nan, inplace=True)


   - **Question 1**: "Is there an association between marital status (married or not married) and education level (bachelor’s degree or higher vs. less than a bachelor’s degree)?"  
     - Variables: `DMDMARTZ` (marital status) and `DMDEDUC2` (education level). Recode as specified.

In [11]:

from scipy.stats import chi2_contingency
contingency_table = pd.crosstab(nhanes_data['DMDMARTZ'], nhanes_data['DMDEDUC2'])
chi2, p, dof, ex = chi2_contingency(contingency_table)
print(f"Chi2 Statistic: {chi2}, p-value: {p}")


Chi2 Statistic: 1677.006341355503, p-value: 0.0


ANS: No.

   - **Question 2**: "Is there a difference in the mean sedentary behavior time between those who are married and those who are not married?"  
     - Variables: `DMDMARTZ` (marital status, recoded) and `PAD680` (sedentary behavior time, cleaned).

In [12]:

from scipy.stats import ttest_ind
married = nhanes_data[nhanes_data['DMDMARTZ'] == 1]['PAD680'].dropna()
not_married = nhanes_data[nhanes_data['DMDMARTZ'] == 0]['PAD680'].dropna()
t_stat, p_val = ttest_ind(married, not_married)
print(f"T-Statistic: {t_stat}, p-value: {p_val}")

T-Statistic: -3.8740971309213887, p-value: 0.0001078679819334709


ANS: There is a different as shown above by the t-test.

   - **Question 3**: "How do age and marital status affect systolic blood pressure?"  
     - Variables: `RIDAGEYR` (age), `DMDMARTZ` (marital status, recoded), and `BPXOSY3` (systolic blood pressure).

In [13]:
import statsmodels.api as sm
from statsmodels.formula.api import ols
model = ols('BPXOSY3 ~ RIDAGEYR + C(DMDMARTZ)', data=nhanes_data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

                   sum_sq      df            F    PR(>F)
C(DMDMARTZ)  1.737748e+03     1.0     6.860916  0.008828
RIDAGEYR     5.889160e+05     1.0  2325.138035  0.000000
Residual     1.893791e+06  7477.0          NaN       NaN



   - **Question 4**: "Is there a correlation between self-reported weight and minutes of sedentary behavior?"  
     - Variables: `WHD020` (self-reported weight, cleaned) and `PAD680` (sedentary behavior time, cleaned).

In [14]:
subset_data = nhanes_data[['WHD020', 'PAD680']].dropna()
from scipy.stats import pearsonr
correlation, p_value = pearsonr(subset_data['WHD020'], subset_data['PAD680'])
print(f"Correlation: {correlation}, p-value: {p_value}")

Correlation: 0.1559714584645021, p-value: 1.6988498386828133e-44
