# Exploring Physiological Predictors of Obesity Using NHANES and Relational Data Models

## Introduction

The National Health and Nutrition Examination Survey (NHANES) provides comprehensive data on the health and nutritional status of the U.S. population. Given the ongoing obesity epidemic in the United States, analyzing this dataset offers a valuable opportunity to understand factors contributing to obesity.

While lifestyle choices play a significant role, physiological factors are also important contributors. This project aims to explore NHANES data to identify potential correlations and predictors related to obesity. Ultimately, these insights could help inform strategies to prevent the development of obesity in at-risk populations.

## Objective

This project aims to:

* Clean and preprocess the NHANES datasets from 2017–2020, focusing on demographics and biometrics.
* Create a relational database structure that organizes the cleaned data for easy querying and analysis.
* Explore potential correlations between physiological and lifestyle factors with obesity-related measures.
* Develop preliminary predictive models to identify key predictors of obesity using the cleaned data.
* Document the data cleaning process thoroughly to ensure reproducibility and transparency.

## Import Libraries & Setup

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import os
!pip install pyreadstat
import pyreadstat #since the data files are .xpt files, this library is needed to import the table



## Data Cleaning

Prior to initiating the data cleaning process, the required Python libraries are imported. Relevant NHANES datasets will be loaded sequentially at each step to maintain clarity and focus. This approach facilitates exploration of the data structure, variable types, and initial summaries, enabling a comprehensive understanding of the data's scope and quality.

### Demographics Data Cleaning

In [2]:
def standardize_id_column(df, original_id='SEQN', new_id='participant_id'):
    """
    Renames the identifier column in a DataFrame from original_id to new_id.
    If the original_id is not present, returns the DataFrame unchanged.

    Parameters:
    - df: pandas DataFrame
    - original_id: name of the identifier column to replace (default 'SEQN')
    - new_id: standardized name to use (default 'participant_id')

    Returns:
    - DataFrame with standardized ID column
    """
    if original_id in df.columns:
        df = df.rename(columns={original_id: new_id})
    return df

In [3]:
file_path = '2017-2020/2017-2020.P_DEMO.xpt'

df_demo, meta = pyreadstat.read_xport(file_path)
df_demo = standardize_id_column(df_demo)

In [4]:
df_demo.head(10)

Unnamed: 0,participant_id,SDDSRVYR,RIDSTATR,RIAGENDR,RIDAGEYR,RIDAGEMN,RIDRETH1,RIDRETH3,RIDEXMON,DMDBORN4,...,FIAINTRP,MIALANG,MIAPROXY,MIAINTRP,AIALANGA,WTINTPRP,WTMECPRP,SDMVPSU,SDMVSTRA,INDFMPIR
0,109263.0,66.0,2.0,1.0,2.0,,5.0,6.0,2.0,1.0,...,2.0,,,,,7891.762435,8951.815567,3.0,156.0,4.66
1,109264.0,66.0,2.0,2.0,13.0,,1.0,1.0,2.0,1.0,...,2.0,1.0,2.0,2.0,1.0,11689.747264,12271.157043,1.0,155.0,0.83
2,109265.0,66.0,2.0,1.0,2.0,,3.0,3.0,2.0,1.0,...,2.0,,,,,16273.825939,16658.764203,1.0,157.0,3.06
3,109266.0,66.0,2.0,2.0,29.0,,5.0,6.0,2.0,2.0,...,2.0,1.0,2.0,2.0,1.0,7825.646112,8154.968193,2.0,168.0,5.0
4,109267.0,66.0,1.0,2.0,21.0,,2.0,2.0,,2.0,...,2.0,,,,,26379.991724,0.0,1.0,156.0,5.0
5,109268.0,66.0,1.0,2.0,18.0,,3.0,3.0,,1.0,...,2.0,,,,,19639.221008,0.0,1.0,155.0,1.66
6,109269.0,66.0,2.0,1.0,2.0,,2.0,2.0,1.0,1.0,...,2.0,,,,,5906.250521,6848.271782,2.0,152.0,0.96
7,109270.0,66.0,2.0,2.0,11.0,,4.0,4.0,1.0,1.0,...,2.0,1.0,2.0,2.0,,4613.057696,4886.930378,1.0,150.0,1.88
8,109271.0,66.0,2.0,1.0,49.0,,3.0,3.0,2.0,1.0,...,2.0,1.0,2.0,2.0,1.0,8481.589837,8658.732873,1.0,167.0,
9,109272.0,66.0,2.0,1.0,0.0,3.0,1.0,1.0,2.0,1.0,...,2.0,,,,,7037.380216,7872.776233,1.0,155.0,0.73


Many of the original column names in the dataset are not immediately intuitive. To improve readability and facilitate analysis, we will first list all column names and then rename them with more straightforward, descriptive labels. The original column descriptions can be referenced here: NHANES Demographics Codebook.

Additionally, based on the project’s scope, several columns that are not relevant to our analysis will be dropped to streamline the dataset.

In [5]:
df_demo.columns

Index(['participant_id', 'SDDSRVYR', 'RIDSTATR', 'RIAGENDR', 'RIDAGEYR',
       'RIDAGEMN', 'RIDRETH1', 'RIDRETH3', 'RIDEXMON', 'DMDBORN4', 'DMDYRUSZ',
       'DMDEDUC2', 'DMDMARTZ', 'RIDEXPRG', 'SIALANG', 'SIAPROXY', 'SIAINTRP',
       'FIALANG', 'FIAPROXY', 'FIAINTRP', 'MIALANG', 'MIAPROXY', 'MIAINTRP',
       'AIALANGA', 'WTINTPRP', 'WTMECPRP', 'SDMVPSU', 'SDMVSTRA', 'INDFMPIR'],
      dtype='object')

In [6]:
df_demo.drop(['SDDSRVYR','RIDAGEMN','RIDRETH1','RIDEXMON','DMDBORN4','DMDYRUSZ','RIDEXPRG','SIALANG', 
         'SIAPROXY', 'SIAINTRP','FIALANG','FIAPROXY', 'FIAINTRP', 'MIALANG', 'MIAPROXY', 
         'MIAINTRP', 'AIALANGA','WTINTPRP', 'WTMECPRP', 'SDMVPSU', 'SDMVSTRA'],
        axis=1,inplace=True, errors='ignore')

In [7]:
df_demo.columns

Index(['participant_id', 'RIDSTATR', 'RIAGENDR', 'RIDAGEYR', 'RIDRETH3',
       'DMDEDUC2', 'DMDMARTZ', 'INDFMPIR'],
      dtype='object')

In [8]:
df_demo = df_demo.rename(columns={
    'RIAGENDR': 'gender',
    'RIDAGEYR' : 'age_year',
    'RIDRETH3' : 'race',
    'DMDEDUC2' : 'education_level', 
    'DMDMARTZ' : 'marital_status',
    'INDFMPIR' : 'family_income_poverty',
    'RIDSTATR' : 'interview_exam_status'
})

In [9]:
df_demo.head(10)

Unnamed: 0,participant_id,interview_exam_status,gender,age_year,race,education_level,marital_status,family_income_poverty
0,109263.0,2.0,1.0,2.0,6.0,,,4.66
1,109264.0,2.0,2.0,13.0,1.0,,,0.83
2,109265.0,2.0,1.0,2.0,3.0,,,3.06
3,109266.0,2.0,2.0,29.0,6.0,5.0,3.0,5.0
4,109267.0,1.0,2.0,21.0,2.0,4.0,3.0,5.0
5,109268.0,1.0,2.0,18.0,3.0,,,1.66
6,109269.0,2.0,1.0,2.0,2.0,,,0.96
7,109270.0,2.0,2.0,11.0,4.0,,,1.88
8,109271.0,2.0,1.0,49.0,3.0,2.0,3.0,
9,109272.0,2.0,1.0,0.0,1.0,,,0.73


In [10]:
df_demo.dtypes

participant_id           float64
interview_exam_status    float64
gender                   float64
age_year                 float64
race                     float64
education_level          float64
marital_status           float64
family_income_poverty    float64
dtype: object

In [11]:
df_demo.shape

(15560, 8)

In [12]:
df_demo.isnull().sum()

participant_id              0
interview_exam_status       0
gender                      0
age_year                    0
race                        0
education_level          6328
marital_status           6328
family_income_poverty    2201
dtype: int64

In [13]:
df_demo.describe()

Unnamed: 0,participant_id,interview_exam_status,gender,age_year,race,education_level,marital_status,family_income_poverty
count,15560.0,15560.0,15560.0,15560.0,15560.0,9232.0,9232.0,13359.0
mean,117042.5,1.919023,1.503792,33.742481,3.486118,3.551993,1.708622,2.405937
std,4491.92943,0.272808,0.500002,25.320532,1.622734,1.214109,2.755878,1.634346
min,109263.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0
25%,113152.75,2.0,1.0,10.0,3.0,3.0,1.0,1.02
50%,117042.5,2.0,2.0,30.0,3.0,4.0,1.0,1.96
75%,120932.25,2.0,2.0,56.0,4.0,4.0,2.0,3.88
max,124822.0,2.0,2.0,80.0,7.0,9.0,99.0,5.0


Based on the NHANES data documentation, the variable RIDSTATR (renamed as interview_exam_status) indicates whether a participant completed only the interview (1) or both the interview and physical examination (2). Since this project focuses on physical and medical examination data, participants with a value of 1 will be excluded. After filtering out these participants, the interview_exam_status column will be dropped, as it will no longer provide useful information.

In [14]:
df_demo = df_demo[df_demo['interview_exam_status'] != 1]

In [15]:
df_demo.head(10)

Unnamed: 0,participant_id,interview_exam_status,gender,age_year,race,education_level,marital_status,family_income_poverty
0,109263.0,2.0,1.0,2.0,6.0,,,4.66
1,109264.0,2.0,2.0,13.0,1.0,,,0.83
2,109265.0,2.0,1.0,2.0,3.0,,,3.06
3,109266.0,2.0,2.0,29.0,6.0,5.0,3.0,5.0
6,109269.0,2.0,1.0,2.0,2.0,,,0.96
7,109270.0,2.0,2.0,11.0,4.0,,,1.88
8,109271.0,2.0,1.0,49.0,3.0,2.0,3.0,
9,109272.0,2.0,1.0,0.0,1.0,,,0.73
10,109273.0,2.0,1.0,36.0,3.0,4.0,3.0,0.83
11,109274.0,2.0,1.0,68.0,7.0,4.0,3.0,1.2


In [16]:
df_demo.drop(['interview_exam_status'], axis = 1, inplace = True)

In [17]:
df_demo.shape

(14300, 7)

After the initial cleaning process, the resulting dataset contains approximately 143,000 rows and 7 columns.

To maintain consistency with the original NHANES coding and to support traceability, the decision was made to retain the raw categorical values for columns containing encoded data (such as gender, race/ethnicity, and education level). These encoded values follow NHANES's standard coding conventions and will be referenced using corresponding lookup tables or detailed clarifications provided in the README.

In [18]:
df_demo.head(10)

Unnamed: 0,participant_id,gender,age_year,race,education_level,marital_status,family_income_poverty
0,109263.0,1.0,2.0,6.0,,,4.66
1,109264.0,2.0,13.0,1.0,,,0.83
2,109265.0,1.0,2.0,3.0,,,3.06
3,109266.0,2.0,29.0,6.0,5.0,3.0,5.0
6,109269.0,1.0,2.0,2.0,,,0.96
7,109270.0,2.0,11.0,4.0,,,1.88
8,109271.0,1.0,49.0,3.0,2.0,3.0,
9,109272.0,1.0,0.0,1.0,,,0.73
10,109273.0,1.0,36.0,3.0,4.0,3.0,0.83
11,109274.0,1.0,68.0,7.0,4.0,3.0,1.2


At this stage, missing values (NaN) in other columns are retained to maximize the number of participants included in the dataset. This approach preserves potential correlations with other lab and physical examination variables. The cleaned dataframe has been saved as a CSV file for subsequent import into a relational database.

In [19]:
df_demo.to_csv('cleaned_demographics.csv', index=False)

The same cleaning and preprocessing steps will be applied to the other tables and datasets to ensure consistency and maintain data quality for the forthcoming analysis.

### Body Measures Data Cleaning

In [20]:
file_path = '2017-2020/2017-2020.P_BMX.xpt'

df_bmx, meta = pyreadstat.read_xport(file_path)
df_bmx = standardize_id_column(df_bmx)

In [21]:
df_bmx.head(10)

Unnamed: 0,participant_id,BMDSTATS,BMXWT,BMIWT,BMXRECUM,BMIRECUM,BMXHEAD,BMIHEAD,BMXHT,BMIHT,...,BMXLEG,BMILEG,BMXARML,BMIARML,BMXARMC,BMIARMC,BMXWAIST,BMIWAIST,BMXHIP,BMIHIP
0,109263.0,4.0,,,,,,,,,...,,,,,,,,,,
1,109264.0,1.0,42.2,,,,,,154.7,,...,36.3,,33.8,,22.7,,63.8,,85.0,
2,109265.0,1.0,12.0,,91.6,,,,89.3,,...,,,18.6,,14.8,,41.2,,,
3,109266.0,1.0,97.1,,,,,,160.2,,...,40.8,,34.7,,35.8,,117.9,,126.1,
4,109269.0,3.0,13.6,,90.9,,,,,1.0,...,,,,1.0,,1.0,,1.0,,
5,109270.0,1.0,75.3,,,,,,156.0,,...,42.6,,36.1,,31.0,,91.4,,,
6,109271.0,1.0,98.8,,,,,,182.3,,...,40.1,,42.0,,38.2,,120.4,,108.2,
7,109272.0,1.0,7.1,,63.6,,41.3,,,,...,,,13.0,,15.5,,,,,
8,109273.0,1.0,74.3,,,,,,184.2,,...,41.0,,41.1,,30.2,,86.8,,94.5,
9,109274.0,1.0,103.7,,,,,,185.3,,...,44.0,,47.0,,32.0,,109.6,,107.8,


In [22]:
df_bmx.shape

(14300, 22)

In [23]:
df_bmx.isnull().sum()

participant_id        0
BMDSTATS              0
BMXWT               225
BMIWT             13712
BMXRECUM          12830
BMIRECUM          14257
BMXHEAD           13990
BMIHEAD           14300
BMXHT              1143
BMIHT             14129
BMXBMI             1163
BMDBMIC            9551
BMXLEG             3316
BMILEG            13812
BMXARML             810
BMIARML           13813
BMXARMC             816
BMIARMC           13807
BMXWAIST           1726
BMIWAIST          13683
BMXHIP             4438
BMIHIP            13924
dtype: int64

Looking at the document associated with this dataset, each of the column and their contents were assessed. There are participants without body measurement data, which will be excluded. 

In [24]:
df_bmx = df_bmx[df_bmx['BMDSTATS'] != 4]

In [25]:
df_bmx.shape

(14107, 22)

The head circumference is measured for children 0 years - 6 months old. Since this value is obtained only for a small portion of the participant population, the decision was made to drop that column. 

There are some columns with comments to explain some of the missing data and the decision was made to keep those. The rest of the columns will be renamed to be more intuitive.

In [26]:
df_bmx.drop(['BMDSTATS','BMXHEAD','BMIHEAD'], axis = 1, inplace = True)

In [27]:
df_bmx.columns

Index(['participant_id', 'BMXWT', 'BMIWT', 'BMXRECUM', 'BMIRECUM', 'BMXHT',
       'BMIHT', 'BMXBMI', 'BMDBMIC', 'BMXLEG', 'BMILEG', 'BMXARML', 'BMIARML',
       'BMXARMC', 'BMIARMC', 'BMXWAIST', 'BMIWAIST', 'BMXHIP', 'BMIHIP'],
      dtype='object')

In [28]:
df_bmx = df_bmx.rename(columns={
    'BMXWT':'weight_kg',
    'BMIWT': 'weight_comment',
    'BMXRECUM': 'recumbent_length_cm',
    'BMIRECUM': 'recumbent_length_comment',
    'BMXHT': 'standing_height', 
    'BMIHT': 'standing_height_comment',
    'BMXBMI': 'bmi',
    'BMDBMIC': 'bmi_category_child',
    'BMXLEG': 'upper_leg_cm',
    'BMILEG': 'upper_leg_comment',
    'BMXARML': 'upper_arm_cm',
    'BMIARML': 'upper_arm_comment',
    'BMXARMC': 'arm_circumference_cm',
    'BMIARMC': 'arm_circ_comment',
    'BMXWAIST': 'waist_circ_cm',
    'BMIWAIST': 'waist_circ_comment', 
    'BMXHIP': 'hip_circ_cm', 
    'BMIHIP': 'hip_circ_comment'
})

In [29]:
df_bmx.head(10)

Unnamed: 0,participant_id,weight_kg,weight_comment,recumbent_length_cm,recumbent_length_comment,standing_height,standing_height_comment,bmi,bmi_category_child,upper_leg_cm,upper_leg_comment,upper_arm_cm,upper_arm_comment,arm_circumference_cm,arm_circ_comment,waist_circ_cm,waist_circ_comment,hip_circ_cm,hip_circ_comment
1,109264.0,42.2,,,,154.7,,17.6,2.0,36.3,,33.8,,22.7,,63.8,,85.0,
2,109265.0,12.0,,91.6,,89.3,,15.0,2.0,,,18.6,,14.8,,41.2,,,
3,109266.0,97.1,,,,160.2,,37.8,,40.8,,34.7,,35.8,,117.9,,126.1,
4,109269.0,13.6,,90.9,,,1.0,,,,,,1.0,,1.0,,1.0,,
5,109270.0,75.3,,,,156.0,,30.9,4.0,42.6,,36.1,,31.0,,91.4,,,
6,109271.0,98.8,,,,182.3,,29.7,,40.1,,42.0,,38.2,,120.4,,108.2,
7,109272.0,7.1,,63.6,,,,,,,,13.0,,15.5,,,,,
8,109273.0,74.3,,,,184.2,,21.9,,41.0,,41.1,,30.2,,86.8,,94.5,
9,109274.0,103.7,,,,185.3,,30.2,,44.0,,47.0,,32.0,,109.6,,107.8,
10,109275.0,20.9,,,,120.4,,14.4,2.0,,,,1.0,,1.0,,1.0,,


With the comment sections, most of the sections are marked "Could not obtain". The comment sections that only have such notation was removed, as the lack of value is self evident with missing values (NaN).

In [30]:
df_bmx.drop(['recumbent_length_comment','upper_leg_comment','upper_arm_comment',
         'arm_circ_comment','waist_circ_comment','hip_circ_comment'], axis = 1, inplace = True)

In [31]:
df_bmx.head(10)

Unnamed: 0,participant_id,weight_kg,weight_comment,recumbent_length_cm,standing_height,standing_height_comment,bmi,bmi_category_child,upper_leg_cm,upper_arm_cm,arm_circumference_cm,waist_circ_cm,hip_circ_cm
1,109264.0,42.2,,,154.7,,17.6,2.0,36.3,33.8,22.7,63.8,85.0
2,109265.0,12.0,,91.6,89.3,,15.0,2.0,,18.6,14.8,41.2,
3,109266.0,97.1,,,160.2,,37.8,,40.8,34.7,35.8,117.9,126.1
4,109269.0,13.6,,90.9,,1.0,,,,,,,
5,109270.0,75.3,,,156.0,,30.9,4.0,42.6,36.1,31.0,91.4,
6,109271.0,98.8,,,182.3,,29.7,,40.1,42.0,38.2,120.4,108.2
7,109272.0,7.1,,63.6,,,,,,13.0,15.5,,
8,109273.0,74.3,,,184.2,,21.9,,41.0,41.1,30.2,86.8,94.5
9,109274.0,103.7,,,185.3,,30.2,,44.0,47.0,32.0,109.6,107.8
10,109275.0,20.9,,,120.4,,14.4,2.0,,,,,


In [32]:
df_bmx.shape

(14107, 13)

The comment sections that are kept have values more than just "Not Obtained". These will be interpreted with appropriate comment and translated into string values

In [33]:
df_bmx.to_csv('cleaned_bodymeasures.csv', index=False)

### Lab Results Data Cleaning

Laboratory analysis is a vital component of the NHANES 2017–2020 dataset, offering detailed biomarker data that can be leveraged for predictive modeling. The most common sources of lab data are blood and urine samples. To facilitate data cleaning and relational database design, all lab results were categorized into two main groups: blood-based tests and urine-based tests.

This separation enhances the clarity and usability of the database structure, especially in the context of developing an obesity prediction model that emphasizes blood biomarkers. While blood tests serve as the primary focus due to their strong association with metabolic health, urine tests are also retained for potential secondary insights.

During data cleaning, each lab dataset from the 2017–2020 cycle was examined for completeness, consistent formatting, and variable alignment. Efforts were made to standardize variable names and units across different files to ensure compatibility. Only participants with valid lab results and complete demographic data were retained for analysis.

#### Urine Labs

##### Albumin and Creatinine

Albumin is a protein often found in urine if there is damage to the kidneys. Creatinine is waste product that is excreted in urine and helps evaluate the kidney function.

In [34]:
file_path = '2017-2020/urine/1.P_ALB_CR.xpt'

df, meta = pyreadstat.read_xport(file_path)
df = standardize_id_column(df)

In [35]:
df.shape

(13027, 8)

In [36]:
df.head(10)

Unnamed: 0,participant_id,URXUMA,URXUMS,URDUMALC,URXUCR,URXCRS,URDUCRLC,URDACT
0,109264.0,,,,,,,
1,109266.0,5.5,5.5,0.0,36.0,3182.4,0.0,15.28
2,109270.0,4.0,4.0,0.0,165.0,14586.0,0.0,2.42
3,109271.0,2.4,2.4,0.0,32.0,2828.8,0.0,7.5
4,109273.0,4.9,4.9,0.0,121.0,10696.4,0.0,4.05
5,109274.0,12.8,12.8,0.0,120.0,10608.0,0.0,10.67
6,109275.0,3.7,3.7,0.0,20.0,1768.0,0.0,18.5
7,109277.0,14.7,14.7,0.0,244.0,21569.6,0.0,6.02
8,109278.0,8.4,8.4,0.0,124.0,10961.6,0.0,6.77
9,109279.0,13.9,13.9,0.0,251.0,22188.4,0.0,5.54


In [37]:
df.columns

Index(['participant_id', 'URXUMA', 'URXUMS', 'URDUMALC', 'URXUCR', 'URXCRS',
       'URDUCRLC', 'URDACT'],
      dtype='object')

In [38]:
df = df.rename(columns={
    'URXUMA': 'albumin_urine_ug_mL',
    'URXUMS': 'albumin_urine_mg_L', 
    'URDUMALC': 'alb_comment',
    'URXUCR': 'creatinine_urine_mg_dL', 
    'URXCRS': 'creatinine_urine_umol_L', 
    'URDUCRLC': 'creatinine_comment',
    'URDACT': 'alb_creat_ratio'
})

In [39]:
df.isnull().sum()

participant_id               0
albumin_urine_ug_mL        517
albumin_urine_mg_L         517
alb_comment                517
creatinine_urine_mg_dL     518
creatinine_urine_umol_L    518
creatinine_comment         518
alb_creat_ratio            518
dtype: int64

Some columns contain a similar number of missing (NaN) values, raising the question of whether these missing values occur for the same participants. To assess potential overlap, the following code was executed to identify if the missing values correspond to the same participant IDs. This check helps ensure that removing rows with missing data will not disproportionately reduce the dataset.

In [40]:
# Assuming your dataframe is called df, and participant ID column is 'SEQN'
def get_common_nan_ids(df, col1, col2, id_col='participant_id', verbose=True):
    """
    Returns a set of participant IDs where BOTH col1 and col2 are NaN.
    
    Parameters:
    - df: pandas DataFrame
    - col1, col2: column names to check for NaNs
    - id_col: column name for participant IDs (default 'participant_id')
    
    Returns:
    - Set of participant IDs with NaNs in both columns
    """
    ids_nan_col1 = set(df.loc[df[col1].isna(), id_col])
    ids_nan_col2 = set(df.loc[df[col2].isna(), id_col])
    common_nan_ids = ids_nan_col1.intersection(ids_nan_col2)
    
    if verbose:
        print(f"Number of NaNs in {col1}: {len(ids_nan_col1)}")
        print(f"Number of NaNs in {col2}: {len(ids_nan_col2)}")
        print(f"Number of IDs with NaNs in both columns: {len(common_nan_ids)}")
    
    return common_nan_ids

In [41]:
def drop_rows_with_common_nan_ids(df, col1, col2, id_col='participant_id'):
    """
    Drops rows where BOTH col1 and col2 are NaN.
    Uses get_common_nan_ids() to identify rows.
    
    Parameters:
    - df: pandas DataFrame
    - col1, col2: column names to check for NaNs
    - id_col: column name for participant IDs (default 'participant_id')
    
    Returns:
    - cleaned DataFrame (copy)
    """
    common_nan_ids = get_common_nan_ids(df, col1, col2, id_col, verbose=False)
    rows_dropped = df[id_col].isin(common_nan_ids).sum()
    
    print(f"Rows dropped where both {col1} and {col2} were NaN: {rows_dropped}")
    
    return df[~(df[col1].isna() & df[col2].isna())].copy()

In [42]:
common_nan = get_common_nan_ids(df,'albumin_urine_ug_mL','creatinine_urine_mg_dL')

Number of NaNs in albumin_urine_ug_mL: 517
Number of NaNs in creatinine_urine_mg_dL: 518
Number of IDs with NaNs in both columns: 517


In [43]:
df_cleaned = drop_rows_with_common_nan_ids(df, 'albumin_urine_ug_mL', 'creatinine_urine_mg_dL')

Rows dropped where both albumin_urine_ug_mL and creatinine_urine_mg_dL were NaN: 517


##### Arsenic

Arsenic is a naturally occurring mineral present in water, air, and soil, existing in both organic and inorganic forms. While the inorganic form is more toxic, exposure to any form of arsenic can be harmful to human health. Elevated arsenic levels have been linked to adverse effects on multiple body systems, including the cardiovascular and endocrine systems [3].

Urinary arsenic levels serve as a biomarker reflecting the concentration of arsenic in the bloodstream. Given arsenic’s potential impact on the endocrine system—a key regulator of metabolism and body weight—these lab values are relevant for investigating associations with obesity in this analysis.

##### Total Arsenic

In [44]:
file_path = '2017-2020/urine/2.P_UTAS.xpt'

df1, meta = pyreadstat.read_xport(file_path)
df1 = standardize_id_column(df1)

In [45]:
df1.head(10)

Unnamed: 0,participant_id,WTSAPRP,URXUAS,URDUASLC
0,109266.0,28660.015986,2.05,0.0
1,109270.0,17900.682903,3.25,0.0
2,109273.0,80106.859617,5.16,0.0
3,109274.0,24512.27628,2.92,0.0
4,109287.0,25828.523003,5.8,0.0
5,109288.0,8535.018174,4.24,0.0
6,109290.0,12410.268374,144.72,0.0
7,109295.0,28235.246814,3.48,0.0
8,109300.0,66737.887353,5.08,0.0
9,109309.0,33019.729726,0.95,0.0


In [46]:
df1.columns

Index(['participant_id', 'WTSAPRP', 'URXUAS', 'URDUASLC'], dtype='object')

In [47]:
df1 = df1.rename(columns={
    'URXUAS':'total_arsenic_ug_L',
    'URDUASLC': 'total_arsenic_comment'
})

In [48]:
df1.drop('WTSAPRP', axis=1,inplace = True)

In [49]:
df1.isna().sum()

participant_id             0
total_arsenic_ug_L       320
total_arsenic_comment    320
dtype: int64

In [50]:
df1.dropna(axis=1, inplace=True)

##### Speciated Arsenic

In [51]:
file_path = '2017-2020/urine/3.P_UAS.xpt'

df2, meta = pyreadstat.read_xport(file_path)
df2 = standardize_id_column(df2)

In [52]:
df2.shape

(4890, 14)

In [53]:
df2.columns

Index(['participant_id', 'WTSAPRP', 'URXUAS3', 'URDUA3LC', 'URXUAS5',
       'URDUA5LC', 'URXUAB', 'URDUABLC', 'URXUAC', 'URDUACLC', 'URXUDMA',
       'URDUDALC', 'URXUMMA', 'URDUMMAL'],
      dtype='object')

In [54]:
df2 = df2.drop('WTSAPRP', axis = 1)

In [55]:
df2 = df2.rename(columns={
    'WTSAPRP': 'subsample_weight',
    'URXUAS3':'arsenous_acid_ug_L',
    'URDUA3LC':'arsenous_acid_comment', 
    'URXUAS5': 'arsenic_acid_ug_L',
    'URDUA5LC': 'arsenic_acid_comment',
    'URXUAB': 'arsenobetaine_ug_L',
    'URDUABLC':'arsenobetaine_comment',
    'URXUAC': 'arsenocholoine_ug_L',
    'URDUACLC': 'arsenocholine_comment',
    'URXUDMA': 'dimethylarsinic_acid_ug_L', 
    'URDUDALC': 'dimethylarsinic_comment',
    'URXUMMA': 'monomethylarsonic_acid_ug_L',
    'URDUMMAL': 'monometylarsonic_comment'
})

In [56]:
df2.head(10)

Unnamed: 0,participant_id,arsenous_acid_ug_L,arsenous_acid_comment,arsenic_acid_ug_L,arsenic_acid_comment,arsenobetaine_ug_L,arsenobetaine_comment,arsenocholoine_ug_L,arsenocholine_comment,dimethylarsinic_acid_ug_L,dimethylarsinic_comment,monomethylarsonic_acid_ug_L,monometylarsonic_comment
0,109266.0,0.08,1.0,0.56,1.0,0.82,1.0,0.08,1.0,1.35,1.0,0.14,1.0
1,109270.0,0.51,0.0,0.56,1.0,0.82,1.0,0.08,1.0,2.07,0.0,0.14,1.0
2,109273.0,0.08,1.0,0.56,1.0,2.74,0.0,0.08,1.0,1.35,1.0,0.14,1.0
3,109274.0,0.08,1.0,0.56,1.0,0.82,1.0,0.08,1.0,1.35,1.0,0.14,1.0
4,109287.0,0.08,1.0,0.56,1.0,0.82,1.0,0.08,1.0,3.6,0.0,0.14,1.0
5,109288.0,0.08,1.0,0.56,1.0,0.82,1.0,0.08,1.0,4.22,0.0,0.14,1.0
6,109290.0,0.44,0.0,0.56,1.0,147.82,0.0,0.76,0.0,6.63,0.0,0.7,0.0
7,109295.0,0.08,1.0,0.56,1.0,1.28,0.0,0.08,1.0,1.35,1.0,0.14,1.0
8,109300.0,0.08,1.0,0.56,1.0,3.03,0.0,0.08,1.0,1.35,1.0,0.14,1.0
9,109309.0,0.08,1.0,0.56,1.0,0.82,1.0,0.08,1.0,1.35,1.0,0.14,1.0


In [57]:
df2.isna().sum()

participant_id                   0
arsenous_acid_ug_L             265
arsenous_acid_comment          265
arsenic_acid_ug_L              265
arsenic_acid_comment           265
arsenobetaine_ug_L             265
arsenobetaine_comment          265
arsenocholoine_ug_L            265
arsenocholine_comment          265
dimethylarsinic_acid_ug_L      265
dimethylarsinic_comment        265
monomethylarsonic_acid_ug_L    265
monometylarsonic_comment       265
dtype: int64

In [58]:
value_cols = [col for col in df2.columns if col != 'participant_id']
rows_all_nan = df2[value_cols].isna().all(axis=1)
print(f"Number of rows missing all arsenic values: {rows_all_nan.sum()}")

Number of rows missing all arsenic values: 265


In [59]:
df2_cleaned = df2[~rows_all_nan].copy()

print(f"Number of rows dropped: {rows_all_nan.sum()}")

Number of rows dropped: 265


##### Chromium

Chromium is a trace mineral found in two main forms: trivalent chromium (Cr³⁺), which is naturally present in food and supplements, and hexavalent chromium (Cr⁶⁺), a toxic by-product of industrial processes such as metal manufacturing 
[4]. Trivalent chromium has been suggested to play a role in carbohydrate, lipid, and protein metabolism by enhancing insulin action; however, no definitive physiological function has been firmly established. In contrast, hexavalent chromium is recognized as potentially carcinogenic, especially when inhaled or ingested in high amounts. Due to the uncertainty surrounding its physiological role, a standardized reference range for chromium has not been established.

For the purposes of this project, which focuses on assessing correlations with obesity, trivalent chromium is of particular interest due to its possible metabolic effects. However, a limitation of the NHANES dataset is that urinary chromium measurements are not specifically categorized by valence state. While this makes it difficult to distinguish between beneficial and harmful forms, urine chromium concentration can still serve as a proxy for overall chromium exposure. Monitoring these levels may offer insight into whether exposure levels are within a potentially physiological or harmful range.

In [60]:
file_path = '2017-2020/urine/4.P_UCM.xpt'

df3, meta = pyreadstat.read_xport(file_path)
df3 = standardize_id_column(df3)

In [61]:
df3.columns

Index(['participant_id', 'WTSAPRP', 'URXUCM', 'URDUCMLC'], dtype='object')

In [62]:
df3 = df3.drop('WTSAPRP',axis=1)

In [63]:
df3.head(10)

Unnamed: 0,participant_id,URXUCM,URDUCMLC
0,109266.0,0.13,1.0
1,109270.0,0.13,1.0
2,109273.0,0.19,0.0
3,109274.0,0.4,0.0
4,109287.0,0.13,1.0
5,109288.0,0.29,0.0
6,109290.0,0.27,0.0
7,109295.0,0.26,0.0
8,109300.0,0.13,1.0
9,109309.0,0.13,1.0


In [64]:
df3.isna().sum()

participant_id      0
URXUCM            321
URDUCMLC          321
dtype: int64

Since there is only one lab value that is being measured in this dataset, any NaN values will be dropped.

In [65]:
df3 = df3.rename(columns={
   'URXUCM': 'chromium_ug_L',
    'URDUCMLC': 'chromium_comment'
})

In [66]:
df3.dropna(axis=1,inplace=True)

##### Flame Retardant

Flame retardants (FRs) are chemicals applied to materials such as furniture, electronics, electrical devices, and construction products to reduce flammability and slow the spread of fire. There are several major classes of flame retardants, including brominated flame retardants (BFRs), hexabromocyclododecane (HBCD), organophosphate flame retardants (OPFRs), tetrabromobisphenol A (TBBPA), and polybrominated diphenyl ethers (PBDEs). These compounds are highly persistent in the environment and can accumulate in human tissue due to their resistance to degradation [5].

Among them, PBDEs have been extensively studied for their potential adverse health effects, including endocrine and thyroid disruption, immunotoxicity, reproductive toxicity, carcinogenicity, and negative impacts on fetal and child development. Although young children are particularly vulnerable to these effects, adults are also susceptible to long-term exposure. Given their endocrine-disrupting properties, flame retardants may play a role in the development or progression of obesity.

According to the documentation from NHANES for this lab, the unit of measurement is all ng/mL.

In [67]:
file_path = '2017-2020/urine/5.P_FR.xpt'

df4, meta = pyreadstat.read_xport(file_path)
df4 = standardize_id_column(df4)

In [68]:
df4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4929 entries, 0 to 4928
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   participant_id  4929 non-null   float64
 1   WTSBPRP         4929 non-null   float64
 2   URXBCPP         4617 non-null   float64
 3   URDBCPLC        4617 non-null   float64
 4   URXBCEP         4618 non-null   float64
 5   URDCEPLC        4618 non-null   float64
 6   URXBDCP         4599 non-null   float64
 7   URDBDCLC        4599 non-null   float64
 8   URXDBUP         4614 non-null   float64
 9   URDDUPLC        4614 non-null   float64
 10  URXDPHP         4622 non-null   float64
 11  URDDPHLC        4622 non-null   float64
 12  URXTBBA         4622 non-null   float64
 13  URDBBALC        4622 non-null   float64
dtypes: float64(14)
memory usage: 539.2 KB


In [69]:
df4.columns

Index(['participant_id', 'WTSBPRP', 'URXBCPP', 'URDBCPLC', 'URXBCEP',
       'URDCEPLC', 'URXBDCP', 'URDBDCLC', 'URXDBUP', 'URDDUPLC', 'URXDPHP',
       'URDDPHLC', 'URXTBBA', 'URDBBALC'],
      dtype='object')

In [70]:
df4.head(10)

Unnamed: 0,participant_id,WTSBPRP,URXBCPP,URDBCPLC,URXBCEP,URDCEPLC,URXBDCP,URDBDCLC,URXDBUP,URDDUPLC,URXDPHP,URDDPHLC,URXTBBA,URDBBALC
0,109271.0,20156.439742,0.0707,1.0,0.0707,1.0,0.137,0.0,0.0707,1.0,0.886,0.0,0.0354,1.0
1,109277.0,51738.369518,0.709,0.0,0.0707,1.0,1.15,0.0,0.138,0.0,1.3,0.0,0.0354,1.0
2,109282.0,97190.5545,,,4.23,0.0,3.66,0.0,0.501,0.0,2.05,0.0,0.0354,1.0
3,109285.0,85548.221421,0.0707,1.0,4.42,0.0,2.27,0.0,0.0707,1.0,10.3,0.0,0.062,0.0
4,109288.0,9103.868995,0.347,0.0,0.145,0.0,0.859,0.0,0.0707,1.0,1.29,0.0,0.0354,1.0
5,109301.0,14215.93737,0.116,0.0,0.0707,1.0,5.18,0.0,0.45,0.0,0.701,0.0,0.0354,1.0
6,109302.0,5984.497323,0.0707,1.0,0.0707,1.0,4.81,0.0,0.171,0.0,0.732,0.0,0.0354,1.0
7,109303.0,16549.099643,0.0707,1.0,0.366,0.0,0.0707,1.0,0.0707,1.0,0.162,0.0,0.0354,1.0
8,109304.0,40089.354988,0.174,0.0,0.386,0.0,2.68,0.0,0.14,0.0,1.28,0.0,0.0354,1.0
9,109307.0,49745.101247,0.0707,1.0,0.0707,1.0,0.903,0.0,0.0707,1.0,1.55,0.0,0.121,0.0


In [71]:
df4 = df4.drop('WTSBPRP', axis = 1)

In [72]:
df4 = df4.rename(columns={
    'URXBCPP': '1_chloro_2_propyl_phosphate', 
    'URDBCPLC' : '1ch_2pro_comment', 
    'URXBCEP': 'bis_1_chloroethyl_phosphate',
    'URDCEPLC' : 'bis_1_chlo_phos_comment',
    'URXBDCP': '1_3_dichloro_2_propyl_phosphate', 
    'URDBDCLC': '1_3_di_2_pro_comment', 
    'URXDBUP': 'dibutyl_phosphate',
    'URDDUPLC': 'dibutyl_phos_comment', 
    'URXDPHP': 'diphenyl_phosphate',
    'URDDPHLC': 'diphe_phos_comment',
    'URXTBBA': '2_3_4_5_tetrabromobenzoic_acid',
    'URDBBALC': '2_3_4_5_tet_comment'
})

In [73]:
df4.isnull().sum()

participant_id                       0
1_chloro_2_propyl_phosphate        312
1ch_2pro_comment                   312
bis_1_chloroethyl_phosphate        311
bis_1_chlo_phos_comment            311
1_3_dichloro_2_propyl_phosphate    330
1_3_di_2_pro_comment               330
dibutyl_phosphate                  315
dibutyl_phos_comment               315
diphenyl_phosphate                 307
diphe_phos_comment                 307
2_3_4_5_tetrabromobenzoic_acid     307
2_3_4_5_tet_comment                307
dtype: int64

There are multiple columns that have over 300 NaN values. There may be rows that are missing all of these values but there may be rows that are missing some of the values. The decision was made to drop rows that are missing all values to ensure retention of meaningful data.

In [74]:
df4.columns #to make it easy to copy and paste the names of these columns without risking any typos

Index(['participant_id', '1_chloro_2_propyl_phosphate', '1ch_2pro_comment',
       'bis_1_chloroethyl_phosphate', 'bis_1_chlo_phos_comment',
       '1_3_dichloro_2_propyl_phosphate', '1_3_di_2_pro_comment',
       'dibutyl_phosphate', 'dibutyl_phos_comment', 'diphenyl_phosphate',
       'diphe_phos_comment', '2_3_4_5_tetrabromobenzoic_acid',
       '2_3_4_5_tet_comment'],
      dtype='object')

In [75]:
value_cols = [col for col in df4.columns if col != 'participant_id']
rows_all_nan = df4[value_cols].isna().all(axis=1)
print(f"Number of rows missing all FR values: {rows_all_nan.sum()}")

Number of rows missing all FR values: 307


In [76]:
# Drop rows where all value columns are NaN (excluding participant_id)
df4_cleaned = df4[~rows_all_nan].copy()

print(f"Number of rows dropped: {rows_all_nan.sum()}")

Number of rows dropped: 307


In [77]:
df4_cleaned.head(10)

Unnamed: 0,participant_id,1_chloro_2_propyl_phosphate,1ch_2pro_comment,bis_1_chloroethyl_phosphate,bis_1_chlo_phos_comment,1_3_dichloro_2_propyl_phosphate,1_3_di_2_pro_comment,dibutyl_phosphate,dibutyl_phos_comment,diphenyl_phosphate,diphe_phos_comment,2_3_4_5_tetrabromobenzoic_acid,2_3_4_5_tet_comment
0,109271.0,0.0707,1.0,0.0707,1.0,0.137,0.0,0.0707,1.0,0.886,0.0,0.0354,1.0
1,109277.0,0.709,0.0,0.0707,1.0,1.15,0.0,0.138,0.0,1.3,0.0,0.0354,1.0
2,109282.0,,,4.23,0.0,3.66,0.0,0.501,0.0,2.05,0.0,0.0354,1.0
3,109285.0,0.0707,1.0,4.42,0.0,2.27,0.0,0.0707,1.0,10.3,0.0,0.062,0.0
4,109288.0,0.347,0.0,0.145,0.0,0.859,0.0,0.0707,1.0,1.29,0.0,0.0354,1.0
5,109301.0,0.116,0.0,0.0707,1.0,5.18,0.0,0.45,0.0,0.701,0.0,0.0354,1.0
6,109302.0,0.0707,1.0,0.0707,1.0,4.81,0.0,0.171,0.0,0.732,0.0,0.0354,1.0
7,109303.0,0.0707,1.0,0.366,0.0,0.0707,1.0,0.0707,1.0,0.162,0.0,0.0354,1.0
8,109304.0,0.174,0.0,0.386,0.0,2.68,0.0,0.14,0.0,1.28,0.0,0.0354,1.0
9,109307.0,0.0707,1.0,0.0707,1.0,0.903,0.0,0.0707,1.0,1.55,0.0,0.121,0.0


In [78]:
file_path = '2017-2020/urine/6.P_SSFR.xpt'

df5, meta = pyreadstat.read_xport(file_path)
df5 = standardize_id_column(df5)

In [79]:
df5.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4929 entries, 0 to 4928
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   participant_id  4929 non-null   float64
 1   WTSSBPP         4929 non-null   float64
 2   SSIPPP          3913 non-null   float64
 3   SSIPPPL         3913 non-null   float64
 4   SSBPPP          3923 non-null   float64
 5   SSBPPPL         3923 non-null   float64
dtypes: float64(6)
memory usage: 231.2 KB


In [80]:
df5.isnull().sum()

participant_id       0
WTSSBPP              0
SSIPPP            1016
SSIPPPL           1016
SSBPPP            1006
SSBPPPL           1006
dtype: int64

In [81]:
df5 = df5.drop('WTSSBPP', axis=1)

In [82]:
df5 = df5.rename(columns={
    'SSIPPP' : '2_isopropylphenyl_phenyl_phosphate',
    'SSIPPPL' : '2_isopropylphenyl_phenyl_phosphate_comment',
    'SSBPPP' : '4_tert_butylphenyl_phenyl_phosphate',
    'SSBPPPL' : '4_tert_butylphenyl_phenyl_phosphate_comment'
})


In [83]:
common_nan = get_common_nan_ids(df5, '2_isopropylphenyl_phenyl_phosphate', '4_tert_butylphenyl_phenyl_phosphate', id_col='participant_id')

Number of NaNs in 2_isopropylphenyl_phenyl_phosphate: 1016
Number of NaNs in 4_tert_butylphenyl_phenyl_phosphate: 1006
Number of IDs with NaNs in both columns: 1006


In [84]:
df5_cleaned = drop_rows_with_common_nan_ids(df5, '2_isopropylphenyl_phenyl_phosphate', '4_tert_butylphenyl_phenyl_phosphate', id_col='participant_id')

Rows dropped where both 2_isopropylphenyl_phenyl_phosphate and 4_tert_butylphenyl_phenyl_phosphate were NaN: 1006


##### Iodine

Iodine is a trace element commonly found in foods and iodized salt, playing a critical role in thyroid function. The thyroid gland is essential for regulating metabolism and is particularly important for fetal and infant development [6].

Approximately 90% of ingested iodine is excreted in the urine. While urinary iodine concentration is not considered a reliable indicator of iodine status at the individual level, it can be used for assessing iodine sufficiency across populations [7].

In [85]:
file_path = '2017-2020/urine/7.P_UIO.xpt'

df6, meta = pyreadstat.read_xport(file_path)
df6 = standardize_id_column(df6)

In [86]:
df6.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4890 entries, 0 to 4889
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   participant_id  4890 non-null   float64
 1   WTSAPRP         4890 non-null   float64
 2   URXUIO          4600 non-null   float64
 3   URDUIOLC        4600 non-null   float64
dtypes: float64(4)
memory usage: 152.9 KB


In [87]:
df6 = df6.drop('WTSAPRP', axis = 1)

In [88]:
df6 = df6.rename(columns={
    'URXUIO' : 'urine_iodine',
    'URDUIOLC' : 'urine_iodine_comment'
})

In [89]:
df6.isna().sum()

participant_id            0
urine_iodine            290
urine_iodine_comment    290
dtype: int64

In [90]:
df6_clean = df6.dropna(subset = ['urine_iodine'])

##### Mercury

Mercury is a heavy metal historically used in devices such as barometers and thermometers. At elevated levels, it is known to cause neurotoxicity, with particularly severe effects on fetal development. Environmental exposure—especially in occupational settings involving manufacturing or chemical production—is a common source of mercury-related toxicity.

Urinary mercury concentration is a standard method for assessing inorganic mercury exposure. Clinical symptoms may begin to appear at concentrations around 100 µg/L, and levels exceeding 800 µg/L can be fatal [9].

In [91]:
file_path = '2017-2020/urine/8.P_UHG.xpt'

df7, meta = pyreadstat.read_xport(file_path)
df7 = standardize_id_column(df7)

In [92]:
df7.head(10)

Unnamed: 0,participant_id,WTSAPRP,URXUHG,URDUHGLC
0,109266.0,28660.015986,0.09,1.0
1,109270.0,17900.682903,0.09,1.0
2,109273.0,80106.859617,0.09,1.0
3,109274.0,24512.27628,0.09,1.0
4,109287.0,25828.523003,0.09,1.0
5,109288.0,8535.018174,0.09,1.0
6,109290.0,12410.268374,1.27,0.0
7,109295.0,28235.246814,0.09,1.0
8,109300.0,66737.887353,0.09,1.0
9,109309.0,33019.729726,0.09,1.0


In [93]:
df7.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4890 entries, 0 to 4889
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   participant_id  4890 non-null   float64
 1   WTSAPRP         4890 non-null   float64
 2   URXUHG          4600 non-null   float64
 3   URDUHGLC        4600 non-null   float64
dtypes: float64(4)
memory usage: 152.9 KB


In [94]:
df7 = df7.drop('WTSAPRP', axis =1)

In [95]:
df7 = df7.rename(columns={
    'URXUHG' : 'urine_mercury',
    'URDUHGLC' : 'urine_mercury_comment'
})

In [96]:
df7_clean = df7.dropna(subset=['urine_mercury'])

In [97]:
df7_clean.head(5)

Unnamed: 0,participant_id,urine_mercury,urine_mercury_comment
0,109266.0,0.09,1.0
1,109270.0,0.09,1.0
2,109273.0,0.09,1.0
3,109274.0,0.09,1.0
4,109287.0,0.09,1.0


##### Metals

There are other types of metals that can be found in urine. From this dataset from NHANES, the metals tested are: barium, cadmium, cobalt, cesium, molybdenum, manganese, lead, antiomny, tin, thallium, and tungsten. 

Barium, if consumed in high concentrations, it can cause cardiac arrhythmias or paralysis [8]

Cadmium is a known carcinogen which is excreted primarily by the renal system. Cadminum can lead to various health conditions throughout the body including kidney and liver dysfunctions. [10]




In [98]:
file_path = '2017-2020/urine/9.P_UM.xpt'

df8, meta = pyreadstat.read_xport(file_path)
df8 = standardize_id_column(df8)

In [99]:
df8.columns.to_list()

['participant_id',
 'WTSAPRP',
 'URXUBA',
 'URDUBALC',
 'URXUCD',
 'URDUCDLC',
 'URXUCO',
 'URDUCOLC',
 'URXUCS',
 'URDUCSLC',
 'URXUMO',
 'URDUMOLC',
 'URXUMN',
 'URDUMNLC',
 'URXUPB',
 'URDUPBLC',
 'URXUSB',
 'URDUSBLC',
 'URXUSN',
 'URDUSNLC',
 'URXUTL',
 'URDUTLLC',
 'URXUTU',
 'URDUTULC']

In [100]:
df8 = df8.rename(columns = {
    'URXUBA': 'urine_barium',
    'URDUBALC': 'barium_comment',
    'URXUCD': 'urine_cadmium',
    'URDUCDLC': 'cadmium_comment',
    'URXUCO': 'urine_cobalt',
    'URDUCOLC': 'cobalt_comment',
     'URXUCS': 'urine_cesium',
     'URDUCSLC': 'cesium_comment',
     'URXUMO': 'urine_molybdenum',
     'URDUMOLC': 'molybdenum_comment',
     'URXUMN': 'urine_manganese',
     'URDUMNLC':'manganese_comment',
     'URXUPB':'urine_lead',
     'URDUPBLC':'lead_comment',
     'URXUSB':'urine_antimony',
     'URDUSBLC':'antimony_comment',
     'URXUSN':'urine_tin',
     'URDUSNLC':'tin_comment',
     'URXUTL':'urine_thallium',
     'URDUTLLC':'thallium_comment',
     'URXUTU':'urine_tungsten',
     'URDUTULC':'tungsten_comment'
})

In [101]:
df8 = df8.drop('WTSAPRP', axis = 1)

In [102]:
df8.head(10)

Unnamed: 0,participant_id,urine_barium,barium_comment,urine_cadmium,cadmium_comment,urine_cobalt,cobalt_comment,urine_cesium,cesium_comment,urine_molybdenum,...,urine_lead,lead_comment,urine_antimony,antimony_comment,urine_tin,tin_comment,urine_thallium,thallium_comment,urine_tungsten,tungsten_comment
0,109266.0,0.359,0.0,0.039,1.0,0.214,0.0,2.16,0.0,8.0,...,0.17,0.0,0.016,1.0,0.14,1.0,0.064,0.0,0.013,1.0
1,109270.0,2.422,0.0,0.868,0.0,0.449,0.0,9.64,0.0,78.66,...,0.532,0.0,0.046,0.0,2.35,0.0,0.354,0.0,0.271,0.0
2,109273.0,0.37,0.0,0.213,0.0,0.274,0.0,2.33,0.0,33.09,...,0.28,0.0,0.082,0.0,0.14,1.0,0.078,0.0,0.036,0.0
3,109274.0,1.72,0.0,0.184,0.0,0.482,0.0,2.803,0.0,74.82,...,0.3,0.0,0.053,0.0,4.09,0.0,0.159,0.0,0.099,0.0
4,109287.0,7.531,0.0,0.215,0.0,0.303,0.0,2.85,0.0,142.88,...,0.244,0.0,0.092,0.0,3.91,0.0,0.161,0.0,0.231,0.0
5,109288.0,0.271,0.0,0.039,1.0,0.097,0.0,5.98,0.0,36.21,...,,,0.067,0.0,0.97,0.0,0.175,0.0,0.036,0.0
6,109290.0,0.72,0.0,0.631,0.0,0.379,0.0,11.91,0.0,111.07,...,0.641,0.0,0.116,0.0,1.31,0.0,0.577,0.0,0.057,0.0
7,109295.0,1.27,0.0,0.039,1.0,0.323,0.0,4.145,0.0,25.76,...,0.06,0.0,0.022,0.0,0.14,1.0,0.147,0.0,0.02,0.0
8,109300.0,0.69,0.0,0.166,0.0,0.171,0.0,2.129,0.0,15.42,...,0.13,0.0,0.016,1.0,0.14,1.0,0.063,0.0,0.013,1.0
9,109309.0,0.61,0.0,0.039,1.0,0.263,0.0,1.673,0.0,5.14,...,,,0.028,0.0,0.3,0.0,0.099,0.0,0.03,0.0


In [103]:
df8.isnull().sum()

participant_id          0
urine_barium          295
barium_comment        295
urine_cadmium         295
cadmium_comment       295
urine_cobalt          296
cobalt_comment        296
urine_cesium          295
cesium_comment        295
urine_molybdenum      295
molybdenum_comment    295
urine_manganese       295
manganese_comment     295
urine_lead            953
lead_comment          953
urine_antimony        295
antimony_comment      295
urine_tin             295
tin_comment           295
urine_thallium        295
thallium_comment      295
urine_tungsten        295
tungsten_comment      295
dtype: int64

As done prior, the plan is to delete rows that are missing most, if not all, of these values. 

In [104]:
df8.columns.to_list()

['participant_id',
 'urine_barium',
 'barium_comment',
 'urine_cadmium',
 'cadmium_comment',
 'urine_cobalt',
 'cobalt_comment',
 'urine_cesium',
 'cesium_comment',
 'urine_molybdenum',
 'molybdenum_comment',
 'urine_manganese',
 'manganese_comment',
 'urine_lead',
 'lead_comment',
 'urine_antimony',
 'antimony_comment',
 'urine_tin',
 'tin_comment',
 'urine_thallium',
 'thallium_comment',
 'urine_tungsten',
 'tungsten_comment']

In [105]:
value_cols = [col for col in df8.columns if col != 'participant_id']
rows_all_nan = df8[value_cols].isna().all(axis=1)
print(f"Number of rows missing all metal values: {rows_all_nan.sum()}")

Number of rows missing all metal values: 295


In [106]:
# Drop rows where all value columns are NaN (excluding participant_id)
df8_cleaned = df8[~rows_all_nan].copy()

print(f"Number of rows dropped: {rows_all_nan.sum()}")

Number of rows dropped: 295


##### Nickel

Nickel is another heavy metal that can potentially cause health risks if exposed at high levels. The most common reaction to nickel is contact dermatitis; however, inhalation and ingestion is also a possibility. When assessing urine samples, greater than 10 mg/dL in nickel concentration may indicate excessive exposure and calls for thorough evaluation [11].

In [107]:
file_path = '2017-2020/urine/10.P_UNI.xpt'

df9, meta = pyreadstat.read_xport(file_path)
df9 = standardize_id_column(df9)

In [108]:
df9.columns.to_list()

['participant_id', 'WTSAPRP', 'URXUNI', 'URDUNILC']

In [109]:
df9 = df9.drop('WTSAPRP', axis=1)

In [110]:
df9 = df9.rename(columns={
    'URXUNI': 'urine_nickel',
    'URDUNILC': 'urine_nickel_comment'
})

In [111]:
df9 = df9.dropna(subset=['urine_nickel'])

In [112]:
df9.head(10)

Unnamed: 0,participant_id,urine_nickel,urine_nickel_comment
0,109266.0,0.46,0.0
1,109270.0,1.08,0.0
2,109273.0,0.91,0.0
3,109274.0,1.17,0.0
4,109287.0,4.01,0.0
5,109288.0,1.08,0.0
6,109290.0,2.72,0.0
7,109295.0,0.22,1.0
8,109300.0,0.55,0.0
9,109309.0,0.92,0.0


##### Organophosphate Insecticides

The lower limit of detection for the insecticide is 0.1 ng/mL.

In [113]:
file_path = '2017-2020/urine/11.P_OPD.xpt'

df10, meta = pyreadstat.read_xport(file_path)
df10 = standardize_id_column(df10)

In [114]:
df10.head(10)

Unnamed: 0,participant_id,WTSBPRP,URXOP1,URDOP1LC,URXOP2,URDOP2LC,URXOP3,URDOP3LC,URXOP4,URDOP4LC,URXOP5,URDOP5LC,URXOP6,URDOP6LC
0,109271.0,20156.439742,0.682,0.0,0.288,0.0,0.179,0.0,0.0707,1.0,0.0707,1.0,0.0707,1.0
1,109277.0,51738.369518,4.45,0.0,15.4,0.0,2.59,0.0,0.281,0.0,0.377,0.0,0.0707,1.0
2,109282.0,97190.5545,1.58,0.0,21.0,0.0,1.31,0.0,1.08,0.0,0.26,0.0,0.0707,1.0
3,109285.0,85548.221421,1.26,0.0,1.3,0.0,0.49,0.0,0.0707,1.0,0.0707,1.0,0.0707,1.0
4,109288.0,9103.868995,0.196,0.0,1.12,0.0,0.0707,1.0,0.0707,1.0,0.0707,1.0,0.0707,1.0
5,109301.0,14215.93737,1.58,0.0,1.43,0.0,0.519,0.0,0.0707,1.0,0.15,0.0,0.0707,1.0
6,109302.0,5984.497323,3.29,0.0,35.8,0.0,1.18,0.0,1.13,0.0,0.456,0.0,0.0707,1.0
7,109303.0,16549.099643,0.426,0.0,,,0.202,0.0,0.0707,1.0,0.0707,1.0,0.0707,1.0
8,109304.0,40089.354988,1.52,0.0,3.53,0.0,1.5,0.0,0.0707,1.0,0.104,0.0,0.0707,1.0
9,109307.0,49745.101247,4.05,0.0,0.825,0.0,,,0.127,0.0,0.0707,1.0,0.0707,1.0


In [115]:
df10.shape

(4929, 14)

In [116]:
df10.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4929 entries, 0 to 4928
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   participant_id  4929 non-null   float64
 1   WTSBPRP         4929 non-null   float64
 2   URXOP1          4618 non-null   float64
 3   URDOP1LC        4618 non-null   float64
 4   URXOP2          4607 non-null   float64
 5   URDOP2LC        4607 non-null   float64
 6   URXOP3          4604 non-null   float64
 7   URDOP3LC        4604 non-null   float64
 8   URXOP4          4611 non-null   float64
 9   URDOP4LC        4611 non-null   float64
 10  URXOP5          4621 non-null   float64
 11  URDOP5LC        4621 non-null   float64
 12  URXOP6          4620 non-null   float64
 13  URDOP6LC        4620 non-null   float64
dtypes: float64(14)
memory usage: 539.2 KB


In [117]:
df10.isnull().sum()

participant_id      0
WTSBPRP             0
URXOP1            311
URDOP1LC          311
URXOP2            322
URDOP2LC          322
URXOP3            325
URDOP3LC          325
URXOP4            318
URDOP4LC          318
URXOP5            308
URDOP5LC          308
URXOP6            309
URDOP6LC          309
dtype: int64

In [118]:
df10.columns.to_list()

['participant_id',
 'WTSBPRP',
 'URXOP1',
 'URDOP1LC',
 'URXOP2',
 'URDOP2LC',
 'URXOP3',
 'URDOP3LC',
 'URXOP4',
 'URDOP4LC',
 'URXOP5',
 'URDOP5LC',
 'URXOP6',
 'URDOP6LC']

In [119]:
df10 = df10.drop('WTSBPRP', axis = 1)

In [120]:
df10 = df10.rename(columns={
 'URXOP1': 'dimethylphosphate_ng_mL',
 'URDOP1LC' : 'dimethylphosphate_comment',
 'URXOP2' : 'diethylphosphate_ng_mL',
 'URDOP2LC' : 'diethylphosphate_comment',
 'URXOP3' : 'dimethylthiophosphate_ng_mL',
 'URDOP3LC' :'dimethylthiophosphate_comment',
 'URXOP4': 'diethylthiophosphate_ng_mL',
 'URDOP4LC' :'diethylthiophosphate_comment',
 'URXOP5':'dimethyldithiophosphate_ng_mL',
 'URDOP5LC' :'dimethyldithiophosphate_comment' ,
 'URXOP6':'diethyldithiophosphate_ng_mL',
 'URDOP6LC':'diethyldithiophosphate_comment'
})

In [121]:
df10.columns.to_list()

['participant_id',
 'dimethylphosphate_ng_mL',
 'dimethylphosphate_comment',
 'diethylphosphate_ng_mL',
 'diethylphosphate_comment',
 'dimethylthiophosphate_ng_mL',
 'dimethylthiophosphate_comment',
 'diethylthiophosphate_ng_mL',
 'diethylthiophosphate_comment',
 'dimethyldithiophosphate_ng_mL',
 'dimethyldithiophosphate_comment',
 'diethyldithiophosphate_ng_mL',
 'diethyldithiophosphate_comment']

In [122]:
value_cols = [col for col in df10.columns if col != 'participant_id']
rows_all_nan = df10[value_cols].isna().all(axis=1)
print(f"Number of rows missing all OPD values: {rows_all_nan.sum()}")

Number of rows missing all OPD values: 307


In [123]:
# Drop rows where all value columns are NaN (excluding participant_id)
df10_cleaned = df10[~rows_all_nan].copy()

print(f"Number of rows dropped: {rows_all_nan.sum()}")

Number of rows dropped: 307


##### Perchlorate, Nitrate & Thiocyanate



In [124]:
file_path = '2017-2020/urine/12.P_PERNT.xpt'

df11, meta = pyreadstat.read_xport(file_path)
df11 = standardize_id_column(df11)

In [125]:
df11.head(10)

Unnamed: 0,participant_id,WTSAPRP,URXUP8,URDUP8LC,URXNO3,URDNO3LC,URXSCN,URDSCNLC
0,109266.0,28660.015986,0.57,0.0,36900.0,0.0,223.0,0.0
1,109270.0,17900.682903,4.02,0.0,48200.0,0.0,2960.0,0.0
2,109273.0,80106.859617,2.17,0.0,49900.0,0.0,4740.0,0.0
3,109274.0,24512.27628,6.95,0.0,2410.0,0.0,5290.0,0.0
4,109287.0,25828.523003,7.29,0.0,78700.0,0.0,603.0,0.0
5,109288.0,8535.018174,2.14,0.0,50600.0,0.0,370.0,0.0
6,109290.0,12410.268374,2.97,0.0,70600.0,0.0,1470.0,0.0
7,109295.0,28235.246814,2.14,0.0,24700.0,0.0,1010.0,0.0
8,109300.0,66737.887353,0.876,0.0,10800.0,0.0,270.0,0.0
9,109309.0,33019.729726,1.51,0.0,16800.0,0.0,171.0,0.0


In [126]:
df11.columns.to_list()

['participant_id',
 'WTSAPRP',
 'URXUP8',
 'URDUP8LC',
 'URXNO3',
 'URDNO3LC',
 'URXSCN',
 'URDSCNLC']

In [127]:
df11 = df11.drop('WTSAPRP', axis =1)

In [128]:
df11 = df11.rename(columns = {
 'URXUP8': 'perchlorate_urine_ng_mL',
 'URDUP8LC': 'perchlorate_comment',
 'URXNO3':'nitrate_urine_ng_mL',
 'URDNO3LC':'nitrate_comment',
 'URXSCN':'thiocyanate_urine_ng_mL',
 'URDSCNLC': 'thiocyanate_comment'
})

In [129]:
df11.columns.to_list()

['participant_id',
 'perchlorate_urine_ng_mL',
 'perchlorate_comment',
 'nitrate_urine_ng_mL',
 'nitrate_comment',
 'thiocyanate_urine_ng_mL',
 'thiocyanate_comment']

In [130]:
value_cols = [col for col in df11.columns if col != 'participant_id']
rows_all_nan = df11[value_cols].isna().all(axis=1)
print(f"Number of rows missing all PERNT values: {rows_all_nan.sum()}")

Number of rows missing all PERNT values: 391


In [131]:
# Drop rows where all value columns are NaN (excluding participant_id)
df11_cleaned = df11[~rows_all_nan].copy()

print(f"Number of rows dropped: {rows_all_nan.sum()}")

Number of rows dropped: 391


##### Urine Pregnancy Test

Point of care urine pregnancy test was performed on women 20-44 years of age. 

In [132]:
file_path = '2017-2020/urine/13.P_UCPREG.xpt'

df12, meta = pyreadstat.read_xport(file_path)
df12 = standardize_id_column(df12)

In [133]:
df12.head()

Unnamed: 0,participant_id,URXPREG
0,109266.0,2.0
1,109284.0,2.0
2,109286.0,1.0
3,109291.0,2.0
4,109297.0,2.0


In [134]:
df12 = df12.rename(columns = {
    'URXPREG' : 'pregnancy_test_result'
})

In [135]:
df12 = df12.dropna(subset=['pregnancy_test_result'])

##### Volatile Organic Compound (VOC) Metabolites

On the NHANES lab dataset, there are two VOC tables: P_UVOC and P_UVOC2. 


In [136]:
file_path = '2017-2020/urine/14.P_UVOC.xpt'

df13, meta = pyreadstat.read_xport(file_path)
df13 = standardize_id_column(df13)

In [137]:
df13.columns.to_list()

['participant_id',
 'WTSAPRP',
 'URX2MH',
 'URD2MHLC',
 'URX34M',
 'URD34MLC',
 'URXAAM',
 'URDAAMLC',
 'URXAMC',
 'URDAMCLC',
 'URXATC',
 'URDATCLC',
 'URXBMA',
 'URDBMALC',
 'URXBPM',
 'URDBPMLC',
 'URXCEM',
 'URDCEMLC',
 'URXCYHA',
 'URDCYALC',
 'URXCYM',
 'URDCYMLC',
 'URXDHB',
 'URDDHBLC',
 'URXGAM',
 'URDGAMLC',
 'URXHEM',
 'URDHEMLC',
 'URXHP2',
 'URDHP2LC',
 'URXHPM',
 'URDHPMLC',
 'URXIPM3',
 'URDPM3LC',
 'URXMAD',
 'URDMADLC',
 'URXMB3',
 'URDMB3LC',
 'URXPHG',
 'URDPHGLC',
 'URXPMM',
 'URDPMMLC',
 'URXTTC',
 'URDTTCLC']

In [138]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2017/DataFiles/P_UVOC.htm"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Get the first table
table = soup.find('table')
rows = table.find_all('tr')

# Extract header and rows
headers = [th.get_text(strip=True) for th in rows[0].find_all('th')]
data = [
    [td.get_text(strip=True) for td in row.find_all('td')]
    for row in rows[1:]
]

df_info = pd.DataFrame(data, columns=headers)

#Concurrently working on IBM data science certificate and a new learned skill was webscraping so the decision was made to utilize the skill to future-proof the cleaning process and to ensure there are no typos/the names are accurate

In [139]:
import re

# Clean text to snake_case
def to_snake_case(text):
    text = text.lower()
    text = text.replace("-", "_")            # Hyphens → underscores
    text = text.replace("/", "_")            # Slashes → underscores
    text = re.sub(r"[^\w\s_]", "", text)     # Remove punctuation except underscores
    text = re.sub(r"\s+", "_", text)         # Spaces → underscores
    return text

# Step 1: Create initial rename_dict from table (matching only columns in df13)
rename_dict = {
    row["VARIABLE NAME"]: to_snake_case(row["ANALYTE NAME"])
    for _, row in df_info.iterrows()
    if row["VARIABLE NAME"] in df13.columns
}

# Step 2: Clean comment code columns (ending in "LC")
unit_suffixes = ["_ng_ml", "_ug_l", "_mg_dl", "_umol_l", "_nmol_l"]
comment_renames = {}

for col in df13.columns:
    if col.endswith("LC") and col not in rename_dict:
        base_col = col[:-2]  # Remove 'LC'
        match_col = base_col.replace("URD", "URX")

        if match_col in rename_dict:
            clean_name = rename_dict[match_col]

            # Strip any unit suffix
            for unit in unit_suffixes:
                if clean_name.endswith(unit):
                    clean_name = clean_name[: -len(unit)]
                    break

            comment_renames[col] = f"{clean_name}_comment"

# Step 3: Merge comment renames into main rename_dict
rename_dict.update(comment_renames)

In [140]:
df13_cleaned = df13.rename(columns=rename_dict)

In [141]:
df13_cleaned.head()

Unnamed: 0,participant_id,WTSAPRP,2_methylhippuric_acid_ng_ml,2_methylhippuric_acid_comment,3__and_4_methylhippuric_acid_ng_ml,3__and_4_methylhippuric_acid_comment,n_acetyl_s_2_carbamoylethyl_l_cysteine_ng_ml,n_acetyl_s_2_carbamoylethyl_l_cysteine_comment,n_acetyl_s_n_methylcarbamoyl_l_cysteine_ng_ml,n_acetyl_s_n_methylcarbamoyl_l_cysteine_comment,...,mandelic_acid_ng_ml,mandelic_acid_comment,n_acetyl_s_4_hydroxy_2_butenyl_l_cysteine_ng_ml,n_acetyl_s_4_hydroxy_2_butenyl_l_cysteine_comment,phenylglyoxylic_acid_ng_ml,phenylglyoxylic_acid_comment,n_acetyl_s_3_hydroxypropyl_1_methyl_l_cysteine_ng_ml,n_acetyl_s_3_hydroxypropyl_1_methyl_l_cysteine_comment,2_thioxothiazolidine_4_carboxylic_acid,2_thioxothiazolidine_4_carboxylic_acid_comment
0,109266.0,28660.015986,3.54,1.0,18.7,0.0,19.0,0.0,12.9,0.0,...,37.2,0.0,0.424,1.0,75.6,0.0,60.6,0.0,7.9,1.0
1,109270.0,17900.682903,9.33,0.0,106.0,0.0,184.0,0.0,91.0,0.0,...,210.0,0.0,5.63,0.0,298.0,0.0,257.0,0.0,7.9,1.0
2,109273.0,80106.859617,90.1,0.0,491.0,0.0,122.0,0.0,543.0,0.0,...,213.0,0.0,31.2,0.0,438.0,0.0,927.0,0.0,7.9,1.0
3,109274.0,24512.27628,39.9,0.0,118.0,0.0,90.0,0.0,129.0,0.0,...,222.0,0.0,9.71,0.0,230.0,0.0,2350.0,0.0,37.2,0.0
4,109287.0,25828.523003,14.7,0.0,121.0,0.0,135.0,0.0,64.0,0.0,...,1270.0,0.0,11.2,0.0,807.0,0.0,240.0,0.0,7.9,1.0


In [142]:
df13_cleaned = df13_cleaned.drop('WTSAPRP',axis=1)

In [143]:
df13_cleaned.isnull().sum()

participant_id                                                0
2_methylhippuric_acid_ng_ml                                 565
2_methylhippuric_acid_comment                               565
3__and_4_methylhippuric_acid_ng_ml                          565
3__and_4_methylhippuric_acid_comment                        565
n_acetyl_s_2_carbamoylethyl_l_cysteine_ng_ml                565
n_acetyl_s_2_carbamoylethyl_l_cysteine_comment              565
n_acetyl_s_n_methylcarbamoyl_l_cysteine_ng_ml               565
n_acetyl_s_n_methylcarbamoyl_l_cysteine_comment             565
2_aminothiazoline_4_carboxylic_acid_ng_ml                   565
2_aminothiazoline_4_carboxylic_acid_comment                 565
n_acetyl_s_benzyl_l_cysteine_ng_ml                          565
n_acetyl_s_benzyl_l_cysteine_comment                        565
n_acetyl_s_n_propyl_l_cysteine_ng_ml                        565
n_acetyl_s_n_propyl_l_cysteine_comment                      565
n_acetyl_s_2_carboxyethyl_l_cysteine_ng_

In [144]:
# The three columns were missed since they did not follow the exact pattern. They were manually renamed

df13_cleaned = df13_cleaned.rename(columns ={
    'SEQN':'participant_id',
    'URDCYALC':'n_acetyl_s_1_cyano_2_hydroxyethyl_l_cysteine_comment',
    'URDPM3LC': 'n_acetyl_s_4_hydroxy_2_methyl_2_butenyl_l_cysteine_comment'
})

In [145]:
df13_cleaned.columns.to_list()

['participant_id',
 '2_methylhippuric_acid_ng_ml',
 '2_methylhippuric_acid_comment',
 '3__and_4_methylhippuric_acid_ng_ml',
 '3__and_4_methylhippuric_acid_comment',
 'n_acetyl_s_2_carbamoylethyl_l_cysteine_ng_ml',
 'n_acetyl_s_2_carbamoylethyl_l_cysteine_comment',
 'n_acetyl_s_n_methylcarbamoyl_l_cysteine_ng_ml',
 'n_acetyl_s_n_methylcarbamoyl_l_cysteine_comment',
 '2_aminothiazoline_4_carboxylic_acid_ng_ml',
 '2_aminothiazoline_4_carboxylic_acid_comment',
 'n_acetyl_s_benzyl_l_cysteine_ng_ml',
 'n_acetyl_s_benzyl_l_cysteine_comment',
 'n_acetyl_s_n_propyl_l_cysteine_ng_ml',
 'n_acetyl_s_n_propyl_l_cysteine_comment',
 'n_acetyl_s_2_carboxyethyl_l_cysteine_ng_ml',
 'n_acetyl_s_2_carboxyethyl_l_cysteine_comment',
 'n_acetyl_s_1_cyano_2_hydroxyethyl_l_cysteine_ng_ml',
 'n_acetyl_s_1_cyano_2_hydroxyethyl_l_cysteine_comment',
 'n_acetyl_s_2_cyanoethyl_l_cysteine_ng_ml',
 'n_acetyl_s_2_cyanoethyl_l_cysteine_comment',
 'n_acetyl_s_34_dihydroxybutyl_l_cysteine_ng_ml',
 'n_acetyl_s_34_dihydroxy

In [146]:
value_cols = [col for col in df13_cleaned.columns if col != 'participant_id']
rows_all_nan = df13_cleaned[value_cols].isna().all(axis=1)
print(f"Number of rows missing all VOC values: {rows_all_nan.sum()}")

Number of rows missing all VOC values: 565


In [147]:
# Drop rows where all value columns are NaN (excluding participant_id)
df13_cleaned = df13_cleaned[~rows_all_nan].copy()

print(f"Number of rows dropped: {rows_all_nan.sum()}")


Number of rows dropped: 565


In [148]:
file_path = '2017-2020/urine/15.P_UVOC2.xpt'

df14, meta = pyreadstat.read_xport(file_path)
df14 = standardize_id_column(df14)

In [149]:
df14.head()

Unnamed: 0,participant_id,WTVOC2PP,URXMUCA,URDMUCLC,URXPHMA,URDPMALC
0,109266.0,29122.785906,6.94,1.0,0.106,1.0
1,109270.0,18436.336755,213.0,0.0,0.185,0.0
2,109273.0,93177.905637,147.0,0.0,1.1,0.0
3,109274.0,27374.984127,244.0,0.0,0.171,0.0
4,109287.0,25946.229537,130.0,0.0,0.153,0.0


In [150]:
df14 = df14.rename(columns={
    'URXMUCA': 'trans_trans_muconic_acid_ng_ml',
    'URDMUCLC': 'trans_trans_muconic_acid_comment',
    'URXPHMA': 'phenylmercapturic_acid_ng_ml',
    'URDPMALC': 'phenylmercapturic_acid_comment'
})   

In [151]:
df14 = df14.drop('WTVOC2PP', axis=1)

In [152]:
df14.isnull().sum()

participant_id                        0
trans_trans_muconic_acid_ng_ml      994
trans_trans_muconic_acid_comment    994
phenylmercapturic_acid_ng_ml        994
phenylmercapturic_acid_comment      994
dtype: int64

In [153]:
common_nan = get_common_nan_ids(df14, 'trans_trans_muconic_acid_ng_ml', 'phenylmercapturic_acid_ng_ml', id_col='participant_id')

Number of NaNs in trans_trans_muconic_acid_ng_ml: 994
Number of NaNs in phenylmercapturic_acid_ng_ml: 994
Number of IDs with NaNs in both columns: 994


In [154]:
df14_cleaned = drop_rows_with_common_nan_ids(df14, 'trans_trans_muconic_acid_ng_ml', 'phenylmercapturic_acid_ng_ml', id_col='participant_id')

Rows dropped where both trans_trans_muconic_acid_ng_ml and phenylmercapturic_acid_ng_ml were NaN: 994


All of the urine labs have been cleaned. The dataframes from all of the urine labs will be collated into one large dataframe named urine_labs into csv file.

In [155]:
df_names = [var for var in globals() if isinstance(globals()[var], pd.DataFrame)]
print(df_names)

['__', 'df_demo', '_4', '_9', '_13', '_15', '_18', 'df_bmx', '_21', '_29', '_31', 'df', '_36', 'df_cleaned', 'df1', '_45', 'df2', '_56', 'df2_cleaned', 'df3', '_63', 'df4', '_70', 'df4_cleaned', '_77', 'df5', 'df5_cleaned', 'df6', 'df6_clean', 'df7', '_92', 'df7_clean', '_97', 'df8', '_102', 'df8_cleaned', 'df9', '_112', 'df10', '_114', 'df10_cleaned', 'df11', '_125', 'df11_cleaned', 'df12', '_133', 'df13', 'df_info', 'df13_cleaned', '_141', 'df14', '_149', 'df14_cleaned']


In [156]:
urine_dfs = [
    df_cleaned,
    df1,
    df2_cleaned,
    df3,
    df4_cleaned,
    df5_cleaned,
    df6_clean,
    df7_clean,
    df8_cleaned,
    df9,
    df10_cleaned,
    df11_cleaned,
    df12,
    df13_cleaned,
    df14_cleaned
]

from functools import reduce

df_urine_combined = reduce(
    lambda left, right: pd.merge(left, right, on="participant_id", how="outer"),
    urine_dfs
)

In [157]:
df_urine_combined.to_csv("cleaned_urine_labs_combined.csv", index=False)

In [158]:
urine_df = pd.read_csv('cleaned_urine_labs_combined.csv')

urine_df.head(10)

Unnamed: 0,participant_id,albumin_urine_ug_mL,albumin_urine_mg_L,alb_comment,creatinine_urine_mg_dL,creatinine_urine_umol_L,creatinine_comment,alb_creat_ratio,arsenous_acid_ug_L,arsenous_acid_comment,...,phenylglyoxylic_acid_ng_ml,phenylglyoxylic_acid_comment,n_acetyl_s_3_hydroxypropyl_1_methyl_l_cysteine_ng_ml,n_acetyl_s_3_hydroxypropyl_1_methyl_l_cysteine_comment,2_thioxothiazolidine_4_carboxylic_acid,2_thioxothiazolidine_4_carboxylic_acid_comment,trans_trans_muconic_acid_ng_ml,trans_trans_muconic_acid_comment,phenylmercapturic_acid_ng_ml,phenylmercapturic_acid_comment
0,109266.0,5.5,5.5,0.0,36.0,3182.4,0.0,15.28,0.08,1.0,...,75.6,0.0,60.6,0.0,7.9,1.0,6.94,1.0,0.106,1.0
1,109270.0,4.0,4.0,0.0,165.0,14586.0,0.0,2.42,0.51,0.0,...,298.0,0.0,257.0,0.0,7.9,1.0,213.0,0.0,0.185,0.0
2,109271.0,2.4,2.4,0.0,32.0,2828.8,0.0,7.5,,,...,,,,,,,,,,
3,109273.0,4.9,4.9,0.0,121.0,10696.4,0.0,4.05,0.08,1.0,...,438.0,0.0,927.0,0.0,7.9,1.0,147.0,0.0,1.1,0.0
4,109274.0,12.8,12.8,0.0,120.0,10608.0,0.0,10.67,0.08,1.0,...,230.0,0.0,2350.0,0.0,37.2,0.0,244.0,0.0,0.171,0.0
5,109275.0,3.7,3.7,0.0,20.0,1768.0,0.0,18.5,,,...,,,,,,,,,,
6,109277.0,14.7,14.7,0.0,244.0,21569.6,0.0,6.02,,,...,,,,,,,,,,
7,109278.0,8.4,8.4,0.0,124.0,10961.6,0.0,6.77,,,...,,,,,,,,,,
8,109279.0,13.9,13.9,0.0,251.0,22188.4,0.0,5.54,,,...,,,,,,,,,,
9,109282.0,16.0,16.0,0.0,192.0,16972.8,0.0,8.33,,,...,,,,,,,,,,


In [159]:
urine_df.shape

(12795, 129)

#### Blood Labs

**Alpha-1-acid glycoprotein (AGP)**

Alpha-1-acid glycoprotein (AGP), also known as orosomucoid (ORM), is an acute-phase serum protein present in humans and many animal species. It is produced in response to inflammation, although its precise biological role remains under investigation and somewhat ambiguous [2]. According to Ceciliani et al. (2019), AGP may play a role in immunometabolism, a function potentially relevant to understanding the obesity epidemic in the U.S.

In the NHANES dataset, AGP levels were measured in children aged 3–5 years and females aged 12–49 years. This data offers an opportunity to explore potential correlations between AGP serum concentrations and obesity prevalence among the female participants in the study.

In [160]:
file_path = '2017-2020/blood/1.P_SSAGP.xpt'

df_b1, meta = pyreadstat.read_xport(file_path)
df_b1 = standardize_id_column(df_b1)

In [161]:
df_b1.head(10)

Unnamed: 0,participant_id,WTSSAGPP,SSAGP
0,109264.0,0.0,
1,109266.0,10003.783188,0.796
2,109277.0,23329.384783,0.746
3,109279.0,14416.168293,0.94
4,109284.0,17705.030492,1.08
5,109286.0,21951.734438,0.358
6,109288.0,0.0,
7,109291.0,25981.10596,0.766
8,109297.0,0.0,
9,109309.0,0.0,


In [162]:
df_b1.shape

(3823, 3)

In [163]:
df_b1 = df_b1.rename(columns={
    'SSAGP': 'alpha_1_agp_g_l'
})

In [164]:
df_b1.head()

Unnamed: 0,participant_id,WTSSAGPP,alpha_1_agp_g_l
0,109264.0,0.0,
1,109266.0,10003.783188,0.796
2,109277.0,23329.384783,0.746
3,109279.0,14416.168293,0.94
4,109284.0,17705.030492,1.08


In [165]:
df_b1 = df_b1.drop('WTSSAGPP', axis=1)

In [166]:
df_b1 = df_b1.dropna(subset = ['alpha_1_agp_g_l'])

In [167]:
df_b1.head(5)

Unnamed: 0,participant_id,alpha_1_agp_g_l
1,109266.0,0.796
2,109277.0,0.746
3,109279.0,0.94
4,109284.0,1.08
5,109286.0,0.358


**Lipid Panel**

Lipids are essential molecules that support a range of physiological functions, including hormone production and cellular structure. However, excessive lipid levels—particularly certain types—are associated with increased risk of cardiovascular disease.

To assess lipid status, a fasting lipid panel is commonly used. This test typically includes measurements of:
- LDL (low-density lipoprotein, or “bad” cholesterol),
- HDL (high-density lipoprotein, or “good” cholesterol),
- Total cholesterol, and
- Triglycerides

The NHANES dataset includes all of these values, enabling analysis of lipid profiles across a large representative population. This section focuses on cleaning and preparing these variables for analysis.

In [168]:
file_path = '2017-2020/blood/2.P_HDL.xpt'

df_b2, meta = pyreadstat.read_xport(file_path)
df_b2 = standardize_id_column(df_b2)

In [169]:
df_b2.head()

Unnamed: 0,participant_id,LBDHDD,LBDHDDSI
0,109264.0,72.0,1.86
1,109266.0,56.0,1.45
2,109270.0,47.0,1.22
3,109271.0,33.0,0.85
4,109273.0,42.0,1.09


In [170]:
df_b2 = df_b2.rename(columns ={
    'LBDHDD':'direct_hdl_mg_dl',
    'LBDHDDSI':'direct_hdl_mmol_l'
})

In [171]:
df_b2.isnull().sum()

participant_id          0
direct_hdl_mg_dl     1370
direct_hdl_mmol_l    1370
dtype: int64

In [172]:
common_nan = get_common_nan_ids(df_b2, 'direct_hdl_mg_dl', 'direct_hdl_mmol_l', id_col='participant_id')

Number of NaNs in direct_hdl_mg_dl: 1370
Number of NaNs in direct_hdl_mmol_l: 1370
Number of IDs with NaNs in both columns: 1370


In [173]:
df_b2 = drop_rows_with_common_nan_ids(df_b2, 'direct_hdl_mg_dl', 'direct_hdl_mmol_l', id_col='participant_id')

Rows dropped where both direct_hdl_mg_dl and direct_hdl_mmol_l were NaN: 1370


In [174]:
file_path = '2017-2020/blood/3.P_TRIGLY.xpt'

df_b3, meta = pyreadstat.read_xport(file_path)
df_b3 = standardize_id_column(df_b3)

In [175]:
df_b3.columns.to_list()

['participant_id',
 'WTSAFPRP',
 'LBXTR',
 'LBDTRSI',
 'LBDLDL',
 'LBDLDLSI',
 'LBDLDLM',
 'LBDLDMSI',
 'LBDLDLN',
 'LBDLDNSI']

In [176]:
df_b3 = df_b3.rename(columns={
    'LBXTR':'triglyceride_mg_dl',
    'LBDTRSI':'triglyceride_mmol_l',
    'LBDLDL':'ldl_friedewald_mg_dl',
    'LBDLDLSI':'ldl_friedwalkd_mmol_l',
    'LBDLDLM': 'ldl_martin_hopkins_mg_dl',
    'LBDLDMSI': 'ldl_martin_hopkins_mmol_l',
    'LBDLDLN':'ldl_nih_mg_dl',
    'LBDLDNSI':'ldl_nih_mmol_l'
})

In [177]:
df_b3 = df_b3.drop('WTSAFPRP', axis=1)

In [178]:
df_b3.isnull().sum()

participant_id                 0
triglyceride_mg_dl           440
triglyceride_mmol_l          440
ldl_friedewald_mg_dl         473
ldl_friedwalkd_mmol_l        473
ldl_martin_hopkins_mg_dl     473
ldl_martin_hopkins_mmol_l    473
ldl_nih_mg_dl                448
ldl_nih_mmol_l               448
dtype: int64

In [179]:
value_cols = [col for col in df_b3.columns if col != 'participant_id']
rows_all_nan = df_b3[value_cols].isna().all(axis=1)
print(f"Number of rows missing all cholesterol values: {rows_all_nan.sum()}")

Number of rows missing all cholesterol values: 440


In [180]:
# Drop rows where all value columns are NaN (excluding participant_id)
df_b3_cleaned = df_b3[~rows_all_nan].copy()

print(f"Number of rows dropped: {rows_all_nan.sum()}")

Number of rows dropped: 440


In [181]:
file_path = '2017-2020/blood/4.P_TCHOL.xpt'

df_b4, meta = pyreadstat.read_xport(file_path)
df_b4 = standardize_id_column(df_b4)

In [182]:
df_b4.head()

Unnamed: 0,participant_id,LBXTC,LBDTCSI
0,109264.0,166.0,4.29
1,109266.0,195.0,5.04
2,109270.0,103.0,2.66
3,109271.0,147.0,3.8
4,109273.0,164.0,4.24


In [183]:
df_b4 = df_b4.rename(columns={
    'LBXTC': 'total_cholesterol_mg_dl',
    'LBDTCSI':'total_cholesterol_mmol_l'
})

In [184]:
df_b4.isnull().sum()

participant_id                 0
total_cholesterol_mg_dl     1370
total_cholesterol_mmol_l    1370
dtype: int64

In [185]:
common_nan = get_common_nan_ids(df_b4, 'total_cholesterol_mg_dl', 'total_cholesterol_mmol_l', id_col='participant_id')

Number of NaNs in total_cholesterol_mg_dl: 1370
Number of NaNs in total_cholesterol_mmol_l: 1370
Number of IDs with NaNs in both columns: 1370


In [186]:
df_b4 = drop_rows_with_common_nan_ids(df_b4, 'total_cholesterol_mg_dl', 'total_cholesterol_mmol_l', id_col='participant_id')

Rows dropped where both total_cholesterol_mg_dl and total_cholesterol_mmol_l were NaN: 1370


**Chromium and Cobalt (Blood)**

NHANES data on chromium and cobalt levels were collected on patients aged 40-150 years old. 

In [187]:
file_path = '2017-2020/blood/5.P_CRCO.xpt'

df_b5, meta = pyreadstat.read_xport(file_path)
df_b5 = standardize_id_column(df_b5)

In [188]:
df_b5.columns.to_list()

['participant_id',
 'LBXBCR',
 'LBDBCRSI',
 'LBDBCRLC',
 'LBXBCO',
 'LBDBCOSI',
 'LBDBCOLC']

In [189]:
df_b5 = df_b5.rename(columns={
    'LBXBCR':'chromium_blood_ug_l', 
    'LBDBCRSI': 'chromium_blood_nmol_l',
    'LBDBCRLC':'chromium_blood_comment', 
    'LBXBCO':'cobalt_blood_ug_l', 
    'LBDBCOSI':'cobalt_blood_nmol_l', 
    'LBDBCOLC' :'cobalt_blood_comment'
})

In [190]:
df_b5.isnull().sum()

participant_id              0
chromium_blood_ug_l       302
chromium_blood_nmol_l     302
chromium_blood_comment    302
cobalt_blood_ug_l         299
cobalt_blood_nmol_l       299
cobalt_blood_comment      299
dtype: int64

In [191]:
common_nan = get_common_nan_ids(df_b5, 'chromium_blood_ug_l', 'cobalt_blood_ug_l', id_col='participant_id')

Number of NaNs in chromium_blood_ug_l: 302
Number of NaNs in cobalt_blood_ug_l: 299
Number of IDs with NaNs in both columns: 297


In [192]:
df_b5 = drop_rows_with_common_nan_ids(df_b5, 'chromium_blood_ug_l', 'cobalt_blood_ug_l', id_col='participant_id')

Rows dropped where both chromium_blood_ug_l and cobalt_blood_ug_l were NaN: 297


**Complete Blood Count with Differential**

CBC with diff is the most common blood work that is ordered for a baseline lab. CBC can be useful to assess the patients for acute inflammation in the body and anemia. 

There are many values that are extracted and assessed through the CBC panel. To more efficiently extract the information, the decision was made to utilize webscraping rather than individually typing out each lab value. The units for each of the columns is included in the README file for this project.

In [193]:
file_path = '2017-2020/blood/6.P_CBC.xpt'

df_b6, meta = pyreadstat.read_xport(file_path)
df_b6 = standardize_id_column(df_b6)

In [194]:
url = "https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2017/DataFiles/P_CBC.htm"

df_info_raw = pd.read_html(url)[0]

# Use the first row as the header
df_info_raw.columns = df_info_raw.iloc[0]
df_info = df_info_raw.drop(index=0).reset_index(drop=True)

In [195]:
df_info.head()

Unnamed: 0,Variable Name,Analyte Description,LLOD,ULOD,Units
0,LBXWBCSI,White blood cell count,0.02,363.0,x 103 cells/uL
1,LBXLYPCT,Lymphocyte percent,0.0,100.0,%
2,LBXMOPCT,Monocyte percent,0.0,100.0,%
3,LBXNEPCT,Segmented neutrophils percent,0.0,100.0,%
4,LBXEOPCT,Eosinophils percent,0.0,100.0,%


In [196]:
df_info.columns.to_list()

['Variable  Name', 'Analyte  Description', 'LLOD', 'ULOD', 'Units']

In [197]:
rename_dict = {
    row["Variable  Name"]: to_snake_case(row["Analyte  Description"])
    for _, row in df_info.iterrows()
    if row["Variable  Name"] in df_b6.columns
}

In [198]:
df_b6 = df_b6.rename(columns=rename_dict)

In [199]:
df_b6.head()

Unnamed: 0,participant_id,white_blood_cell_count,lymphocyte_percent,monocyte_percent,segmented_neutrophils_percent,eosinophils_percent,basophils_percent,LBDLYMNO,LBDMONO,LBDNENO,...,red_blood_cell_count,hemoglobin,LBXHCT,mean_cell_volume,LBXMC,LBXMCHSI,red_cell_distribution_width,platelet_count,mean_platelet_volume,LBXNRBC
0,109263.0,,,,,,,,,,...,,,,,,,,,,
1,109264.0,4.5,45.6,6.2,46.4,1.4,0.5,2.1,0.3,2.1,...,4.8,13.7,40.5,84.3,33.7,28.4,13.1,263.0,8.2,0.1
2,109265.0,9.5,46.4,10.9,39.2,2.9,0.7,4.4,1.0,3.7,...,4.5,12.6,36.6,81.2,34.4,27.9,13.1,286.0,6.6,0.1
3,109266.0,7.8,34.5,6.0,58.3,0.8,0.5,2.7,0.5,4.5,...,4.35,12.3,36.5,83.7,33.6,28.1,14.0,314.0,6.9,0.1
4,109269.0,9.1,38.3,7.8,48.8,4.1,1.1,3.5,0.7,4.4,...,4.21,11.7,33.5,79.6,34.9,27.8,13.4,287.0,6.9,0.1


In [200]:
df_b6.columns.to_list()

#Some of the names were renamed based on what was available on the first table in the URL. For the other ones that were not, they were manually renamed

['participant_id',
 'white_blood_cell_count',
 'lymphocyte_percent',
 'monocyte_percent',
 'segmented_neutrophils_percent',
 'eosinophils_percent',
 'basophils_percent',
 'LBDLYMNO',
 'LBDMONO',
 'LBDNENO',
 'LBDEONO',
 'LBDBANO',
 'red_blood_cell_count',
 'hemoglobin',
 'LBXHCT',
 'mean_cell_volume',
 'LBXMC',
 'LBXMCHSI',
 'red_cell_distribution_width',
 'platelet_count',
 'mean_platelet_volume',
 'LBXNRBC']

In [201]:
df_b6 = df_b6.rename(columns={
 'LBDLYMNO': 'lymphocyte_number',
 'LBDMONO': 'monocyte_number',
 'LBDNENO':'segmented_neutrophils_number',
 'LBDEONO':'eosinophils_number',
 'LBDBANO':'basophils_number',
 'LBXHCT':'hematocrit_percent',
 'LBXMC':'mean_cell_hgb_concentration',
 'LBXMCHSI':'mean_cell_hemoglobin',
 'LBXNRBC':'nucelated_red_blood_cells'
})

In [202]:
df_b6.columns

Index(['participant_id', 'white_blood_cell_count', 'lymphocyte_percent',
       'monocyte_percent', 'segmented_neutrophils_percent',
       'eosinophils_percent', 'basophils_percent', 'lymphocyte_number',
       'monocyte_number', 'segmented_neutrophils_number', 'eosinophils_number',
       'basophils_number', 'red_blood_cell_count', 'hemoglobin',
       'hematocrit_percent', 'mean_cell_volume', 'mean_cell_hgb_concentration',
       'mean_cell_hemoglobin', 'red_cell_distribution_width', 'platelet_count',
       'mean_platelet_volume', 'nucelated_red_blood_cells'],
      dtype='object')

In [203]:
df_b6.isnull().sum()

participant_id                      0
white_blood_cell_count           1616
lymphocyte_percent               1621
monocyte_percent                 1621
segmented_neutrophils_percent    1621
eosinophils_percent              1621
basophils_percent                1621
lymphocyte_number                1621
monocyte_number                  1621
segmented_neutrophils_number     1621
eosinophils_number               1621
basophils_number                 1621
red_blood_cell_count             1616
hemoglobin                       1616
hematocrit_percent               1616
mean_cell_volume                 1616
mean_cell_hgb_concentration      1616
mean_cell_hemoglobin             1616
red_cell_distribution_width      1616
platelet_count                   1616
mean_platelet_volume             1616
nucelated_red_blood_cells        1621
dtype: int64

In [204]:
value_cols = [col for col in df_b6.columns if col != 'participant_id']
rows_all_nan = df_b6[value_cols].isna().all(axis=1)
print(f"Number of rows missing all CBC values: {rows_all_nan.sum()}")

Number of rows missing all CBC values: 1616


In [205]:
# Drop rows where all value columns are NaN (excluding participant_id)
df_b6_cleaned = df_b6[~rows_all_nan].copy()

print(f"Number of rows dropped: {rows_all_nan.sum()}")

Number of rows dropped: 1616


**Cotinine**

Cotinine is a metabolite that is produced when nicotine is processed. Its long half-life makes it a good marker for assessing tobacco exposure or usage. 

In [206]:
file_path = '2017-2020/blood/7.P_COT.xpt'

df_b7, meta = pyreadstat.read_xport(file_path)
df_b7 = standardize_id_column(df_b7)

In [207]:
df_b7.columns.to_list()

['participant_id', 'LBXCOT', 'LBDCOTLC', 'LBXHCOT', 'LBDHCOLC']

In [208]:
df_b7 = df_b7.rename(columns={
    'LBXCOT':'serum_cotinine_ng_ml',
    'LBDCOTLC':'serum_cotinine_comment',
    'LBXHCOT':'serum_hydroxycotinine_ng_ml',
    'LBDHCOLC':'serum_hydroxycotinine_comment'
})

In [209]:
df_b7.isnull().sum()

participant_id                      0
serum_cotinine_ng_ml             1632
serum_cotinine_comment           1632
serum_hydroxycotinine_ng_ml      1632
serum_hydroxycotinine_comment    1632
dtype: int64

In [210]:
common_nan = get_common_nan_ids(df_b7, 'serum_cotinine_ng_ml', 'serum_hydroxycotinine_ng_ml', id_col='participant_id')

Number of NaNs in serum_cotinine_ng_ml: 1632
Number of NaNs in serum_hydroxycotinine_ng_ml: 1632
Number of IDs with NaNs in both columns: 1632


In [211]:
df_b7 = drop_rows_with_common_nan_ids(df_b7, 'serum_cotinine_ng_ml', 'serum_hydroxycotinine_ng_ml', id_col='participant_id')

Rows dropped where both serum_cotinine_ng_ml and serum_hydroxycotinine_ng_ml were NaN: 1632


**Cytomegalovirus**

Cytomegalovirus (CMV) is a double-stranded DNA virus that causes flu-like symptoms in immunocompetant population but can cause organ damage in immunocompromised (i.e. HIV/AIDS) population. CMV virus is transmitted via bodily fluids including sexual contact [12]. 

Avidity tests for whether the CMV infection was recent or in the past. Low avidity shows recent infection and high avidity shows past infeciton.

In [212]:
file_path = '2017-2020/blood/8.P_CMV.xpt'

df_b8, meta = pyreadstat.read_xport(file_path)
df_b8 = standardize_id_column(df_b8)

In [213]:
df_b8.columns.to_list()

['participant_id', 'LBXIGG', 'LBXIGM', 'LBXIGGA']

In [214]:
df_b8 = df_b8.rename(columns={
    'LBXIGG':'cmv_igg',
    'LBXIGM':'cmv_igm', 
    'LBXIGGA':'cmv_igg_avidity'
})

In [215]:
df_b8.isnull().sum()

#there are missing avidity value which would indicate that the person was never infected with CMV. The null values for the IgG and IgM would indicate missing data so rows without these two values will be dropped

participant_id        0
cmv_igg             617
cmv_igm             617
cmv_igg_avidity    1307
dtype: int64

In [216]:
common_nan = get_common_nan_ids(df_b8, 'cmv_igg', 'cmv_igm', id_col='participant_id')

Number of NaNs in cmv_igg: 617
Number of NaNs in cmv_igm: 617
Number of IDs with NaNs in both columns: 617


In [217]:
df_b8 = drop_rows_with_common_nan_ids(df_b8, 'cmv_igg', 'cmv_igm', id_col='participant_id')

Rows dropped where both cmv_igg and cmv_igm were NaN: 617


**Ethylene Oxide**

Ethylene Oxide (EtO) is a colorless gas that is used to produce various materials as well as sterilize medical equipments. Exposure to EtO most often is due to aerosolization. EtO is a well-known carcinogen and long term exposure to this substance could lead to blood cancers such as non-Hodgkin lymphoma, myeloma and lymphocytic leukemia [13].

The unit for EtO measurement in the blood is picomoles per gram of hemoglobin (pmol/g Hb).

In [218]:
file_path = '2017-2020/blood/9.P_ETHOX.xpt'

df_b9, meta = pyreadstat.read_xport(file_path)
df_b9 = standardize_id_column(df_b9)

In [219]:
df_b9.columns.to_list()

['participant_id', 'WTSAPRP', 'LBXEOA', 'LBDEOALC']

In [220]:
df_b9 = df_b9.drop('WTSAPRP',axis=1)

In [221]:
df_b9 = df_b9.rename(columns={
    'LBXEOA':'eto_pmol_g_hb',
    'LBDEOALC':'eto_comment'
})

In [222]:
df_b9.isnull().sum()

participant_id      0
eto_pmol_g_hb     424
eto_comment       424
dtype: int64

In [223]:
df_b9 = df_b9.dropna(subset=['eto_pmol_g_hb'])

In [224]:
df_b9.head()

Unnamed: 0,participant_id,eto_pmol_g_hb,eto_comment
0,109266.0,18.7,0.0
1,109270.0,37.7,0.0
2,109273.0,359.0,0.0
3,109274.0,61.2,0.0
5,109290.0,25.5,0.0


**Ferritin and iron panel**

Ferritin and iron panel are used to assess someone's iron status. Low values in the iron panel and ferritin along with clinical symptoms are corroborated to diagnose iron deficiency anemia. 

In [225]:
file_path = '2017-2020/blood/10.P_FERTIN.xpt'

df_b10, meta = pyreadstat.read_xport(file_path)
df_b10 = standardize_id_column(df_b10)

In [226]:
df_b10.head()

Unnamed: 0,participant_id,LBXFER,LBDFERSI
0,109263.0,,
1,109264.0,15.7,15.7
2,109265.0,42.1,42.1
3,109266.0,11.6,11.6
4,109269.0,41.7,41.7


In [227]:
df_b10 = df_b10.rename(columns={
    'LBXFER':'ferritin_ng_ml',
    'LBDFERSI':'ferritin_ug_l'
})

In [228]:
df_b10.isnull().sum()

participant_id       0
ferritin_ng_ml    1426
ferritin_ug_l     1426
dtype: int64

In [229]:
common_nan = get_common_nan_ids(df_b10, 'ferritin_ng_ml', 'ferritin_ug_l', id_col='participant_id')

Number of NaNs in ferritin_ng_ml: 1426
Number of NaNs in ferritin_ug_l: 1426
Number of IDs with NaNs in both columns: 1426


In [230]:
df_b10 = drop_rows_with_common_nan_ids(df_b10, 'ferritin_ng_ml', 'ferritin_ug_l', id_col='participant_id')

Rows dropped where both ferritin_ng_ml and ferritin_ug_l were NaN: 1426


In [231]:
file_path = '2017-2020/blood/11.P_FETIB.xpt'

df_b11, meta = pyreadstat.read_xport(file_path)
df_b11 = standardize_id_column(df_b11)

In [232]:
df_b11.columns.to_list()

['participant_id',
 'LBXIRN',
 'LBDIRNSI',
 'LBXUIB',
 'LBDUIBLC',
 'LBDUIBSI',
 'LBDTIB',
 'LBDTIBSI',
 'LBDPCT']

In [233]:
df_b11 = df_b11.rename(columns={
 'LBXIRN':'iron_frozen_ug_dl',
 'LBDIRNSI':'iron_frozen_umol_l',
 'LBXUIB':'uibc_ug_dl',
 'LBDUIBLC':'uibc_comment',
 'LBDUIBSI':'uibc_umol_l',
 'LBDTIB':'tibc_ug_dl',
 'LBDTIBSI':'tibc_umol_l',
 'LBDPCT':'transferrin_saturation'
})

In [234]:
df_b11.isnull().sum()

participant_id              0
iron_frozen_ug_dl         904
iron_frozen_umol_l        904
uibc_ug_dl                949
uibc_comment              949
uibc_umol_l               949
tibc_ug_dl                956
tibc_umol_l               956
transferrin_saturation    956
dtype: int64

In [235]:
value_cols = [col for col in df_b11.columns if col != 'participant_id']
rows_all_nan = df_b11[value_cols].isna().all(axis=1)
print(f"Number of rows missing all iron panel values: {rows_all_nan.sum()}")

Number of rows missing all iron panel values: 904


In [236]:
# Drop rows where all value columns are NaN (excluding participant_id)
df_b11_cleaned = df_b11[~rows_all_nan].copy()

print(f"Number of rows dropped: {rows_all_nan.sum()}")

Number of rows dropped: 904


**Folate**



In [237]:
file_path = '2017-2020/blood/12.P_FOLATE.xpt'

df_b12, meta = pyreadstat.read_xport(file_path)
df_b12 = standardize_id_column(df_b12)

In [238]:
df_b12.columns

Index(['participant_id', 'WTFOLPRP', 'LBDRFO', 'LBDRFOSI'], dtype='object')

In [239]:
df_b12 = df_b12.drop('WTFOLPRP', axis=1)

In [240]:
df_b12 = df_b12.rename(columns={
    'LBDRFO':'rbc_folate_ng_ml',
    'LBDRFOSI':'rbc_folate_nmol_l'
})

In [241]:
df_b12.isnull().sum()

participant_id         0
rbc_folate_ng_ml     966
rbc_folate_nmol_l    966
dtype: int64

In [242]:
common_nan = get_common_nan_ids(df_b12, 'rbc_folate_ng_ml', 'rbc_folate_nmol_l', id_col='participant_id')

Number of NaNs in rbc_folate_ng_ml: 966
Number of NaNs in rbc_folate_nmol_l: 966
Number of IDs with NaNs in both columns: 966


In [243]:
df_b12 = drop_rows_with_common_nan_ids(df_b12, 'rbc_folate_ng_ml', 'rbc_folate_nmol_l', id_col='participant_id')

Rows dropped where both rbc_folate_ng_ml and rbc_folate_nmol_l were NaN: 966


In [255]:
file_path = '2017-2020/blood/13.P_FOLFMS.xpt'

df_b13, meta = pyreadstat.read_xport(file_path)
df_b13 = standardize_id_column(df_b13)

#There are a lot of technical names for these values so webscraping will be done instead of manual renaming of the columns

In [253]:
url = "https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2017/DataFiles/P_FOLFMS.htm"

df_info_raw = pd.read_html(url)[1]

# Use the first row as the header
df_info_raw.columns = df_info_raw.iloc[0]

In [254]:
df_info_raw.columns

Index(['LBXSF1SI', '5-Methyltetrahydrofolate', 0.13], dtype='object', name=0)

In [249]:
rename_dict = {
    row["Variable Name"]: to_snake_case(row["Analyte Description"])
    for _, row in df_info.iterrows()
    if row["Variable Name"] in df_b13.columns
}

KeyError: 'Variable Name'

## References

2. Ceciliani, F., & Lecchi, C. (2019). The immune functions of α1 acid glycoprotein. Current Protein & Peptide Science, 20(6), 505–524. https://doi.org/10.2174/1389203720666190405101138
3. National Institute of Environmental Health Sciences. (n.d.). Arsenic. U.S. Department of Health and Human Services. Retrieved July 3, 2025, from https://www.niehs.nih.gov/health/topics/agents/arsenic
4. Office of Dietary Supplements. (2022, June 2). Chromium: Health professional fact sheet. U.S. Department of Health and Human Services. Retrieved July 6, 2025, from https://ods.od.nih.gov/factsheets/Chromium-HealthProfessional/
5. National Institute of Environmental Health Sciences. (2025, January 30). Flame retardants. U.S. Department of Health and Human Services. Retrieved July 7, 2025, from https://www.niehs.nih.gov/health/topics/agents/flame_retardants
6. Office of Dietary Supplements. (2024, November 5). Iodine: Health professional fact sheet. U.S. Department of Health and Human Services, National Institutes of Health. Retrieved July 8, 2025, from https://ods.od.nih.gov/factsheets/Iodine-HealthProfessional/
7. Pearce, E. N., & Caldwell, K. L. (2016). Urinary iodine, thyroid function, and thyroglobulin as biomarkers of iodine status. The American Journal of Clinical Nutrition, 104(Suppl 3), 898S–901S. https://doi.org/10.3945/ajcn.115.110395
8. Agency for Toxic Substances and Disease Registry. (2007, August). Toxicological profile for barium and barium compounds: Public health statement. U.S. Department of Health and Human Services. https://www.ncbi.nlm.nih.gov/books/NBK598787/
9. Ye, B. J., Kim, B. G., Jeon, M. J., Kim, S. Y., Kim, H. C., Jang, T. W., Chae, H. J., Choi, W. J., Ha, M. N., & Hong, Y. S. (2016). Evaluation of mercury exposure level, clinical diagnosis and treatment for mercury intoxication. Annals of occupational and environmental medicine, 28, 5. https://doi.org/10.1186/s40557-015-0086-8
10. Genchi, G., Sinicropi, M. S., Lauria, G., Carocci, A., & Catalano, A. (2020). The Effects of Cadmium Toxicity. International journal of environmental research and public health, 17(11), 3782. https://doi.org/10.3390/ijerph17113782
11. Gates, A., Jakubowski, J. A., & Regina, A. C. (2025). Nickel toxicology. In StatPearls. StatPearls Publishing. Retrieved May 20, 2023, from https://www.ncbi.nlm.nih.gov/books/NBK592400/
12. Gupta, M., & Shorman, M. (2025). Cytomegalovirus Infections. In StatPearls. StatPearls Publishing.
13. U.S. Environmental Protection Agency. (n.d.). Our current understanding of ethylene oxide (EtO). EPA. Retrieved July 16, 2025, from https://www.epa.gov/hazardous-air-pollutants-ethylene-oxide/our-current-understanding-ethylene-oxide-eto