# Urine Lab Data Cleaning

Laboratory analysis is a vital component of the NHANES 2017–2020 dataset, offering detailed biomarker data that can be leveraged for predictive modeling. The most common sources of lab data are blood and urine samples. To facilitate data cleaning and relational database design, all lab results were categorized into two main groups: blood-based tests and urine-based tests.

This separation enhances the clarity and usability of the database structure, especially in the context of developing an obesity prediction model that emphasizes blood biomarkers. While blood tests serve as the primary focus due to their strong association with metabolic health, urine tests are also retained for potential secondary insights.

During data cleaning, each lab dataset from the 2017–2020 cycle was examined for completeness, consistent formatting, and variable alignment. Efforts were made to standardize variable names and units across different files to ensure compatibility. Only participants with valid lab results and complete demographic data were retained for analysis.

### Albumin and Creatinine

Albumin is a protein often found in urine if there is damage to the kidneys. Creatinine is waste product that is excreted in urine and helps evaluate the kidney function.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import os
!pip install pyreadstat
import pyreadstat #since the data files are .xpt files, this library is needed to import the table
import re
import requests
from bs4 import BeautifulSoup
from nhanes_utils import to_snake_case, get_common_nan_ids, standardize_id_column, drop_rows_with_common_nan_ids



In [2]:
file_path = '2017-2020/urine/1.P_ALB_CR.xpt'

df, meta = pyreadstat.read_xport(file_path)
df = standardize_id_column(df)

In [3]:
df.shape

(13027, 8)

In [4]:
df.head(10)

Unnamed: 0,participant_id,URXUMA,URXUMS,URDUMALC,URXUCR,URXCRS,URDUCRLC,URDACT
0,109264.0,,,,,,,
1,109266.0,5.5,5.5,0.0,36.0,3182.4,0.0,15.28
2,109270.0,4.0,4.0,0.0,165.0,14586.0,0.0,2.42
3,109271.0,2.4,2.4,0.0,32.0,2828.8,0.0,7.5
4,109273.0,4.9,4.9,0.0,121.0,10696.4,0.0,4.05
5,109274.0,12.8,12.8,0.0,120.0,10608.0,0.0,10.67
6,109275.0,3.7,3.7,0.0,20.0,1768.0,0.0,18.5
7,109277.0,14.7,14.7,0.0,244.0,21569.6,0.0,6.02
8,109278.0,8.4,8.4,0.0,124.0,10961.6,0.0,6.77
9,109279.0,13.9,13.9,0.0,251.0,22188.4,0.0,5.54


In [5]:
df.columns

Index(['participant_id', 'URXUMA', 'URXUMS', 'URDUMALC', 'URXUCR', 'URXCRS',
       'URDUCRLC', 'URDACT'],
      dtype='object')

In [6]:
df = df.rename(columns={
    'URXUMA': 'albumin_urine_ug_mL',
    'URXUMS': 'albumin_urine_mg_L', 
    'URDUMALC': 'alb_comment',
    'URXUCR': 'creatinine_urine_mg_dL', 
    'URXCRS': 'creatinine_urine_umol_L', 
    'URDUCRLC': 'creatinine_comment',
    'URDACT': 'alb_creat_ratio'
})

In [7]:
df.isnull().sum()

participant_id               0
albumin_urine_ug_mL        517
albumin_urine_mg_L         517
alb_comment                517
creatinine_urine_mg_dL     518
creatinine_urine_umol_L    518
creatinine_comment         518
alb_creat_ratio            518
dtype: int64

Some columns contain a similar number of missing (NaN) values, raising the question of whether these missing values occur for the same participants. To assess potential overlap, the following code was executed to identify if the missing values correspond to the same participant IDs. This check helps ensure that removing rows with missing data will not disproportionately reduce the dataset.

In [8]:
common_nan = get_common_nan_ids(df,'albumin_urine_ug_mL','creatinine_urine_mg_dL')

Number of NaNs in albumin_urine_ug_mL: 517
Number of NaNs in creatinine_urine_mg_dL: 518
Number of IDs with NaNs in both columns: 517


In [9]:
df_cleaned = drop_rows_with_common_nan_ids(df, 'albumin_urine_ug_mL', 'creatinine_urine_mg_dL')

Rows dropped where both albumin_urine_ug_mL and creatinine_urine_mg_dL were NaN: 517


### Arsenic

Arsenic is a naturally occurring mineral present in water, air, and soil, existing in both organic and inorganic forms. While the inorganic form is more toxic, exposure to any form of arsenic can be harmful to human health. Elevated arsenic levels have been linked to adverse effects on multiple body systems, including the cardiovascular and endocrine systems [3].

Urinary arsenic levels serve as a biomarker reflecting the concentration of arsenic in the bloodstream. Given arsenic’s potential impact on the endocrine system—a key regulator of metabolism and body weight—these lab values are relevant for investigating associations with obesity in this analysis.

#### Total Arsenic

In [10]:
file_path = '2017-2020/urine/2.P_UTAS.xpt'

df1, meta = pyreadstat.read_xport(file_path)
df1 = standardize_id_column(df1)

In [11]:
df1.head(10)

Unnamed: 0,participant_id,WTSAPRP,URXUAS,URDUASLC
0,109266.0,28660.015986,2.05,0.0
1,109270.0,17900.682903,3.25,0.0
2,109273.0,80106.859617,5.16,0.0
3,109274.0,24512.27628,2.92,0.0
4,109287.0,25828.523003,5.8,0.0
5,109288.0,8535.018174,4.24,0.0
6,109290.0,12410.268374,144.72,0.0
7,109295.0,28235.246814,3.48,0.0
8,109300.0,66737.887353,5.08,0.0
9,109309.0,33019.729726,0.95,0.0


In [12]:
df1.columns

Index(['participant_id', 'WTSAPRP', 'URXUAS', 'URDUASLC'], dtype='object')

In [13]:
df1 = df1.rename(columns={
    'URXUAS':'total_arsenic_ug_L',
    'URDUASLC': 'total_arsenic_comment'
})

In [14]:
df1.drop('WTSAPRP', axis=1,inplace = True)

In [15]:
df1.isna().sum()

participant_id             0
total_arsenic_ug_L       320
total_arsenic_comment    320
dtype: int64

In [16]:
df1.dropna(axis=1, inplace=True)

#### Speciated Arsenic

In [17]:
file_path = '2017-2020/urine/3.P_UAS.xpt'

df2, meta = pyreadstat.read_xport(file_path)
df2 = standardize_id_column(df2)

In [18]:
df2.shape

(4890, 14)

In [19]:
df2.columns

Index(['participant_id', 'WTSAPRP', 'URXUAS3', 'URDUA3LC', 'URXUAS5',
       'URDUA5LC', 'URXUAB', 'URDUABLC', 'URXUAC', 'URDUACLC', 'URXUDMA',
       'URDUDALC', 'URXUMMA', 'URDUMMAL'],
      dtype='object')

In [20]:
df2 = df2.drop('WTSAPRP', axis = 1)

In [21]:
df2 = df2.rename(columns={
    'WTSAPRP': 'subsample_weight',
    'URXUAS3':'arsenous_acid_ug_L',
    'URDUA3LC':'arsenous_acid_comment', 
    'URXUAS5': 'arsenic_acid_ug_L',
    'URDUA5LC': 'arsenic_acid_comment',
    'URXUAB': 'arsenobetaine_ug_L',
    'URDUABLC':'arsenobetaine_comment',
    'URXUAC': 'arsenocholoine_ug_L',
    'URDUACLC': 'arsenocholine_comment',
    'URXUDMA': 'dimethylarsinic_acid_ug_L', 
    'URDUDALC': 'dimethylarsinic_comment',
    'URXUMMA': 'monomethylarsonic_acid_ug_L',
    'URDUMMAL': 'monometylarsonic_comment'
})

In [22]:
df2.head(10)

Unnamed: 0,participant_id,arsenous_acid_ug_L,arsenous_acid_comment,arsenic_acid_ug_L,arsenic_acid_comment,arsenobetaine_ug_L,arsenobetaine_comment,arsenocholoine_ug_L,arsenocholine_comment,dimethylarsinic_acid_ug_L,dimethylarsinic_comment,monomethylarsonic_acid_ug_L,monometylarsonic_comment
0,109266.0,0.08,1.0,0.56,1.0,0.82,1.0,0.08,1.0,1.35,1.0,0.14,1.0
1,109270.0,0.51,0.0,0.56,1.0,0.82,1.0,0.08,1.0,2.07,0.0,0.14,1.0
2,109273.0,0.08,1.0,0.56,1.0,2.74,0.0,0.08,1.0,1.35,1.0,0.14,1.0
3,109274.0,0.08,1.0,0.56,1.0,0.82,1.0,0.08,1.0,1.35,1.0,0.14,1.0
4,109287.0,0.08,1.0,0.56,1.0,0.82,1.0,0.08,1.0,3.6,0.0,0.14,1.0
5,109288.0,0.08,1.0,0.56,1.0,0.82,1.0,0.08,1.0,4.22,0.0,0.14,1.0
6,109290.0,0.44,0.0,0.56,1.0,147.82,0.0,0.76,0.0,6.63,0.0,0.7,0.0
7,109295.0,0.08,1.0,0.56,1.0,1.28,0.0,0.08,1.0,1.35,1.0,0.14,1.0
8,109300.0,0.08,1.0,0.56,1.0,3.03,0.0,0.08,1.0,1.35,1.0,0.14,1.0
9,109309.0,0.08,1.0,0.56,1.0,0.82,1.0,0.08,1.0,1.35,1.0,0.14,1.0


In [23]:
df2.isna().sum()

participant_id                   0
arsenous_acid_ug_L             265
arsenous_acid_comment          265
arsenic_acid_ug_L              265
arsenic_acid_comment           265
arsenobetaine_ug_L             265
arsenobetaine_comment          265
arsenocholoine_ug_L            265
arsenocholine_comment          265
dimethylarsinic_acid_ug_L      265
dimethylarsinic_comment        265
monomethylarsonic_acid_ug_L    265
monometylarsonic_comment       265
dtype: int64

In [24]:
value_cols = [col for col in df2.columns if col != 'participant_id']
rows_all_nan = df2[value_cols].isna().all(axis=1)
print(f"Number of rows missing all arsenic values: {rows_all_nan.sum()}")

Number of rows missing all arsenic values: 265


In [25]:
df2_cleaned = df2[~rows_all_nan].copy()

print(f"Number of rows dropped: {rows_all_nan.sum()}")

Number of rows dropped: 265


### Chromium

Chromium is a trace mineral found in two main forms: trivalent chromium (Cr³⁺), which is naturally present in food and supplements, and hexavalent chromium (Cr⁶⁺), a toxic by-product of industrial processes such as metal manufacturing 
[4]. Trivalent chromium has been suggested to play a role in carbohydrate, lipid, and protein metabolism by enhancing insulin action; however, no definitive physiological function has been firmly established. In contrast, hexavalent chromium is recognized as potentially carcinogenic, especially when inhaled or ingested in high amounts. Due to the uncertainty surrounding its physiological role, a standardized reference range for chromium has not been established.

For the purposes of this project, which focuses on assessing correlations with obesity, trivalent chromium is of particular interest due to its possible metabolic effects. However, a limitation of the NHANES dataset is that urinary chromium measurements are not specifically categorized by valence state. While this makes it difficult to distinguish between beneficial and harmful forms, urine chromium concentration can still serve as a proxy for overall chromium exposure. Monitoring these levels may offer insight into whether exposure levels are within a potentially physiological or harmful range.

In [26]:
file_path = '2017-2020/urine/4.P_UCM.xpt'

df3, meta = pyreadstat.read_xport(file_path)
df3 = standardize_id_column(df3)

In [27]:
df3.columns

Index(['participant_id', 'WTSAPRP', 'URXUCM', 'URDUCMLC'], dtype='object')

In [28]:
df3 = df3.drop('WTSAPRP',axis=1)

In [29]:
df3.head(10)

Unnamed: 0,participant_id,URXUCM,URDUCMLC
0,109266.0,0.13,1.0
1,109270.0,0.13,1.0
2,109273.0,0.19,0.0
3,109274.0,0.4,0.0
4,109287.0,0.13,1.0
5,109288.0,0.29,0.0
6,109290.0,0.27,0.0
7,109295.0,0.26,0.0
8,109300.0,0.13,1.0
9,109309.0,0.13,1.0


In [30]:
df3.isna().sum()

participant_id      0
URXUCM            321
URDUCMLC          321
dtype: int64

Since there is only one lab value that is being measured in this dataset, any NaN values will be dropped.

In [31]:
df3 = df3.rename(columns={
   'URXUCM': 'chromium_ug_L',
    'URDUCMLC': 'chromium_comment'
})

In [32]:
df3.dropna(axis=1,inplace=True)

### Flame Retardant

Flame retardants (FRs) are chemicals applied to materials such as furniture, electronics, electrical devices, and construction products to reduce flammability and slow the spread of fire. There are several major classes of flame retardants, including brominated flame retardants (BFRs), hexabromocyclododecane (HBCD), organophosphate flame retardants (OPFRs), tetrabromobisphenol A (TBBPA), and polybrominated diphenyl ethers (PBDEs). These compounds are highly persistent in the environment and can accumulate in human tissue due to their resistance to degradation [5].

Among them, PBDEs have been extensively studied for their potential adverse health effects, including endocrine and thyroid disruption, immunotoxicity, reproductive toxicity, carcinogenicity, and negative impacts on fetal and child development. Although young children are particularly vulnerable to these effects, adults are also susceptible to long-term exposure. Given their endocrine-disrupting properties, flame retardants may play a role in the development or progression of obesity.

According to the documentation from NHANES for this lab, the unit of measurement is all ng/mL.

In [33]:
file_path = '2017-2020/urine/5.P_FR.xpt'

df4, meta = pyreadstat.read_xport(file_path)
df4 = standardize_id_column(df4)

In [34]:
df4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4929 entries, 0 to 4928
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   participant_id  4929 non-null   float64
 1   WTSBPRP         4929 non-null   float64
 2   URXBCPP         4617 non-null   float64
 3   URDBCPLC        4617 non-null   float64
 4   URXBCEP         4618 non-null   float64
 5   URDCEPLC        4618 non-null   float64
 6   URXBDCP         4599 non-null   float64
 7   URDBDCLC        4599 non-null   float64
 8   URXDBUP         4614 non-null   float64
 9   URDDUPLC        4614 non-null   float64
 10  URXDPHP         4622 non-null   float64
 11  URDDPHLC        4622 non-null   float64
 12  URXTBBA         4622 non-null   float64
 13  URDBBALC        4622 non-null   float64
dtypes: float64(14)
memory usage: 539.2 KB


In [35]:
df4.columns

Index(['participant_id', 'WTSBPRP', 'URXBCPP', 'URDBCPLC', 'URXBCEP',
       'URDCEPLC', 'URXBDCP', 'URDBDCLC', 'URXDBUP', 'URDDUPLC', 'URXDPHP',
       'URDDPHLC', 'URXTBBA', 'URDBBALC'],
      dtype='object')

In [36]:
df4.head(10)

Unnamed: 0,participant_id,WTSBPRP,URXBCPP,URDBCPLC,URXBCEP,URDCEPLC,URXBDCP,URDBDCLC,URXDBUP,URDDUPLC,URXDPHP,URDDPHLC,URXTBBA,URDBBALC
0,109271.0,20156.439742,0.0707,1.0,0.0707,1.0,0.137,0.0,0.0707,1.0,0.886,0.0,0.0354,1.0
1,109277.0,51738.369518,0.709,0.0,0.0707,1.0,1.15,0.0,0.138,0.0,1.3,0.0,0.0354,1.0
2,109282.0,97190.5545,,,4.23,0.0,3.66,0.0,0.501,0.0,2.05,0.0,0.0354,1.0
3,109285.0,85548.221421,0.0707,1.0,4.42,0.0,2.27,0.0,0.0707,1.0,10.3,0.0,0.062,0.0
4,109288.0,9103.868995,0.347,0.0,0.145,0.0,0.859,0.0,0.0707,1.0,1.29,0.0,0.0354,1.0
5,109301.0,14215.93737,0.116,0.0,0.0707,1.0,5.18,0.0,0.45,0.0,0.701,0.0,0.0354,1.0
6,109302.0,5984.497323,0.0707,1.0,0.0707,1.0,4.81,0.0,0.171,0.0,0.732,0.0,0.0354,1.0
7,109303.0,16549.099643,0.0707,1.0,0.366,0.0,0.0707,1.0,0.0707,1.0,0.162,0.0,0.0354,1.0
8,109304.0,40089.354988,0.174,0.0,0.386,0.0,2.68,0.0,0.14,0.0,1.28,0.0,0.0354,1.0
9,109307.0,49745.101247,0.0707,1.0,0.0707,1.0,0.903,0.0,0.0707,1.0,1.55,0.0,0.121,0.0


In [37]:
df4 = df4.drop('WTSBPRP', axis = 1)

In [38]:
df4 = df4.rename(columns={
    'URXBCPP': '1_chloro_2_propyl_phosphate', 
    'URDBCPLC' : '1ch_2pro_comment', 
    'URXBCEP': 'bis_1_chloroethyl_phosphate',
    'URDCEPLC' : 'bis_1_chlo_phos_comment',
    'URXBDCP': '1_3_dichloro_2_propyl_phosphate', 
    'URDBDCLC': '1_3_di_2_pro_comment', 
    'URXDBUP': 'dibutyl_phosphate',
    'URDDUPLC': 'dibutyl_phos_comment', 
    'URXDPHP': 'diphenyl_phosphate',
    'URDDPHLC': 'diphe_phos_comment',
    'URXTBBA': '2_3_4_5_tetrabromobenzoic_acid',
    'URDBBALC': '2_3_4_5_tet_comment'
})

In [39]:
df4.isnull().sum()

participant_id                       0
1_chloro_2_propyl_phosphate        312
1ch_2pro_comment                   312
bis_1_chloroethyl_phosphate        311
bis_1_chlo_phos_comment            311
1_3_dichloro_2_propyl_phosphate    330
1_3_di_2_pro_comment               330
dibutyl_phosphate                  315
dibutyl_phos_comment               315
diphenyl_phosphate                 307
diphe_phos_comment                 307
2_3_4_5_tetrabromobenzoic_acid     307
2_3_4_5_tet_comment                307
dtype: int64

There are multiple columns that have over 300 NaN values. There may be rows that are missing all of these values but there may be rows that are missing some of the values. The decision was made to drop rows that are missing all values to ensure retention of meaningful data.

In [40]:
df4.columns #to make it easy to copy and paste the names of these columns without risking any typos

Index(['participant_id', '1_chloro_2_propyl_phosphate', '1ch_2pro_comment',
       'bis_1_chloroethyl_phosphate', 'bis_1_chlo_phos_comment',
       '1_3_dichloro_2_propyl_phosphate', '1_3_di_2_pro_comment',
       'dibutyl_phosphate', 'dibutyl_phos_comment', 'diphenyl_phosphate',
       'diphe_phos_comment', '2_3_4_5_tetrabromobenzoic_acid',
       '2_3_4_5_tet_comment'],
      dtype='object')

In [41]:
value_cols = [col for col in df4.columns if col != 'participant_id']
rows_all_nan = df4[value_cols].isna().all(axis=1)
print(f"Number of rows missing all FR values: {rows_all_nan.sum()}")

Number of rows missing all FR values: 307


In [42]:
# Drop rows where all value columns are NaN (excluding participant_id)
df4_cleaned = df4[~rows_all_nan].copy()

print(f"Number of rows dropped: {rows_all_nan.sum()}")

Number of rows dropped: 307


In [43]:
df4_cleaned.head(10)

Unnamed: 0,participant_id,1_chloro_2_propyl_phosphate,1ch_2pro_comment,bis_1_chloroethyl_phosphate,bis_1_chlo_phos_comment,1_3_dichloro_2_propyl_phosphate,1_3_di_2_pro_comment,dibutyl_phosphate,dibutyl_phos_comment,diphenyl_phosphate,diphe_phos_comment,2_3_4_5_tetrabromobenzoic_acid,2_3_4_5_tet_comment
0,109271.0,0.0707,1.0,0.0707,1.0,0.137,0.0,0.0707,1.0,0.886,0.0,0.0354,1.0
1,109277.0,0.709,0.0,0.0707,1.0,1.15,0.0,0.138,0.0,1.3,0.0,0.0354,1.0
2,109282.0,,,4.23,0.0,3.66,0.0,0.501,0.0,2.05,0.0,0.0354,1.0
3,109285.0,0.0707,1.0,4.42,0.0,2.27,0.0,0.0707,1.0,10.3,0.0,0.062,0.0
4,109288.0,0.347,0.0,0.145,0.0,0.859,0.0,0.0707,1.0,1.29,0.0,0.0354,1.0
5,109301.0,0.116,0.0,0.0707,1.0,5.18,0.0,0.45,0.0,0.701,0.0,0.0354,1.0
6,109302.0,0.0707,1.0,0.0707,1.0,4.81,0.0,0.171,0.0,0.732,0.0,0.0354,1.0
7,109303.0,0.0707,1.0,0.366,0.0,0.0707,1.0,0.0707,1.0,0.162,0.0,0.0354,1.0
8,109304.0,0.174,0.0,0.386,0.0,2.68,0.0,0.14,0.0,1.28,0.0,0.0354,1.0
9,109307.0,0.0707,1.0,0.0707,1.0,0.903,0.0,0.0707,1.0,1.55,0.0,0.121,0.0


In [44]:
file_path = '2017-2020/urine/6.P_SSFR.xpt'

df5, meta = pyreadstat.read_xport(file_path)
df5 = standardize_id_column(df5)

In [45]:
df5.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4929 entries, 0 to 4928
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   participant_id  4929 non-null   float64
 1   WTSSBPP         4929 non-null   float64
 2   SSIPPP          3913 non-null   float64
 3   SSIPPPL         3913 non-null   float64
 4   SSBPPP          3923 non-null   float64
 5   SSBPPPL         3923 non-null   float64
dtypes: float64(6)
memory usage: 231.2 KB


In [46]:
df5.isnull().sum()

participant_id       0
WTSSBPP              0
SSIPPP            1016
SSIPPPL           1016
SSBPPP            1006
SSBPPPL           1006
dtype: int64

In [47]:
df5 = df5.drop('WTSSBPP', axis=1)

In [48]:
df5 = df5.rename(columns={
    'SSIPPP' : '2_isopropylphenyl_phenyl_phosphate',
    'SSIPPPL' : '2_isopropylphenyl_phenyl_phosphate_comment',
    'SSBPPP' : '4_tert_butylphenyl_phenyl_phosphate',
    'SSBPPPL' : '4_tert_butylphenyl_phenyl_phosphate_comment'
})


In [49]:
common_nan = get_common_nan_ids(df5, '2_isopropylphenyl_phenyl_phosphate', '4_tert_butylphenyl_phenyl_phosphate', id_col='participant_id')

Number of NaNs in 2_isopropylphenyl_phenyl_phosphate: 1016
Number of NaNs in 4_tert_butylphenyl_phenyl_phosphate: 1006
Number of IDs with NaNs in both columns: 1006


In [50]:
df5_cleaned = drop_rows_with_common_nan_ids(df5, '2_isopropylphenyl_phenyl_phosphate', '4_tert_butylphenyl_phenyl_phosphate', id_col='participant_id')

Rows dropped where both 2_isopropylphenyl_phenyl_phosphate and 4_tert_butylphenyl_phenyl_phosphate were NaN: 1006


### Iodine

Iodine is a trace element commonly found in foods and iodized salt, playing a critical role in thyroid function. The thyroid gland is essential for regulating metabolism and is particularly important for fetal and infant development [6].

Approximately 90% of ingested iodine is excreted in the urine. While urinary iodine concentration is not considered a reliable indicator of iodine status at the individual level, it can be used for assessing iodine sufficiency across populations [7].

In [51]:
file_path = '2017-2020/urine/7.P_UIO.xpt'

df6, meta = pyreadstat.read_xport(file_path)
df6 = standardize_id_column(df6)

In [52]:
df6.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4890 entries, 0 to 4889
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   participant_id  4890 non-null   float64
 1   WTSAPRP         4890 non-null   float64
 2   URXUIO          4600 non-null   float64
 3   URDUIOLC        4600 non-null   float64
dtypes: float64(4)
memory usage: 152.9 KB


In [53]:
df6 = df6.drop('WTSAPRP', axis = 1)

In [54]:
df6 = df6.rename(columns={
    'URXUIO' : 'urine_iodine',
    'URDUIOLC' : 'urine_iodine_comment'
})

In [55]:
df6.isna().sum()

participant_id            0
urine_iodine            290
urine_iodine_comment    290
dtype: int64

In [56]:
df6_clean = df6.dropna(subset = ['urine_iodine'])

### Mercury

Mercury is a heavy metal historically used in devices such as barometers and thermometers. At elevated levels, it is known to cause neurotoxicity, with particularly severe effects on fetal development. Environmental exposure—especially in occupational settings involving manufacturing or chemical production—is a common source of mercury-related toxicity.

Urinary mercury concentration is a standard method for assessing inorganic mercury exposure. Clinical symptoms may begin to appear at concentrations around 100 µg/L, and levels exceeding 800 µg/L can be fatal [9].

In [57]:
file_path = '2017-2020/urine/8.P_UHG.xpt'

df7, meta = pyreadstat.read_xport(file_path)
df7 = standardize_id_column(df7)

In [58]:
df7.head(10)

Unnamed: 0,participant_id,WTSAPRP,URXUHG,URDUHGLC
0,109266.0,28660.015986,0.09,1.0
1,109270.0,17900.682903,0.09,1.0
2,109273.0,80106.859617,0.09,1.0
3,109274.0,24512.27628,0.09,1.0
4,109287.0,25828.523003,0.09,1.0
5,109288.0,8535.018174,0.09,1.0
6,109290.0,12410.268374,1.27,0.0
7,109295.0,28235.246814,0.09,1.0
8,109300.0,66737.887353,0.09,1.0
9,109309.0,33019.729726,0.09,1.0


In [59]:
df7.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4890 entries, 0 to 4889
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   participant_id  4890 non-null   float64
 1   WTSAPRP         4890 non-null   float64
 2   URXUHG          4600 non-null   float64
 3   URDUHGLC        4600 non-null   float64
dtypes: float64(4)
memory usage: 152.9 KB


In [60]:
df7 = df7.drop('WTSAPRP', axis =1)

In [61]:
df7 = df7.rename(columns={
    'URXUHG' : 'urine_mercury',
    'URDUHGLC' : 'urine_mercury_comment'
})

In [62]:
df7_clean = df7.dropna(subset=['urine_mercury'])

In [63]:
df7_clean.head(5)

Unnamed: 0,participant_id,urine_mercury,urine_mercury_comment
0,109266.0,0.09,1.0
1,109270.0,0.09,1.0
2,109273.0,0.09,1.0
3,109274.0,0.09,1.0
4,109287.0,0.09,1.0


### Metals

There are other types of metals that can be found in urine. From this dataset from NHANES, the metals tested are: barium, cadmium, cobalt, cesium, molybdenum, manganese, lead, antiomny, tin, thallium, and tungsten. 

Barium, if consumed in high concentrations, it can cause cardiac arrhythmias or paralysis [8]

Cadmium is a known carcinogen which is excreted primarily by the renal system. Cadminum can lead to various health conditions throughout the body including kidney and liver dysfunctions. [10]




In [64]:
file_path = '2017-2020/urine/9.P_UM.xpt'

df8, meta = pyreadstat.read_xport(file_path)
df8 = standardize_id_column(df8)

In [65]:
df8.columns.to_list()

['participant_id',
 'WTSAPRP',
 'URXUBA',
 'URDUBALC',
 'URXUCD',
 'URDUCDLC',
 'URXUCO',
 'URDUCOLC',
 'URXUCS',
 'URDUCSLC',
 'URXUMO',
 'URDUMOLC',
 'URXUMN',
 'URDUMNLC',
 'URXUPB',
 'URDUPBLC',
 'URXUSB',
 'URDUSBLC',
 'URXUSN',
 'URDUSNLC',
 'URXUTL',
 'URDUTLLC',
 'URXUTU',
 'URDUTULC']

In [66]:
df8 = df8.rename(columns = {
    'URXUBA': 'urine_barium',
    'URDUBALC': 'barium_comment',
    'URXUCD': 'urine_cadmium',
    'URDUCDLC': 'cadmium_comment',
    'URXUCO': 'urine_cobalt',
    'URDUCOLC': 'cobalt_comment',
     'URXUCS': 'urine_cesium',
     'URDUCSLC': 'cesium_comment',
     'URXUMO': 'urine_molybdenum',
     'URDUMOLC': 'molybdenum_comment',
     'URXUMN': 'urine_manganese',
     'URDUMNLC':'manganese_comment',
     'URXUPB':'urine_lead',
     'URDUPBLC':'lead_comment',
     'URXUSB':'urine_antimony',
     'URDUSBLC':'antimony_comment',
     'URXUSN':'urine_tin',
     'URDUSNLC':'tin_comment',
     'URXUTL':'urine_thallium',
     'URDUTLLC':'thallium_comment',
     'URXUTU':'urine_tungsten',
     'URDUTULC':'tungsten_comment'
})

In [67]:
df8 = df8.drop('WTSAPRP', axis = 1)

In [68]:
df8.head(10)

Unnamed: 0,participant_id,urine_barium,barium_comment,urine_cadmium,cadmium_comment,urine_cobalt,cobalt_comment,urine_cesium,cesium_comment,urine_molybdenum,...,urine_lead,lead_comment,urine_antimony,antimony_comment,urine_tin,tin_comment,urine_thallium,thallium_comment,urine_tungsten,tungsten_comment
0,109266.0,0.359,0.0,0.039,1.0,0.214,0.0,2.16,0.0,8.0,...,0.17,0.0,0.016,1.0,0.14,1.0,0.064,0.0,0.013,1.0
1,109270.0,2.422,0.0,0.868,0.0,0.449,0.0,9.64,0.0,78.66,...,0.532,0.0,0.046,0.0,2.35,0.0,0.354,0.0,0.271,0.0
2,109273.0,0.37,0.0,0.213,0.0,0.274,0.0,2.33,0.0,33.09,...,0.28,0.0,0.082,0.0,0.14,1.0,0.078,0.0,0.036,0.0
3,109274.0,1.72,0.0,0.184,0.0,0.482,0.0,2.803,0.0,74.82,...,0.3,0.0,0.053,0.0,4.09,0.0,0.159,0.0,0.099,0.0
4,109287.0,7.531,0.0,0.215,0.0,0.303,0.0,2.85,0.0,142.88,...,0.244,0.0,0.092,0.0,3.91,0.0,0.161,0.0,0.231,0.0
5,109288.0,0.271,0.0,0.039,1.0,0.097,0.0,5.98,0.0,36.21,...,,,0.067,0.0,0.97,0.0,0.175,0.0,0.036,0.0
6,109290.0,0.72,0.0,0.631,0.0,0.379,0.0,11.91,0.0,111.07,...,0.641,0.0,0.116,0.0,1.31,0.0,0.577,0.0,0.057,0.0
7,109295.0,1.27,0.0,0.039,1.0,0.323,0.0,4.145,0.0,25.76,...,0.06,0.0,0.022,0.0,0.14,1.0,0.147,0.0,0.02,0.0
8,109300.0,0.69,0.0,0.166,0.0,0.171,0.0,2.129,0.0,15.42,...,0.13,0.0,0.016,1.0,0.14,1.0,0.063,0.0,0.013,1.0
9,109309.0,0.61,0.0,0.039,1.0,0.263,0.0,1.673,0.0,5.14,...,,,0.028,0.0,0.3,0.0,0.099,0.0,0.03,0.0


In [69]:
df8.isnull().sum()

participant_id          0
urine_barium          295
barium_comment        295
urine_cadmium         295
cadmium_comment       295
urine_cobalt          296
cobalt_comment        296
urine_cesium          295
cesium_comment        295
urine_molybdenum      295
molybdenum_comment    295
urine_manganese       295
manganese_comment     295
urine_lead            953
lead_comment          953
urine_antimony        295
antimony_comment      295
urine_tin             295
tin_comment           295
urine_thallium        295
thallium_comment      295
urine_tungsten        295
tungsten_comment      295
dtype: int64

As done prior, the plan is to delete rows that are missing most, if not all, of these values. 

In [70]:
df8.columns.to_list()

['participant_id',
 'urine_barium',
 'barium_comment',
 'urine_cadmium',
 'cadmium_comment',
 'urine_cobalt',
 'cobalt_comment',
 'urine_cesium',
 'cesium_comment',
 'urine_molybdenum',
 'molybdenum_comment',
 'urine_manganese',
 'manganese_comment',
 'urine_lead',
 'lead_comment',
 'urine_antimony',
 'antimony_comment',
 'urine_tin',
 'tin_comment',
 'urine_thallium',
 'thallium_comment',
 'urine_tungsten',
 'tungsten_comment']

In [71]:
value_cols = [col for col in df8.columns if col != 'participant_id']
rows_all_nan = df8[value_cols].isna().all(axis=1)
print(f"Number of rows missing all metal values: {rows_all_nan.sum()}")

Number of rows missing all metal values: 295


In [72]:
# Drop rows where all value columns are NaN (excluding participant_id)
df8_cleaned = df8[~rows_all_nan].copy()

print(f"Number of rows dropped: {rows_all_nan.sum()}")

Number of rows dropped: 295


### Nickel

Nickel is another heavy metal that can potentially cause health risks if exposed at high levels. The most common reaction to nickel is contact dermatitis; however, inhalation and ingestion is also a possibility. When assessing urine samples, greater than 10 mg/dL in nickel concentration may indicate excessive exposure and calls for thorough evaluation [11].

In [73]:
file_path = '2017-2020/urine/10.P_UNI.xpt'

df9, meta = pyreadstat.read_xport(file_path)
df9 = standardize_id_column(df9)

In [74]:
df9.columns.to_list()

['participant_id', 'WTSAPRP', 'URXUNI', 'URDUNILC']

In [75]:
df9 = df9.drop('WTSAPRP', axis=1)

In [76]:
df9 = df9.rename(columns={
    'URXUNI': 'urine_nickel',
    'URDUNILC': 'urine_nickel_comment'
})

In [77]:
df9 = df9.dropna(subset=['urine_nickel'])

In [78]:
df9.head(10)

Unnamed: 0,participant_id,urine_nickel,urine_nickel_comment
0,109266.0,0.46,0.0
1,109270.0,1.08,0.0
2,109273.0,0.91,0.0
3,109274.0,1.17,0.0
4,109287.0,4.01,0.0
5,109288.0,1.08,0.0
6,109290.0,2.72,0.0
7,109295.0,0.22,1.0
8,109300.0,0.55,0.0
9,109309.0,0.92,0.0


### Organophosphate Insecticides

The lower limit of detection for the insecticide is 0.1 ng/mL.

In [79]:
file_path = '2017-2020/urine/11.P_OPD.xpt'

df10, meta = pyreadstat.read_xport(file_path)
df10 = standardize_id_column(df10)

In [80]:
df10.head(10)

Unnamed: 0,participant_id,WTSBPRP,URXOP1,URDOP1LC,URXOP2,URDOP2LC,URXOP3,URDOP3LC,URXOP4,URDOP4LC,URXOP5,URDOP5LC,URXOP6,URDOP6LC
0,109271.0,20156.439742,0.682,0.0,0.288,0.0,0.179,0.0,0.0707,1.0,0.0707,1.0,0.0707,1.0
1,109277.0,51738.369518,4.45,0.0,15.4,0.0,2.59,0.0,0.281,0.0,0.377,0.0,0.0707,1.0
2,109282.0,97190.5545,1.58,0.0,21.0,0.0,1.31,0.0,1.08,0.0,0.26,0.0,0.0707,1.0
3,109285.0,85548.221421,1.26,0.0,1.3,0.0,0.49,0.0,0.0707,1.0,0.0707,1.0,0.0707,1.0
4,109288.0,9103.868995,0.196,0.0,1.12,0.0,0.0707,1.0,0.0707,1.0,0.0707,1.0,0.0707,1.0
5,109301.0,14215.93737,1.58,0.0,1.43,0.0,0.519,0.0,0.0707,1.0,0.15,0.0,0.0707,1.0
6,109302.0,5984.497323,3.29,0.0,35.8,0.0,1.18,0.0,1.13,0.0,0.456,0.0,0.0707,1.0
7,109303.0,16549.099643,0.426,0.0,,,0.202,0.0,0.0707,1.0,0.0707,1.0,0.0707,1.0
8,109304.0,40089.354988,1.52,0.0,3.53,0.0,1.5,0.0,0.0707,1.0,0.104,0.0,0.0707,1.0
9,109307.0,49745.101247,4.05,0.0,0.825,0.0,,,0.127,0.0,0.0707,1.0,0.0707,1.0


In [81]:
df10.shape

(4929, 14)

In [82]:
df10.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4929 entries, 0 to 4928
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   participant_id  4929 non-null   float64
 1   WTSBPRP         4929 non-null   float64
 2   URXOP1          4618 non-null   float64
 3   URDOP1LC        4618 non-null   float64
 4   URXOP2          4607 non-null   float64
 5   URDOP2LC        4607 non-null   float64
 6   URXOP3          4604 non-null   float64
 7   URDOP3LC        4604 non-null   float64
 8   URXOP4          4611 non-null   float64
 9   URDOP4LC        4611 non-null   float64
 10  URXOP5          4621 non-null   float64
 11  URDOP5LC        4621 non-null   float64
 12  URXOP6          4620 non-null   float64
 13  URDOP6LC        4620 non-null   float64
dtypes: float64(14)
memory usage: 539.2 KB


In [83]:
df10.isnull().sum()

participant_id      0
WTSBPRP             0
URXOP1            311
URDOP1LC          311
URXOP2            322
URDOP2LC          322
URXOP3            325
URDOP3LC          325
URXOP4            318
URDOP4LC          318
URXOP5            308
URDOP5LC          308
URXOP6            309
URDOP6LC          309
dtype: int64

In [84]:
df10.columns.to_list()

['participant_id',
 'WTSBPRP',
 'URXOP1',
 'URDOP1LC',
 'URXOP2',
 'URDOP2LC',
 'URXOP3',
 'URDOP3LC',
 'URXOP4',
 'URDOP4LC',
 'URXOP5',
 'URDOP5LC',
 'URXOP6',
 'URDOP6LC']

In [85]:
df10 = df10.drop('WTSBPRP', axis = 1)

In [86]:
df10 = df10.rename(columns={
 'URXOP1': 'dimethylphosphate_ng_mL',
 'URDOP1LC' : 'dimethylphosphate_comment',
 'URXOP2' : 'diethylphosphate_ng_mL',
 'URDOP2LC' : 'diethylphosphate_comment',
 'URXOP3' : 'dimethylthiophosphate_ng_mL',
 'URDOP3LC' :'dimethylthiophosphate_comment',
 'URXOP4': 'diethylthiophosphate_ng_mL',
 'URDOP4LC' :'diethylthiophosphate_comment',
 'URXOP5':'dimethyldithiophosphate_ng_mL',
 'URDOP5LC' :'dimethyldithiophosphate_comment' ,
 'URXOP6':'diethyldithiophosphate_ng_mL',
 'URDOP6LC':'diethyldithiophosphate_comment'
})

In [87]:
df10.columns.to_list()

['participant_id',
 'dimethylphosphate_ng_mL',
 'dimethylphosphate_comment',
 'diethylphosphate_ng_mL',
 'diethylphosphate_comment',
 'dimethylthiophosphate_ng_mL',
 'dimethylthiophosphate_comment',
 'diethylthiophosphate_ng_mL',
 'diethylthiophosphate_comment',
 'dimethyldithiophosphate_ng_mL',
 'dimethyldithiophosphate_comment',
 'diethyldithiophosphate_ng_mL',
 'diethyldithiophosphate_comment']

In [88]:
value_cols = [col for col in df10.columns if col != 'participant_id']
rows_all_nan = df10[value_cols].isna().all(axis=1)
print(f"Number of rows missing all OPD values: {rows_all_nan.sum()}")

Number of rows missing all OPD values: 307


In [89]:
# Drop rows where all value columns are NaN (excluding participant_id)
df10_cleaned = df10[~rows_all_nan].copy()

print(f"Number of rows dropped: {rows_all_nan.sum()}")

Number of rows dropped: 307


##### Perchlorate, Nitrate & Thiocyanate



In [90]:
file_path = '2017-2020/urine/12.P_PERNT.xpt'

df11, meta = pyreadstat.read_xport(file_path)
df11 = standardize_id_column(df11)

In [91]:
df11.head(10)

Unnamed: 0,participant_id,WTSAPRP,URXUP8,URDUP8LC,URXNO3,URDNO3LC,URXSCN,URDSCNLC
0,109266.0,28660.015986,0.57,0.0,36900.0,0.0,223.0,0.0
1,109270.0,17900.682903,4.02,0.0,48200.0,0.0,2960.0,0.0
2,109273.0,80106.859617,2.17,0.0,49900.0,0.0,4740.0,0.0
3,109274.0,24512.27628,6.95,0.0,2410.0,0.0,5290.0,0.0
4,109287.0,25828.523003,7.29,0.0,78700.0,0.0,603.0,0.0
5,109288.0,8535.018174,2.14,0.0,50600.0,0.0,370.0,0.0
6,109290.0,12410.268374,2.97,0.0,70600.0,0.0,1470.0,0.0
7,109295.0,28235.246814,2.14,0.0,24700.0,0.0,1010.0,0.0
8,109300.0,66737.887353,0.876,0.0,10800.0,0.0,270.0,0.0
9,109309.0,33019.729726,1.51,0.0,16800.0,0.0,171.0,0.0


In [92]:
df11.columns.to_list()

['participant_id',
 'WTSAPRP',
 'URXUP8',
 'URDUP8LC',
 'URXNO3',
 'URDNO3LC',
 'URXSCN',
 'URDSCNLC']

In [93]:
df11 = df11.drop('WTSAPRP', axis =1)

In [94]:
df11 = df11.rename(columns = {
 'URXUP8': 'perchlorate_urine_ng_mL',
 'URDUP8LC': 'perchlorate_comment',
 'URXNO3':'nitrate_urine_ng_mL',
 'URDNO3LC':'nitrate_comment',
 'URXSCN':'thiocyanate_urine_ng_mL',
 'URDSCNLC': 'thiocyanate_comment'
})

In [95]:
df11.columns.to_list()

['participant_id',
 'perchlorate_urine_ng_mL',
 'perchlorate_comment',
 'nitrate_urine_ng_mL',
 'nitrate_comment',
 'thiocyanate_urine_ng_mL',
 'thiocyanate_comment']

In [96]:
value_cols = [col for col in df11.columns if col != 'participant_id']
rows_all_nan = df11[value_cols].isna().all(axis=1)
print(f"Number of rows missing all PERNT values: {rows_all_nan.sum()}")

Number of rows missing all PERNT values: 391


In [97]:
# Drop rows where all value columns are NaN (excluding participant_id)
df11_cleaned = df11[~rows_all_nan].copy()

print(f"Number of rows dropped: {rows_all_nan.sum()}")

Number of rows dropped: 391


### Urine Pregnancy Test

Point of care urine pregnancy test was performed on women 20-44 years of age. 

In [98]:
file_path = '2017-2020/urine/13.P_UCPREG.xpt'

df12, meta = pyreadstat.read_xport(file_path)
df12 = standardize_id_column(df12)

In [99]:
df12.head()

Unnamed: 0,participant_id,URXPREG
0,109266.0,2.0
1,109284.0,2.0
2,109286.0,1.0
3,109291.0,2.0
4,109297.0,2.0


In [100]:
df12 = df12.rename(columns = {
    'URXPREG' : 'pregnancy_test_result'
})

In [101]:
df12 = df12.dropna(subset=['pregnancy_test_result'])

### Volatile Organic Compound (VOC) Metabolites

On the NHANES lab dataset, there are two VOC tables: P_UVOC and P_UVOC2. 


In [102]:
file_path = '2017-2020/urine/14.P_UVOC.xpt'

df13, meta = pyreadstat.read_xport(file_path)
df13 = standardize_id_column(df13)

In [103]:
df13.columns.to_list()

['participant_id',
 'WTSAPRP',
 'URX2MH',
 'URD2MHLC',
 'URX34M',
 'URD34MLC',
 'URXAAM',
 'URDAAMLC',
 'URXAMC',
 'URDAMCLC',
 'URXATC',
 'URDATCLC',
 'URXBMA',
 'URDBMALC',
 'URXBPM',
 'URDBPMLC',
 'URXCEM',
 'URDCEMLC',
 'URXCYHA',
 'URDCYALC',
 'URXCYM',
 'URDCYMLC',
 'URXDHB',
 'URDDHBLC',
 'URXGAM',
 'URDGAMLC',
 'URXHEM',
 'URDHEMLC',
 'URXHP2',
 'URDHP2LC',
 'URXHPM',
 'URDHPMLC',
 'URXIPM3',
 'URDPM3LC',
 'URXMAD',
 'URDMADLC',
 'URXMB3',
 'URDMB3LC',
 'URXPHG',
 'URDPHGLC',
 'URXPMM',
 'URDPMMLC',
 'URXTTC',
 'URDTTCLC']

In [104]:
url = "https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2017/DataFiles/P_UVOC.htm"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Get the first table
table = soup.find('table')
rows = table.find_all('tr')

# Extract header and rows
headers = [th.get_text(strip=True) for th in rows[0].find_all('th')]
data = [
    [td.get_text(strip=True) for td in row.find_all('td')]
    for row in rows[1:]
]

df_info = pd.DataFrame(data, columns=headers)

#Concurrently working on IBM data science certificate and a new learned skill was webscraping so the decision was made to utilize the skill to future-proof the cleaning process and to ensure there are no typos/the names are accurate

In [105]:
# Step 1: Create initial rename_dict from table (matching only columns in df13)
rename_dict = {
    row["VARIABLE NAME"]: to_snake_case(row["ANALYTE NAME"])
    for _, row in df_info.iterrows()
    if row["VARIABLE NAME"] in df13.columns
}

# Step 2: Clean comment code columns (ending in "LC")
unit_suffixes = ["_ng_ml", "_ug_l", "_mg_dl", "_umol_l", "_nmol_l"]
comment_renames = {}

for col in df13.columns:
    if col.endswith("LC") and col not in rename_dict:
        base_col = col[:-2]  # Remove 'LC'
        match_col = base_col.replace("URD", "URX")

        if match_col in rename_dict:
            clean_name = rename_dict[match_col]

            # Strip any unit suffix
            for unit in unit_suffixes:
                if clean_name.endswith(unit):
                    clean_name = clean_name[: -len(unit)]
                    break

            comment_renames[col] = f"{clean_name}_comment"

# Step 3: Merge comment renames into main rename_dict
rename_dict.update(comment_renames)

In [106]:
df13_cleaned = df13.rename(columns=rename_dict)

In [107]:
df13_cleaned.head()

Unnamed: 0,participant_id,WTSAPRP,2_methylhippuric_acid_ng_ml,2_methylhippuric_acid_comment,3__and_4_methylhippuric_acid_ng_ml,3__and_4_methylhippuric_acid_comment,n_acetyl_s_2_carbamoylethyl_l_cysteine_ng_ml,n_acetyl_s_2_carbamoylethyl_l_cysteine_comment,n_acetyl_s_n_methylcarbamoyl_l_cysteine_ng_ml,n_acetyl_s_n_methylcarbamoyl_l_cysteine_comment,...,mandelic_acid_ng_ml,mandelic_acid_comment,n_acetyl_s_4_hydroxy_2_butenyl_l_cysteine_ng_ml,n_acetyl_s_4_hydroxy_2_butenyl_l_cysteine_comment,phenylglyoxylic_acid_ng_ml,phenylglyoxylic_acid_comment,n_acetyl_s_3_hydroxypropyl_1_methyl_l_cysteine_ng_ml,n_acetyl_s_3_hydroxypropyl_1_methyl_l_cysteine_comment,2_thioxothiazolidine_4_carboxylic_acid,2_thioxothiazolidine_4_carboxylic_acid_comment
0,109266.0,28660.015986,3.54,1.0,18.7,0.0,19.0,0.0,12.9,0.0,...,37.2,0.0,0.424,1.0,75.6,0.0,60.6,0.0,7.9,1.0
1,109270.0,17900.682903,9.33,0.0,106.0,0.0,184.0,0.0,91.0,0.0,...,210.0,0.0,5.63,0.0,298.0,0.0,257.0,0.0,7.9,1.0
2,109273.0,80106.859617,90.1,0.0,491.0,0.0,122.0,0.0,543.0,0.0,...,213.0,0.0,31.2,0.0,438.0,0.0,927.0,0.0,7.9,1.0
3,109274.0,24512.27628,39.9,0.0,118.0,0.0,90.0,0.0,129.0,0.0,...,222.0,0.0,9.71,0.0,230.0,0.0,2350.0,0.0,37.2,0.0
4,109287.0,25828.523003,14.7,0.0,121.0,0.0,135.0,0.0,64.0,0.0,...,1270.0,0.0,11.2,0.0,807.0,0.0,240.0,0.0,7.9,1.0


In [108]:
df13_cleaned = df13_cleaned.drop('WTSAPRP',axis=1)

In [109]:
df13_cleaned.isnull().sum()

participant_id                                                0
2_methylhippuric_acid_ng_ml                                 565
2_methylhippuric_acid_comment                               565
3__and_4_methylhippuric_acid_ng_ml                          565
3__and_4_methylhippuric_acid_comment                        565
n_acetyl_s_2_carbamoylethyl_l_cysteine_ng_ml                565
n_acetyl_s_2_carbamoylethyl_l_cysteine_comment              565
n_acetyl_s_n_methylcarbamoyl_l_cysteine_ng_ml               565
n_acetyl_s_n_methylcarbamoyl_l_cysteine_comment             565
2_aminothiazoline_4_carboxylic_acid_ng_ml                   565
2_aminothiazoline_4_carboxylic_acid_comment                 565
n_acetyl_s_benzyl_l_cysteine_ng_ml                          565
n_acetyl_s_benzyl_l_cysteine_comment                        565
n_acetyl_s_n_propyl_l_cysteine_ng_ml                        565
n_acetyl_s_n_propyl_l_cysteine_comment                      565
n_acetyl_s_2_carboxyethyl_l_cysteine_ng_

In [110]:
# The three columns were missed since they did not follow the exact pattern. They were manually renamed

df13_cleaned = df13_cleaned.rename(columns ={
    'SEQN':'participant_id',
    'URDCYALC':'n_acetyl_s_1_cyano_2_hydroxyethyl_l_cysteine_comment',
    'URDPM3LC': 'n_acetyl_s_4_hydroxy_2_methyl_2_butenyl_l_cysteine_comment'
})

In [111]:
df13_cleaned.columns.to_list()

['participant_id',
 '2_methylhippuric_acid_ng_ml',
 '2_methylhippuric_acid_comment',
 '3__and_4_methylhippuric_acid_ng_ml',
 '3__and_4_methylhippuric_acid_comment',
 'n_acetyl_s_2_carbamoylethyl_l_cysteine_ng_ml',
 'n_acetyl_s_2_carbamoylethyl_l_cysteine_comment',
 'n_acetyl_s_n_methylcarbamoyl_l_cysteine_ng_ml',
 'n_acetyl_s_n_methylcarbamoyl_l_cysteine_comment',
 '2_aminothiazoline_4_carboxylic_acid_ng_ml',
 '2_aminothiazoline_4_carboxylic_acid_comment',
 'n_acetyl_s_benzyl_l_cysteine_ng_ml',
 'n_acetyl_s_benzyl_l_cysteine_comment',
 'n_acetyl_s_n_propyl_l_cysteine_ng_ml',
 'n_acetyl_s_n_propyl_l_cysteine_comment',
 'n_acetyl_s_2_carboxyethyl_l_cysteine_ng_ml',
 'n_acetyl_s_2_carboxyethyl_l_cysteine_comment',
 'n_acetyl_s_1_cyano_2_hydroxyethyl_l_cysteine_ng_ml',
 'n_acetyl_s_1_cyano_2_hydroxyethyl_l_cysteine_comment',
 'n_acetyl_s_2_cyanoethyl_l_cysteine_ng_ml',
 'n_acetyl_s_2_cyanoethyl_l_cysteine_comment',
 'n_acetyl_s_34_dihydroxybutyl_l_cysteine_ng_ml',
 'n_acetyl_s_34_dihydroxy

In [112]:
value_cols = [col for col in df13_cleaned.columns if col != 'participant_id']
rows_all_nan = df13_cleaned[value_cols].isna().all(axis=1)
print(f"Number of rows missing all VOC values: {rows_all_nan.sum()}")

Number of rows missing all VOC values: 565


In [113]:
# Drop rows where all value columns are NaN (excluding participant_id)
df13_cleaned = df13_cleaned[~rows_all_nan].copy()

print(f"Number of rows dropped: {rows_all_nan.sum()}")


Number of rows dropped: 565


In [114]:
file_path = '2017-2020/urine/15.P_UVOC2.xpt'

df14, meta = pyreadstat.read_xport(file_path)
df14 = standardize_id_column(df14)

In [115]:
df14.head()

Unnamed: 0,participant_id,WTVOC2PP,URXMUCA,URDMUCLC,URXPHMA,URDPMALC
0,109266.0,29122.785906,6.94,1.0,0.106,1.0
1,109270.0,18436.336755,213.0,0.0,0.185,0.0
2,109273.0,93177.905637,147.0,0.0,1.1,0.0
3,109274.0,27374.984127,244.0,0.0,0.171,0.0
4,109287.0,25946.229537,130.0,0.0,0.153,0.0


In [116]:
df14 = df14.rename(columns={
    'URXMUCA': 'trans_trans_muconic_acid_ng_ml',
    'URDMUCLC': 'trans_trans_muconic_acid_comment',
    'URXPHMA': 'phenylmercapturic_acid_ng_ml',
    'URDPMALC': 'phenylmercapturic_acid_comment'
})   

In [117]:
df14 = df14.drop('WTVOC2PP', axis=1)

In [118]:
df14.isnull().sum()

participant_id                        0
trans_trans_muconic_acid_ng_ml      994
trans_trans_muconic_acid_comment    994
phenylmercapturic_acid_ng_ml        994
phenylmercapturic_acid_comment      994
dtype: int64

In [119]:
common_nan = get_common_nan_ids(df14, 'trans_trans_muconic_acid_ng_ml', 'phenylmercapturic_acid_ng_ml', id_col='participant_id')

Number of NaNs in trans_trans_muconic_acid_ng_ml: 994
Number of NaNs in phenylmercapturic_acid_ng_ml: 994
Number of IDs with NaNs in both columns: 994


In [120]:
df14_cleaned = drop_rows_with_common_nan_ids(df14, 'trans_trans_muconic_acid_ng_ml', 'phenylmercapturic_acid_ng_ml', id_col='participant_id')

Rows dropped where both trans_trans_muconic_acid_ng_ml and phenylmercapturic_acid_ng_ml were NaN: 994


All of the urine labs have been cleaned. The dataframes from all of the urine labs will be collated into one large dataframe named urine_labs into csv file.

In [121]:
df_names = [var for var in globals() if isinstance(globals()[var], pd.DataFrame)]
print(df_names)

['__', 'df', '_4', 'df_cleaned', 'df1', '_11', 'df2', '_22', 'df2_cleaned', 'df3', '_29', 'df4', '_36', 'df4_cleaned', '_43', 'df5', 'df5_cleaned', 'df6', 'df6_clean', 'df7', '_58', 'df7_clean', '_63', 'df8', '_68', 'df8_cleaned', 'df9', '_78', 'df10', '_80', 'df10_cleaned', 'df11', '_91', 'df11_cleaned', 'df12', '_99', 'df13', 'df_info', 'df13_cleaned', '_107', 'df14', '_115', 'df14_cleaned']


In [122]:
urine_dfs = [
    df_cleaned,
    df1,
    df2_cleaned,
    df3,
    df4_cleaned,
    df5_cleaned,
    df6_clean,
    df7_clean,
    df8_cleaned,
    df9,
    df10_cleaned,
    df11_cleaned,
    df12,
    df13_cleaned,
    df14_cleaned
]

from functools import reduce

df_urine_combined = reduce(
    lambda left, right: pd.merge(left, right, on="participant_id", how="outer"),
    urine_dfs
)

In [123]:
df_urine_combined.to_csv("cleaned_urine_labs_combined.csv", index=False)

In [124]:
urine_df = pd.read_csv('cleaned_urine_labs_combined.csv')

urine_df.head(10)

Unnamed: 0,participant_id,albumin_urine_ug_mL,albumin_urine_mg_L,alb_comment,creatinine_urine_mg_dL,creatinine_urine_umol_L,creatinine_comment,alb_creat_ratio,arsenous_acid_ug_L,arsenous_acid_comment,...,phenylglyoxylic_acid_ng_ml,phenylglyoxylic_acid_comment,n_acetyl_s_3_hydroxypropyl_1_methyl_l_cysteine_ng_ml,n_acetyl_s_3_hydroxypropyl_1_methyl_l_cysteine_comment,2_thioxothiazolidine_4_carboxylic_acid,2_thioxothiazolidine_4_carboxylic_acid_comment,trans_trans_muconic_acid_ng_ml,trans_trans_muconic_acid_comment,phenylmercapturic_acid_ng_ml,phenylmercapturic_acid_comment
0,109266.0,5.5,5.5,0.0,36.0,3182.4,0.0,15.28,0.08,1.0,...,75.6,0.0,60.6,0.0,7.9,1.0,6.94,1.0,0.106,1.0
1,109270.0,4.0,4.0,0.0,165.0,14586.0,0.0,2.42,0.51,0.0,...,298.0,0.0,257.0,0.0,7.9,1.0,213.0,0.0,0.185,0.0
2,109271.0,2.4,2.4,0.0,32.0,2828.8,0.0,7.5,,,...,,,,,,,,,,
3,109273.0,4.9,4.9,0.0,121.0,10696.4,0.0,4.05,0.08,1.0,...,438.0,0.0,927.0,0.0,7.9,1.0,147.0,0.0,1.1,0.0
4,109274.0,12.8,12.8,0.0,120.0,10608.0,0.0,10.67,0.08,1.0,...,230.0,0.0,2350.0,0.0,37.2,0.0,244.0,0.0,0.171,0.0
5,109275.0,3.7,3.7,0.0,20.0,1768.0,0.0,18.5,,,...,,,,,,,,,,
6,109277.0,14.7,14.7,0.0,244.0,21569.6,0.0,6.02,,,...,,,,,,,,,,
7,109278.0,8.4,8.4,0.0,124.0,10961.6,0.0,6.77,,,...,,,,,,,,,,
8,109279.0,13.9,13.9,0.0,251.0,22188.4,0.0,5.54,,,...,,,,,,,,,,
9,109282.0,16.0,16.0,0.0,192.0,16972.8,0.0,8.33,,,...,,,,,,,,,,


In [125]:
urine_df.shape

(12795, 129)