# NHANES - Create csv file

This purpose of this notebook is to load the bloodcount and demographic datafiles from 2013-2018, combinine the files from different years, merge the lab data with the demographic data, some data cleaning, and save the output as a single clean csv file.

*The years only go up to 2018 because the NHANES program suspended field operations in March 2020 due to the coronavirus disease 2019 (COVID-19) pandemic. As a result, data collection for the NHANES 2019-2020 cycle was not completed and the collected data are not nationally representative.* 

## Links to XPT files and data codebooks

**NHANES XPT files and data codebooks:**

**2017-2018** 
- TFR https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/TFR_J.htm
- Ferritin https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/FERTIN_J.htm
- Demographics https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/DEMO_J.htm
- Reproductive Health Questionnaire https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/RHQ_J.htm
- Depression Score:https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/DPQ_J.htm

**2015-2016** 
- TFR https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/TFR_I.htm
- Ferritin https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/FERTIN_I.htm
- Demographics https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm
- Reproductive Health Questionnaire https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/RHQ_I.htm
- Depression Score:https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DPQ_I.htm

--------------------------------------
****NEED TO EXCLUDE YEARS 2011-2014 as they don't have Ferritin and TfR measures****

------------------------------------------------

**2009-2010** 
- TFR https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/TFR_F.htm
- Ferritin https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/FERTIN_F.htm
- Demographics https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/DEMO_F.htm
- Reproductive Health Questionnaire https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/RHQ_F.htm
- Depression Score:https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/DPQ_F.htm

**2007-2008** 
- TFR https://wwwn.cdc.gov/Nchs/Nhanes/2007-2008/TFR_E.htm
- Ferritin https://wwwn.cdc.gov/Nchs/Nhanes/2007-2008/FERTIN_E.htm
- Demographics https://wwwn.cdc.gov/Nchs/Nhanes/2007-2008/DEMO_E.htm
- Reproductive Health Questionnaire https://wwwn.cdc.gov/Nchs/Nhanes/2007-2008/RHQ_E.htm
- Depression Score:https://wwwn.cdc.gov/Nchs/Nhanes/2007-2008/DPQ_E.htm

**2005-2006** 
- TFR https://wwwn.cdc.gov/Nchs/Nhanes/2005-2006/TFR_D.htm
- Ferritin https://wwwn.cdc.gov/Nchs/Nhanes/2005-2006/FERTIN_D.htm
- Demographics https://wwwn.cdc.gov/Nchs/Nhanes/2005-2006/DEMO_D.htm
- Reproductive Health Questionnaire https://wwwn.cdc.gov/Nchs/Nhanes/2005-2006/RHQ_D.htm
- Depression Score:https://wwwn.cdc.gov/Nchs/Nhanes/2005-2006/DPQ_D.htm


## Import packages

In [1]:
import pandas as pd
import numpy as np
import xport
from glob import glob

## Load Data Files

### TFR Data

In [2]:
tfr_files = glob('data/TFR*.XPT')

tfr = pd.DataFrame()

for file in tfr_files:
    tfr_data = pd.read_sas(file, format='xport', encoding='utf-8')
    tfr = tfr.append(tfr_data, ignore_index=True)

### Ferritin Data

In [3]:
fer_files = glob('data/FERTIN*.XPT')

ferritin = pd.DataFrame()

for file in fer_files:
    fer_data = pd.read_sas(file, format='xport', encoding='utf-8')
    ferritin = ferritin.append(fer_data, ignore_index=True)

In [4]:
# Keeping only the nanograms measure
ferritin  = ferritin [['SEQN','LBXFER']]

### Demographic Data

In [5]:
demo_files = glob('data/DEMO*.XPT')

demographics = pd.DataFrame()

for file in demo_files:
    demo_data = pd.read_sas(file, format='xport', encoding='utf-8')
    demographics = demographics.append(demo_data, ignore_index=True)

### Reproductive Health Questionnaire (RHQ) Data

In [6]:
rhq_files = glob('data/RHQ*.XPT')

rhq = pd.DataFrame()

for file in rhq_files:
    rhq_data = pd.read_sas(file, format='xport', encoding='utf-8')
    rhq = rhq.append(rhq_data, ignore_index=True)

In [7]:
# Keep only the following variables: RHD143, RHQ197
rhq = rhq[['SEQN','RHQ197']]

### Depression Score Data

In [8]:
dpq_files = glob('data/DPQ*.XPT')

dpq = pd.DataFrame()

for file in dpq_files:
    dpq_data = pd.read_sas(file, format='xport', encoding='utf-8')
    dpq = dpq.append(dpq_data, ignore_index=True)

#### Create the dpq score

There is a bug when reading in xport files where where on certain fields, 0.0 (floats) are read in as 5.397605e-79 (lowest IBM float value).


In [9]:
# See issue by checking value counts below
dpq['DPQ010'].value_counts()

5.397605e-79    19640
1.000000e+00     4270
2.000000e+00     1211
3.000000e+00      945
9.000000e+00       44
7.000000e+00       12
Name: DPQ010, dtype: int64

In [10]:
# Replace this value with 0
dpq = dpq.replace(5.397605346934028e-79, 0)

In [11]:
# Check that it worked
dpq['DPQ010'].value_counts()

0.0    19640
1.0     4270
2.0     1211
3.0      945
9.0       44
7.0       12
Name: DPQ010, dtype: int64

In [12]:
# Replace value 7 (answer "refused") with 0
dpq = dpq.replace(7, 0)

In [13]:
# Replace value 9 (answer "don't know") with 0
dpq = dpq.replace(9, 0)

#### Create new column and add depression variables to come up with score for each participant

In [14]:
# Make SEQN column the index so that it's not included in dpq score
dpq.set_index('SEQN', inplace=True)

In [15]:
# Select only numeric columns from 'dpq' DataFrame
numeric_columns = dpq.select_dtypes(include='number')

# Sum these numeric columns across rows to calculate 'dpq_score'
dpq['dpq_score'] = numeric_columns.sum(axis=1)

In [16]:
# Reset the index to its original state
dpq.reset_index(inplace=True)

In [17]:
# Save only the SEQN and dpq_score column 
dpq = dpq[['SEQN','dpq_score']]

## Merge data to create Main Dataframe

In [18]:
# Merging the DataFrames on 'SEQN' and 'file_name'
nhanes = tfr.merge(demographics, on=['SEQN'], how='outer')

In [19]:
nhanes = nhanes.merge(dpq, on=['SEQN'], how='outer')

In [20]:
nhanes = nhanes.merge(rhq, on=['SEQN'], how='outer')

In [21]:
nhanes = nhanes.merge(ferritin, on=['SEQN'], how='outer')

## Clean data & filter for variables of interest

### Replace the bug scientific notation value with 0 in the main dataframe

In [22]:
# Replace this value with 0 in the main dataframe
nhanes = nhanes.replace(5.397605346934028e-79, 0)

### Drop any non-female participants

In [23]:
# Values to filter rows on
values_to_drop = [1,'.']

In [24]:
# Filter and drop rows based on the value in column 'C'
nhanes = nhanes[~nhanes['RIAGENDR'].isin(values_to_drop)]

#### NHANES asked only pregnancy status for females between 20 and 44 years of age at the time of exam.
> Drop any participants under the age of 20 and over the age of 44

In [25]:
# Define the column and threshold value
column_to_check = 'RIDAGEYR'
threshold_value = 20

# Filter and drop rows where the value in column 'A' is less than 20
nhanes = nhanes[nhanes[column_to_check] >= threshold_value]

In [26]:
# Define the column and threshold value
column_to_check = 'RIDAGEYR'
threshold_value = 44

# Filter and drop rows where the value in column 'A' is less than 20
nhanes = nhanes[nhanes[column_to_check] <= threshold_value]

### Rename columns to be more readable

In [27]:
nhanes = nhanes.rename(columns={'LBXHGB': 'hemoglobin', 
                                'RHQ197': 'months-postpartum',
                                'RIAGENDR': 'sex',
                                'RIDAGEYR': 'age',
                                'RIDRETH1': 'race-ethnicity',
                                'DMDEDUC2': 'edu-level',
                                'DMDMARTL': 'maritial-status',
                                'RIDEXPRG': 'pregnancy-status',
                                'SDMVPSU': 'masked-variance-psu',
                                'SDMVSTRA': 'masked-variance-stratum',
                                'INDHHIN2': 'household-income',
                                'INDFMPIR': 'income-to-poverty-ratio',
                                'LBXFER': 'ferritin',
                                'LBXTFR': 'tfr'}) 

In [28]:
nhanes.drop(columns = ['RIDRETH3','DMDEDUC3',
                       'DMDHRAGE','INDFMIN2',
                       'RIDSTATR','RIDAGEMN','RIDEXMON',
                       'RIDEXAGM', 'DMQMILIZ', 'DMQADFC', 
                       'DMDBORN4','DMDCITZN', 'DMDYRSUS', 
                       'SIALANG','SIAPROXY','SIAINTRP',
                       'FIALANG','FIAPROXY', 'FIAINTRP',
                       'MIALANG', 'MIAPROXY', 'MIAINTRP',
                       'AIALANGA','DMDHHSIZ','DMDFMSIZ',
                       'DMDHHSZA','DMDHHSZB','DMDHHSZE',
                       'DMDHRGND','DMDHRBR4', 'DMDHREDU', 
                       'DMDHRMAR','DMDHSEDU','DMDHRAGZ', 
                       'DMDHREDZ', 'DMDHRMAZ', 'DMDHSEDZ',
                       'RIDAGEEX', 'DMDBORN', 'INDHHINC',
                       'INDFMINC', 'DMDHRBRN',
                       'DMQMILIT', 'DMDBORN2', 'DMDSCHOL', 
                       'DMDHRBR2', 'AIALANG'], inplace=True)

### Clean up 'pregnancy_status' column

In [29]:
# Pregnant q, 1=yes, 2=no
nhanes['pregnancy-status'].value_counts()

2.0    5333
1.0     586
3.0     395
Name: pregnancy-status, dtype: int64

In [30]:
# Pregnancy status, make 3 = NA
nhanes['pregnancy-status'] = nhanes['pregnancy-status'].replace(3, 0)

In [31]:
# Pregnancy status, make 2 = 0
nhanes['pregnancy-status'] = nhanes['pregnancy-status'].replace(2, 0)

In [32]:
# Pregnancy status, Make postpartum <13 months and are not pregnant = 2
nhanes.loc[(nhanes['months-postpartum'] < 13) & (nhanes['pregnancy-status'] == 0), 'pregnancy-status'] = 2

In [33]:
nhanes['pregnancy-status'].value_counts()

0.0    5408
1.0     586
2.0     320
Name: pregnancy-status, dtype: int64

### Create depression threshold column

In [34]:
# Score greater than or equal to 10 will be defined as depression
nhanes['depression'] = (nhanes['dpq_score'] >= 10).astype(int)

## Save csv

In [35]:
nhanes.to_csv('data/nhanes.csv', index=False)