# NHANES - Explore and clean data

The purpose of this notebook is to examine the dataframe for any missing values, clean the data, and save the output as a single clean csv file

## Import packages

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import chi2_contingency

## Load data

In [4]:
nhanes = pd.read_csv('data/nhanes.csv')

## Clean up dataset

### Explore outcome variable (dpq score) missing values 

In [5]:
# Check how many missing outcome variable values
nhanes['dpq_score'].isnull().sum()

207

In [6]:
# Is there a pattern to where the depression scores are missing or is it random?

In [7]:
# Check how many missing predictor variable values
nhanes['ferritin'].isnull().sum()

603

In [8]:
# How many rows are both ferritin and dpq scores null?

In [9]:
# Look at the overlap of ferritin null values and dpq score null values

# Create a boolean mask for rows where both 'Column1' and 'Column2' are null
mask = nhanes['ferritin'].isnull() & nhanes['dpq_score'].isnull()

# Apply the mask to filter the DataFrame
overlap_nulls_df = nhanes[mask]

In [10]:
len(overlap_nulls_df)

207

In [11]:
len(nhanes)

6314

In [12]:
207/6314

0.03278428888184986

**For all dpq scores that are null, ferritin is also null, so let's drop all null dpq rows and see how many null ferritin rows are left**

In [13]:
nhanes = nhanes.dropna(subset=['dpq_score'])

In [14]:
nhanes['dpq_score'].isnull().sum()

0

In [15]:
nhanes['ferritin'].isnull().sum()

396

In [16]:
nhanes['pregnancy-status'].value_counts()

0.0    5212
1.0     575
2.0     320
Name: pregnancy-status, dtype: int64

In [17]:
# Check null ferritin values
ferritin_isnull = nhanes[nhanes['ferritin'].isnull()]

In [18]:
ferritin_isnull.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 396 entries, 27 to 6103
Data columns (total 20 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   SEQN                     396 non-null    float64
 1   tfr                      5 non-null      float64
 2   LBDTFRSI                 0 non-null      float64
 3   SDDSRVYR                 396 non-null    float64
 4   sex                      396 non-null    float64
 5   age                      396 non-null    float64
 6   race-ethnicity           396 non-null    float64
 7   edu-level                396 non-null    float64
 8   maritial-status          396 non-null    float64
 9   household-income         296 non-null    float64
 10  income-to-poverty-ratio  347 non-null    float64
 11  pregnancy-status         396 non-null    float64
 12  WTINT2YR                 396 non-null    float64
 13  WTMEC2YR                 396 non-null    float64
 14  masked-variance-psu     

### Check TFR missing values

In [19]:
nhanes['tfr'].isnull().sum()

423

In [20]:
tfr_isnull = nhanes[nhanes['tfr'].isnull()]

In [21]:
tfr_isnull

Unnamed: 0,SEQN,tfr,LBDTFRSI,SDDSRVYR,sex,age,race-ethnicity,edu-level,maritial-status,household-income,income-to-poverty-ratio,pregnancy-status,WTINT2YR,WTMEC2YR,masked-variance-psu,masked-variance-stratum,dpq_score,months-postpartum,ferritin,depression
27,31381.0,,,4.0,2.0,34.0,5.0,2.0,1.0,,2.41,0.0,72642.691949,76094.488614,2.0,44.0,7.0,,,0
28,31383.0,,,4.0,2.0,27.0,3.0,3.0,1.0,,4.22,0.0,66674.297133,70222.532080,2.0,54.0,6.0,,,0
33,31431.0,,,4.0,2.0,20.0,1.0,2.0,5.0,,,1.0,1547.159611,1537.424539,2.0,57.0,1.0,,,0
37,31502.0,,,4.0,2.0,38.0,3.0,3.0,1.0,,1.92,0.0,95837.707862,100391.672914,2.0,46.0,0.0,,,0
75,31861.0,,,4.0,2.0,22.0,5.0,5.0,6.0,,5.00,0.0,77329.946421,81717.303622,1.0,56.0,0.0,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6063,93291.0,,,9.0,2.0,24.0,5.0,4.0,5.0,99.0,,0.0,33328.911983,34812.142387,2.0,122.0,0.0,,,0
6067,93322.0,,,9.0,2.0,23.0,3.0,4.0,5.0,6.0,1.62,0.0,70690.643521,76003.688167,2.0,121.0,10.0,,48.6,1
6098,93619.0,,,9.0,2.0,44.0,5.0,2.0,1.0,14.0,1.72,0.0,13508.737701,13721.095669,2.0,121.0,0.0,,42.5,0
6099,93653.0,,,9.0,2.0,25.0,3.0,5.0,6.0,14.0,2.97,0.0,110170.177077,110393.459614,1.0,132.0,3.0,,,0


In [22]:
# Look at the overlap of tfr null values and dpq score null values

# Create a boolean mask for rows where both 'Column1' and 'Column2' are null
mask_tfr = nhanes['tfr'].isnull() & nhanes['dpq_score'].isnull()

# Apply the mask to filter the DataFrame
tfr_overlap_nulls_df = nhanes[mask_tfr]

In [23]:
tfr_overlap_nulls_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 0 entries
Data columns (total 20 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   SEQN                     0 non-null      float64
 1   tfr                      0 non-null      float64
 2   LBDTFRSI                 0 non-null      float64
 3   SDDSRVYR                 0 non-null      float64
 4   sex                      0 non-null      float64
 5   age                      0 non-null      float64
 6   race-ethnicity           0 non-null      float64
 7   edu-level                0 non-null      float64
 8   maritial-status          0 non-null      float64
 9   household-income         0 non-null      float64
 10  income-to-poverty-ratio  0 non-null      float64
 11  pregnancy-status         0 non-null      float64
 12  WTINT2YR                 0 non-null      float64
 13  WTMEC2YR                 0 non-null      float64
 14  masked-variance-psu      0 non-null   

notes: there is no overlap in tfr and dpq scores

### Standardize ferritin values across years

> Different years used different processes for processing ferritin

"Serum Ft [15,16] and TfR [17,18] were analyzed using the immunoturbidimetric assay method via Roche kits on a Hitachi 912 clinical analyzer for 2005 to 2008 samples and on an Elecsys 170 for Ft [19] and Hitachi Mod P for TfR [20] for the 2009 to 2010 samples. Due to the use of different technologies for Ft assessment and following best-practice, concentrations were standardized using the formula,
 (https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/TFR_F.htm). "

https://www.sciencedirect.com/science/article/pii/S0022316623726187?via%3Dihub#sec2

- 3 different different methods were used across the years, need to standardize or go with less years

https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/FERTIN_J.htm#LBXFER

The Roche Mod E170 analyzer was used for most of 2015-2016 and replaced with the Roche Cobas e601 analyzer in mid-2016. Randomly selected serum samples (n=188) from NHANES 2015-2016 participants, QC material, and proficiency testing specimens were measured using both instruments and the results were used to conduct the analysis. On average, ferritin values measured from the Roche e601 analyzer were 8.8% higher than values from the Roche Mod E170 (p<.0001). Data from the bridging study indicated the correlation coefficient (r) between the measurements was 0.999. Regression analyses were performed using Analyse-it, v4.30.4. Given that the data showed proportional differences in variability, a weighted Deming regression was chosen to adjust the ferritin results (ng/mL).The forward and backward equations are below:

Forward:    Y (e601) = 0.2243 (95%CI: -0.0069 – 0.4554) + X (E170) * 1.079 (95%CI: 1.070 – 1.088)

Backward:  Y (E170) = -0.2079 (95%CI: -0.4233 – 0.0074) + X (e601) * 0.9271 (95%CI: 0.9195 – 0.9348)

These regression equations should be used when examining trends of ferritin data across 2015-2016 and 2009-2010 cycles, or combining 2015-2016 data with these previous cycles. For analysis involving 2015-2016 data and data collected prior to 2009-2010 cycle, please refer to the documentation accompanying the 2009 -2010 (FERTIN_F) and 2003-2004 (L06TFR_C) ferritin data for additional adjustments.

Results in this 2015-2016 dataset from specimens analyzed using the Roche MOD E170 were adjusted using the above forward regression equation.

In [24]:
nhanes['SDDSRVYR'].value_counts()

6.0     1377
4.0     1253
9.0     1240
5.0     1180
10.0    1057
Name: SDDSRVYR, dtype: int64

4 = 2005-2006
5 = 2006-2007
6 = 2009-2010
9 = 2015-2016
10 = 2017-2018

for years 2005-2006 and 2006-2007, values 4 and 5 in SDDSRVYR column respectively, transform ferritin column values using the following regression equation: 
E170 = 10**(0.989*Log10(Hitachi 912) + 0.049), where "Hitachi 912" refers to the current 'ferritin' value.

In [25]:
# Apply the transformation to the 'ferritin' column for specific 'SDDSRVYR' groups
def transform_ferritin(row):
    if row['SDDSRVYR'] in [4.0, 5.0]:
        return 10**(0.989 * np.log10(row['ferritin']) + 0.049)
    else:
        return row['ferritin']

In [26]:
nhanes['transformed_ferritin'] = nhanes.apply(transform_ferritin, axis=1)

for values 4,5,6,9 in SDDSRVYR column respectively, transform values in  'transformed_ferritin' using the following forward equation, where "E170" refers to the current 'transform_ferritin' value:
Y (e601) = 0.2243 (95%CI: -0.0069 – 0.4554) + X (E170) * 1.079 (95%CI: 1.070 – 1.088)

In [27]:
# Apply the new transformation to the 'transformed_ferritin' column for specified 'SDDSRVYR' groups
def apply_forward_equation(row):
    if row['SDDSRVYR'] in [4.0, 5.0, 6.0, 9.0]:
        return 0.2243 + (row['transformed_ferritin'] * 1.079)
    else:
        return row['transformed_ferritin']

In [28]:
nhanes['transformed_ferritin'] = nhanes.apply(apply_forward_equation, axis=1)

In [29]:
nhanes['ferritin'].describe()

count    5711.000000
mean       51.989471
std        57.450924
min         1.040000
25%        19.000000
50%        37.000000
75%        66.000000
max      1720.000000
Name: ferritin, dtype: float64

In [30]:
nhanes['transformed_ferritin'].describe()

count    5711.000000
mean       56.933053
std        62.163773
min         1.040000
25%        21.285638
50%        40.878436
75%        72.517300
max      1856.104300
Name: transformed_ferritin, dtype: float64

### Check missing ferritin values

Is there a trend where the data is missing? Are hemoglobin levels also missing where ferritin is null?

In [31]:
# Analyze the distribution of missing 'ferritin' values across different variables
missing_value_analysis = nhanes[nhanes['ferritin'].isna()].describe(include='all')

In [32]:
# For a more detailed pattern analysis, let's check the proportion of missing values by 'SDDSRVYR'
missing_by_sddsrvyr = nhanes.groupby('SDDSRVYR')['ferritin'].apply(lambda x: x.isna().mean())

In [33]:
missing_by_sddsrvyr

SDDSRVYR
4.0     0.071828
5.0     0.077119
6.0     0.048656
9.0     0.066129
10.0    0.062441
Name: ferritin, dtype: float64

In [34]:
missing_by_race_ethnicity = nhanes.groupby('race-ethnicity')['ferritin'].apply(lambda x: x.isna().mean())

> - By Race-Ethnicity ('race-ethnicity'): The missing proportion of 'ferritin' values varies across different race-ethnicity groups, with the lowest missing rates in groups coded as 1.0 (5.0%) and 2.0 (4.8%), and the highest in group 4.0 (9.7%). This suggests that there might be a pattern in missingness related to race-ethnicity, with some groups having higher rates of missing data.

> - However, the higher missing rate in the race-ethnicity group coded as 4.0 (Non-Hispanic Black) might warrant further investigation to understand the underlying reasons.

#### Is there a statistically significant difference among the missing_by_race_ethnicity groups?

To check this we can conduct a Chi-square test for independence:
> The null hypothesis (H0) for the Chi-square test in this context is that there is no association between 'race-ethnicity' and the likelihood of 'ferritin' values being missing—that is, the proportion of missing 'ferritin' values is the same across all 'race-ethnicity' groups. The alternative hypothesis (Ha) is that there is an association, meaning the proportion of missing 'ferritin' values differs among the groups.

In [35]:
# Create a contingency table of 'race-ethnicity' groups and missing status of 'ferritin'
contingency_table = nhanes.groupby('race-ethnicity')['ferritin'].apply(lambda x: pd.Series([x.isna().sum(), 
                                                                                            x.notna().sum()], index=['Missing', 'Not Missing'])).unstack()

In [36]:
# Perform the Chi-square test for independence
chi2, p_value, dof, expected = chi2_contingency(contingency_table)


In [37]:
chi2, p_value, dof, expected

(41.000047697771,
 2.6877675431819974e-08,
 4,
 array([[  81.50843295, 1175.49156705],
        [  40.39757655,  582.60242345],
        [ 146.15752415, 2107.84247585],
        [  84.29670869, 1215.70329131],
        [  43.63975766,  629.36024234]]))

The Chi-square test for independence resulted in a Chi-square statistic of approximately 41.00 with a p-value of approximately 2.6877675431819974e-08 and 4 degrees of freedom. Given that the p-value is significantly less than the common significance level of 0.05, we reject the null hypothesis. This suggests that there is a statistically significant difference in the proportion of missing 'ferritin' values among the different 'race-ethnicity' groups.

> The analysis indicates that the likelihood of 'ferritin' values being missing is associated with the 'race-ethnicity' category of the respondents. This finding could be important for further analyses and in deciding how to handle the missing 'ferritin' values, as it suggests that the missingness may not be completely random.

Given the finding that the likelihood of 'ferritin' values being missing is associated with the 'race-ethnicity' category, it's crucial to handle these missing values in a way that minimizes bias and retains as much information as possible. 

> **Be sure to mention this in the disscussion section: Imputation is likely preferable to preserve data points and maintain the representativeness of your dataset. However, the potential for bias by imputing data (or dropping null values) is important to consider, especially when using complex survey data such as NHANES.**

### Look also at household income and income-to-poverty ratio values

In [38]:
# Analyze the proportion of missing 'ferritin' values by 'household-income'
missing_by_household_income = nhanes.groupby('household-income')['ferritin'].apply(lambda x: x.isna().mean())


In [39]:
# Analyze the proportion of missing 'ferritin' values by 'income-to-poverty-ratio'
# For a more granular analysis, categorize 'income-to-poverty-ratio' into quantiles
nhanes['income_to_poverty_ratio_category'] = pd.qcut(nhanes['income-to-poverty-ratio'], q=4, duplicates='drop')
missing_by_income_to_poverty_ratio = nhanes.groupby('income_to_poverty_ratio_category')['ferritin'].apply(lambda x: x.isna().mean())


In [40]:
missing_by_household_income, missing_by_income_to_poverty_ratio

(household-income
 1.0     0.120567
 2.0     0.110497
 3.0     0.072000
 4.0     0.026144
 5.0     0.060000
 6.0     0.055357
 7.0     0.067416
 8.0     0.059585
 9.0     0.051829
 10.0    0.048583
 12.0    0.100000
 13.0    0.062500
 14.0    0.048117
 15.0    0.059155
 77.0    0.116667
 99.0    0.065217
 Name: ferritin, dtype: float64,
 income_to_poverty_ratio_category
 (-0.001, 0.96]    0.064631
 (0.96, 1.85]      0.064723
 (1.85, 3.63]      0.057325
 (3.63, 5.0]       0.060302
 Name: ferritin, dtype: float64)

The variation in missing 'ferritin' values across 'household-income' levels, especially the higher rates at lower income levels, might indicate economic factors play a role in the missingness of 'ferritin' data. However, the 'income-to-poverty-ratio' analysis suggests that the effect might not be as pronounced across broader economic categories.
The findings suggest that economic status, particularly at extreme levels of income, could influence the likelihood of 'ferritin' data being missing. This might reflect access to healthcare, participation in certain parts of the survey, or other socio-economic factors influencing data collection.

### Check missing tfr values

In [41]:
# Is there a trend among the null tfr scores?
tfr_missing_by_sddsrvyr = nhanes.groupby('SDDSRVYR')['tfr'].apply(lambda x: x.isna().mean())
tfr_missing_by_race_ethnicity = nhanes.groupby('race-ethnicity')['tfr'].apply(lambda x: x.isna().mean())

In [42]:
tfr_missing_by_sddsrvyr

SDDSRVYR
4.0     0.071828
5.0     0.075424
6.0     0.048656
9.0     0.084677
10.0    0.068117
Name: tfr, dtype: float64

In [43]:
tfr_missing_by_race_ethnicity

race-ethnicity
1.0    0.052506
2.0    0.051364
3.0    0.055457
4.0    0.101538
5.0    0.101040
Name: tfr, dtype: float64

### Combine race and ethnicity categories based on previous paper

previous paper Hispanic, NH White, NH Black (ridereth1) explain the reasoning for this and the potential bias, as well as future directions... we will include 'other' as it seems odd not to...

In [44]:
# Combine groups 1 and 2
nhanes['race-ethnicity'] = nhanes['race-ethnicity'].replace([1, 2], 1)

In [46]:
nhanes['race-ethnicity'].value_counts()

3.0    2254
1.0    1880
4.0    1300
5.0     673
Name: race-ethnicity, dtype: int64

## Save csv

In [51]:
nhanes.to_csv('data/nhanes.csv', index=False)