# HDS 5210 Final Project: Osteoporosis in Multiple Data Sets
### Elisabeth C. DeMarco

This project examines data about osteoporosis in women in the United States using 2 data sources: the National Health and Nutrition Examination Survey (NHANES) and the Study of Women's Health Across the Nation (SWAN). While NHANES is continuously run, SWAN was most recently conducted between 2006-2008. Thus, the present project focuses on NHANES 2007-2008 and 2006-2008 data. 

All data/documentation can be found at the following links: 
* NHANES:
    * Demographics:
        * Documentation: https://wwwn.cdc.gov/Nchs/Nhanes/2007-2008/DEMO_E.htm
        * Data: https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Demographics&CycleBeginYear=2007
            * File name: `DEMO_E.XPT`
    * Ostoporosis Questionnaire Module: 
        * Documentation: https://wwwn.cdc.gov/Nchs/Nhanes/2007-2008/OSQ_E.htm
        * Data: https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Questionnaire&CycleBeginYear=2007
             * File name: `OSQ_E.XPT`
    * Current Health Status: 
        * Documentation: https://wwwn.cdc.gov/Nchs/Nhanes/2007-2008/HSQ_E.htm
        * Data: https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Questionnaire&CycleBeginYear=2007
            * File name: `HSQ_E.XPT`
    * Mental Health - Depression Screener: 
        * Documentation: https://wwwn.cdc.gov/Nchs/Nhanes/2007-2008/DPQ_E.htm
        * Data: https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Questionnaire&CycleBeginYear=2007
            * File name: `DPQ_E.XPT`

* SWAN:
    * Study of Women's Health Across the Nation (SWAN) Cross-Sectional Screener Dataset, [United States], 
        1995-1997 (ICPSR 4368): https://www.icpsr.umich.edu/web/ICPSR/studies/4368/summary
        * Codebook and data files are available for download at the same site
            * File Name: '04368-0001-Data.tsv'
    * Study of Women's Health Across the Nation (SWAN) 2006-2008: https://www.icpsr.umich.edu/web/ICPSR/studies/32961
        * Codebook and data file are available for download at the same site
            * File Name: '32961-0001-Data.tsv'

Files were saved to the `/final` directory for ease of access.  

## Study Objectives
I will be working with a medical student for her 2022 Summer Research Fellowship to examine associations between osteoporosis and mental health, with a particular focus on mental health treatment and gender disparities. We plan to use secondary data from the National Health and Nutrition Examination Survey (NHANES), a nationally representative survey comprised of in-home questionnaires and physical examinations administered at a mobile site. We will examine a subset of individuals with osteoporosis for mental health outcomes and treatment. In particular, we will use the data on osteoporosis outcomes (such as broken bones and diagnosis), general impression of mental health, and PHQ-9 scores (a commonly used screening tool for depression). Treatment data could be gathered from reports of prescription medication and mental health encounters in other modules of NHANES not included above. The larger project will include NHANES data from multiple years, using those years in which NHANES collected data on osteoporosis.

As no one dataset contains all the information one typically needs, cross-comparison between two similar data sources is often required. The current project will focus on proor-of-concept, data management, and exploratory analysis. In addition, I will compare data from NHANES with SWAN to determine whether NHANES gathers comparable information to SWAN, a longitudinal in-person study. From the current project, I will identify variables that could be harmonized and compared between the two data sources. This cross-sectional analysis will be the first step towards future comparisons between these two data sets, including later longitudinal study. This will be particularly useful as SWAN stopped collecting data in 2008, while NHANES continues to collect data. If the conclusions drawn from both studies are similar regarding women's mental health and osteoporosis outcomes, NHANES would be a good candidate data source to continue to monitor national trends in these areas. If these conclusions are different, the necessity of longitudinal studies such as SWAN would be underscored.

## Variables of Interest

The documentation for both datasets was examined and a final set of variables of interest collected. These are detailed in the `DeMarco_HDS5210Final_VariableFile.xlsx` file. Concepts of interest are summarized below:
* Age
* Sex
* Race/Ethnicity
* Education
* Osteoporosis diagnosis
* Broken bone
* Fracture of wrist or hip
* Pain interference with activities
* Feeling nervous or anxious in last month
* Felt depressed in last 2 weeks
* Overall health

## Creating the Analytic Data Set

SWAN data for 2006-2008 is limited to "Women age 51 through 63" while NHANES samples a broader portion of the US population. To facilitate accurate comparisons, NHANES data was limited to women age 51 through 63. 

Although NHANES is designed for weighted analysis, the present analysis uses unweighted counts for initial comparisons, as SWAN data does not use any weighting or oversampling. Variables that could be harmonized were identified. Common names were assigned to these variables to create a single set. The SWAN and NHANES data were then concatenated to create a composite analytic data set.

In [1]:
# Import Packages and read in data
import pandas as pd
import numpy as np

# Read in tsv file from SWAN data
SWAN_2008 = pd.read_csv('32961-0001-Data.tsv', sep = '\t')
SWAN_demographics = pd.read_csv('04368-0001-Data.tsv', sep = '\t')

# Read in NHANES files
NHANES_osteoporosis = pd.read_sas('OSQ_E.XPT')
NHANES_demographics = pd.read_sas('DEMO_E.XPT')
NHANES_currenthealth = pd.read_sas('HSQ_E.XPT')
NHANES_depression = pd.read_sas('DPQ_E.XPT')

  interactivity=interactivity, compiler=compiler, result=result)


In [2]:
# Create a function to subset columns given a list of variables
def column_select (dataframe, l):
    '''dataframe, list -> dataframe
    This function takes two arguments, a dataframe and a list. It 
    returns a dataframe with only those columns matching the input
    of the list provided. 
    '''
    smalldata = dataframe[l]
    return smalldata   

In [24]:
# Pull the variables of interest from the SWAN 2008 file
SWAN_2008_Analytic = column_select(SWAN_2008, ['SWANID', 'AGE10',
                                              'OSTEOPR10', 'BROKEBO10',
                                              'BONES110',
                                              'PAINTRF10', 'NERV4WK10',
                                              'DEPRESS10', 'OVERHLT10'])

# Pull the variables of interest from the SWAN crosssectional file
SWAN_demo_Analytic = column_select(SWAN_demographics, ['SWANID', 'RACE', 'DEGREE'])

In [25]:
SWAN_2008_Analytic.head

<bound method NDFrame.head of       SWANID AGE10 OSTEOPR10 BROKEBO10 BONES110 PAINTRF10 NERV4WK10 DEPRESS10  \
0      10046    62         1         0                  1         6         1   
1      10056    61         1         0                  1         6         1   
2      10153    61         1         0                  1         6         1   
3      10196    56         1         0                  1         4         1   
4      10245    57         1         0                  1         5         2   
...      ...   ...       ...       ...      ...       ...       ...       ...   
2240   99805    52         1         0                  1         6         1   
2241   99809    53         1         0                  1         4         2   
2242   99888    58         1         0                  4         3         4   
2243   99898    55         1         0                  1         6         2   
2244   99962    57         1         0                  1         5         1  

In [26]:
# Use a left-join on SWANID to create an overall SWAN2008 analytic dataset
SWAN_full = SWAN_2008_Analytic.merge(SWAN_demo_Analytic, how = 'left', left_on = 'SWANID', right_on = 'SWANID')

SWAN_full.head

<bound method NDFrame.head of       SWANID AGE10 OSTEOPR10 BROKEBO10 BONES110 PAINTRF10 NERV4WK10 DEPRESS10  \
0      10046    62         1         0                  1         6         1   
1      10056    61         1         0                  1         6         1   
2      10153    61         1         0                  1         6         1   
3      10196    56         1         0                  1         4         1   
4      10245    57         1         0                  1         5         2   
...      ...   ...       ...       ...      ...       ...       ...       ...   
2240   99805    52         1         0                  1         6         1   
2241   99809    53         1         0                  1         4         2   
2242   99888    58         1         0                  4         3         4   
2243   99898    55         1         0                  1         6         2   
2244   99962    57         1         0                  1         5         1  

In [27]:
# Missing values in the SWAN data are denoted with '-7', '-8', '-9' and are replaced with NaN
SWAN_analytic = SWAN_full.replace(['-7', '-8', '-9', ' '], np.nan)

In [28]:
# Review that missing data has been appropriately removed
SWAN_analytic['DEPRESS10'].value_counts()

1    1452
2     442
3     144
4      52
Name: DEPRESS10, dtype: int64

In [38]:
SWAN_analytic

Unnamed: 0,SWANID,AGE10,OSTEOPR10,BROKEBO10,BONES110,PAINTRF10,NERV4WK10,DEPRESS10,OVERHLT10,RACE,DEGREE
0,10046,62,1,0,,1,6,1,1,2,2
1,10056,61,1,0,,1,6,1,2,4,3
2,10153,61,1,0,,1,6,1,3,3,2
3,10196,56,1,0,,1,4,1,3,2,5
4,10245,57,1,0,,1,5,2,2,4,2
...,...,...,...,...,...,...,...,...,...,...,...
2240,99805,52,1,0,,1,6,1,3,1,4
2241,99809,53,1,0,,1,4,2,1,4,4
2242,99888,58,1,0,,4,3,4,5,3,3
2243,99898,55,1,0,,1,6,2,3,4,5


In [8]:
# Subset the columns from the NHANES data sets
NHANES_demo_analytic = column_select(NHANES_demographics, ['SEQN', 'WTINT2YR', 
                                                          'RIDAGEYR', 'RIAGENDR',
                                                          'RIDRETH1',
                                                          'DMDEDUC2'])
NHANES_osteo_analytic = column_select(NHANES_osteoporosis, ['SEQN', 'OSQ060',
                                                           'OSQ010A', 'OSQ010B'])

NHANES_health_analytic = column_select(NHANES_currenthealth, ['SEQN', 'HSQ493', 'HSQ496', 'HSD010'])

NHANES_dep_analytic = column_select(NHANES_depression, ['SEQN', 'DPQ020'])

In [9]:
# Create a function for NHANES joins, which rely on 'SEQN'
def NHANES_join(df1, df2, method):
    '''(dataframe, dataframe, string) -> dataframe
    This function takes 2 dataframes from NHANES data and a merge method based on pd.merge and
    creates a merged dataframe. This function assumes both dataframes include the SEQN variable used
    as a unique identifier in NHANES.
    '''
    merged = df1.merge(df2, how = method, left_on = 'SEQN', right_on = 'SEQN')
    
    return merged

In [10]:
# Use the NHANES-join function to sequentially join dataframes and create the full analytic dataset
NHANES_demo_osteo = NHANES_join(NHANES_demo_analytic, NHANES_osteo_analytic, 'left')
NHANES_do_health = NHANES_join(NHANES_demo_osteo, NHANES_health_analytic, 'left')
NHANES_full = NHANES_join(NHANES_do_health, NHANES_dep_analytic, 'left')

In [11]:
# View the merged data set
NHANES_full.head

<bound method NDFrame.head of           SEQN      WTINT2YR  RIDAGEYR  RIAGENDR  RIDRETH1  DMDEDUC2  OSQ060  \
0      41475.0  59356.356426      62.0       2.0       5.0       3.0     2.0   
1      41476.0  35057.218405       6.0       2.0       5.0       NaN     NaN   
2      41477.0   9935.266183      71.0       1.0       3.0       3.0     2.0   
3      41478.0  12846.712058       1.0       2.0       3.0       NaN     NaN   
4      41479.0   8727.797555      52.0       1.0       1.0       1.0     2.0   
...        ...           ...       ...       ...       ...       ...     ...   
10144  51619.0   5197.083889      61.0       1.0       1.0       1.0     2.0   
10145  51620.0  27909.120820      50.0       2.0       3.0       2.0     2.0   
10146  51621.0  11057.659484      17.0       1.0       2.0       NaN     NaN   
10147  51622.0   9842.672903      60.0       2.0       4.0       1.0     2.0   
10148  51623.0  24692.989537      72.0       1.0       3.0       1.0     2.0   

       OS

In [12]:
# To correct the data types of HSQ493, HSQ496, and DPQ020, '77' was chosen to fill NaN
NHANES_full = NHANES_full.fillna(77)

In [46]:
# All columns except weighting changed to integer type
NHANES_analytic = NHANES_full[['SEQN', 'RIDAGEYR', 'RIAGENDR', 'RIDRETH1', 'DMDEDUC2', 'OSQ060',
                            'OSQ010A', 'OSQ010B', 'HSQ493', 'HSQ496', 'HSD010', 'DPQ020']].astype('int64')

NHANES_analytic.head

<bound method NDFrame.head of         SEQN  RIDAGEYR  RIAGENDR  RIDRETH1  DMDEDUC2  OSQ060  OSQ010A  \
0      41475        62         2         5         3       2        2   
1      41476         6         2         5        77      77       77   
2      41477        71         1         3         3       2        2   
3      41478         1         2         3        77      77       77   
4      41479        52         1         1         1       2        2   
...      ...       ...       ...       ...       ...     ...      ...   
10144  51619        61         1         1         1       2        2   
10145  51620        50         2         3         2       2        2   
10146  51621        17         1         2        77      77       77   
10147  51622        60         2         4         1       2        2   
10148  51623        72         1         3         1       2        2   

       OSQ010B  HSQ493  HSQ496  HSD010  DPQ020  
0            2       7      10       3      

In [47]:
# Replace missing values for the NHANES data set (77, 99)
NHANES_analytic = NHANES_analytic.replace([77, 99, ' '], np.nan)

In [48]:
NHANES_analytic.head

<bound method NDFrame.head of         SEQN  RIDAGEYR  RIAGENDR  RIDRETH1  DMDEDUC2  OSQ060  OSQ010A  \
0      41475      62.0         2         5       3.0     2.0      2.0   
1      41476       6.0         2         5       NaN     NaN      NaN   
2      41477      71.0         1         3       3.0     2.0      2.0   
3      41478       1.0         2         3       NaN     NaN      NaN   
4      41479      52.0         1         1       1.0     2.0      2.0   
...      ...       ...       ...       ...       ...     ...      ...   
10144  51619      61.0         1         1       1.0     2.0      2.0   
10145  51620      50.0         2         3       2.0     2.0      2.0   
10146  51621      17.0         1         2       NaN     NaN      NaN   
10147  51622      60.0         2         4       1.0     2.0      2.0   
10148  51623      72.0         1         3       1.0     2.0      2.0   

       OSQ010B  HSQ493  HSQ496  HSD010  DPQ020  
0          2.0     7.0    10.0     3.0     0

In [58]:
# To facilitate comparison betweeen SWAN and NHANES data, NHANES data limited to women age 51 - 63
ageFilter = (NHANES_analytic['RIDAGEYR'] >= 51.0) & (NHANES_analytic['RIDAGEYR'] <= 63.0)
genderFilter = NHANES_analytic['RIAGENDR'] == 2.0
finalFilter = ageFilter & genderFilter

NHANES_filtered = NHANES_analytic[finalFilter]
NHANES_filtered.shape

(659, 12)

In [29]:
# Create variables in SWAN data for hip fracture and wrist fracture to harmonize with NHANES
# View the unique values
SWAN_analytic['BONES110'].value_counts()

RIGHT FOOT                        4
LEFT RIBS                         3
LEFT PINKY TOE                    3
LEFT WRIST                        3
LEFT ELBOW                        3
RIGHT ANKLE                       2
BOTH FEET                         1
LEFT METATARSAL BONE              1
5TH METATARSEL IN RT FOOT         1
LEFT ULNA                         1
LEFT ULNA & RADIUS                1
LEFT HAND NEAR WRIST              1
NOSE                              1
LEFT SMALL TOE                    1
LEFT FOOT                         1
RIGHT SHOULDER                    1
LEFT MIDDLE TOE                   1
RIBS BOTH SIDES                   1
RIGHT FOURTH TOE                  1
RIGHT TOE                         1
RIGHT GREAT TOE                   1
LT PINKY TOE, LT 5TH DIGIT        1
RIGHT WRIST                       1
RIGHT 3RD TOE                     1
RIGHT HIP                         1
RIGHT THUMB                       1
LEFT TOE                          1
LEFT HAND 1ST & 2ND FINGER  

In [21]:
# Create a new column for wrist fracture
SWAN_analytic['WristFrac'] = [1 if (bone.lower == 'both wrists') else 0 for bone in SWAN_analytic['BONES110']]

AttributeError: 'float' object has no attribute 'lower'

In [32]:
SWAN_analytic['BONES110']

0       NaN
1       NaN
2       NaN
3       NaN
4       NaN
       ... 
2240    NaN
2241    NaN
2242    NaN
2243    NaN
2244    NaN
Name: BONES110, Length: 2245, dtype: object

In [33]:
SWAN_analytic['BONES110'].value_counts()

RIGHT FOOT                        4
LEFT RIBS                         3
LEFT PINKY TOE                    3
LEFT WRIST                        3
LEFT ELBOW                        3
RIGHT ANKLE                       2
BOTH FEET                         1
LEFT METATARSAL BONE              1
5TH METATARSEL IN RT FOOT         1
LEFT ULNA                         1
LEFT ULNA & RADIUS                1
LEFT HAND NEAR WRIST              1
NOSE                              1
LEFT SMALL TOE                    1
LEFT FOOT                         1
RIGHT SHOULDER                    1
LEFT MIDDLE TOE                   1
RIBS BOTH SIDES                   1
RIGHT FOURTH TOE                  1
RIGHT TOE                         1
RIGHT GREAT TOE                   1
LT PINKY TOE, LT 5TH DIGIT        1
RIGHT WRIST                       1
RIGHT 3RD TOE                     1
RIGHT HIP                         1
RIGHT THUMB                       1
LEFT TOE                          1
LEFT HAND 1ST & 2ND FINGER  

In [39]:
# View the columns for SWAN data
SWAN_analytic.columns

Index(['SWANID', 'AGE10', 'OSTEOPR10', 'BROKEBO10', 'BONES110', 'PAINTRF10',
       'NERV4WK10', 'DEPRESS10', 'OVERHLT10', 'RACE', 'DEGREE'],
      dtype='object')

In [54]:
# Rename variables in SWAN data to match the common definitions
SWAN_analytic = SWAN_analytic.rename(columns = {'SWANID': 'ID', 'AGE10': 'Age', 'OSTEOPR10': 'OsteoDx', 'BROKEBO10': 'BrokeBone', 'BONES110':'FractureType', 'PAINTRF10':'Pain',
       'NERV4WK10': 'Anxiety', 'DEPRESS10': 'Depress', 'OVERHLT10': 'OverallHealth', 'RACE': 'Race', 'DEGREE': 'Educ'})

In [None]:
# Create broken bone composite variable in NHANES to harmonize with SWAN

In [41]:
# View the columns in NHANES
NHANES_filtered.columns

Index(['SEQN', 'RIDAGEYR', 'RIAGENDR', 'RIDRETH1', 'DMDEDUC2', 'OSQ060',
       'OSQ010A', 'OSQ010B', 'HSQ493', 'HSQ496', 'HSD010', 'DPQ020'],
      dtype='object')

In [55]:
# Rename the variables in NHANES data to match the common definitions
NHANES_filtered = NHANES_filtered.rename(columns = {'SEQN': 'ID', 'RIDAGEYR': 'Age', 'OSQ060': 'OsteoDx', 'OSQ010A': 'HipFracture', 'OSQ010B': 'WristFracture', 'HSQ493':'Pain',
       'HSQ496': 'Anxiety', 'DPQ020': 'Depress', 'HSD010': 'OverallHealth', 'RIDRETH1': 'Race', 'DMDEDUC2': 'Educ'})

In [56]:
# Concatenate the NHANES and SWAN datasets, using the source identifier as a key and resetting the index
compositeData = pd.concat([SWAN_analytic, NHANES_filtered], axis = 0, keys = ['SWAN', 'NHANES'], names = ['source']).reset_index(level = 'source')

In [57]:
# Check the composite analytic data set
compositeData

Unnamed: 0,source,ID,Age,OsteoDx,BrokeBone,FractureType,Pain,Anxiety,Depress,OverallHealth,Race,Educ,RIAGENDR,HipFracture,WristFracture
0,SWAN,10046,62,1,0,,1,6,1,1,2,2,,,
1,SWAN,10056,61,1,0,,1,6,1,2,4,3,,,
2,SWAN,10153,61,1,0,,1,6,1,3,3,2,,,
3,SWAN,10196,56,1,0,,1,4,1,3,2,5,,,
4,SWAN,10245,57,1,0,,1,5,2,2,4,2,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10121,NHANES,51596,52,2,,,0,8,0,4,1,1,2.0,2.0,2.0
10129,NHANES,51604,53,2,,,0,2,0,2,3,5,2.0,2.0,1.0
10136,NHANES,51611,55,2,,,0,0,0,3,3,4,2.0,2.0,2.0
10140,NHANES,51615,61,1,,,,,,,2,1,2.0,2.0,2.0


Could potentially join groups/datasets together based on age, once aggregated?

OR concatenate the datasets with NHANES or SWAN as the identifier, then run functions to recode -- would first need to rename all the columns to harmonize them (requires making NHANES variable for broken bone and SWAN variables for hip fracture and wrist fracture)

### Recoding Data based on Codebook

Variables were harmonized based on cross-examination of the available codebooks (detailed in `DeMarco_HDS5210Final_VariableFile.xlsx`). Variables were renamed and values recoded to facilitate comprehension based on the provided codebooks. Functions were created to recode variables based on these codebooks.

## Data Analysis

ideas for comparisons
* pivot table by age 
* pivot table by osteoporosis status
* pivot table by depression variable - would need to reconcile between the two

consider looking into variations in pain, current health status, anxious, broken bones in above gropus
* can use these as aggregating variables, to count or get an average - may require more data manipulation

will need to ask about field level transformations (would something like recoding the data from numeric to strings count? or binning the data for analysis, such as for pain (like transforming data from 0-30 days to "less than half" to "more than half", etc.)?), to make sure I meet that requirement, and figure out what to plot

## Data Visualization and Discussion

### References
Centers for Disease Control and Prevention (CDC). National Center for Health Statistics (NCHS). National Health and Nutrition Examination Survey Data. Hyattsville, MD: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention, 2007-2008, https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2007.

Sutton-Tyrrell, Kim, Selzer, Faith, Sowers, MaryFran, Finkelstein, Joel, Powell, Lynda, Gold, Ellen, … Brooks, Maria Mori. Study of Women’s Health Across the Nation (SWAN), 2006-2008: Visit 10 Dataset. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2018-11-15. https://doi.org/10.3886/ICPSR32961.v2

#### Github Submission

In [None]:
a=input('''
Are you ready to submit your work?
1. Click the Save icon (or do Ctrl-S / Cmd-S)
2. Type "yes" or "no" below
3. Press Enter

''')

if a=='yes':
    !git pull
    !git add Final_DeMarco.ipynb
    !git commit -a -m "Submitting final project programming assignment"
    !git push
else:
    print('''
    
OK. We can wait.
''')