# HDS 5210 Final Project: Osteoporosis in Multiple Data Sets
### Elisabeth C. DeMarco

This project examines data about osteoporosis in women in the United States using 2 data sources: the National Health and Nutrition Examination Survey (NHANES) and the Study of Women's Health Across the Nation (SWAN). While NHANES is continuously run, SWAN was most recently conducted between 2006-2008. Thus, the present project focuses on NHANES 2007-2008 and 2006-2008 data. 

All data/documentation can be found at the following links: 
* NHANES:
    * Demographics:
        * Documentation: https://wwwn.cdc.gov/Nchs/Nhanes/2007-2008/DEMO_E.htm
        * Data: https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Demographics&CycleBeginYear=2007
            * File name: `DEMO_E.XPT`
    * Ostoporosis Questionnaire Module: 
        * Documentation: https://wwwn.cdc.gov/Nchs/Nhanes/2007-2008/OSQ_E.htm
        * Data: https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Questionnaire&CycleBeginYear=2007
             * File name: `OSQ_E.XPT`
    * Current Health Status: 
        * Documentation: https://wwwn.cdc.gov/Nchs/Nhanes/2007-2008/HSQ_E.htm
        * Data: https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Questionnaire&CycleBeginYear=2007
            * File name: `HSQ_E.XPT`
    * Mental Health - Depression Screener: 
        * Documentation: https://wwwn.cdc.gov/Nchs/Nhanes/2007-2008/DPQ_E.htm
        * Data: https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Questionnaire&CycleBeginYear=2007
            * File name: `DPQ_E.XPT`

* SWAN:
    * Study of Women's Health Across the Nation (SWAN) Cross-Sectional Screener Dataset, [United States], 
        1995-1997 (ICPSR 4368): https://www.icpsr.umich.edu/web/ICPSR/studies/4368/summary
        * Codebook and data files are available for download at the same site
            * File Name: '04368-0001-Data.tsv'
    * Study of Women's Health Across the Nation (SWAN) 2006-2008: https://www.icpsr.umich.edu/web/ICPSR/studies/32961
        * Codebook and data file are available for download at the same site
            * File Name: '32961-0001-Data.tsv'

Files were saved to the `/final` directory for ease of access.  

## Study Objectives
I will be working with a medical student for her 2022 Summer Research Fellowship to examine associations between osteoporosis and mental health, with a particular focus on mental health treatment and gender disparities. We plan to use secondary data from the National Health and Nutrition Examination Survey (NHANES), a nationally representative survey comprised of in-home questionnaires and physical examinations administered at a mobile site. We will examine a subset of individuals with osteoporosis for mental health outcomes and treatment. In particular, we will use the data on osteoporosis outcomes (such as broken bones and diagnosis), general impression of mental health, and PHQ-9 scores (a commonly used screening tool for depression). Treatment data could be gathered from reports of prescription medication and mental health encounters in other modules of NHANES not included above. The larger project will include NHANES data from multiple years, using those years in which NHANES collected data on osteoporosis.

As no one dataset contains all the information one typically needs, cross-comparison between two similar data sources is often required. The current project will focus on proor-of-concept, data management, and exploratory analysis. In addition, I will compare data from NHANES with SWAN to determine whether NHANES gathers comparable information to SWAN, a longitudinal in-person study. From the current project, I will identify variables that could be harmonized and compared between the two data sources. This cross-sectional analysis will be the first step towards future comparisons between these two data sets, including later longitudinal study. This will be particularly useful as SWAN stopped collecting data in 2008, while NHANES continues to collect data. If the conclusions drawn from both studies are similar regarding women's mental health and osteoporosis outcomes, NHANES would be a good candidate data source to continue to monitor national trends in these areas. If these conclusions are different, the necessity of longitudinal studies such as SWAN would be underscored.

## Variables of Interest

The documentation for both datasets was examined and a final set of variables of interest collected. These are detailed in the `DeMarco_HDS5210Final_VariableFile.xlsx` file. Concepts of interest are summarized below:
* Age
* Sex
* Race/Ethnicity
* Education
* Osteoporosis diagnosis
* Broken bone
* Fracture of wrist or hip
* Pain interference with activities
* Feeling nervous or anxious in last month
* Felt depressed in last 2 weeks
* Overall health

## Creating the Analytic Data Sets

SWAN data for 2006-2008 is limited to "Women age 51 through 63" while NHANES samples a broader portion of the US population. To facilitate accurate comparisons, NHANES data was limited to women age 51 through 63. 

In [1]:
# Import Packages and read in data
import pandas as pd

# Read in tsv file from SWAN data
SWAN_2008 = pd.read_csv('32961-0001-Data.tsv', sep = '\t')
SWAN_demographics = pd.read_csv('04368-0001-Data.tsv', sep = '\t')

# Read in NHANES files
NHANES_osteoporosis = pd.read_sas('OSQ_E.XPT')
NHANES_demographics = pd.read_sas('DEMO_E.XPT')
NHANES_currenthealth = pd.read_sas('HSQ_E.XPT')
NHANES_depression = pd.read_sas('DPQ_E.XPT')

  interactivity=interactivity, compiler=compiler, result=result)


In [10]:
# Create a function to subset columns given a list of variables
def column_select (dataframe, l):
    '''dataframe, list -> dataframe
    This function takes two arguments, a dataframe and a list. It 
    returns a dataframe with only those columns matching the input
    of the list provided. 
    '''
    smalldata = dataframe[l]
    return smalldata   

In [15]:
# Pull the variables of interest from the SWAN 2008 file
SWAN_2008_Analytic = column_select(SWAN_2008, ['SWANID', 'AGE10',
                                              'OSTEOPR10', 'BROKEBO10',
                                              'BONES110',
                                              'PAINTRF10', 'NERV4WK10',
                                              'DEPRESS10', 'OVERHLT10'])

# Pull the variables of interest from the SWAN crosssectional file
SWAN_demo_Analytic = column_select(SWAN_demographics, ['SWANID', 'RACE', 'DEGREE'])

In [16]:
SWAN_2008_Analytic.head

<bound method NDFrame.head of       SWANID AGE10 OSTEOPR10 BROKEBO10 BONES110 PAINTRF10 NERV4WK10 DEPRESS10  \
0      10046    62         1         0                  1         6         1   
1      10056    61         1         0                  1         6         1   
2      10153    61         1         0                  1         6         1   
3      10196    56         1         0                  1         4         1   
4      10245    57         1         0                  1         5         2   
...      ...   ...       ...       ...      ...       ...       ...       ...   
2240   99805    52         1         0                  1         6         1   
2241   99809    53         1         0                  1         4         2   
2242   99888    58         1         0                  4         3         4   
2243   99898    55         1         0                  1         6         2   
2244   99962    57         1         0                  1         5         1  

In [17]:
# Use a left-join on SWANID to create an overall SWAN2008 analytic dataset
SWAN_full = SWAN_2008_Analytic.merge(SWAN_demo_Analytic, how = 'left', left_on = 'SWANID', right_on = 'SWANID')

SWAN_full.head

<bound method NDFrame.head of       SWANID AGE10 OSTEOPR10 BROKEBO10 BONES110 PAINTRF10 NERV4WK10 DEPRESS10  \
0      10046    62         1         0                  1         6         1   
1      10056    61         1         0                  1         6         1   
2      10153    61         1         0                  1         6         1   
3      10196    56         1         0                  1         4         1   
4      10245    57         1         0                  1         5         2   
...      ...   ...       ...       ...      ...       ...       ...       ...   
2240   99805    52         1         0                  1         6         1   
2241   99809    53         1         0                  1         4         2   
2242   99888    58         1         0                  4         3         4   
2243   99898    55         1         0                  1         6         2   
2244   99962    57         1         0                  1         5         1  

In [20]:
# Subset the columns from the NHANES data sets
NHANES_demo_analytic = column_select(NHANES_demographics, ['SEQN', 'WTINT2YR', 
                                                          'RIDAGEYR', 'RIAGENDR',
                                                          'RIDRETH1',
                                                          'DMDEDUC2'])
NHANES_osteo_analytic = column_select(NHANES_osteoporosis, ['SEQN', 'OSQ060',
                                                           'OSQ010A', 'OSQ010B'])

NHANES_health_analytic = column_select(NHANES_currenthealth, ['SEQN', 'HSQ493', 'HSQ496', 'HSD010'])

NHANES_dep_analytic = column_select(NHANES_depression, ['SEQN', 'DPQ020'])

In [21]:
# Create a function for NHANES joins, which rely on 'SEQN'
def NHANES_join(df1, df2, method):
    '''(dataframe, dataframe, string) -> dataframe
    This function takes 2 dataframes from NHANES data and a merge method based on pd.merge and
    creates a merged dataframe. This function assumes both dataframes include the SEQN variable used
    as a unique identifier in NHANES.
    '''
    merged = df1.merge(df2, how = method, left_on = 'SEQN', right_on = 'SEQN')
    
    return merged

In [23]:
# Use the NHANES-join function to sequentially join dataframes and create the full analytic dataset
NHANES_demo_osteo = NHANES_join(NHANES_demo_analytic, NHANES_osteo_analytic, 'left')
NHANES_do_health = NHANES_join(NHANES_demo_osteo, NHANES_health_analytic, 'left')
NHANES_full = NHANES_join(NHANES_do_health, NHANES_dep_analytic, 'left')

In [24]:
# View the merged data set
NHANES_full.head

<bound method NDFrame.head of           SEQN      WTINT2YR  RIDAGEYR  RIAGENDR  RIDRETH1  DMDEDUC2  OSQ060  \
0      41475.0  59356.356426      62.0       2.0       5.0       3.0     2.0   
1      41476.0  35057.218405       6.0       2.0       5.0       NaN     NaN   
2      41477.0   9935.266183      71.0       1.0       3.0       3.0     2.0   
3      41478.0  12846.712058       1.0       2.0       3.0       NaN     NaN   
4      41479.0   8727.797555      52.0       1.0       1.0       1.0     2.0   
...        ...           ...       ...       ...       ...       ...     ...   
10144  51619.0   5197.083889      61.0       1.0       1.0       1.0     2.0   
10145  51620.0  27909.120820      50.0       2.0       3.0       2.0     2.0   
10146  51621.0  11057.659484      17.0       1.0       2.0       NaN     NaN   
10147  51622.0   9842.672903      60.0       2.0       4.0       1.0     2.0   
10148  51623.0  24692.989537      72.0       1.0       3.0       1.0     2.0   

       OS

AttributeError: 'function' object has no attribute 'value_counts'

### References
Centers for Disease Control and Prevention (CDC). National Center for Health Statistics (NCHS). National Health and Nutrition Examination Survey Data. Hyattsville, MD: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention, 2007-2008, https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2007.

Sutton-Tyrrell, Kim, Selzer, Faith, Sowers, MaryFran, Finkelstein, Joel, Powell, Lynda, Gold, Ellen, … Brooks, Maria Mori. Study of Women’s Health Across the Nation (SWAN), 2006-2008: Visit 10 Dataset. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2018-11-15. https://doi.org/10.3886/ICPSR32961.v2