# Week 11 Assignment

Because I was unable to conduct our workshop this week, I'm keeping the assignment light as well.  Below you'll find just two steps for this week: one programming exercise and then a planning activity for your final project.

For clarification, the "final project" I've been referring to is your "final."  It is not a project in addition to a final exam.  They're one-in-the-same.

Please do the programming exercise and verify that your code works using the tests, then think about your final project and fill out the questions in the second part.

---
---

### 47.1: Filtering and summarizing data

For this work, you'll find a data file in `/data/complications_all.csv`.

Read in the data file and create a variable called `mo_hospitals` that contains a data frame from the `complications_all.csv` file, filtered down to only contain those hospitals from the state of Missouri (MO).

Then aggregate that data by hospital into a variable named `mo_summary`.  There are some key fields that we want to summarize:
* We want to know the earliest date that each hospital was participating in any program
* We want to know the latest date that each hospital stopped participating in any program
* We want to know the total number of patients in the denominators of these programs

Some things to note:
* You will need to convert the `Start Date` and `End Date` to actual datetime fields
* You will need to clean up and convert the `Denominator` field to just be numeric - the rule that you should use it to simply remove any records where the `Denominator` is `'Not Available'`


The final result of this step should be a new data frame called `mo_summary` that contains one row for each hospital and contains the min start date, max end date, and total denominator.  Use the names `start_date`, `end_date`, and `number` for those columns in `mo_summary`.


You do not need to create your code in the form of a function, just make sure your variable names match what I've described above so the tests work.

In [1]:
import pandas as pd

# Read in data from "complications_all.csv" using pandas
data = pd.read_csv('/data/complications_all.csv')

In [2]:
# Test the import of the data
data.shape

(91395, 18)

In [3]:
# Create a filter variable for the state of Missouri ("MO")
state_filter = data['State'] == 'MO'

# This is just to show you the name to use for the variable you need to create for this step to pass.
# Assign mo_hospitals to the data that meets the criteria for state_filter
mo_hospitals = data[state_filter]

# Test the filter
mo_hospitals.shape

(2133, 18)

In [4]:
# These assertions will help make sure that you're on the right track.
assert(mo_hospitals['State'].unique() == ['MO'])
assert(mo_hospitals.shape == (2133,18))

In [5]:
# View a the first rows
mo_hospitals.head

<bound method NDFrame.head of       Facility ID                    Facility Name           Address  \
45534      260001            MERCY HOSPITAL JOPLIN     100 MERCY WAY   
45535      260001            MERCY HOSPITAL JOPLIN     100 MERCY WAY   
45536      260001            MERCY HOSPITAL JOPLIN     100 MERCY WAY   
45537      260001            MERCY HOSPITAL JOPLIN     100 MERCY WAY   
45538      260001            MERCY HOSPITAL JOPLIN     100 MERCY WAY   
...           ...                              ...               ...   
47662      263304  SHRINERS HOSPITALS FOR CHILDREN  4400 CLAYTON AVE   
47663      263304  SHRINERS HOSPITALS FOR CHILDREN  4400 CLAYTON AVE   
47664      263304  SHRINERS HOSPITALS FOR CHILDREN  4400 CLAYTON AVE   
47665      263304  SHRINERS HOSPITALS FOR CHILDREN  4400 CLAYTON AVE   
47666      263304  SHRINERS HOSPITALS FOR CHILDREN  4400 CLAYTON AVE   

              City State  ZIP Code     County Name    Phone Number  \
45534       JOPLIN    MO     64804 

In [6]:
# Convert the start date to a datetime field
# Create a variable for the desired format
start_date = pd.to_datetime(mo_hospitals['Start Date'], format = '%m/%d/%Y')

# Create a new column in the mo_hospitals dataframe
mo_hospitals['start_date'] = start_date

# Check the start_date
mo_hospitals['start_date']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


45534   2015-04-01
45535   2015-07-01
45536   2015-07-01
45537   2015-07-01
45538   2015-07-01
           ...    
47662   2016-07-01
47663   2016-07-01
47664   2016-07-01
47665   2016-07-01
47666   2016-07-01
Name: start_date, Length: 2133, dtype: datetime64[ns]

In [7]:
# Convert the end date to a datetime field
# Create a variable for the desired format
end_date = pd.to_datetime(mo_hospitals['End Date'], format = '%m/%d/%Y')

# Create a new column in the mo_hospitals dataframe
mo_hospitals['end_date'] = end_date

# Check the end_date
mo_hospitals['end_date']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


45534   2018-03-31
45535   2018-06-30
45536   2018-06-30
45537   2018-06-30
45538   2018-06-30
           ...    
47662   2018-06-30
47663   2018-06-30
47664   2018-06-30
47665   2018-06-30
47666   2018-06-30
Name: end_date, Length: 2133, dtype: datetime64[ns]

In [None]:
# Create a filter for numeric denominator fields - keep only those where the denominator is NOT 'Not Available'
denom_filter = mo_hospitals['Denominator'] != 'Not Available'

In [8]:
# Originally could not get number column to reflect correct sum
# testing the denominator revealed that sum() was adding the indicies because "Denominator" was a string type in the original data set
testing = mo_hospitals[denom_filter]
testing['Denominator'].sum()

NameError: name 'denom_filter' is not defined

In [None]:

# Create data set with only numeric denominator fields and
mo_hospitals_clean = mo_hospitals[denom_filter]

In [None]:
# Set the type for "Denominator" to be a float
mo_hospitals_clean['Denominator'] = mo_hospitals_clean['Denominator'].astype(float)

In [None]:
# Group by hospital
grouped_hospitals = mo_hospitals_clean.groupby('Facility Name')

In [None]:
# Testing the denominator column
grouped_hospitals['Denominator'].get_group("WRIGHT MEMORIAL HOSPITAL")

In [None]:
# Testing ability to sum the denominator column as a grouped object
grouped_hospitals.aggregate(sum)

In [None]:
# Create the final data frame from the grouped_hospitals data frame by applying aggregating functions
# To achieve the prescribed result - earliest start date, latest end date, and sum of all the denominators (individuals)
mo_summary = grouped_hospitals.agg(
    start_date = pd.NamedAgg(column = 'start_date', aggfunc = "min"),
    end_date = pd.NamedAgg(column = 'end_date', aggfunc = 'max'),
    number = pd.NamedAgg(column = 'Denominator', aggfunc = 'sum'))

In [None]:
# Test the mo_summary
mo_summary

In [None]:
assert(mo_summary['number'].sum() == 1766908)
assert(mo_summary['start_date'].min() == pd.Timestamp(2015,4,1))
assert(mo_summary['end_date'].max() == pd.Timestamp(2018,6,30))
assert(mo_summary.shape == (108,3))
assert(mo_summary.loc['BARNES JEWISH HOSPITAL'].number == 131313)
assert(mo_summary.loc['BOONE HOSPITAL CENTER'].number == 63099)

---

### 47.2 Planning your final project

You should be thinking about the things we've been learning and how you can apply them to your final project.  Use the rubric to help guid your thinking and then answer the questions below.  This is meant as a guide to help you think through what you will do.

#### A) Data Access

Your project should include data from at least two distinct types of sources (locations and/or file types).  For example: AWS S3, Relational Databases, Internet, Web Services, local files.  List what data sources you're planning to use.

National Health and Nutrition Examination Survey (NHANES) 2007-2008:
https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2007

NHANES is a long-running program of the National Center for Health Statistics within the CDC. Its complex sampling design is intended to produce nationally-representative estimates. These estimates can then be used to inform policy, understand health issues within the US, and track trends. NHANES collects information on a variety of topics, including mental health, general health, social determinants of health (i.e. insurance, education, citizenship status, etc.), and other demographic information. Unique participant IDs are assigned and can be used to merge data from different modules. Below are the modules I intend to use for this project: 

Demographics:https://wwwn.cdc.gov/Nchs/Nhanes/2007-2008/DEMO_E.htm

Ostoporosis Questionnaire Module: https://wwwn.cdc.gov/Nchs/Nhanes/2007-2008/OSQ_E.htm

Current Health Status: https://wwwn.cdc.gov/Nchs/Nhanes/2007-2008/HSQ_E.htm

Mental Health - Depression Screener: https://wwwn.cdc.gov/Nchs/Nhanes/2007-2008/DPQ_E.htm

Note: all links provided are to the documentation modules. NHANES provides documentation and data files for each module.

Citation: Centers for Disease Control and Prevention (CDC). National Center for Health Statistics (NCHS). National Health and Nutrition Examination Survey Data. Hyattsville, MD: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention, 2007-2008, https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2007.

Study of Women's Health Across the Nation (SWAN) 2006-2008: https://www.icpsr.umich.edu/web/ICPSR/studies/32961

This is the most recent iteration of the study. SWAN is an epidemiologic survey of women in their middle years and examines several aspects of health, including physical and psychological. Although this is longitudinal study began in 1994, data will only be used from this iteration in a cross-sectional manner. Data is available for similar mental health, physical health, and demographic variables as in NHANES.

Citation: Sutton-Tyrrell, Kim, Selzer, Faith, Sowers, MaryFran, Finkelstein, Joel, Powell, Lynda, Gold, Ellen, … Brooks, Maria Mori. Study of Women’s Health Across the Nation (SWAN), 2006-2008: Visit 10 Dataset. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2018-11-15. https://doi.org/10.3886/ICPSR32961.v2
 

#### B. Data Formats

Your project should include data that comes in different file formats.  For example: HL7, EDI, HTML, CSV, Excel, JSON, XML.  List what data formats you're planning to use.

NHANES data is provided in SAS format as .XPT files. 

SWAN 2006-2008 can be exported in multiple file formats, including delimited (TSV), R, SPSS, SAS, and ASCII. I plan to use a TSV format, as NHANES data is only available in SAS and Stata formats. 

#### C. Objective

What purpose would your project serve in a real work setting?  Take a couple of paragraphs to write down why this is an interesting product.

I will be working with a medical student for her 2022 Summer Research Fellowship to examine associations between osteoporosis and mental health, with a particular focus on mental health treatment and gender disparities. We plan to use secondary data from the National Health and Nutrition Examination Survey (NHANES), a nationally representative survey comprised of in-home questionnaires and physical examinations administered at a mobile site. We will examine a subset of individuals with osteoporosis for mental health outcomes and treatment. In particular, we will use the data on osteoporosis outcomes (such as broken bones and diagnosis), general impression of mental health, and PHQ-9 scores (a commonly used screening tool for depression). Treatment data could be gathered from reports of prescription medication and mental health encounters in other modules of NHANES not included above. The larger project will include NHANES data from multiple years, using those years in which NHANES collected data on osteoporosis. 

As no one dataset contains all the information one typically needs, cross-comparison between two similar data sources is often required. The current project will focus on proor-of-concept, data management, and exploratory analysis. In addition, I will compare data from NHANES with SWAN to determine whether NHANES gathers comparable information to SWAN, a longitudinal in-person study. From the current project, I will identify variables that could be harmonized and compared between the two data sources. This cross-sectional analysis will be the first step towards future comparisons between these two data sets, including later longitudinal study. This will be particularly useful as SWAN stopped collecting data in 2008, while NHANES continues to collect data. If the conclusions drawn from both studies are similar regarding women's mental health and osteoporosis outcomes, NHANES would be a good candidate data source to continue to monitor national trends in these areas. If these conclusions are different, the necessity of longitudinal studies such as SWAN would be underscored.



---



## Submitting Your Work

In order to submit your work, you'll need to use the `git` command line program to **add** your homework file (this file) to your local repository, **commit** your changes to your local repository, and then **push** those changes up to github.com.  From there, I'll be able to **pull** the changes down and do my grading.  I'll provide some feedback, **commit** and **push** my comments back to you.  Next week, I'll show you how to **pull** down my comments.

To run through everything one last time and submit your work:
1. Use the `Kernel` -> `Restart Kernel and Run All Cells` menu option to run everything from top to bottom and stop here.
2. Follow the instruction on the prompt below to either ssave and submit your work, or continue working.

If anything fails along the way with this submission part of the process, let me know.  I'll help you troubleshoort.

---

In [None]:
a=input('''
Are you ready to submit your work?
1. Click the Save icon (or do Ctrl-S / Cmd-S)
2. Type "yes" or "no" below
3. Press Enter

''')

if a=='yes':
    !git add week11_assignment_2.ipynb
    !git commit -a -m "Submitting the week 11 programming assignment"
    !git push
else:
    print('''
    
OK. We can wait.
''')