# Purpose

The goal of this notebook is to pull down Centers for Medicare and Medicaid Services (CMS) data on hospital-level variables (e.g. the amount of times patients have serious falls in a given reporting period). This notebook focuses purely on the data wrangling and exploration steps of the Phase I analysis (please see the project's [README](README.md) for info on the Phases of this project). A separate notebook handles the more advanced modeling work for both Phases I and II. 

# Background

CMS collects are sorts of information on US hospitals as part of its role as a federally-managed insurance organization. I was inspired to think about modeling hospital outcomes by the news in early 2019 that CMS had mandated that all US hospital chargemaster data must be published online in machine-readable format on hospital websites. These tables provide the pre-insurance-negotiation prices for all procedures and consumables in a gvien hospital.

Upon digging a bit into the chargemasters data I could find (and forking [a very helpful repo](https://github.com/vsoch/hospital-chargemaster) that had done a lot of the heavy lifting parsing many hospitals' chargemaster data), I determined that the cleaning process for those data (including discerning what procedures and consumables were equivalent items, due to the shorthand used to describe them in many instances) would be a bigger task, so  I split the work into two phases as described in the README.

This notebook, as mentioned before, is concerned with Phase I's data collection steps.

# Pulling Data from CMS API

The first step in this endeavor is to pull down as much relevant (and up to date) data as I can from [the CMS API covering hospitals](https://data.medicare.gov/data/hospital-compare). 

In [13]:
# Package import

import json
import requests
import numpy as np
import pandas as pd
import plotly.express as px

In [6]:
# Pull app token to identify the analysis from private key file
# Note that data can be pulled without a token, but limits are put on the number of requests
# and the data stream is throttled

# I recommend that anyone attempting to reproduce this work generate their own token to do so

APP_TOKEN = pd.read_json("secure_keys/CMS_app_token.json").loc[0,'App Token']

In [70]:
def query_CMS(dataset_url, query_params={"$select": "*"}, app_token=APP_TOKEN):
    '''
    Queries the CMS API for a specific dataset and returns the data 
    from the query as a pandas DataFrame
    
    Inputs
    ------
    dataset_url: str. URL of the dataset being queried (CMS uses different source
        URLs for each dataset instead of exposing one big database via a single URL).
        Options for different dataset URLs can be found by exploring the documentation
        at https://data.medicare.gov/data/hospital-compare.
        
    query_params: dict. Represents the parameters that can be used to narrow queries from the API
        Don't explicitly include the app token in this, as it is added as part of execution of 
        this function. Also don't include a LIMIT parameter, as this will be calculated automatically
        and used to pull down the full dataset.
    
    
    Return
    ------
    pandas DataFrame with queried data and relevant metadata. 
        Will return None if query throws an error
'''

    # Query to figure out how many records there are and set LIMIT based on them
    r_count = requests.get(dataset_url, params={"$select": "COUNT(*)",
                                          "$$app_token": APP_TOKEN})
    num_rows = int(r_count.json()[0]['COUNT'])
    query_params["$limit"] = num_rows
    
    # Add the app token to avoid throttled queries
    query_params["$$app_token"] = app_token

    
    # Perform the main query
    r = requests.get(dataset_url, params=query_params)

    # Check that query didn't throw any errors
    if r.status_code == requests.codes.ok:
        print("Query successful!")
    else:
        print(f"Query failed with status code {r.status_code}")
        return None
    
        
    df = pd.DataFrame.from_dict(r.json())
        
    # Includes dataset last updated datetime as a column
    df['Dataset Last Updated'] = pd.to_datetime(r.headers['X-SODA2-Truth-Last-Modified'])
    
    return df

In [77]:
metadata = pd.read_csv('metadata.csv', encoding = 'latin-1')
metadata

Unnamed: 0,Dataset Name,API Endpoint URL,Description
0,Footnote Crosswalk,https://data.medicare.gov/resource/sbph-xiia.json,List of footnotes referenced in many datasets ...
1,Hospital General Information,https://data.medicare.gov/resource/rbry-mqwu.json,A list of all hospitals that have been registe...
2,Complications and Deaths - Hospital,https://data.medicare.gov/resource/ukfj-tt6v.json,Complications and deaths data provided by the ...
3,Healthcare Associated Infections - Hospital,https://data.medicare.gov/resource/ppaw-hhm5.json,Hospital-provided data. These measures are dev...
4,Hospital Readmissions Reduction Program,https://data.medicare.gov/resource/kac9-a9fp.json,Measures of frequency of patient readmissions ...
5,Medicare Spending Per Beneficiary Ð Hospital A...,https://data.medicare.gov/resource/8ckj-r4j6.json,The Medicare Spending Per Beneficiary (MSPB) M...
6,Outpatient Imaging Efficiency - Hospital,https://data.medicare.gov/resource/72af-b2t9.json,Hospital-provided data about the use of medica...
7,Patient survey (HCAHPS) - Hospital,https://data.medicare.gov/resource/rmgi-5fhi.json,A list of hospital ratings for the Hospital Co...
8,Structural Measures - Hospital,https://data.medicare.gov/resource/w5ci-7egs.json,A list of hospitals and the structural measure...


In [98]:
footnote_mapping = query_CMS(metadata.loc[0, 'API Endpoint URL'])
footnote_mapping.head()

Query successful!


Unnamed: 0,footnote,footnote_text,Dataset Last Updated
0,1,The number of cases/patients is too few to rep...,2019-07-30 06:19:20+00:00
1,2,Data submitted were based on a sample of cases...,2019-07-30 06:19:20+00:00
2,3,Results are based on a shorter time period tha...,2019-07-30 06:19:20+00:00
3,4,Data suppressed by CMS for one or more quarters.,2019-07-30 06:19:20+00:00
4,5,Results are not available for this reporting p...,2019-07-30 06:19:20+00:00


In [99]:
footnote_mapping.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 3 columns):
footnote                30 non-null object
footnote_text           30 non-null object
Dataset Last Updated    30 non-null datetime64[ns, UTC]
dtypes: datetime64[ns, UTC](1), object(2)
memory usage: 800.0+ bytes


In [142]:
def parse_footnote(footnote_ids, footnote_defs):
    '''
    As some of the CMS datasets provide the footnote ID number(s) 
    as a comma-delimited list of integers (unless there is only one or none),
    it is necessary to use the footnote definitions DataFrame to decode them
    if we need to understand any caveats present

    Inputs
    ------
    footnote_ids: single int as a str or list of comma-delimited ints as a str. The ID(s) of 
        relevant footnotes for a record

    footnote_defs: pandas DataFrame with mapping of footnote IDs to human-readable
        footnote text


    Returns
    -------
    pandas DataFrame of descriptions with each ID and its corresponding description text
    '''
    # Remove any spaces and make split into list based on commas
    # Single ID values will take form '1' -> ['1']
    footnote_ids = footnote_ids.replace(' ', '').split(',')

    return footnote_defs.loc[footnote_defs['footnote'].isin(footnote_ids),
                             ['footnote', 'footnote_text']]


In [94]:
gen_info = query_CMS(metadata.loc[1, 'API Endpoint URL'])
gen_info.sort_values('provider_id').head(20)

Query successful!


Unnamed: 0,:@computed_region_csmy_5jwy,:@computed_region_f3tr_pr43,:@computed_region_nwen_78xc,address,city,county_name,effectiveness_of_care_national_comparison,effectiveness_of_care_national_comparison_footnote,efficient_use_of_medical_imaging_national_comparison,efficient_use_of_medical_imaging_national_comparison_footnote,...,provider_id,readmission_national_comparison,readmission_national_comparison_footnote,safety_of_care_national_comparison,safety_of_care_national_comparison_footnote,state,timeliness_of_care_national_comparison,timeliness_of_care_national_comparison_footnote,zip_code,Dataset Last Updated
3752,29.0,1551.0,1551.0,1108 ROSS CLARK CIRCLE,DOTHAN,HOUSTON,Same as the national average,,Same as the national average,,...,10001,Below the national average,,Above the national average,,AL,Above the national average,,36301,2019-07-30 06:20:01+00:00
1582,,,,2505 U S HIGHWAY 431 NORTH,BOAZ,MARSHALL,Above the national average,,Below the national average,,...,10005,Below the national average,,Below the national average,,AL,Above the national average,,35957,2019-07-30 06:20:01+00:00
351,29.0,1584.0,1584.0,1701 VETERANS DRIVE,FLORENCE,LAUDERDALE,Same as the national average,,Below the national average,,...,10006,Above the national average,,Above the national average,,AL,Above the national average,,35630,2019-07-30 06:20:01+00:00
1404,29.0,1539.0,1539.0,702 N MAIN ST,OPP,COVINGTON,Below the national average,,Not Available,Results are not available for this reporting p...,...,10007,Below the national average,,Same as the national average,,AL,Above the national average,,36467,2019-07-30 06:20:01+00:00
101,29.0,1540.0,1540.0,101 HOSPITAL CIRCLE,LUVERNE,CRENSHAW,Same as the national average,,Not Available,Results are not available for this reporting p...,...,10008,Above the national average,,Not Available,Results are not available for this reporting p...,AL,Above the national average,,36049,2019-07-30 06:20:01+00:00
2200,29.0,1583.0,1583.0,50 MEDICAL PARK EAST DRIVE,BIRMINGHAM,JEFFERSON,Below the national average,,Same as the national average,,...,10011,Below the national average,,Below the national average,,AL,Below the national average,,35235,2019-07-30 06:20:01+00:00
1043,,,,200 MED CENTER DRIVE,FORT PAYNE,DE KALB,Same as the national average,,Same as the national average,,...,10012,Below the national average,,Above the national average,,AL,Above the national average,,35968,2019-07-30 06:20:01+00:00
3700,29.0,1636.0,1636.0,1000 FIRST STREET NORTH,ALABASTER,SHELBY,Same as the national average,,Below the national average,,...,10016,Above the national average,,Below the national average,,AL,Same as the national average,,35007,2019-07-30 06:20:01+00:00
1305,29.0,1583.0,1583.0,"1720 UNIVERSITY BLVD, SUITE 500",BIRMINGHAM,JEFFERSON,Same as the national average,,Not Available,There are too few measures or measure groups r...,...,10018,Not Available,There are too few measures or measure groups r...,Not Available,Results are not available for this reporting p...,AL,Same as the national average,,35233,2019-07-30 06:20:01+00:00
1560,29.0,1498.0,1498.0,1300 SOUTH MONTGOMERY AVENUE,SHEFFIELD,COLBERT,Same as the national average,,Below the national average,,...,10019,Below the national average,,Above the national average,,AL,Above the national average,,35660,2019-07-30 06:20:01+00:00


In [91]:
gen_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5334 entries, 0 to 5333
Data columns (total 37 columns):
:@computed_region_csmy_5jwy                                      4879 non-null object
:@computed_region_f3tr_pr43                                      4930 non-null object
:@computed_region_nwen_78xc                                      4930 non-null object
address                                                          5334 non-null object
city                                                             5334 non-null object
county_name                                                      5334 non-null object
effectiveness_of_care_national_comparison                        5334 non-null object
effectiveness_of_care_national_comparison_footnote               1514 non-null object
efficient_use_of_medical_imaging_national_comparison             5334 non-null object
efficient_use_of_medical_imaging_national_comparison_footnote    2292 non-null object
emergency_services               

In [None]:
# TODO: make Provider ID the index and re-order columns more logically
