# Purpose

The code herein is used to pull together the chargemaster dataset

In [1]:
import pandas as pd
import requests

# The Process

1. Load up the metadata index of files
2. Extract embedded URLs wherever they may exist within the file
3. Merge extracted URLs and file format data into metadata DataFrame
4. Download files from URL column
5. Sit back and bask in the glow of a job well done.

## Load up the metadata index of files

In [165]:
metadata = pd.read_excel('chargemasters/chargemaster_index.xlsx', sheet_name = 'Sheet1',
                        index_col = 0)
metadata

Unnamed: 0_level_0,Hospital,URL,File Format,Notes,Secondary URL (e.g. common landing page)
Document ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,Atlanticare Regional Medical Center,https://www.atlanticare.org/assets/images/serv...,CSV,,
1,Aurora BayCare Medical Center,https://www.aurorahealthcare.org/-/media/auror...,CSV,,
2,Aurora Medical Center in Burlington,https://www.aurorahealthcare.org/-/media/auror...,CSV,,
3,Aurora Medical Center in Grafton,https://www.aurorahealthcare.org/-/media/auror...,CSV,,
4,Aurora Medical Center in Kenosha,https://www.aurorahealthcare.org/-/media/auror...,CSV,,
5,Aurora Lakeland Medical Center,https://www.aurorahealthcare.org/-/media/auror...,CSV,,
6,Aurora Medical Center in Manitowoc County,https://www.aurorahealthcare.org/-/media/auror...,CSV,,
7,Aurora Medical Center in Oshkosh,https://www.aurorahealthcare.org/-/media/auror...,CSV,,
8,Aurora Psychiatric Hospital,https://www.aurorahealthcare.org/-/media/auror...,CSV,,
9,Aurora Sheboygan Memorial Medical Center,https://www.aurorahealthcare.org/-/media/auror...,CSV,,


## Extract embedded URLs wherever they may exist within the file

First, we'll need to write up a function that can extract URLs that exist as embedded links within the metadata file. This is needed because sometimes copying and pasting proved simplest/quickest if I didn't stop to manually extract the links (e.g. for the hundreds of chargemasters available from the state of California).

In [211]:
import openpyxl
import pandas as pd


def get_URLs(cell_range, filepath='chargemasters/chargemaster_index.xlsx', sheet_name='Sheet1'):
    '''
    Extract embedded URLs from cells in an Excel workbook.

    Parameters
    ----------
    cell_range: two-element tuple of str. Defines the first and last cell in a column 
        from which you want to extract embedded URLs

    filepath: str. Relative (to working directory) filepath of workbook

    sheet_name: str. Name of worksheet in your workbook that contains the cells of interest

    Returns
    -------
    urls: pandas DataFrame with one string column called 'URL'. Each element is an extracted URL. 
        If a URL can't be found, element is None.
        Index of urls should match corresponding DataFrame row index for relevant subset of data if XLSX
        were imported as a DataFrame using read_excel(). Assumes that: Excel row number = (DataFrame index + 1)
    '''

    wb = openpyxl.load_workbook(filepath)
    ws = wb[sheet_name]
    cells = ws[cell_range[0]:cell_range[1]]
    index = [cell.row - 1 for e in cells for cell in e]

    # Whenever hyperlink is found, return URL as string, otherwise return None
    urls = [cell.hyperlink.target if cell.hyperlink is not None else None for e in cells for cell in e]

    return pd.DataFrame(urls, index=index, columns=['URL'])

## Merge extracted URLs and file format data into metadata DataFrame
**...and while you're at it, might as well download files from URL column too!**

In [221]:
import requests
import re

def download_file(row, filepath = 'chargemasters/'):
    '''
    Downloads file determined by row['URL'] to filepath, renaming it before saving. This function is expected
    to be used via apply(download_file, axis = 1) on a DataFrame containing 
    a column 'URL' generated by get_URLs()
    
    Parameters
    ----------
    row: pandas DataFrame row with a string representing a file download URL
    
    filepath: str. Dictates the directory into which the downloaded file will be stored.    
    
    
    Returns
    -------
    str. file extension downloaded in all caps (e.g. 'XLSX' or 'CSV')
    '''
    
    #Make sure URL not None
    if row['URL']:
        # response object
        r = requests.get(row['URL'], allow_redirects=True)
        file_info = r.headers.get('content-disposition')

        # Find the original filename and the file extension
        orig_filename = re.findall(r'filename="(.+)"', file_info)[0]
        file_ext = filename.rsplit(".")[1]

        #Name the downloaded file using the index of the URL from the DataFrame
        new_filename = str(row.name) + '.' + file_ext
        print(f"Downloading {orig_filename} as {new_filename}...")
        open('chargemasters/' + new_filename, 'wb').write(r.content)

        return file_ext.upper()

In [218]:
#Extract URLs and file formats and download chargemaster files
df = get_URLs(('C75', 'C846'))
df['File Format'] = temp_df.apply(download_file, axis = 1)
df

Downloading 106190812_CDM_All_2018.xlsx as 832.xlsx...
Downloading 106014050_CDM_All_2018.xlsx as 833.xlsx...
Downloading 106560481_Common25_2018.xlsx as 834.xlsx...
Downloading 106560481_PCT_CHG_2018.xlsx as 835.xlsx...
Downloading 106560481_CDM_2018.xlsx as 836.xlsx...
Downloading 106361370_Comments_2018.docx as 837.xlsx...
Downloading 106361370_Common25_2018.xlsx as 838.xlsx...
Downloading 106361370_CDM_2018.xlsx as 839.xlsx...
Downloading 106010987_CDM_All_2018.xlsx as 840.xlsx...
Downloading 106444013_CDM_All_2018.xlsx as 841.xlsx...
Downloading 106301379_CDM_All_2018.xlsx as 842.xlsx...
Downloading 106190883_CDM_All_2018.xlsx as 843.xlsx...
Downloading 106571086_Comments_2018.docx as 844.xlsx...
Downloading 106571086_CDM_All_2018.xls as 845.xlsx...
Downloading 106380939_CDM_All_2018.xlsx as 846.xlsx...


Unnamed: 0,URL,File Format
832,https://oshpd.ca.gov/ml/v1/resources/document?...,XLSX
833,https://oshpd.ca.gov/ml/v1/resources/document?...,XLSX
834,https://oshpd.ca.gov/ml/v1/resources/document?...,XLSX
835,https://oshpd.ca.gov/ml/v1/resources/document?...,XLSX
836,https://oshpd.ca.gov/ml/v1/resources/document?...,XLSX
837,https://oshpd.ca.gov/ml/v1/resources/document?...,XLSX
838,https://oshpd.ca.gov/ml/v1/resources/document?...,XLSX
839,https://oshpd.ca.gov/ml/v1/resources/document?...,XLSX
840,https://oshpd.ca.gov/ml/v1/resources/document?...,XLSX
841,https://oshpd.ca.gov/ml/v1/resources/document?...,XLSX


In [220]:
# Merge metadata df with output of download_files()
# (by overwriting existing data, if any, in URL and File Format columns)

metadata.loc[df.index, ['URL', 'File Format']] = temp_df
metadata.loc[df.index]

Unnamed: 0,Hospital,URL,File Format,Notes,Secondary URL (e.g. common landing page)
832,Valley Presbyterian Hospital,https://oshpd.ca.gov/ml/v1/resources/document?...,XLSX,CA mandated chargemaster,https://oshpd.ca.gov/data-and-reports/cost-tra...
833,Valleycare Medical Center,https://oshpd.ca.gov/ml/v1/resources/document?...,XLSX,CA mandated chargemaster,https://oshpd.ca.gov/data-and-reports/cost-tra...
834,Ventura County Medical Center,https://oshpd.ca.gov/ml/v1/resources/document?...,XLSX,CA mandated chargemaster,https://oshpd.ca.gov/data-and-reports/cost-tra...
835,Ventura County Medical Center,https://oshpd.ca.gov/ml/v1/resources/document?...,XLSX,CA mandated chargemaster,https://oshpd.ca.gov/data-and-reports/cost-tra...
836,Ventura County Medical Center,https://oshpd.ca.gov/ml/v1/resources/document?...,XLSX,CA mandated chargemaster,https://oshpd.ca.gov/data-and-reports/cost-tra...
837,Victor Valley Global Medical Center,https://oshpd.ca.gov/ml/v1/resources/document?...,XLSX,CA mandated chargemaster,https://oshpd.ca.gov/data-and-reports/cost-tra...
838,Victor Valley Global Medical Center,https://oshpd.ca.gov/ml/v1/resources/document?...,XLSX,CA mandated chargemaster,https://oshpd.ca.gov/data-and-reports/cost-tra...
839,Victor Valley Global Medical Center,https://oshpd.ca.gov/ml/v1/resources/document?...,XLSX,CA mandated chargemaster,https://oshpd.ca.gov/data-and-reports/cost-tra...
840,Washington Hospital - Fremont,https://oshpd.ca.gov/ml/v1/resources/document?...,XLSX,CA mandated chargemaster,https://oshpd.ca.gov/data-and-reports/cost-tra...
841,Watsonville Community Hospital,https://oshpd.ca.gov/ml/v1/resources/document?...,XLSX,CA mandated chargemaster,https://oshpd.ca.gov/data-and-reports/cost-tra...
