# Purpose

This notebook largely serves to allow me to sift through all of the chargemasters and metadata generated via the work already done in [this wonderful repo](https://github.com/vsoch/hospital-chargemaster) (from which I forked my repo). 

Based upon the originating repo's own README, there's at least some data collection that still needs to be done (e.g. [data from hospitalpriceindex.com](https://search.hospitalpriceindex.com/hospital/Barnes-Jewish-Hospital/5359?page=1) has to be scraped but they're denying IP addresses that try to do so). However, if any of that gets done in here, it won't be until after I've sifted through the current material to make sure I have a handle on what data are already available. 

It's fairly plausible that I'll then attempt to combine it all into a single sqlite or MongoDB database that can subsequently be analyzed. But I'm getting ahead of myself - first to figure out what we have to work with!

# Background

*Assume everything in this cell is quoted directly from the originating repo README, albeit with some extra content removed for the purposes of streamlining. Anything in italics like this should be assumed to be editorial additions by me.*

**From the original README:**

## Get List of Hospital Pages
We have compiled a list of hospitals and links in the [hospitals.tsv](hospitals.tsv) 
file, generated via the [0.get_hospitals.py](0.get_hospitals.py) script *which pulls these data from [a Quartz article](https://qz.com/1518545/price-lists-for-the-115-biggest-us-hospitals-new-transparency-law/) detailing ~115 hospital URLs from which the authors were able to find chargemasters in one form or another*. 

The file includes the following variables, separated by tabs:

 - **hospital_name** is the human friendly name
 - **hospital_url** is the human friendly URL, typically the page that includes a link to the data.
 - **hospital_id** is the unique identifier for the hospital, the hospital name, in lowercase, with spaces replaced with `-`
   
## Organize Data

Each hospital has records kept in a subfolder in the [data](data) folder. Specifically,
each subfolder is named according to the hospital name (made all lowercase, with spaces 
replaced with `-`). If a subfolder begins with an underscore, it means that I wasn't
able to find the charge list on the hospital site (and maybe you can help?) 
Within that folder, you will find:

 - `scrape.py`: A script to scrape the data
 - `browser.py`: If we need to interact with a browser, we use selenium to do this.
 - `latest`: a folder with the last scraped (latest data files)
 - `YYYY-MM-DD` folders, where each folder includes:
   - `records.json` the complete list of records scraped for a particular data
   - `*.csv` or `*.xlsx` or `*.json`: the scraped data files.

## Parsing
This is likely one of the hardest steps. I wanted to see the extent to which I could
create a simple parser that would generate a single TSV (tab separted value) file
per hospital, with minimally an identifier for a charge, and a price in dollars. If
provided, I would also include a description and code:

 - **charge_code**
 - **price**
 - **description**
 - **hospital_id**
 - **filename**

Each of these parsers is also in the hospital subfolder, and named as "parser.py." The parser would output a data-latest.tsv file at the top level of the folder, along with a dated (by year `data-<year>.tsv`). At some point
I realized that there were different kinds of charges, including inpatient, outpatient, DRG (diagnostic related group) and others called
"standard" or "average." I then went back and added an additional column
to the data:

 - **charge_type** can be one of standard, average, inpatient, outpatient, drg, or (if more detail is supplied) insured, uninsured, pharmacy, or supply. This is not a gold standard labeling but a best effort. If not specified, I labeled as standard, because this would be a good assumption.

# Exploratory Data Analysis (EDA)

OK, I think I have a handle on this, let's take a look at the data, starting with the hospital metadata.

It may be a bit confusing for anyone following along at home, but note that I had already started my own effort to download a bunch of these chargemasters in a far more manual approach than what @vsoch did with `BeautifulSoup` and all. As a result, I may be comparing her dataset at times in this notebook to the one I was developing. Mine was never going to have automated updates like hers however, hence why I'm deferring to her repo and data over my own (while mine may be a bit more comprehensive in terms of number of hospitals, I'd rather the data be up to date as much as possible). 

That said, as of this writing, I have approximately 600+ unique hospitals included in my chargemaster index, so I'm going to keep an eye out during my EDA to see if my more manual approach may still be useful.

## Metadata

In [1]:
#Make sure any changes to custom packages can be reflected immediately 
#in the notebook without kernel restart
import autoreload
%load_ext autoreload
%autoreload 2

In [2]:
#Import the hospital metadata

import pandas as pd
metadata = pd.read_csv('hospitals.tsv', delimiter = r'\t')
metadata

  after removing the cwd from sys.path.


Unnamed: 0,hospital_name,hospital_id,hospital_url
0,Atlanticare Regional Medical Center,atlanticare-regional-medical-center,https://www.atlanticare.org/patients-and-visit...
1,Aurora Health Care Metro Inc.,aurora-health-care-metro-inc.,https://www.aurorahealthcare.org/patients-visi...
2,Baptist Health System (San Antonio),baptist-health-system-(san-antonio),https://www.baptisthealthsystem.com/for-patien...
3,Baptist Hospital (Miami),baptist-hospital-(miami),https://baptisthealth.net/en/facilities/baptis...
4,Baptist Medical Center (Jacksonville),baptist-medical-center-(jacksonville),https://www.baptistjax.com/patient-info/billin...
5,Barnes Jewish Hospital,barnes-jewish-hospital,https://www.bjc.org/For-Patients-Billing-Visit...
6,Brigham and Womens Hospital,brigham-and-womens-hospital,https://www.partners.org/for-patients/Patient-...
7,California Pacific Medical Center,california-pacific-medical-center,https://www.sutterhealth.org/cpmc/for-patients...
8,California Pacific Medical Center R.K. Davies ...,california-pacific-medical-center-r.k.-davies-...,https://www.sutterhealth.org/for-patients/heal...
9,Carolinas Medical Center,carolinas-medical-center,https://atriumhealth.org/for-patients-visitors...


In [3]:
#Do these data include the huge CA chargemaster dataset on oshpd.ca.gov?
metadata[metadata['hospital_url'].str.contains('oshpd.ca.gov')]

Unnamed: 0,hospital_name,hospital_id,hospital_url
34,Loma Linda University Medical Center,loma-linda-university-medical-center,https://oshpd.ca.gov/data-and-reports/cost-tra...
36,Lucile Packard Childrens Hospital,lucile-packard-childrens-hospital,https://oshpd.ca.gov/data-and-reports/cost-tra...
66,Ronald Reagan UCLA Medical Center,ronald-reagan-ucla-medical-center,https://oshpd.ca.gov/data-and-reports/cost-tra...
76,Stanford Hospitals and Clinics,stanford-hospitals-and-clinics,https://oshpd.ca.gov/data-and-reports/cost-tra...


*Interesting, it looks like this dataset may not include the full OSHPD chargemaster list.* I find that unlikely however, as @vsoch makes it clear in another part of her README that she built the `oshpd-ca` scraper and found 795+ hospitals in that dataset. **This suggests that this metadata file may not be complete, as it's really just an inventory of the links from the aforementioned Quartz article, and not necessarily a full accounting of the hospitals contained in the dataset.**

## Tabulated Data

OK, so the metadata table wasn't super useful (investigating `data/oshpd-ca` confirmed that indeed there are far more files in this repo than hospitals listed in `hospitals.tsv`), but there are **a lot** of files to plow through here! And vsoch was kind enough to try and compile them whenever appropriate in the various hospital/site-specific folders within `data` as `data-latest[-n].tsv` (`-n` indicates that, if the file gets above 100 MB, it's split into `data-latest-1.tsv`, `data-latest-2.tsv`, etc. to avoid going over the GitHub per-file size limit).

Let's try to parse all of these TSV files into a single coherent DataFrame! The entire `data` folders is less than 4 GB, and I'm confident that more than half of that is individual XLSX/CSV files, so I think this should be something we can hold in memory easily enough.

...still, we'll use some tricks (e.g. making the sub-dataframes as a generator instead of a list) to ensure optimal memory usage, just to be safe.

In [5]:
# Search through the data/hospital-id folders for data-latest[-n].tsv files
# so you can concatenate them into a single DataFrame
from glob import iglob

for name in iglob('data/*/data-latest*.tsv'):
    print(name)

data/university-of-virginia-medical-center/data-latest.tsv
data/university-hospitals-case-medical-center/data-latest.tsv
data/montefiore-medical-center/data-latest.tsv
data/swedish-medical-center/data-latest.tsv
data/temple-university-hospital/data-latest.tsv
data/rush-university-medical-center/data-latest.tsv
data/long-island-jewish-medical-center/data-latest.tsv
data/advent-health/data-latest.tsv
data/atlanticare-regional-medical-center/data-latest.tsv
data/northshore-university-health-system/data-latest.tsv
data/university-of-iowa-hospitals-and-clinics/data-latest.tsv
data/orlando-health/data-latest.tsv
data/north-shore-university-hospital/data-latest.tsv
data/jackson-memorial/data-latest.tsv
data/carolinas-medical-center/data-latest.tsv
data/st.-luke’s-hospital-(san-francisco)/data-latest.tsv
data/memorial-regional-hospital/data-latest.tsv
data/geisinger-medical-center/data-latest.tsv
data/aurora-health-care-metro-inc./data-latest.tsv
data/uc-irvine-medical-center/data-latest.tsv
d

In [11]:
# Setup the full dataframe using iterators/generators to save on memory
all_files = iglob('data/*/data-latest*.tsv')
individual_dfs = (pd.read_csv(f, delimiter = r'\t') for f in all_files)

import ipdb

ipdb.set_trace() # set breakpoint for debuggin
df = pd.concat(individual_dfs, ignore_index=True)

df.info(memory_usage = 'deep')

--Return--
None
> [0;32m<ipython-input-11-26e232755eb7>[0m(7)[0;36m<module>[0;34m()[0m
[0;32m      6 [0;31m[0;34m[0m[0m
[0m[0;32m----> 7 [0;31m[0mipdb[0m[0;34m.[0m[0mset_trace[0m[0;34m([0m[0;34m)[0m [0;31m# set breakpoint for debuggin[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m      8 [0;31m[0mdf[0m [0;34m=[0m [0mpd[0m[0;34m.[0m[0mconcat[0m[0;34m([0m[0mindividual_dfs[0m[0;34m,[0m [0mignore_index[0m[0;34m=[0m[0;32mTrue[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m
ipdb> n
> [0;32m/Users/emigre459/anaconda3/envs/HospitalPricing/lib/python3.6/site-packages/IPython/core/interactiveshell.py[0m(3299)[0;36mrun_code[0;34m()[0m
[0;32m   3298 [0;31m                [0;31m# Reset our crash handler in place[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m-> 3299 [0;31m                [0msys[0m[0;34m.[0m[0mexcepthook[0m [0;34m=[0m [0mold_excepthook[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m   3300 [0;31m        [0;32mexcept[0m [0m

  This is separate from the ipykernel package so we can avoid doing imports until


ParserError: Expected 6 fields in line 608919, saw 7. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.

> [0;32m/Users/emigre459/anaconda3/envs/HospitalPricing/lib/python3.6/site-packages/IPython/core/interactiveshell.py[0m(3215)[0;36mrun_ast_nodes[0;34m()[0m
[0;32m   3214 [0;31m                    [0;32mif[0m [0;34m([0m[0;32myield[0m [0;32mfrom[0m [0mself[0m[0;34m.[0m[0mrun_code[0m[0;34m([0m[0mcode[0m[0;34m,[0m [0mresult[0m[0;34m)[0m[0;34m)[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m-> 3215 [0;31m                        [0;32mreturn[0m [0;32mTrue[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m   3216 [0;31m[0;34m[0m[0m
[0m
ipdb> n
Internal StopIteration: True
> [0;32m/Users/emigre459/anaconda3/envs/HospitalPricing/lib/python3.6/site-packages/IPython/core/interactiveshell.py[0m(3049)[0;36mrun_cell_async[0;34m()[0m
[0;32m   3048 [0;31m                has_raised = yield from self.run_ast_nodes(code_ast.body, cell_name,
[0m[0;32m-> 3049 [0;31m                       interactivity=interactivity, compiler=compiler, result=result)
[0m[0;3

In [None]:
# TODO: need to figure out what the chars are that are screwing this up (e.g double quotes)
# and what file they're in, so I can manually correct the file in question

# Looking into line 608,919. Which file is that? Assume they're being accessed in same order
# as iglob-generated filepath list

