# Transparency in Coverage Data with Python: Part 1

July 1, 2022, the [Center for Medicare & Medicaid Services](https://www.cms.gov/) (CMS) began enforcing its new [Transparency in Coverage Rule](https://www.cms.gov/healthplan-price-transparency), which requires almost all US health plans to post very detailed pricing data about the vast majority of the services and procedures these plans cover.

In this Jupyter Notebook series, I will provide code and commentary that enables mere mortals like you and me to draw insights from the Transparency in Coverage datasets that payers have released. Some of this data is quite large - and that's no problem - we'll go through methods using the Open Source programming language [Python](https://www.python.org/) to efficiently crawl through these files and extract exactly what we need.

*(The targeted audience for this notebook series is actuaries who have some familiarity with the Python lagnuage, but you can probably also learn as you go if you've not seen it before.)*

## Transparency in Coverage Schema
The [CMS has published a guide on Github](https://github.com/CMSgov/price-transparency-guide/) that defines how the data required for compliance with the new Transparency in Coverage Rule should be packaged by payers, in accordance with [85 FR 72158](https://www.federalregister.gov/documents/2019/11/27/2019-25011/transparency-in-coverage).

Our first step will be to select an arbitrary "Table of Contents" (hereafter, "TOC") file from a payer to use as an example. The [TOC file schema](https://github.com/CMSgov/price-transparency-guide/tree/master/schemas/table-of-contents) is described here. 

Payer data is required to be stored without any "gatekeeping" of any kind. Therefore, in most cases, you can [Google Search](https://www.google.com/search?q=transparency+in+coverage+machine+readable+carefirst&sxsrf=ALiCzsbF3J4fo2uJfS8HNcU6iwmNNC-ikA%3A1661350949782&ei=JTQGY5KfL_GYptQP-ZivwAM&ved=0ahUKEwjSxdXJ1t_5AhVxjIkEHXnMCzgQ4dUDCA4&uact=5&oq=transparency+in+coverage+machine+readable+carefirst&gs_lcp=Cgdnd3Mtd2l6EAMyBQgAEKIEMgUIABCiBDIFCAAQogQ6BwgAEEcQsAM6BwgjELACECc6CAgAEB4QCBANOgUIABCGAzoHCAAQHhCiBDoECCEQCkoECEEYAEoECEYYAFDbCljSFWCZLWgFcAB4AIABZYgB4wqSAQQxOC4xmAEAoAEByAEIwAEB&sclient=gws-wiz) your way to the sections of large payer websites where this data is kept.

Some payers have elected to mask downloadable links through JavaScript wrappers. It is uncertain whether or not this practice will be tolerated as compliant activity by the CMS. 

For no particular reason, we'll select the TOC file for group HMO plans administered by CareFirst, the primary BCBS payer in Maryland.

In [433]:
import requests

URL = "https://mrf.carefirst.com/mrf-files/Table-Of-Content-Carefirst-HMO.json"
response = requests.get(URL)
open("toc_file.json", "wb").write(response.content)

542163

## A quick JSON primer

JSON introduction. Note nested data structure.

## Reading the Table of Contents File

Per the CMS Standard, the first level of the TOC Schema should consist of elements that contain basic information about the info we're reading in. Let's start with getting that info.

In [434]:
import ijson.backends.yajl2 as ijson
import pandas as pd

toc_file = open('./toc_file.json')

base_toc = [(prefix, event, value) for (prefix, event, value) in ijson.parse(toc_file) if prefix == '']

print('\n'.join('{}: {}'.format(*k) for k in enumerate(base_toc)))

toc_file.close()

0: ('', 'start_map', None)
1: ('', 'map_key', 'reporting_entity_name')
2: ('', 'map_key', 'reporting_entity_type')
3: ('', 'map_key', 'reporting_structure')
4: ('', 'end_map', None)


Note the line:

`base_toc = [(prefix, event, value) for (prefix, event, value) in ijson.parse(toc_file) if prefix == '']`

Here, we are using a special feature of Python (and some other languages) known as [List Comprehension](https://en.wikipedia.org/wiki/List_comprehension). This is a way of building a list containing a subset of information drawn from elements of a larger set. In plain language, what this means is:

1. We'd like to call `ijson.parse` on our entire `toc_file`, which is an open connection to our toc_file.json we downloaded in the previous section.
2. From this method, we want all the 3-tuple combinations of `(prefix, event, value)` that `ijson.parse` has returned.
3. We want to subset this further such that the list only contains 3-tuple combinations where there is no value for the prefix event - meaning that we're looking at the root-level JSON elements for this file.
4. The square brackets `[]` around this code indicate we want the subset of 3-tuples to be expressed as a list, one that is assigned to the variable `base_toc`.

We can easily compare the output to here to the base-level elements we're expecting to see, as described by the [CMS published standard](https://github.com/CMSgov/price-transparency-guide/tree/master/schemas/table-of-contents) for this data.

![An inline image screenshot of the CMS published standard for table of contents files.](./images/CMS_TOC_Standard.png "CMS TOC root-level elements")


So, we're on the right track! Let's press on. Since `reporting_entity_name` and `reporting_entity_type` should have character string values, pull those out.

In [435]:
toc_file = open('./toc_file.json')
reporting_entity_name = [(prefix, event, value) for (prefix, event, value) in ijson.parse(toc_file) if prefix == 'reporting_entity_name']
print('\n'.join('{}: {}'.format(*k) for k in enumerate(reporting_entity_name)))
toc_file.close()

toc_file = open('./toc_file.json')
reporting_entity_type = [(prefix, event, value) for (prefix, event, value) in ijson.parse(toc_file) if prefix == 'reporting_entity_type']
print('\n'.join('{}: {}'.format(*k) for k in enumerate(reporting_entity_type)))
toc_file.close()


0: ('reporting_entity_name', 'string', 'CareFirst Inc')
0: ('reporting_entity_type', 'string', 'HEALTH INSURANCE ISSUER')


Hopefully this is as expected. Also note that there could be multiple entities in the TOC file, but it appears this particular TOC file has only one reporting entity.

Now let's peel the onion back a bit. According to the CMS standard, the `reporting_structure` item that is a root member of the TOC file is itself a JSON array ("\[a\]n array of reporting structure object types" per the standard), which we can express as a Python list. So let's investigate and see if the `reporting_structure` array is what we're expecting, which is this:

![An inline image screenshot of the CMS published standard for reporting_structure_objects.](./images/CMS_Reporting_Structure_Object.png "CMS Reporting Structure Object root-level elements")


In [436]:
toc_file = open('./toc_file.json')

reporting_structure_objects = [(prefix, event, value) for (prefix, event, value) in ijson.parse(toc_file) if prefix == 'reporting_structure.item' and event == 'start_map']
print('\n'.join('{}: {}'.format(*k) for k in enumerate(reporting_structure_objects)))

toc_file.close()


0: ('reporting_structure.item', 'start_map', None)
1: ('reporting_structure.item', 'start_map', None)
2: ('reporting_structure.item', 'start_map', None)
3: ('reporting_structure.item', 'start_map', None)
4: ('reporting_structure.item', 'start_map', None)
5: ('reporting_structure.item', 'start_map', None)


As expected, we see that there is an array of `reporting_structure` objects, as defined above. Are the contents as expected?

In [437]:
toc_file = open('./toc_file.json')

reporting_structure_contents = [{'key': event, 'value': value} for (prefix, event, value) in ijson.parse(toc_file) if prefix == 'reporting_structure.item' and event == 'map_key']
reporting_structure_contents_df = pd.DataFrame(reporting_structure_contents)

display(reporting_structure_contents_df)

toc_file.close()

Unnamed: 0,key,value
0,map_key,reporting_plans
1,map_key,in_network_files
2,map_key,allowed_amount_file
3,map_key,reporting_plans
4,map_key,in_network_files
5,map_key,allowed_amount_file
6,map_key,reporting_plans
7,map_key,in_network_files
8,map_key,allowed_amount_file
9,map_key,reporting_plans


That's right - each of the six `reporting_structure` objects contains the three objects expected.

But what we really came here for is the treasure trove of information in these files, right? So let's cut to the chase.

## Extracting the Reporting Structure Information

There is nothing in the Reporting Structure Object definition in the CMS standard that contains identifying information about the structure itself: in other words, there is no string value within `reporting_structure` that names the structure. Instead, the six `reporting_structure` objects here are provided, and each one has the following features:

1. `reporting_plans`: an array of Reporting Plan objects
2. `in_network_files`: an array of File Location objects associated with each item in `reporting_plans`
3. `allowed_amount_files`: a single File Location object associated with each item in `reporting_plans`

Knowing this up front makes it very easy for us to start scraping and organizing this data. Let's begin!

### Reporting Plans Array

We know this is an array of Reporting Plans objects. This is what we expect these objects to look like:

![An inline image screenshot of the CMS published standard for reporting_plan objects.](./images/CMS_Reporting_Plans_Object.png "CMS Reporting Plan Object root-level elements")

Note that this is object is the lowest "level" we can peel back, since every data type here is basically a "primitive" data type - which is computer science-speak for a single-valued data type, vs. an array or an object containing other data inside of it. This means that we can make a table that represents the all the Reporting Plans objects, because each item in this array of Reporting Plans objects can be represented as a [Python `dict` type](https://en.wikibooks.org/wiki/Python_Programming/Dictionaries), which can be thought of as an individual row in a data table. Since there are presumably *many* Reporting Plans objects, we can iterate over all of of them with our parser and assemble them into a `pandas` DataFrame. 

In [438]:
toc_file = open('./toc_file.json')

reporting_plans = pd.DataFrame(ijson.items(toc_file, 'reporting_structure.item.reporting_plans.item'))
reporting_plans['index'] = reporting_plans.reset_index().index
display(reporting_plans)

toc_file.close()

Unnamed: 0,plan_name,plan_id_type,plan_id,plan_market_type,index
0,BLUECHOICE ADVANTAGE_POS,EIN,52-6002033,group,0
1,BLUECHOICE HMO HDHP INTEG DED_HMO,HIOS,10207VA038,individual,1
2,BLUECHOICE HMO HDHP INTEG DED_HMO,HIOS,28137MD037,individual,2
3,BLUECHOICE HMO HDHP INTEG DED_HMO,HIOS,86052DC040,individual,3
4,BLUECHOICE HMO HDHP NON INTDED_HMO,HIOS,10207VA038,individual,4
...,...,...,...,...,...
134,DHMO - BlueDental HMO_HMO,EIN,83-4713006,group,134
135,DHMO - BlueDental HMO_HMO,EIN,87-0787360,group,135
136,DHMO - Dental HMO_HMO,EIN,52-0348850,group,136
137,DHMO - Dental HMO_HMO,EIN,52-2064235,group,137


This is very encouraging, finally some useful data! It looks like there are 139 `reporting_plans` objects. But one challenge here is that we don't know how these items relate to their parent objects, `reporting_structure`. We know there are 6 `reporting_structure` objects within this TOC file, but how do these 139 plans relate to each of the six reporting structures? We'll want to keep track of what `reporting_structure` each `reporting_plans` object belongs to!

In [439]:
toc_file = open('./toc_file.json')

reporting_plans = ijson.items(toc_file, 'reporting_structure.item.reporting_plans')
rps = pd.DataFrame()
for (struct_num, rep_plan) in enumerate(reporting_plans):
    plan_details = pd.DataFrame([plan for plan in rep_plan])
    plan_details['reporting_structure_number'] = struct_num
    rps = rps.append(plan_details, ignore_index=True)
    
display(rps)

toc_file.close()

Unnamed: 0,plan_name,plan_id_type,plan_id,plan_market_type,reporting_structure_number
0,BLUECHOICE ADVANTAGE_POS,EIN,52-6002033,group,0
1,BLUECHOICE HMO HDHP INTEG DED_HMO,HIOS,10207VA038,individual,1
2,BLUECHOICE HMO HDHP INTEG DED_HMO,HIOS,28137MD037,individual,1
3,BLUECHOICE HMO HDHP INTEG DED_HMO,HIOS,86052DC040,individual,1
4,BLUECHOICE HMO HDHP NON INTDED_HMO,HIOS,10207VA038,individual,2
...,...,...,...,...,...
134,DHMO - BlueDental HMO_HMO,EIN,83-4713006,group,4
135,DHMO - BlueDental HMO_HMO,EIN,87-0787360,group,4
136,DHMO - Dental HMO_HMO,EIN,52-0348850,group,5
137,DHMO - Dental HMO_HMO,EIN,52-2064235,group,5


Whew, okay, that's much better! We can copy this approach when we take on the In-Network Files.

### In-Network Files Array

Here again we have to be careful - just reading in all the items under "map_start" events in all items marked "in_network_files" gets us a giant list of 840 `file_location` type Objects. And we have no way to relate these back to the `reporting_structure` or `reporting_plans` objects we've already discovered.

In [440]:
toc_file = open('./toc_file.json')

in_network_files = pd.DataFrame(ijson.items(toc_file, 'reporting_structure.item.in_network_files.item'))
in_network_files['index'] = in_network_files.reset_index().index
display(in_network_files)

toc_file.close()

Unnamed: 0,description,location,index
0,Carefirst in-network HMO file,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,0
1,Carefirst in-network HMO file,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,1
2,Carefirst in-network HMO file,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,2
3,Carefirst in-network HMO file,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,3
4,Carefirst in-network HMO file,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,4
...,...,...,...
835,Carefirst in-network HMO file,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,835
836,Carefirst in-network HMO file,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,836
837,Carefirst in-network HMO file,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,837
838,Carefirst in-network HMO file,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,838


Here, we see that there are 840 `file_location` objects in the `in_network_file` array, but there's only 139 `reporting_plans` - so we will need to take care to group the `in_network_file` array items alongside the `reporting_structure` items to which they belong, just as we did earlier with `reporting_plans`.

In [441]:
toc_file = open('./toc_file.json')

in_network_files = ijson.items(toc_file, 'reporting_structure.item.in_network_files')
infs = pd.DataFrame()
for (struct_num, in_netw_f) in enumerate(in_network_files):
    in_net_f_detail = pd.DataFrame([inf for inf in in_netw_f])
    in_net_f_detail['reporting_structure_number'] = struct_num
    infs = infs.append(in_net_f_detail, ignore_index=True)
    
display(infs)
display(pd.DataFrame(infs.value_counts('reporting_structure_number')))

toc_file.close()

Unnamed: 0,description,location,reporting_structure_number
0,Carefirst in-network HMO file,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,0
1,Carefirst in-network HMO file,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,0
2,Carefirst in-network HMO file,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,0
3,Carefirst in-network HMO file,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,0
4,Carefirst in-network HMO file,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,0
...,...,...,...
835,Carefirst in-network HMO file,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,5
836,Carefirst in-network HMO file,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,5
837,Carefirst in-network HMO file,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,5
838,Carefirst in-network HMO file,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,5


Unnamed: 0_level_0,0
reporting_structure_number,Unnamed: 1_level_1
5,140
4,140
3,140
2,140
1,140
0,140


Perfect, we now have collected all the `in_network_file` items, and furthermore have kept track of which `reporting_structure` they belong to. But it looks like there are 140 `file_location` objects attributed to each `reporting_structure`, which suggests that these `reporting_structures` and the plans therein all contain the exact same rates. 

Let's verify this with two tests:

In [442]:
# First, if we select distinct location values, out of the entire table, are there only 140?
infs_locs = infs[['location']]

infs_nodups = infs_locs.drop_duplicates().count()
print("{} == 140? {}".format(infs_nodups.values, infs_nodups.values==140))

# Second, does the list of distinct URLs survive an exclusion join with the entire original table?
infs_test2 = pd.merge(infs_locs.drop_duplicates(), infs_locs, on=['location','location'], how="outer", indicator=True)
infs_test2 = infs_test2[infs_test2['_merge'] == 'left_only']['location'].count()
print("{} == 0? {}".format(infs_test2, infs_test2==0))

[140] == 140? [ True]
0 == 0? True


### Allowed Amount File Object

As we can see in the `reporting_structure` Object definition above, the `allowed_amount_file` item should refer to only one `file_location` object and not an array of them, as did the `in_network_files` item. This should make collecting the associated URL for the Allowed Amount File very easy - and we should expect only six of them.

In [443]:
toc_file = open('./toc_file.json')

allowed_amt_file = ijson.items(toc_file, 'reporting_structure.item.allowed_amount_file')

aafs = pd.DataFrame([aaf for aaf in allowed_amt_file])    
aafs['reporting_structure_number'] = aafs.index
display(aafs)

toc_file.close()

Unnamed: 0,description,location,reporting_structure_number
0,Carefirst allowed amount HMO file,https://mrf.carefirst.com/mrf-files/allowed-am...,0
1,Carefirst allowed amount HMO file,https://mrf.carefirst.com/mrf-files/allowed-am...,1
2,Carefirst allowed amount HMO file,https://mrf.carefirst.com/mrf-files/allowed-am...,2
3,Carefirst allowed amount HMO file,https://mrf.carefirst.com/mrf-files/allowed-am...,3
4,Carefirst allowed amount HMO file,https://mrf.carefirst.com/mrf-files/allowed-am...,4
5,Carefirst allowed amount HMO file,https://mrf.carefirst.com/mrf-files/allowed-am...,5


### This reporting structure reinforces redundancies

You may have heard people talking about how absolutely massive these Transparency in Coverage datasets are. For example, Cigna's in-network rates files can top several hundred gigabytes, and that's *before* you uncompress them.

But what we just saw here was the same 140 of CareFirst's in-network rate file URLs being copied 6 times for each `reporting_structure` object. This isn't an error on CareFirst's part - this is merely them following the standard set by CMS. In truth, a large reason why these files are so massive is because the data standard prescribed by CMS enforces redundancies like this. I'd hazard a guess that it is likely that those responsible for writing the CMS standard did not expect the in-network rate data to be so absolutely massive. 

In the case of *this* particular TOC file, CareFirst was unable to bundle all of their `reporting_plan` and `in_network_files` objects under one single `reporting_structure` entry because the `allowed_amount_file` objects differ for each `reporting_structure`. This is due to criteria for being included within the `allowed_amount_file`, which is listed below:

![An inline image screenshot of requirements concerning what allowed amount values to include in the data taken from the CMS reporting standard github site.](./images/CMS_Allowed_Amt_Additional_Note.png "CMS Allowed Amounts - Additional Note")

Pay attention to that last line - that these variations need to be captured at the plan or issuer level! Since aggregate variations exist at the plan level, there must be six `reporting_structure` items instead of one. This is because all six `reporting_structures` contain `reporting_plans` that refer to the same 140 `in_network_files` objects. 

### Finishing up with a final join? Or how about not...

Let's finish our work here by collecting our `reporting_plans` data table as well as our `in_network_files` data table, knowing that we can join their respective subsets at any time on the `reporting_structure_number` field as the join key for both. Joining it all now would be redundant, as it would create a table where the 139 `reporting_plans` each have 140 `file_location` objects, for a total of 139 x 140 = 19,460 rows of mostly redundant info.

Since your goal at the end of this might be to parse this information and warehouse it in a more efficient storage format, it makes sense to be comfortable with having three datasets that can be joined at any time down the line, if required, vs. a much larger dataset.

In [444]:
final_toc_in_netw_out = pd.DataFrame(pd.merge(rps, infs, on=['reporting_structure_number', 'reporting_structure_number'], how="inner", indicator=True).count())

display(final_toc_in_netw_out)

Unnamed: 0,0
plan_name,19460
plan_id_type,19460
plan_id,19460
plan_market_type,19460
reporting_structure_number,19460
description,19460
location,19460
_merge,19460


## TIC-TOC Helper Function

One of the best parts about having a published standard to which payers are (mostly) adhering is that the Transparency in Coverage (TIC) data always should always be structured the same, no matter what insurer you're getting it from. Thus, all the TOC files in the TIC data should be readable with a generalized function that parses the data in the way we did today.  

So, let's finish up by building a parse_tic_toc function that takes as input an open file connection, and runs the parse operations we've described here, outputting the items in a 3-tuple of objects for us to reference later.

### A Note on GZIP Compression
URLs from these TiC files generally come to us in two forms - .json files which are compressed in GZIP format, so they have an extension of `.json.gz`, and those which are not compressed, which merely have an extension of `.json`. At first, we want our helper function to have the ability to parse both kinds once the file is downloaded. 

*(Extra points if we can even skip downloading the file, and instead stream the contents of these URLs over the HTTPS protocol incrementally through our parser function. We're not there yet, but we'll tackle that task in the next TiC Tutorial Notebook.)*

### Why is GZIP Helpful Here?
I want to note that GZIP compression a so-called "sliding window" compression method (please enjoy this [explainer video on the compression algorithm on which gzip is based](https://www.youtube.com/watch?v=goOa3DGezUA)). This means that we can actually parse JSON from within a GZIPped file without uncompressing it entirely on disk first. 

While this might seem like trivial knowledge now, in the next TiC Tutorial notebook we will encounter some of these super-huge in-network rate files that you've heard about. We're going to tackle a few that are 30 GB compressed. 30 GB is already a file size that is hard to accomodate on most computers. Reinflating compressed GZIP files to their former glory can somtimes mean a file size that is 100 times the size of its compressed form. In this case, 30 GB might become 3,000 GB, or 3 terabytes. This creates challenges for parsing the file, because we'll need to find a way to extract the information we need from these files *without* ingesting them entirely into memory.

### Why is the Compression Ratio So Large?
As I've mentioned before, due to the highly redundant nature of the CMS standard, the GZIP compression ratio for these files is quite good, averaging around 95% - 99%. Generally, whenever you have lots of repeated information in a text file, GZIP compression ratios are higher.

### Compressed vs. Uncompressed: ¿Por qué no los dos?
Ultimately, we want to build into our `parse_tic_toc` helper function the capability to parse JSON whether it encounters it in compressed or uncompressed format, so we've designed for that with the initial block of code:

    if file_path.endswith('.json.gz'):
        toc_file = gzip.open('./toc_file_aetna.json.gz', 'rb')
    elif file_path.endswith('.json'):
        toc_file = open('./toc_file.json')
    else:
        raise ValueError("Expecting a .json or .json.gz file...")

In other words, we want our helper method to open a streaming connection to the file that varies based on the file extension - if it's `.json.gz` then we'll open the file with the method `gzip.open()`, if it's `.json` then we'll simply the standard file reader, and if it's neither, we'll raise an error and halt execution.

In [456]:
import gzip

def parse_tic_toc(file_path):
    
    ###### Do a first pass collecting Reporting Entity and Type
    if file_path.endswith('.json.gz'):
        toc_file = gzip.open(file_path, 'rb')
    elif file_path.endswith('.json'):
        toc_file = open(file_path)
    else:
        raise ValueError("Expecting a .json or .json.gz file...")

    parser = ijson.parse(toc_file)
    for prefix, event, value in parser:
        if prefix == 'reporting_entity_name':
            reporting_entity_name = value
        if prefix == 'reporting_entity_type':
            reporting_entity_type = value
    
    toc_file.close()
    
    ###### Next, collect Reporting Plans
    if file_path.endswith('.json.gz'):
        toc_file = gzip.open(file_path, 'rb')
    elif file_path.endswith('.json'):
        toc_file = open(file_path)
    else:
        raise ValueError("Expecting a .json or .json.gz file...")

    reporting_plans = ijson.items(toc_file, 'reporting_structure.item.reporting_plans')
    rps = pd.DataFrame()
    for (struct_num, rep_plan) in enumerate(reporting_plans):
        plan_details = pd.DataFrame([plan for plan in rep_plan])
        plan_details['reporting_structure_number'] = struct_num
        rps = rps.append(plan_details, ignore_index=True)
        
    rps['reporting_entity_name'] = reporting_entity_name
    rps['reporting_entity_type'] = reporting_entity_type
    
    toc_file.close()
    
    ###### Next, collect In-Network File Location Objects
    if file_path.endswith('.json.gz'):
        toc_file = gzip.open(file_path, 'rb')
    elif file_path.endswith('.json'):
        toc_file = open(file_path)
    else:
        raise ValueError("Expecting a .json or .json.gz file...")
        
    in_network_files = ijson.items(toc_file, 'reporting_structure.item.in_network_files')
    infs = pd.DataFrame()
    for (struct_num, in_netw_f) in enumerate(in_network_files):
        in_net_f_detail = pd.DataFrame([inf for inf in in_netw_f])
        in_net_f_detail['reporting_structure_number'] = struct_num
        infs = infs.append(in_net_f_detail, ignore_index=True)
        
    infs['reporting_entity_name'] = reporting_entity_name
    infs['reporting_entity_type'] = reporting_entity_type
    
    toc_file.close()
    
    ###### Finally, collect Allowed Amount File Location Objects
    if file_path.endswith('.json.gz'):
        toc_file = gzip.open(file_path, 'rb')
    elif file_path.endswith('.json'):
        toc_file = open(file_path)
    else:
        raise ValueError("Expecting a .json or .json.gz file...")
        
    allowed_amt_file = ijson.items(toc_file, 'reporting_structure.item.allowed_amount_file')
    aafs = pd.DataFrame([aaf for aaf in allowed_amt_file])    
    aafs['reporting_structure_number'] = aafs.index
    
    aafs['reporting_entity_name'] = reporting_entity_name
    aafs['reporting_entity_type'] = reporting_entity_type
    
    return rps, infs, aafs

### Four Passes?!

Yeah, in our helper method above, we are making four passes over the data. This is a trade-off to consider. Parsing large files takes time, and crawling over a massive file doing string comparisons is time consuming. On the other hand, from a memory management perspective, if we attempt to extract too much data at one time during our pass over the file, we might max out available memory. We have to balance one constraint (time) with the other (memory).

In the next section, we'll be working with some pretty big files, and for smaller machines with less memory, it makes sense to spend the time doing multiple passes to extract smaller portions of data, versus extracting more information on a single pass.

### Testing

Finally, let's test our data on a number of options. 

1. First, on our CareFirst file. 
2. Then, on a very large TOC file from Cigna. **Note**: *Cigna has elected to include ALL of its plans in one TOC file, which includes ASO plans for many employer plan sponsors, so this test takes about a minute to run.*
3. Then, on a very large TOC from Aetna that is a compressed `.json.gz` file
4. Finally, on an image file presented within this notebook, just to make sure our error handling is functioning accordingly.


In [457]:
# Testing - our CareFirst File
toc_file = './toc_file.json'
rps, infs, aafs = parse_tic_toc(toc_file)
display(rps, infs, aafs)

# More testing - Cigna - uncomment the next three lines on the first run to download the file
#URL = "https://d25kgz5rikkq4n.cloudfront.net/cost_transparency/mrf/table-of-contents/reporting_month=2022-08/2022-08-01_cigna-health-life-insurance-company_index.json?Expires=1663258504&Policy=eyJTdGF0ZW1lbnQiOlt7IlJlc291cmNlIjoiaHR0cHM6Ly9kMjVrZ3o1cmlra3E0bi5jbG91ZGZyb250Lm5ldC9jb3N0X3RyYW5zcGFyZW5jeS9tcmYvdGFibGUtb2YtY29udGVudHMvcmVwb3J0aW5nX21vbnRoPTIwMjItMDgvMjAyMi0wOC0wMV9jaWduYS1oZWFsdGgtbGlmZS1pbnN1cmFuY2UtY29tcGFueV9pbmRleC5qc29uIiwiQ29uZGl0aW9uIjp7IkRhdGVMZXNzVGhhbiI6eyJBV1M6RXBvY2hUaW1lIjoxNjYzMjU4NTA0fX19XX0_&Signature=NucG2ID8F7zsGtIqNirj1uliIPIuFuhEapXIC3MjTN4cjvDwoJiZ0X2-4PRERH7i0Y-T99~xUFBsO~NjkegP4R2HGgcZygAT6C5T6NHl1UY-~qowDIl3KnujEvNJLvxOEYftbZsE7yfpPWXlV8sqM5dvItJrRuQEhP6Du9kBYA~SifOtfLUz-a6wn9QdVbsfPo80mUHqq~OBuk3HOJigBJbS0miiUHRhvEbTdMa9Nu5VSwMLTrod850P~kh~TgzEB4MTP-B-PrarwUCgsu4aYP3Eh2OMSIy4kxnL8xtlhBL7W0EiUUlpvVgsOTUScp43eyGC0Mmi5LMnEwqLD8HJsg__&Key-Pair-Id=K1NVBEPVH9LWJP"
#response = requests.get(URL)
#open("toc_file_cigna.json", "wb").write(response.content)

toc_file = './toc_file_cigna.json'
rps, infs, aafs = parse_tic_toc(toc_file)
display(rps, infs, aafs)

# More testing - Aetna - parsing through a GZIP'd file
# URL = "https://mrf.healthsparq.com/aetnacvs-egress.nophi.kyruushsq.com/prd/mrf/AETNACVS_I/ALICFI/2022-08-05/tableOfContents/2022-08-05_Aetna-Life-insurance-Company_index.json.gz"
# response = requests.get(URL)
# open("toc_file_aetna.json.gz", "wb").write(response.content)

toc_file = './toc_file_aetna.json.gz'
rps, infs, aafs = parse_tic_toc(toc_file)
display(rps, infs, aafs)

# Testing - an image (for which we expect an error)
toc_file = './images/CMS_TOC_Standard.png'
rps, infs, aafs = parse_tic_toc(toc_file) 

Unnamed: 0,plan_name,plan_id_type,plan_id,plan_market_type,reporting_structure_number,reporting_entity_name,reporting_entity_type
0,BLUECHOICE ADVANTAGE_POS,EIN,52-6002033,group,0,CareFirst Inc,HEALTH INSURANCE ISSUER
1,BLUECHOICE HMO HDHP INTEG DED_HMO,HIOS,10207VA038,individual,1,CareFirst Inc,HEALTH INSURANCE ISSUER
2,BLUECHOICE HMO HDHP INTEG DED_HMO,HIOS,28137MD037,individual,1,CareFirst Inc,HEALTH INSURANCE ISSUER
3,BLUECHOICE HMO HDHP INTEG DED_HMO,HIOS,86052DC040,individual,1,CareFirst Inc,HEALTH INSURANCE ISSUER
4,BLUECHOICE HMO HDHP NON INTDED_HMO,HIOS,10207VA038,individual,2,CareFirst Inc,HEALTH INSURANCE ISSUER
...,...,...,...,...,...,...,...
134,DHMO - BlueDental HMO_HMO,EIN,83-4713006,group,4,CareFirst Inc,HEALTH INSURANCE ISSUER
135,DHMO - BlueDental HMO_HMO,EIN,87-0787360,group,4,CareFirst Inc,HEALTH INSURANCE ISSUER
136,DHMO - Dental HMO_HMO,EIN,52-0348850,group,5,CareFirst Inc,HEALTH INSURANCE ISSUER
137,DHMO - Dental HMO_HMO,EIN,52-2064235,group,5,CareFirst Inc,HEALTH INSURANCE ISSUER


Unnamed: 0,description,location,reporting_structure_number,reporting_entity_name,reporting_entity_type
0,Carefirst in-network HMO file,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,0,CareFirst Inc,HEALTH INSURANCE ISSUER
1,Carefirst in-network HMO file,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,0,CareFirst Inc,HEALTH INSURANCE ISSUER
2,Carefirst in-network HMO file,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,0,CareFirst Inc,HEALTH INSURANCE ISSUER
3,Carefirst in-network HMO file,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,0,CareFirst Inc,HEALTH INSURANCE ISSUER
4,Carefirst in-network HMO file,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,0,CareFirst Inc,HEALTH INSURANCE ISSUER
...,...,...,...,...,...
835,Carefirst in-network HMO file,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,5,CareFirst Inc,HEALTH INSURANCE ISSUER
836,Carefirst in-network HMO file,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,5,CareFirst Inc,HEALTH INSURANCE ISSUER
837,Carefirst in-network HMO file,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,5,CareFirst Inc,HEALTH INSURANCE ISSUER
838,Carefirst in-network HMO file,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,5,CareFirst Inc,HEALTH INSURANCE ISSUER


Unnamed: 0,description,location,reporting_structure_number,reporting_entity_name,reporting_entity_type
0,Carefirst allowed amount HMO file,https://mrf.carefirst.com/mrf-files/allowed-am...,0,CareFirst Inc,HEALTH INSURANCE ISSUER
1,Carefirst allowed amount HMO file,https://mrf.carefirst.com/mrf-files/allowed-am...,1,CareFirst Inc,HEALTH INSURANCE ISSUER
2,Carefirst allowed amount HMO file,https://mrf.carefirst.com/mrf-files/allowed-am...,2,CareFirst Inc,HEALTH INSURANCE ISSUER
3,Carefirst allowed amount HMO file,https://mrf.carefirst.com/mrf-files/allowed-am...,3,CareFirst Inc,HEALTH INSURANCE ISSUER
4,Carefirst allowed amount HMO file,https://mrf.carefirst.com/mrf-files/allowed-am...,4,CareFirst Inc,HEALTH INSURANCE ISSUER
5,Carefirst allowed amount HMO file,https://mrf.carefirst.com/mrf-files/allowed-am...,5,CareFirst Inc,HEALTH INSURANCE ISSUER


Unnamed: 0,plan_name,plan_id_type,plan_id,plan_market_type,reporting_structure_number,reporting_entity_name,reporting_entity_type
0,"NATIONAL OAP COMPETITION CAMS, INC.",ein,621009281,group,0,Cigna Health Life Insurance Company,Health Insurance Issuer
1,NATIONAL OAP CIGNA Health and Life Insurance C...,ein,59-1031071,group,1,Cigna Health Life Insurance Company,Health Insurance Issuer
2,LOCALPLUS Weld County Garage,ein,840348620,group,2,Cigna Health Life Insurance Company,Health Insurance Issuer
3,SAN FRANCISCO SACRAMENTO HMO CIGNA HEALTHCARE ...,ein,95-3310115,group,3,Cigna Health Life Insurance Company,Health Insurance Issuer
4,METRO NEW YORK GPPO CIGNA Health and Life Insu...,ein,59-1031071,group,4,Cigna Health Life Insurance Company,Health Insurance Issuer
...,...,...,...,...,...,...,...
34235,"NATIONAL OAP Sunshine Media Group, Inc.",ein,651064897,group,18233,Cigna Health Life Insurance Company,Health Insurance Issuer
34236,"LOCALPLUS C3 Industries, Inc.",ein,832247786,group,18234,Cigna Health Life Insurance Company,Health Insurance Issuer
34237,NATIONAL OAP CM & Associates Construction Mana...,ein,204915944,group,18235,Cigna Health Life Insurance Company,Health Insurance Issuer
34238,"NATIONAL OAP InterConnect Wiring, LLP",ein,752459439,group,18236,Cigna Health Life Insurance Company,Health Insurance Issuer


Unnamed: 0,description,location,reporting_structure_number,reporting_entity_name,reporting_entity_type
0,in-network file,https://d25kgz5rikkq4n.cloudfront.net/cost_tra...,0,Cigna Health Life Insurance Company,Health Insurance Issuer
1,in-network file,https://d25kgz5rikkq4n.cloudfront.net/cost_tra...,1,Cigna Health Life Insurance Company,Health Insurance Issuer
2,in-network file,https://d25kgz5rikkq4n.cloudfront.net/cost_tra...,2,Cigna Health Life Insurance Company,Health Insurance Issuer
3,in-network file,https://d25kgz5rikkq4n.cloudfront.net/cost_tra...,3,Cigna Health Life Insurance Company,Health Insurance Issuer
4,in-network file,https://d25kgz5rikkq4n.cloudfront.net/cost_tra...,4,Cigna Health Life Insurance Company,Health Insurance Issuer
...,...,...,...,...,...
2435,in-network file,https://d25kgz5rikkq4n.cloudfront.net/cost_tra...,2435,Cigna Health Life Insurance Company,Health Insurance Issuer
2436,in-network file,https://d25kgz5rikkq4n.cloudfront.net/cost_tra...,2436,Cigna Health Life Insurance Company,Health Insurance Issuer
2437,in-network file,https://d25kgz5rikkq4n.cloudfront.net/cost_tra...,2437,Cigna Health Life Insurance Company,Health Insurance Issuer
2438,in-network file,https://d25kgz5rikkq4n.cloudfront.net/cost_tra...,2438,Cigna Health Life Insurance Company,Health Insurance Issuer


Unnamed: 0,description,location,reporting_structure_number,reporting_entity_name,reporting_entity_type
0,allowed amount file,https://d25kgz5rikkq4n.cloudfront.net/cost_tra...,0,Cigna Health Life Insurance Company,Health Insurance Issuer
1,allowed amount file,https://d25kgz5rikkq4n.cloudfront.net/cost_tra...,1,Cigna Health Life Insurance Company,Health Insurance Issuer
2,allowed amount file,https://d25kgz5rikkq4n.cloudfront.net/cost_tra...,2,Cigna Health Life Insurance Company,Health Insurance Issuer
3,allowed amount file,https://d25kgz5rikkq4n.cloudfront.net/cost_tra...,3,Cigna Health Life Insurance Company,Health Insurance Issuer
4,allowed amount file,https://d25kgz5rikkq4n.cloudfront.net/cost_tra...,4,Cigna Health Life Insurance Company,Health Insurance Issuer
...,...,...,...,...,...
18185,allowed amount file,https://d25kgz5rikkq4n.cloudfront.net/cost_tra...,18185,Cigna Health Life Insurance Company,Health Insurance Issuer
18186,allowed amount file,https://d25kgz5rikkq4n.cloudfront.net/cost_tra...,18186,Cigna Health Life Insurance Company,Health Insurance Issuer
18187,allowed amount file,https://d25kgz5rikkq4n.cloudfront.net/cost_tra...,18187,Cigna Health Life Insurance Company,Health Insurance Issuer
18188,allowed amount file,https://d25kgz5rikkq4n.cloudfront.net/cost_tra...,18188,Cigna Health Life Insurance Company,Health Insurance Issuer


Unnamed: 0,plan_name,plan_id_type,plan_id,plan_market_type,reporting_structure_number,reporting_entity_name,reporting_entity_type
0,Aetna Bronze PPO5900 50/50 HSA OffMarketplace,HIOS,53357ME0040055,group,0,Aetna Life insurance Company,Health Insurance Issuer
1,Aetna Silver PPO 5500 70/50 Off Marketplace,HIOS,53357ME0040060,group,0,Aetna Life insurance Company,Health Insurance Issuer
2,Aetna Gold PPO 2500 70/50 Off Marketplace,HIOS,53357ME0040057,group,0,Aetna Life insurance Company,Health Insurance Issuer
3,Aetna Silver OAEPO 5500 80% PY,HIOS,39159CT0140001,group,1,Aetna Life insurance Company,Health Insurance Issuer
4,Aetna Gold PPO 750 70/50,HIOS,11082AK0060032,group,2,Aetna Life insurance Company,Health Insurance Issuer
...,...,...,...,...,...,...,...
815,Aetna OOS Broad PPO Gold 3000 80/50,HIOS,84251AZ0100190,group,28,Aetna Life insurance Company,Health Insurance Issuer
816,Aetna OOS Broad PPO Gold 3500 80/50,HIOS,84251AZ0100175,group,28,Aetna Life insurance Company,Health Insurance Issuer
817,Aetna OOS Broad PPO Silver 4000 70/50,HIOS,84251AZ0100176,group,28,Aetna Life insurance Company,Health Insurance Issuer
818,Aetna OOS Broad PPO Bronze 5700 70/50 HSA,HIOS,84251AZ0100185,group,28,Aetna Life insurance Company,Health Insurance Issuer


Unnamed: 0,description,location,reporting_structure_number,reporting_entity_name,reporting_entity_type
0,in-network file,https://mrf.healthsparq.com/aetnacvs-egress.no...,0,Aetna Life insurance Company,Health Insurance Issuer
1,in-network file,https://mrf.healthsparq.com/aetnacvs-egress.no...,1,Aetna Life insurance Company,Health Insurance Issuer
2,in-network file,https://mrf.healthsparq.com/aetnacvs-egress.no...,2,Aetna Life insurance Company,Health Insurance Issuer
3,in-network file,https://mrf.healthsparq.com/aetnacvs-egress.no...,3,Aetna Life insurance Company,Health Insurance Issuer
4,in-network file,https://mrf.healthsparq.com/aetnacvs-egress.no...,4,Aetna Life insurance Company,Health Insurance Issuer
5,in-network file,https://mrf.healthsparq.com/aetnacvs-egress.no...,5,Aetna Life insurance Company,Health Insurance Issuer
6,in-network file,https://mrf.healthsparq.com/aetnacvs-egress.no...,6,Aetna Life insurance Company,Health Insurance Issuer
7,in-network file,https://mrf.healthsparq.com/aetnacvs-egress.no...,7,Aetna Life insurance Company,Health Insurance Issuer
8,in-network file,https://mrf.healthsparq.com/aetnacvs-egress.no...,8,Aetna Life insurance Company,Health Insurance Issuer
9,in-network file,https://mrf.healthsparq.com/aetnacvs-egress.no...,9,Aetna Life insurance Company,Health Insurance Issuer


Unnamed: 0,description,location,reporting_structure_number,reporting_entity_name,reporting_entity_type
0,allowed amount file,https://mrf.healthsparq.com/aetnacvs-egress.no...,0,Aetna Life insurance Company,Health Insurance Issuer
1,allowed amount file,https://mrf.healthsparq.com/aetnacvs-egress.no...,1,Aetna Life insurance Company,Health Insurance Issuer
2,allowed amount file,https://mrf.healthsparq.com/aetnacvs-egress.no...,2,Aetna Life insurance Company,Health Insurance Issuer
3,allowed amount file,https://mrf.healthsparq.com/aetnacvs-egress.no...,3,Aetna Life insurance Company,Health Insurance Issuer
4,allowed amount file,https://mrf.healthsparq.com/aetnacvs-egress.no...,4,Aetna Life insurance Company,Health Insurance Issuer
5,allowed amount file,https://mrf.healthsparq.com/aetnacvs-egress.no...,5,Aetna Life insurance Company,Health Insurance Issuer
6,allowed amount file,https://mrf.healthsparq.com/aetnacvs-egress.no...,6,Aetna Life insurance Company,Health Insurance Issuer
7,allowed amount file,https://mrf.healthsparq.com/aetnacvs-egress.no...,7,Aetna Life insurance Company,Health Insurance Issuer
8,allowed amount file,https://mrf.healthsparq.com/aetnacvs-egress.no...,8,Aetna Life insurance Company,Health Insurance Issuer
9,allowed amount file,https://mrf.healthsparq.com/aetnacvs-egress.no...,9,Aetna Life insurance Company,Health Insurance Issuer


ValueError: Expecting a .json or .json.gz file...