# Transparency in Coverage Data with Python: Part 2

Hello there! If you are starting with this notebook and have not yet checked out the [introductory notebook in this series]('./tic_start.ipynb'), please do that first. It won't take long, and you'll be better prepared for the content in this notebook.

## In-Network Rate Files

As we left off at the end of the previous notebook, we had defined a parser function to handle TOC files. I've saved off that function in a separate Python file so we can import it, and start off where we left off from Part 1. Let's quickly confirm that it works, because we'll be using it to fetch in-network rates.

In [25]:
from tic_toc import parse_tic_toc

rps, infs, aafs = parse_tic_toc('./toc_file.json')
display(rps, infs, aafs)

Unnamed: 0,plan_name,plan_id_type,plan_id,plan_market_type,reporting_structure_number,reporting_entity_name,reporting_entity_type
0,BLUECHOICE ADVANTAGE_POS,EIN,52-6002033,group,0,CareFirst Inc,HEALTH INSURANCE ISSUER
1,BLUECHOICE HMO HDHP INTEG DED_HMO,HIOS,10207VA038,individual,1,CareFirst Inc,HEALTH INSURANCE ISSUER
2,BLUECHOICE HMO HDHP INTEG DED_HMO,HIOS,28137MD037,individual,1,CareFirst Inc,HEALTH INSURANCE ISSUER
3,BLUECHOICE HMO HDHP INTEG DED_HMO,HIOS,86052DC040,individual,1,CareFirst Inc,HEALTH INSURANCE ISSUER
4,BLUECHOICE HMO HDHP NON INTDED_HMO,HIOS,10207VA038,individual,2,CareFirst Inc,HEALTH INSURANCE ISSUER
...,...,...,...,...,...,...,...
134,DHMO - BlueDental HMO_HMO,EIN,83-4713006,group,4,CareFirst Inc,HEALTH INSURANCE ISSUER
135,DHMO - BlueDental HMO_HMO,EIN,87-0787360,group,4,CareFirst Inc,HEALTH INSURANCE ISSUER
136,DHMO - Dental HMO_HMO,EIN,52-0348850,group,5,CareFirst Inc,HEALTH INSURANCE ISSUER
137,DHMO - Dental HMO_HMO,EIN,52-2064235,group,5,CareFirst Inc,HEALTH INSURANCE ISSUER


Unnamed: 0,description,location,reporting_structure_number,reporting_entity_name,reporting_entity_type
0,Carefirst in-network HMO file,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,0,CareFirst Inc,HEALTH INSURANCE ISSUER
1,Carefirst in-network HMO file,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,0,CareFirst Inc,HEALTH INSURANCE ISSUER
2,Carefirst in-network HMO file,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,0,CareFirst Inc,HEALTH INSURANCE ISSUER
3,Carefirst in-network HMO file,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,0,CareFirst Inc,HEALTH INSURANCE ISSUER
4,Carefirst in-network HMO file,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,0,CareFirst Inc,HEALTH INSURANCE ISSUER
...,...,...,...,...,...
835,Carefirst in-network HMO file,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,5,CareFirst Inc,HEALTH INSURANCE ISSUER
836,Carefirst in-network HMO file,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,5,CareFirst Inc,HEALTH INSURANCE ISSUER
837,Carefirst in-network HMO file,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,5,CareFirst Inc,HEALTH INSURANCE ISSUER
838,Carefirst in-network HMO file,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,5,CareFirst Inc,HEALTH INSURANCE ISSUER


Unnamed: 0,description,location,reporting_structure_number,reporting_entity_name,reporting_entity_type
0,Carefirst allowed amount HMO file,https://mrf.carefirst.com/mrf-files/allowed-am...,0,CareFirst Inc,HEALTH INSURANCE ISSUER
1,Carefirst allowed amount HMO file,https://mrf.carefirst.com/mrf-files/allowed-am...,1,CareFirst Inc,HEALTH INSURANCE ISSUER
2,Carefirst allowed amount HMO file,https://mrf.carefirst.com/mrf-files/allowed-am...,2,CareFirst Inc,HEALTH INSURANCE ISSUER
3,Carefirst allowed amount HMO file,https://mrf.carefirst.com/mrf-files/allowed-am...,3,CareFirst Inc,HEALTH INSURANCE ISSUER
4,Carefirst allowed amount HMO file,https://mrf.carefirst.com/mrf-files/allowed-am...,4,CareFirst Inc,HEALTH INSURANCE ISSUER
5,Carefirst allowed amount HMO file,https://mrf.carefirst.com/mrf-files/allowed-am...,5,CareFirst Inc,HEALTH INSURANCE ISSUER


## Unique In-Network Rates Only

In the last notebook, we discovered that while there are 140 x 6 = 840 unique File Location objects in the TOC file that refer to URLs containing In-Network Rates, that these 140 URLs are duplicated 6 times for each of the 6 `reporting_structure` objects in the TOC file.

So, we only need the unqiue URLs. Let's get them now.

In [46]:
import pandas as pd

unique_infs = infs['location'].drop_duplicates()
display(pd.DataFrame(unique_infs))


Unnamed: 0,location
0,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...
1,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...
2,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...
3,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...
4,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...
...,...
135,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...
136,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...
137,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...
138,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...


## GET header Requests over HTTPS

You may have heard that these in-network rate files can be quite large. Before we start going willy-nilly downloading them all, let's transmit a bunch of GET requests over HTTPS for just the request header - which will tell us a number of things about the file, including its size, in bytes.

Let's *also* remember that payers have to pay egress charges for these downloads, so when we download these large files, we need to be mindful to not spam them with download requests - either intentionally or unintentionally. (An issue has already been raised on the CMS GitHub site concerning this very issue.)

We're going to convert bytes to Megabytes (MB) by dividing the header value returned for the key `Content-length` by 1000^2. Then, we'll combine this list with the URLs and sort our URLs in order of smallest to largest.

(This will take a moment to run as it is fetch the header content from sending GET requests over HTTPS for all 140 URLs)

In [54]:
import requests as rq

header_info = [rq.head(url).headers['Content-length'] for url in unique_infs]
header_info_mb = [ int(hi) / (1000**2) for hi in header_info]

unique_inf_urls = pd.DataFrame()
unique_inf_urls['location'] = unique_infs
unique_inf_urls['filesize'] = header_info_mb
unique_inf_urls = unique_inf_urls.sort_values(by=['filesize']).reset_index(drop=True)
display(unique_inf_urls)


Unnamed: 0,location,filesize
0,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,0.211853
1,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,0.211853
2,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,0.211853
3,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,0.211853
4,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,0.257835
...,...,...
135,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,13.788947
136,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,21.501631
137,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,21.501631
138,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,21.501631


0.211853


In [259]:
display(unique_inf_urls.tail(n=50))

Unnamed: 0,location,filesize
90,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,10.833255
91,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,10.833255
92,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,11.08275
93,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,11.08275
94,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,11.08275
95,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,11.08275
96,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,11.825912
97,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,11.825912
98,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,11.825912
99,https://carefirstbcbs.mrf.bcbs.com/2022-08_690...,11.825912


## Downloading to disk vs. Streaming

Up until now, we have downloaded every file to disk before processing. However, some of these files are quite large, and when uncompressed, even larger. We can avoid downloading large files, instead by streaming them over HTTPS and through a gzip file connection, which uncompresses the stream and gives us the capability to parse. 

The `ijson` parser works unidirectionally, so there's no need to backtrack as long as we tell `ijson`'s parser exactly the JSON map items we're looking to mine out of the file up front. Thanks to the CMS standard that explicitly describes the files' structures, we're 

While uncompressing the stream on the fly takes a little longer, we can be certain to not having to worry about finding the disk space for all these giant files.

In [261]:
import urllib.request
import ijson
import gzip

ijson.use_float=True
ijson.multiple_values=True

# Start with the smallest filesize in our collection of URLs
url = unique_inf_urls['location'][139]
hdr = { 'Accept-encoding' : 'gzip' }

# Send HTTPS request
req = urllib.request.Request(url, headers=hdr)
response = urllib.request.urlopen(req)

# Stream compressed HTTPS data through open gzip file reader for decompression
uncompressed_stream = gzip.open(response, 'r')

# Stream decompressed output through ijson, searching for known items
parser = ijson.parse(uncompressed_stream)

base_rows = []
in_network_rows = pd.DataFrame()
for prefix, event, value in parser:
    if event == 'string' and '.' not in prefix:
        base_rows.append({prefix: value})
    if prefix == ('in_network.item') and event == 'start_map':
        in_network_row = []
        while ((prefix.startswith('in_network.item') and event != 'map_key') or (prefix == 'in_network.item' and event != 'end_map')):
            # Discard the next iteration
            prefix, event, value = next(parser)
            # The next item should be a primitive we want to keep - unless it is a negotiated-rates item
            prefix, event, value = next(parser)
            if prefix in ['in_network.item.negotiated_rates', 'in_network.item.bundled_codes', 'in_network.item.covered_services']:
                break
            # Add the dictionary to the list
            #print({prefix.replace('in_network.item.',''): value})
            in_network_row.extend([{prefix.replace('in_network.item.',''): value}])
        
        in_network_rows = in_network_rows.append({k: v for d in in_network_row for k, v in d.items()}, ignore_index=True)
        
        # At this point we should have a list of dicts we need to merge into a single dict,
        # then we add it to our rows list to be made into a dataframe at the end
        
uncompressed_stream.close()
display(in_network_rows)


Unnamed: 0,billing_code,billing_code_type,billing_code_type_version,description,name,negotiation_arrangement
0,69710,CPT,2022,impltj/rplcmt emgnt bone cndj dev temporal bo,Surgery,bundle
1,69710,CPT,2022,impltj/rplcmt emgnt bone cndj dev temporal bo,Surgery,bundle
2,69711,CPT,2022,rmvl/rpr emgnt bone cndj dev temporal bone,Surgery,bundle
3,69711,CPT,2022,rmvl/rpr emgnt bone cndj dev temporal bone,Surgery,bundle
4,69714,CPT,2022,impltj oi implt skull perq attachment esp,Surgery,bundle
...,...,...,...,...,...,...
1070,78016,CPT,2022,thyroid carcinoma metastases img addl study,Radiology,bundle
1071,78018,CPT,2022,thyroid carcinoma metastases img whole body,Radiology,bundle
1072,78018,CPT,2022,thyroid carcinoma metastases img whole body,Radiology,bundle
1073,78020,CPT,2022,thyroid carcinoma metastases uptake,Radiology,bundle


In [253]:
in_network_details = pd.DataFrame({k: v for d in base_rows for k, v in d.items()}, index=[0])
display(in_network_details)


Unnamed: 0,reporting_entity_name,reporting_entity_type,last_updated_on,version
0,CareFirst BlueCross BlueShield,health insurance issuer,2020-08-27,1.0.0


In [135]:
parser = ijson.parse(uncompressed_stream)

for prefix, event, value in parser:
    if prefix == 'reporting_entity_name':
        reporting_entity_name = value
    if prefix == 'reporting_entity_type':
        reporting_entity_type = value
    if prefix == 'last_updated_on':
        last_updated_on = value
    if prefix == 'version':
        version = value
        
print(reporting_entity_name, reporting_entity_type, last_updated_on, version)


IncompleteJSONError: parse error: premature EOF
                                       
                     (right here) ------^
