Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregate machine-readable provider network directories and drug formularies into tabular format #56

Open
dportnoy opened this issue Apr 11, 2016 · 100 comments

Comments

@dportnoy
Copy link
Member

Putting out a call to those interested in making an impact by contributing to public data projects... Looking for somebody to create a new public dataset (and accompanying source code).

Background

In November 2015, the Centers for Medicare & Medicaid Services (CMS) enacted a new regulatory requirement for health insurers who list plans on insurance marketplaces. They must now publish a machine-readable version of their provider network directory and drug formulary, publish it to a specified JSON standard, and update it at least monthly. This data has just recently become accessible to the public. Some of its uses can be found in the Bayes Impact hackathon "prompts" or in at least 7 DDOD use cases.

Challenge

While these newly available datasets can significantly benefit consumer health applications and be used in a range of healthcare analytics, the current format doesn't lend itself to doing so.

  • First, while there is a single "Machine-readable URL PUF" seed file, the actual data is scattered throughout thousands of URLs.
  • Second, for analytics and aggregate operations, it would be beneficial to convert the JSON schema into a tabular data format.

Request

Write code that does the following:

  1. Crawls the URLs starting with the "Machine-readable URL PUF" seed file found on this page: https://www.cms.gov/CCIIO/Resources/Data-Resources/marketplace-puf.html.
    • (As of 3/1/2016, there are 636 plan URLs, 23,711 provider URLs and 1,662 formulary URLs.)
  2. Converts the data from the JSON schema (https://github.com/CMSgov/QHP-provider-formulary-APIs) into a tabular format with the same fields.
    • (There may be some challenges converting multiple independent array fields from JSON to a tabular row.)
  3. Aggregates the results into CSV or text delimited text files, such that we have files for: Plans, Providers, and Formularies.

Run the code and let us know where to find the resulting files. We should be able to find a good home for them, so that they enjoy widespread use.

If you can do this, you’ll be an official Open Data Hero! (Spandex optional.)

@dportnoy
Copy link
Member Author

@marks, @ftrotter, @cornstein, @BAP-Jeff, @loranstefani, @Jordanrau, @HeatherBrotsos, @lloydbrodsky,
Since you're all past contributors, I wanted to see if any of you would take the challenge. Feel free to pass it on to your colleagues as well.

@dportnoy
Copy link
Member Author

In case it's helpful, I started working on the code, but just haven't had enough time to get to it lately. Feel free to use. You can find it here: https://github.com/dportnoy/health-insurance-marketplace-analytics

@marks
Copy link

marks commented Apr 12, 2016

@dportnoy - looks like you got pretty far.. would you mind documenting what it does so far and what areas need attention? Would definitely help so others dont accidentally reinvent the wheel

@dportnoy
Copy link
Member Author

@marks You give me entirely too much credit. But point taken. I'll write up a summary and post it.

@dportnoy
Copy link
Member Author

Suggestions for JSON to tabular (CSV/TSV) column mapping

Translation from JSON schema into tabular should be straight forward...

  • Files: Break into 3 core files by schema: plans, providers, drugs
  • Columns: Schema fields get translated into individual columns
  • Arrays: Biggest challenge is with arrays. Multiple independent arrays are especially problematic, since you can't just create a new row with that value
    • Option 1: Just leave an array as a list within a single field.
      • (Easiest to translate from JSON, but hard for analytics)
    • Option 2: Pivot array into separate columns.
      • (Easiest for analytics, but there's a long tail of typos and outliers.)
    • Option 3: Normalize it by breaking arrays into separate files with foreign keys, such as plan id and provider NPI.
      • (Great for loading into a relational database, but requires multiple steps before use for analytics.)

@marks
Copy link

marks commented Apr 16, 2016

I put something together really quickly that takes insurance provider's index.json URLs and flattens the data from the formulary_urls array.

Sample dataset: https://healthdata.demo.socrata.com/view/979j-m4qb
Code: marks/health-insurance-marketplace-analytics@3507b47

@cornstein
Copy link

Really interesting output, but I wonder if there's a way to make this a bit
more useful. Perhaps we can have a brief chat?

On Sat, Apr 16, 2016 at 11:44 AM, Mark Silverberg notifications@github.com
wrote:

I put something together really quickly that takes insurance provider's
index.json URLs and flattens the data from the formulary_urls array.

Sample dataset: https://healthdata.demo.socrata.com/view/979j-m4qb
Code: marks/health-insurance-marketplace-analytics@3507b47
marks/health-insurance-marketplace-analytics@3507b47


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#56 (comment)

@marks
Copy link

marks commented Apr 16, 2016

@cornstein - definitely... That's just a quick visualization of the data. You can export it (all or a slice) or use the API as well, of course.

Kind of hectic day but happy to jump on a call tomorrow or continue to chat here if you're interested in a specific view

Mark Silverberg
(m) 512 826 7004
http://twitter.com/skram

On Apr 16, 2016, at 11:56 AM, cornstein notifications@github.com wrote:

Really interesting output, but I wonder if there's a way to make this a bit
more useful. Perhaps we can have a brief chat?

On Sat, Apr 16, 2016 at 11:44 AM, Mark Silverberg notifications@github.com
wrote:

I put something together really quickly that takes insurance provider's
index.json URLs and flattens the data from the formulary_urls array.

Sample dataset: https://healthdata.demo.socrata.com/view/979j-m4qb
Code: marks/health-insurance-marketplace-analytics@3507b47
marks/health-insurance-marketplace-analytics@3507b47


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#56 (comment)


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

@dportnoy
Copy link
Member Author

dportnoy commented Apr 16, 2016

Here are some options for the end product, along with their uses. They are listed in increasing order of difficulty:

  1. AGGREGATE: Aggregate the crawled URLs into a small number of files, keeping the original QHP schema format in JSON: plans, providers, drugs.
    • Eliminates the need to crawl thousands of URLs is no longer needed.
    • Expect to run into file size challenges, since there are thousands of URLs and some files may be hundreds of megabytes.
    • This format may be adequate for loading into a NoSQL/document database for use with non-analytical consumer web apps.
  2. FLATTEN: Make the format tabular, in the easiest way possible.
    • Method: The files produced can line up exactly with the QHP schema entities pulled from URL links: plans, providers, drugs. For arrays, you could put multiple values in a single field.
    • The goal here is to create files that can be hosted ASAP and used for rudimentary analytics.
  3. NORMALIZE: Normalize the work product from item # 2 into relational tables, tied with foreign keys.
    • This format is useful for loading into a relational database for use in consumer apps and web services, especially if they require calculations and transformations. Convenient for codathons and challenges.
  4. JOIN & PIVOT: Make a tabular format that’s easiest for analytics by creating one row per entity with all possible attributes.
    • For example, in addition to the fields in the drugs.json schema and its plans subtype, include fields from "Machine-readable URL PUF" XLSX seed file: Marketplace Category, State, Issuer ID, Issuer Name. You can find other joins between the entities of the QHP schema.
    • Expect to run into challenges for entities with multiple arrays that are free-form or have a long tail of values.

@marks
Copy link

marks commented Apr 16, 2016

I've made a bit more progress. The latest code is available at https://github.com/marks/health-insurance-marketplace-analytics/blob/master/flattener/flatten_from_index.py

I have code that starts* to flatten all three file types: drugs.json, plans.json, and providers.json. I have only run providers.json and drugs.json on a few index.json's as it can take a while to process.

The flattened data can be explored/downloaded/called-with-an-API from the following links:

File Type Flattened Data Explorer
(you can export from here too)
Data Grid
(you can do roll ups and filtering here)
API Docs
plans.json https://healthdata.demo.socrata.com/view/yjgn-upfm https://healthdata.demo.socrata.com/d/x6bt-9iym https://dev.socrata.com/foundry/healthdata.demo.socrata.com/55ch-w329
providers.json https://healthdata.demo.socrata.com/view/pqgg-pgsy https://healthdata.demo.socrata.com/d/axbq-xnwy https://dev.socrata.com/foundry/healthdata.demo.socrata.com/5ht2-73gj
drugs.json https://healthdata.demo.socrata.com/view/979j-m4qb https://healthdata.demo.socrata.com/d/jaa8-k3k2 https://dev.socrata.com/foundry/healthdata.demo.socrata.com/6ngr-bz3d

* Not all data is flattened at this time to decrease the number of rows. For example, languages in the provider file are serialized as a JSON array as are addresses and specialties

@dportnoy
Copy link
Member Author

@BAP-Jeff, hi again. Welcome to use case #56! To answer your question from issue #52... Having a ready to go dataset would be ideal for the Bayes Hack codathon for next weekend. It's identified as the central dataset for building consumer healthcare apps. You'll see a comment above that identifies 4 possible file types: #56 (comment). It's good to have a few people working on this, because some of the aspects (such as file sizes and cleanup for analytics) are challenging. Having the code would be a useful resource for the community to refresh the data in the future.

@ghost
Copy link

ghost commented Apr 17, 2016

Hello All,

As David knows, I have been playing with the drugs.json set for a while as
I build some commercial applications for the formulary space. There are a
ton of challenges in using the data as supplied but overall it is a good
first year effort by the agency.

If I understand the challenges David laid out, I have been pursuing Option

4 for my app. I am using a couple simple Python scripts to (1) read a

list of URLs from the Machine Readable PUF, then retrieve the drugs.json
files for each (2) then I run a second script to rip apart the json
documents into a flat pipe delimited file.

I am happy to share the code and/or the end result (warning around 35 to 40
million records). What do we want for the hack-a-thon? Have to admit I
have not participated in one before, so I am not sure if people are only
interested in the end result (upload the dataset to say Socrata or kaggle)
or do they want to understand the process, so they can roll their own?

Let me know which way the group wants to go.

Jeff

On Sat, Apr 16, 2016 at 11:46 PM, David X Portnoy notifications@github.com
wrote:

@BAP-Jeff https://github.com/BAP-Jeff, hi again. Welcome to use case #56
#56! To
answer your question from issue #52
#52...
Having a ready to go dataset would be ideal for the Bayes Hack codathon for
next weekend. It's identified as the central dataset for building consumer
healthcare apps. You'll see a comment above that identifies 4 possible file
types: #56 (comment)
#56 (comment).
It's good to have a few people working on this, because some of the aspects
(such as file sizes and cleanup for analytics) are challenging. Having the
code would be a useful resource for the community to refresh the data in
the future.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#56 (comment)

@marks
Copy link

marks commented Apr 17, 2016

@BAP-Jeff - awesome. I think both the data and the provenance (how you got it, including any scripts/documentation) would be much appreciated by participants as well as the community in general. Would be happy to look into hosting it onto Socrata if you can provide the raw data

Mark

@ghost
Copy link

ghost commented Apr 17, 2016

Okay thanks. I will put together some things tonight. Much of my time in
working with this dataset is dealing with the data discrepancies (i.e.
multiple statuses on a drug for a plan, use of NDCs vs RxCUIs, missing
restrictions, bogus URLs), but showing off the millions of records that are
produced will be cool. Plus hopefully making it more accessible will get
more eyes on it...and we can get those things corrected.

Here is the "process" I would propose that we show off (for the drugs.json
branch):

  • Download latest Readable PUF
  • Extract the details for the URLs and Plans, excluding dental and NOT SUBMITTED
    records
  • Iterate over that list, pulling down the valid json files
  • Then iterate over the json files and create csv files
  • Then push them to Socrata and/or other spots

David, do you need a "deck" or some sort of visual to present? Help me out
if I am going astray...

Jeff

On Sun, Apr 17, 2016 at 10:08 AM, Mark Silverberg notifications@github.com
wrote:

@BAP-Jeff https://github.com/BAP-Jeff - awesome. I think both the data
and the provenance (how you got it, including any scripts/documentation)
would be much appreciated by participants as well as the community in
general. Would be happy to look into hosting it onto Socrata if you can
provide the raw data

Mark


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#56 (comment)

@marks
Copy link

marks commented Apr 17, 2016

I dont want to make more work for you, @BAP-Jeff, but I'd be very interested in what your CSVs are looking like. I am running some code on the index.jsons I found in a previously shared xlsx to flatten all three file types (but they may not have all important info... I am not a SME in this area

@ghost
Copy link

ghost commented Apr 17, 2016

I actually use pipe delimiting but can easily change it out. I am attaching an example source json file and what I am outputting (tried to find a small one). It is very close to what you had shown off before Mark. I can easily add some more columns to this from the Plan Attributes PUF file to be more complete (IssuerID, Region, etc), but sounded like you wanted to see what the current state was. I also included the python script I used to parse the file.

The format of the output is:

  • Source JSON file name
  • PlanIDType
  • PlanID
  • DrugStatus
  • DrugID
  • PA
  • ST
  • QL

Archive.zip

HIXFormExtract.py.zip

@marks
Copy link

marks commented Apr 17, 2016

@BAP-Jeff - Yup! Looks quite similar. I would think it would be very helpful to folks to include info from the Plan Attributed PUF. Will standby for files so that we can get them on Socrata* (and Kaggle and other platforms) for dissemination/exploration/API acces

* Full disclosure: I work for Socrata and we're more than happy to help folks download the data in a number of formats. Can accept pipe delimited, no big deal.

@ghost
Copy link

ghost commented Apr 17, 2016

Okay I will be gone most of the day but will get on this later today.

On Apr 17, 2016, at 11:04 AM, Mark Silverberg notifications@github.com wrote:

@BAP-Jeff - Yup! Looks quite similar. I would think it would be very helpful to folks to include info from the Plan Attributed PUF. Will standby for files so that we can get them on Socrata* for dissemination/exploration/API acces

  • Full disclosure: I work for Socrata and we're more than happy to help folks download the data in a number of formats. Can accept pipe delimited, no big deal.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

@dportnoy
Copy link
Member Author

@BAP-Jeff, great to have you contribute! Let's see if I could answer most of your questions...

  • STORAGE: For now, the datasets generated can be stored wherever is most convenient for your methods. They can always be copied elsewhere later.
    • Ultimately, Kaggle agreed to put the datasets in their HHS section for use with the codathon and other future activity: https://www.kaggle.com/hhsgov/health-insurance-marketplace
    • If needed, I can provide you with an S3 bucket to dump it in. (But I still need to determine the most appropriate permissions and access mechanism.)
    • If Socrata is willing to host any portion of it, that's great too.
  • FILE SIZES: You asked about potential sizes. I think we need two sets of files:
    • The complete files (with millions of records), which are obviously needed for analytics. We may end up needing to break these into multiple files for manageability.
    • A small data sample subset, which can be loaded for proof of concepts with applications. Perhaps pick a few regions, but make sure the files are complete for a particular plan, so that joins are possible.
  • EXAMPLES: For your pipe-delimited file, we'd need to be sure to include either a header record or package a dictionary with it. For data.json processing, we can add more useful fields by joining with the "Machine-readable URL PUF" XLSX file and possibly attributes from other entities (plans and providers).
  • DECK: Yes, if you can put together a deck, it will be useful. I think data visual slides would be compelling -- both for confirming assumptions and demonstrating surprising "aha" moments.

@ghost
Copy link

ghost commented Apr 17, 2016

David,

I think I can take on all the above but may need a little help with the Deck. Do you have a template you want to use - Powerpoint, Google, ??

I can produce the files tonight/tomorrow and upload them probably tomorrow.

@dportnoy give me the details on the kaggle stuff
@marks let me know where to drop the files for Socrata

David, did you want to reserve any slides for ..."Issues we found in the data" or does that diminish the good vibes?

Jeff

@marks
Copy link

marks commented Apr 17, 2016

Jeff- if I can just get delimited files, I can take care of loading it for ya. Sounds like kaggle is top priority for this event though ;)

Mark Silverberg
(m) 512 826 7004
http://twitter.com/skram

On Apr 17, 2016, at 7:18 PM, BAP-Jeff notifications@github.com wrote:

David,

I think I can take on all the above but may need a little help with the Deck. Do you have a template you want to use - Powerpoint, Google, ??

I can produce the files tonight/tomorrow and upload them probably tomorrow.

@dportnoy give me the details on the kaggle stuff
@marks let me know where to drop the files for Socrata

David, did you want to reserve any slides for ..."Issues we found in the data" or does that diminish the good vibes?

Jeff


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

@ghost
Copy link

ghost commented Apr 17, 2016

Okay I will loop back around to you as time is short.

On Sun, Apr 17, 2016 at 7:28 PM, Mark Silverberg notifications@github.com
wrote:

Jeff- if I can just get delimited files, I can take care of loading it for
ya. Sounds like kaggle is top priority for this event though ;)

Mark Silverberg
(m) 512 826 7004
http://twitter.com/skram

On Apr 17, 2016, at 7:18 PM, BAP-Jeff notifications@github.com wrote:

David,

I think I can take on all the above but may need a little help with the
Deck. Do you have a template you want to use - Powerpoint, Google, ??

I can produce the files tonight/tomorrow and upload them probably
tomorrow.

@dportnoy give me the details on the kaggle stuff
@marks let me know where to drop the files for Socrata

David, did you want to reserve any slides for ..."Issues we found in the
data" or does that diminish the good vibes?

Jeff


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#56 (comment)

@dportnoy
Copy link
Member Author

@BAP-Jeff Google slides. No specific template. Just the white default.

For Kaggle, upload to it will be done manually by their staff (since they're still working on making that functionality self-serve). So we need a staging location before the data gets copied there.

@ghost
Copy link

ghost commented Apr 18, 2016

Okay I have the file ready to go. It is about 5GBs uncompressed (31 million records). The file has the following table def:

create table Hackathon
(StateCode varchar(2),
IssuerID char(5),
StandardComponentID varchar(14),
MarketCoverage varchar(18),
PlanMarketingName varchar(142),
HIOSProductID varchar(10),
ServiceAreaID varchar(6),
FormularyID varchar(6),
PlanType varchar(9),
MetalLevel varchar(12),
DrugStatus varchar(100),
RxCUI varchar(15),
DrugName varchar(2000),
PAStatus char(1),
STStatus char(1),
QLStatus char(1))

I pulled from Plan Attributes what I thought was the key information. Note: PlanID in the Formulary Files is not the same as PlanID in Plan Attributes. The analogous column is StandardComponentID, furthermore I ignored the CSR variations in Plan Attributes and joined the distinct records from Plan Attributes to the Formulary Files. If this didn't make sense...well I guess trust me...

I have added a header record to the file and am now waiting for the zipping to finish up. I assume it will get down to the 2GB range. I can throw it up into a S3 bucket and share it I guess, or GDrive. Let me know. Could be a few days on the slides....

Jeff

@dportnoy
Copy link
Member Author

@BAP-Jeff, definitely throw it up on either S3 or gDrive, whichever works best! Ideally, you could also provide a smaller file that has a more manageable sample subset of this data, so that it can be used to test code.

@dportnoy
Copy link
Member Author

@ftrotter, wanted to ping you and see if you're still working on any aspect of this.

@dportnoy
Copy link
Member Author

That would be beyond fantastic!

On Apr 21, 2016, at 9:17 AM, BAP-Jeff notifications@github.com wrote:

I could maybe get to it tomorrow morning if you are desperate.

On Thu, Apr 21, 2016 at 10:14 AM, David X Portnoy notifications@github.com
wrote:

Great, @marks https://github.com/marks. I appreciate it. I'm working on
a page with all the links.

That said, we still need PROVIDERS files.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#56 (comment)


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

@ghost
Copy link

ghost commented Apr 21, 2016

I just kicked off a process to download all the provider.json files. I have gotten through the first 10, initial impression is that these files are MASSIVE.

This could take a while to process.

@marks
Copy link

marks commented Apr 21, 2016

@BAP-Jeff - that's why I initially focused on formulary ;) - Anyway I can help by running some scripts locally for you and/or on some cloud servers?

@ghost
Copy link

ghost commented Apr 21, 2016

@marks, ha! Thanks for the offer. I can grab the files, but based on some back of the envelope calculations this is going to be multi-terabytes of source data.

I'll check in later.

@dportnoy
Copy link
Member Author

@BAP-Jeff, @marks, I was afraid of that. A few options:

  1. Break up into multiple files
  2. Normalize the heck out of the data
    • It should help that multiple plans use identical networks and providers.json URLs. Perhaps use a separate "issuer_provider" file that joins issuer_id to providers_url. Then the "providers" file doesn't need to repeat entries that come from the same providers_url.
    • But this approach is a pain for analytics purposes.
  3. Don't load all fields, focusing just on the ones not found anywhere else
    • While this doesn't change the cardinality and therefore the number of rows, it can reduce the file sizes significantly. Perhaps include key fields: npi, type, state, zip, speciality, accepting, plan_id, network_tier.
    • This approach may be sufficient for many analytics purposes. Unfortunately, it may not be as useful for consumer applications, which could benefit from fields such as address and language

None of these are ideal. Perhaps to get something going, we can focus on loading a subset of the data. So pick either one interesting large state (like TX or FL) or small state (like AK or SD) to load. This would at least allow for use with consumer apps and certain types of analytics.

@marks
Copy link

marks commented Apr 22, 2016

@BAP-Jeff, @dportnoy - the latest (#56 (comment)) formulary files are API enabled at the following link. Let me know how else I can help, of course.

https://healthdata.demo.socrata.com/view/xc22-8t66

@dportnoy
Copy link
Member Author

Thanks! That leaves just the providers files.

On Apr 22, 2016, at 7:15 AM, Mark Silverberg notifications@github.com wrote:

@BAP-Jeff, @dportnoy - the latest (#56 (comment)) formulary files are API enabled at the following link. Let me know how else I can help, of course.

https://healthdata.demo.socrata.com/view/xc22-8t66


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

@ghost
Copy link

ghost commented Apr 22, 2016

I downloaded all the provider json files last night. Looks like maybe the "big" ones were at the top of the list and they got better after that. As we discussed/suggested, I am breaking them into a number of tables - indiv_languages, fac_addresses, etc... Still going to be big files, but seems like things are running....

@ghost
Copy link

ghost commented Apr 22, 2016

FYI, we have some healthy sized files. I just exhausted the memory on my box. It will take me a few to spin up a big honker on AWS to run these big guys...

Maybe I should just give up on these for now and deal with them later. @dportnoy what is the timing on all this? Today right?

-rw-rw-r-- 1 ubuntu ubuntu 2267461069 Apr 21 21:11 PROVJSON3_201604212013.json
-rw-rw-r-- 1 ubuntu ubuntu 2170298888 Apr 21 22:29 PROVJSON2903_201604212013.json
-rw-rw-r-- 1 ubuntu ubuntu 1544176291 Apr 21 22:18 PROVJSON2891_201604212013.json
-rw-rw-r-- 1 ubuntu ubuntu 1456651939 Apr 21 23:08 PROVJSON2915_201604212013.json
-rw-rw-r-- 1 ubuntu ubuntu 1081970470 Apr 21 22:44 PROVJSON2910_201604212013.json
-rw-rw-r-- 1 ubuntu ubuntu 1073854978 Apr 22 01:46 PROVJSON3425_201604212013.json
-rw-rw-r-- 1 ubuntu ubuntu 1029692658 Apr 22 00:21 PROVJSON2987_201604212013.json
-rw-rw-r-- 1 ubuntu ubuntu 1014403038 Apr 21 23:15 PROVJSON2916_201604212013.json
-rw-rw-r-- 1 ubuntu ubuntu 976208707 Apr 21 22:36 PROVJSON2909_201604212013.json
-rw-rw-r-- 1 ubuntu ubuntu 965357940 Apr 21 21:58 PROVJSON2833_201604212013.json

@dportnoy
Copy link
Member Author

@BAP-Jeff, for now could you create a sample file and pick one full state to crawl. (See bottom of my note above.) Since the size would be manageable, you could create the simplest layout possible.

On Apr 22, 2016, at 8:17 AM, BAP-Jeff notifications@github.com wrote:

FYI, we have some healthy sized files. I just exhausted the memory on my box. It will take me a few to spin up a big honker on AWS to run these big guys...

Maybe I should just give up on these for now and deal with them later. @dportnoy what is the timing on all this? Today right?

-rw-rw-r-- 1 ubuntu ubuntu 2267461069 Apr 21 21:11 PROVJSON3_201604212013.json
-rw-rw-r-- 1 ubuntu ubuntu 2170298888 Apr 21 22:29 PROVJSON2903_201604212013.json
-rw-rw-r-- 1 ubuntu ubuntu 1544176291 Apr 21 22:18 PROVJSON2891_201604212013.json
-rw-rw-r-- 1 ubuntu ubuntu 1456651939 Apr 21 23:08 PROVJSON2915_201604212013.json
-rw-rw-r-- 1 ubuntu ubuntu 1081970470 Apr 21 22:44 PROVJSON2910_201604212013.json
-rw-rw-r-- 1 ubuntu ubuntu 1073854978 Apr 22 01:46 PROVJSON3425_201604212013.json
-rw-rw-r-- 1 ubuntu ubuntu 1029692658 Apr 22 00:21 PROVJSON2987_201604212013.json
-rw-rw-r-- 1 ubuntu ubuntu 1014403038 Apr 21 23:15 PROVJSON2916_201604212013.json
-rw-rw-r-- 1 ubuntu ubuntu 976208707 Apr 21 22:36 PROVJSON2909_201604212013.json
-rw-rw-r-- 1 ubuntu ubuntu 965357940 Apr 21 21:58 PROVJSON2833_201604212013.json


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

@ghost
Copy link

ghost commented Apr 22, 2016

@dportnoy actually that is tougher than it sounds. To get a full state would imply that we would need to load all the files then query the state we want out of it. I can work towards that but it will take me some time to get a machine that can parse the 2GB files. What is the timing?

Here is what the header files will look like for eight files (anything obvious I am missing?):

Plan Table - for Indiv and Facilities (I guess I could create separate ones...)

plan_id_type|plan_id|network_tier|npi

Individual Files

npi|prefix|first|middle|last|suffix|accepting|gender|last_updated_on

npi|language

npi|specialty

npi|address|address_2|city|state|zip|phone (I could combine with facility address file...)

Facility Files

npi|facility_name|last_updated_on

npi|facility_type

npi|address|address_2|city|state|zip|phone

@dportnoy
Copy link
Member Author

@BAP-Jeff,

One-state subset: I could be wrong, but I thought you start with the state associated with each issuer in the Machine Readable PUF 02-22-2016.txt. Not sure if each shows nationwide networks, but I'm sure some have narrow networks that wouldn't be as big. If that logic is incorrect or doesn't work well from a workload perspective, please do what makes most sense.

Sample file: Besides the one-state option, it would be helpful to have a small sample file to work with -- regardless what data is in it.

Timing: Ideally by end of day. But I'll start writing up links to the files we already have in parallel. You've done so much already. I really appreciate it!

Looking at fields next...

@ghost
Copy link

ghost commented Apr 22, 2016

@dportnoy, I think what you are saying is more or less right. What I am
seeing is that an Issuer can be operating across multiple states, so the
url listed for say Aetna/Texas is the same url as Aetna/Florida. I could
(as I think you suggest), go after the urls listed for a state and if there
is spill over to other states so be it. But looks like national plans are
just that....they show up everywhere which makes the task tough. I am
making progress on loading everything though.

Do you have a state that you would like me to grab?

On Fri, Apr 22, 2016 at 10:20 AM, David X Portnoy notifications@github.com
wrote:

@BAP-Jeff https://github.com/BAP-Jeff,

One-state subset: I could be wrong, but I thought you start with the
state associated with each issuer in the Machine Readable PUF
02-22-2016.txt
https://github.com/demand-driven-open-data/ddod-intake/files/231940/Machine.Readable.PUF.02-22-2016.txt.
Not sure if each shows nationwide networks, but I'm sure some have narrow
networks that wouldn't be as big. If that logic is incorrect or doesn't
work well from a workload perspective, please do what makes most sense.

Sample file: Besides the one-state option, it would be helpful to have
a small sample file to work with -- regardless what data is in it.

Timing: Ideally by end of day. But I'll start writing up links to the
files we already have in parallel. You've done so much already. I really
appreciate it!

Looking at fields next...


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#56 (comment)

@dportnoy
Copy link
Member Author

@BAP-Jeff, on fields...

I think we need to add plan_id and network_tier to INDIVIDUAL and FACILITY files. Otherwise we'll end up with a lot of "duplicate" npi entries, each with slightly different information, including whether they're accepting new patients. We won't know which entry belongs to which plan.

Optional: Consider adding issuer_id (from PUF), as that would be useful for some of analytics.

@dportnoy
Copy link
Member Author

#56 (comment) Which state? I'd love to get CA since BayesHack is there, but alas it's not included. If you want to be on the conservative side, you can pick a state with fewest entries in PUF, like AK or SD. When I did my last analysis a couple months ago, both these states had only one providers.json url.

@ghost
Copy link

ghost commented Apr 22, 2016

Got it. It will be interesting to see if the plans put leverage their own internal address information or just use the NPPES file...

I'll take a look at AK, I might also look at ME.

@ghost
Copy link

ghost commented Apr 23, 2016

I am focussing on pulling files for ME. Though it is only a few JSON files, the created tables are enormous. Just for ME we are looking at probably a bit over 5GBs. I am not sure what to do about that. I guess I will just zip them up and post them. Wow.

@dportnoy
Copy link
Member Author

@BAP-Jeff, since the size is ending up to be so big anyway, what if we post a subset first. Perhaps pick some arbitrary way to grab a subset that people can easily work with?

@ghost
Copy link

ghost commented Apr 23, 2016

@dportnoy, okay I got Maine done. It is posted to here:

https://drive.google.com/file/d/0B9yZheZrBn54UFFBOGlLWmZVejQ/view?usp=sharing

Compressed down to about 130mb. Very little QA on this guy. If we want to pull out something smaller let me know. Not entirely obvious how to filter it...maybe by a couple PlanIDs.....

@dportnoy
Copy link
Member Author

@BAP-Jeff nice! Let's go with it.

@dportnoy
Copy link
Member Author

dportnoy commented Apr 23, 2016

Working on BayesImpact related posts now. (@BAP-Jeff, @marks Could you help me summarize the latest and best links we have for each category.)


Will finish tomorrow morning!

@marks
Copy link

marks commented Apr 23, 2016

@dportnoy all I've got for ya is https://healthdata.demo.socrata.com/view/xc22-8t66 for the latest version of the Formulary data @BAP-Jeff scraped. Happy to upload anything else too though

@dportnoy
Copy link
Member Author

@marks, It would be great if you could load the updated data. I'll proceed to publish individual links in the mean time, as well as data dictionaries.

@marks
Copy link

marks commented Apr 23, 2016

@dportnoy somehow missed Jeff's comment about ME. I'll see what I can do but please confirm that the Formulary file is as you'd expect it to be. Can definitely update title/description with whatever you'd like (perhaps a link back to the right place to see other resources)

@marks
Copy link

marks commented Apr 23, 2016

@BAP-Jeff / @dportnoy - as you all know, providers are split into 6 files this time. Think it's worth creating one or two combined files for easier analysis? May need to leave something off like fac type or language or concatenate arrays into a string. Just a suggestion for making use easier like it is for formulary. Regardless, working on uploading to enable viz and APIs for these 6 files

@ghost
Copy link

ghost commented Apr 23, 2016

Good ideas. I am out of pocket for most of the weekend. Maybe we see if
anyone hits the APIs over the next couple weeks and then look to improve
things.

Will be interested to hear from @dportnoy what kind of interest/if any this
generates. Clearly there are not many people using these files.

On Sat, Apr 23, 2016 at 10:59 AM, Mark Silverberg notifications@github.com
wrote:

@BAP-Jeff https://github.com/BAP-Jeff / @dportnoy
https://github.com/dportnoy - as you all know, providers are split into
6 files this time. Think it's worth creating one or two combined files for
easier analysis? May need to leave something off like fac type or language
or concatenate arrays into a string. Just a suggestion for making use
easier like it is for formulary. Regardless, working on uploading to enable
viz and APIs for these 6 files


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#56 (comment)

@marks
Copy link

marks commented Apr 23, 2016

@dportnoy the following files are ready

All US Forumulary file: https://healthdata.demo.socrata.com/view/xc22-8t66

Maine Provider Facility Type file: https://healthdata.demo.socrata.com/view/3juv-cnb4
Maine Provider Individual Specialty file: https://healthdata.demo.socrata.com/view/gz7p-vgqg
Maine Provider Individual Language file: https://healthdata.demo.socrata.com/view/fiik-zy3e

@dportnoy
Copy link
Member Author

Update... @BAP-Jeff, @marks, thank you again! You guys were a huge help! Couldn't have done it without you. We still need to write up the activities at BayesHack, but there were 8 HHS teams at BayesHack, 4 of them specifically dealing with helping consumers find the right healthcare.
bayeshack - 8 hacker teams for hhs

@cornstein
Copy link

Anything useful/shareable come out of it? I would love to see.

On Tue, Apr 26, 2016 at 11:25 AM, David X Portnoy notifications@github.com
wrote:

Update... @BAP-Jeff https://github.com/BAP-Jeff, @marks
https://github.com/marks, thank you again! You guys were a huge help!
Couldn't have done it without you. We still need to write up the activities
at BayesHack, but there were 8 HHS teams at BayesHack, 4 of them
specifically dealing with helping consumers find the right healthcare.

[image: bayeshack - 8 hacker teams for hhs]
https://cloud.githubusercontent.com/assets/2925801/14823613/c549e2c8-0b98-11e6-905d-0c3fa9929634.png


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#56 (comment)

@Jordanrau
Copy link

Ditto.

Jordan Rau | Senior Correspondent | Kaiser Health News<khn.org> | 202.654.1362 | @Jordanrau | jrau@kff.orgmailto:jrau@kff.org

From: cornstein [mailto:notifications@github.com]
Sent: Tuesday, April 26, 2016 1:37 PM
To: demand-driven-open-data/ddod-intake ddod-intake@noreply.github.com
Cc: Jordan Rau JordanR@kff.org; Mention mention@noreply.github.com
Subject: Re: [demand-driven-open-data/ddod-intake] Aggregate machine-readable provider network directories and drug formularies into tabular format (#56)

Anything useful/shareable come out of it? I would love to see.

On Tue, Apr 26, 2016 at 11:25 AM, David X Portnoy notifications@github.com
wrote:

Update... @BAP-Jeff https://github.com/BAP-Jeff, @marks
https://github.com/marks, thank you again! You guys were a huge help!
Couldn't have done it without you. We still need to write up the activities
at BayesHack, but there were 8 HHS teams at BayesHack, 4 of them
specifically dealing with helping consumers find the right healthcare.

[image: bayeshack - 8 hacker teams for hhs]
https://cloud.githubusercontent.com/assets/2925801/14823613/c549e2c8-0b98-11e6-905d-0c3fa9929634.png


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#56 (comment)


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHubhttps://github.com//issues/56#issuecomment-214822486

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants