-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Aggregate machine-readable provider network directories and drug formularies into tabular format #56
Comments
@marks, @ftrotter, @cornstein, @BAP-Jeff, @loranstefani, @Jordanrau, @HeatherBrotsos, @lloydbrodsky, |
In case it's helpful, I started working on the code, but just haven't had enough time to get to it lately. Feel free to use. You can find it here: https://github.com/dportnoy/health-insurance-marketplace-analytics |
@dportnoy - looks like you got pretty far.. would you mind documenting what it does so far and what areas need attention? Would definitely help so others dont accidentally reinvent the wheel |
@marks You give me entirely too much credit. But point taken. I'll write up a summary and post it. |
Suggestions for JSON to tabular (CSV/TSV) column mappingTranslation from JSON schema into tabular should be straight forward...
|
I put something together really quickly that takes insurance provider's Sample dataset: https://healthdata.demo.socrata.com/view/979j-m4qb |
Really interesting output, but I wonder if there's a way to make this a bit On Sat, Apr 16, 2016 at 11:44 AM, Mark Silverberg notifications@github.com
|
@cornstein - definitely... That's just a quick visualization of the data. You can export it (all or a slice) or use the API as well, of course. Kind of hectic day but happy to jump on a call tomorrow or continue to chat here if you're interested in a specific view Mark Silverberg
|
Here are some options for the end product, along with their uses. They are listed in increasing order of difficulty:
|
I've made a bit more progress. The latest code is available at https://github.com/marks/health-insurance-marketplace-analytics/blob/master/flattener/flatten_from_index.py I have code that starts* to flatten all three file types: The flattened data can be explored/downloaded/called-with-an-API from the following links:
|
@BAP-Jeff, hi again. Welcome to use case #56! To answer your question from issue #52... Having a ready to go dataset would be ideal for the Bayes Hack codathon for next weekend. It's identified as the central dataset for building consumer healthcare apps. You'll see a comment above that identifies 4 possible file types: #56 (comment). It's good to have a few people working on this, because some of the aspects (such as file sizes and cleanup for analytics) are challenging. Having the code would be a useful resource for the community to refresh the data in the future. |
Hello All, As David knows, I have been playing with the drugs.json set for a while as If I understand the challenges David laid out, I have been pursuing Option 4 for my app. I am using a couple simple Python scripts to (1) read alist of URLs from the Machine Readable PUF, then retrieve the drugs.json I am happy to share the code and/or the end result (warning around 35 to 40 Let me know which way the group wants to go. Jeff On Sat, Apr 16, 2016 at 11:46 PM, David X Portnoy notifications@github.com
|
@BAP-Jeff - awesome. I think both the data and the provenance (how you got it, including any scripts/documentation) would be much appreciated by participants as well as the community in general. Would be happy to look into hosting it onto Socrata if you can provide the raw data Mark |
Okay thanks. I will put together some things tonight. Much of my time in Here is the "process" I would propose that we show off (for the drugs.json
David, do you need a "deck" or some sort of visual to present? Help me out Jeff On Sun, Apr 17, 2016 at 10:08 AM, Mark Silverberg notifications@github.com
|
I dont want to make more work for you, @BAP-Jeff, but I'd be very interested in what your CSVs are looking like. I am running some code on the |
I actually use pipe delimiting but can easily change it out. I am attaching an example source json file and what I am outputting (tried to find a small one). It is very close to what you had shown off before Mark. I can easily add some more columns to this from the Plan Attributes PUF file to be more complete (IssuerID, Region, etc), but sounded like you wanted to see what the current state was. I also included the python script I used to parse the file. The format of the output is:
|
@BAP-Jeff - Yup! Looks quite similar. I would think it would be very helpful to folks to include info from the Plan Attributed PUF. Will standby for files so that we can get them on Socrata* (and Kaggle and other platforms) for dissemination/exploration/API acces
|
Okay I will be gone most of the day but will get on this later today.
|
@BAP-Jeff, great to have you contribute! Let's see if I could answer most of your questions...
|
David, I think I can take on all the above but may need a little help with the Deck. Do you have a template you want to use - Powerpoint, Google, ?? I can produce the files tonight/tomorrow and upload them probably tomorrow. @dportnoy give me the details on the kaggle stuff David, did you want to reserve any slides for ..."Issues we found in the data" or does that diminish the good vibes? Jeff |
Jeff- if I can just get delimited files, I can take care of loading it for ya. Sounds like kaggle is top priority for this event though ;) Mark Silverberg
|
Okay I will loop back around to you as time is short. On Sun, Apr 17, 2016 at 7:28 PM, Mark Silverberg notifications@github.com
|
@BAP-Jeff Google slides. No specific template. Just the white default. For Kaggle, upload to it will be done manually by their staff (since they're still working on making that functionality self-serve). So we need a staging location before the data gets copied there. |
Okay I have the file ready to go. It is about 5GBs uncompressed (31 million records). The file has the following table def: create table Hackathon I pulled from Plan Attributes what I thought was the key information. Note: PlanID in the Formulary Files is not the same as PlanID in Plan Attributes. The analogous column is StandardComponentID, furthermore I ignored the CSR variations in Plan Attributes and joined the distinct records from Plan Attributes to the Formulary Files. If this didn't make sense...well I guess trust me... I have added a header record to the file and am now waiting for the zipping to finish up. I assume it will get down to the 2GB range. I can throw it up into a S3 bucket and share it I guess, or GDrive. Let me know. Could be a few days on the slides.... Jeff |
@BAP-Jeff, definitely throw it up on either S3 or gDrive, whichever works best! Ideally, you could also provide a smaller file that has a more manageable sample subset of this data, so that it can be used to test code. |
@ftrotter, wanted to ping you and see if you're still working on any aspect of this. |
That would be beyond fantastic!
|
I just kicked off a process to download all the provider.json files. I have gotten through the first 10, initial impression is that these files are MASSIVE. This could take a while to process. |
@BAP-Jeff - that's why I initially focused on formulary ;) - Anyway I can help by running some scripts locally for you and/or on some cloud servers? |
@marks, ha! Thanks for the offer. I can grab the files, but based on some back of the envelope calculations this is going to be multi-terabytes of source data. I'll check in later. |
@BAP-Jeff, @marks, I was afraid of that. A few options:
None of these are ideal. Perhaps to get something going, we can focus on loading a subset of the data. So pick either one interesting large state (like TX or FL) or small state (like AK or SD) to load. This would at least allow for use with consumer apps and certain types of analytics. |
@BAP-Jeff, @dportnoy - the latest (#56 (comment)) formulary files are API enabled at the following link. Let me know how else I can help, of course. |
Thanks! That leaves just the providers files.
|
I downloaded all the provider json files last night. Looks like maybe the "big" ones were at the top of the list and they got better after that. As we discussed/suggested, I am breaking them into a number of tables - indiv_languages, fac_addresses, etc... Still going to be big files, but seems like things are running.... |
FYI, we have some healthy sized files. I just exhausted the memory on my box. It will take me a few to spin up a big honker on AWS to run these big guys... Maybe I should just give up on these for now and deal with them later. @dportnoy what is the timing on all this? Today right? -rw-rw-r-- 1 ubuntu ubuntu 2267461069 Apr 21 21:11 PROVJSON3_201604212013.json |
@BAP-Jeff, for now could you create a sample file and pick one full state to crawl. (See bottom of my note above.) Since the size would be manageable, you could create the simplest layout possible.
|
@dportnoy actually that is tougher than it sounds. To get a full state would imply that we would need to load all the files then query the state we want out of it. I can work towards that but it will take me some time to get a machine that can parse the 2GB files. What is the timing? Here is what the header files will look like for eight files (anything obvious I am missing?): Plan Table - for Indiv and Facilities (I guess I could create separate ones...)plan_id_type|plan_id|network_tier|npi Individual Filesnpi|prefix|first|middle|last|suffix|accepting|gender|last_updated_on npi|language npi|specialty npi|address|address_2|city|state|zip|phone (I could combine with facility address file...) Facility Filesnpi|facility_name|last_updated_on npi|facility_type npi|address|address_2|city|state|zip|phone |
@BAP-Jeff, One-state subset: I could be wrong, but I thought you start with the Sample file: Besides the one-state option, it would be helpful to have a small sample file to work with -- regardless what data is in it. Timing: Ideally by end of day. But I'll start writing up links to the files we already have in parallel. You've done so much already. I really appreciate it! Looking at fields next... |
@dportnoy, I think what you are saying is more or less right. What I am Do you have a state that you would like me to grab? On Fri, Apr 22, 2016 at 10:20 AM, David X Portnoy notifications@github.com
|
@BAP-Jeff, on fields... I think we need to add Optional: Consider adding |
#56 (comment) Which state? I'd love to get CA since BayesHack is there, but alas it's not included. If you want to be on the conservative side, you can pick a state with fewest entries in PUF, like AK or SD. When I did my last analysis a couple months ago, both these states had only one providers.json url. |
Got it. It will be interesting to see if the plans put leverage their own internal address information or just use the NPPES file... I'll take a look at AK, I might also look at ME. |
I am focussing on pulling files for ME. Though it is only a few JSON files, the created tables are enormous. Just for ME we are looking at probably a bit over 5GBs. I am not sure what to do about that. I guess I will just zip them up and post them. Wow. |
@BAP-Jeff, since the size is ending up to be so big anyway, what if we post a subset first. Perhaps pick some arbitrary way to grab a subset that people can easily work with? |
@dportnoy, okay I got Maine done. It is posted to here: https://drive.google.com/file/d/0B9yZheZrBn54UFFBOGlLWmZVejQ/view?usp=sharing Compressed down to about 130mb. Very little QA on this guy. If we want to pull out something smaller let me know. Not entirely obvious how to filter it...maybe by a couple PlanIDs..... |
@BAP-Jeff nice! Let's go with it. |
Working on BayesImpact related posts now. (@BAP-Jeff, @marks Could you help me summarize the latest and best links we have for each category.)
Will finish tomorrow morning! |
@dportnoy all I've got for ya is https://healthdata.demo.socrata.com/view/xc22-8t66 for the latest version of the Formulary data @BAP-Jeff scraped. Happy to upload anything else too though |
@marks, It would be great if you could load the updated data. I'll proceed to publish individual links in the mean time, as well as data dictionaries. |
@dportnoy somehow missed Jeff's comment about ME. I'll see what I can do but please confirm that the Formulary file is as you'd expect it to be. Can definitely update title/description with whatever you'd like (perhaps a link back to the right place to see other resources) |
@BAP-Jeff / @dportnoy - as you all know, providers are split into 6 files this time. Think it's worth creating one or two combined files for easier analysis? May need to leave something off like fac type or language or concatenate arrays into a string. Just a suggestion for making use easier like it is for formulary. Regardless, working on uploading to enable viz and APIs for these 6 files |
Good ideas. I am out of pocket for most of the weekend. Maybe we see if Will be interested to hear from @dportnoy what kind of interest/if any this On Sat, Apr 23, 2016 at 10:59 AM, Mark Silverberg notifications@github.com
|
@dportnoy the following files are ready All US Forumulary file: https://healthdata.demo.socrata.com/view/xc22-8t66 Maine Provider Facility Type file: https://healthdata.demo.socrata.com/view/3juv-cnb4 |
Update... @BAP-Jeff, @marks, thank you again! You guys were a huge help! Couldn't have done it without you. We still need to write up the activities at BayesHack, but there were 8 HHS teams at BayesHack, 4 of them specifically dealing with helping consumers find the right healthcare. |
Anything useful/shareable come out of it? I would love to see. On Tue, Apr 26, 2016 at 11:25 AM, David X Portnoy notifications@github.com
|
Ditto. Jordan Rau | Senior Correspondent | Kaiser Health News<khn.org> | 202.654.1362 | @Jordanrau | jrau@kff.orgmailto:jrau@kff.org From: cornstein [mailto:notifications@github.com] Anything useful/shareable come out of it? I would love to see. On Tue, Apr 26, 2016 at 11:25 AM, David X Portnoy notifications@github.com
— |
Putting out a call to those interested in making an impact by contributing to public data projects... Looking for somebody to create a new public dataset (and accompanying source code).
Background
In November 2015, the Centers for Medicare & Medicaid Services (CMS) enacted a new regulatory requirement for health insurers who list plans on insurance marketplaces. They must now publish a machine-readable version of their provider network directory and drug formulary, publish it to a specified JSON standard, and update it at least monthly. This data has just recently become accessible to the public. Some of its uses can be found in the Bayes Impact hackathon "prompts" or in at least 7 DDOD use cases.
Challenge
While these newly available datasets can significantly benefit consumer health applications and be used in a range of healthcare analytics, the current format doesn't lend itself to doing so.
Request
Write code that does the following:
Run the code and let us know where to find the resulting files. We should be able to find a good home for them, so that they enjoy widespread use.
If you can do this, you’ll be an official Open Data Hero! (Spandex optional.)
The text was updated successfully, but these errors were encountered: