Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document how to add new data sources #177

Closed
gschivley opened this issue Jun 13, 2018 · 10 comments
Closed

Document how to add new data sources #177

gschivley opened this issue Jun 13, 2018 · 10 comments
Assignees
Labels
docs Documentation for users and contributors.
Milestone

Comments

@gschivley
Copy link
Contributor

I think it would be helpful if some documentation/guidance were added somewhere explaining how functions/scripts for new sources of data should be added to the project folders. Individual scripts appear to be well documented but it could save lots of time to have someone very familiar with the project write out where they would add a new script, where they would add functions to a script, and what existing functions might be helpful. Guidance on storing data in the pudl database vs creating a new database would also be great.

Example data sources I use and have in mind:

  • EIA state-level generation/fuel consumption (I get this from the API or bulk download)
  • EIA fuel prices (again the API)
  • EIA 860m (similar to 860 but only a subset of the data columns)
  • ERCOT hourly load history

My default would be to write a script for each of these sources and bring it in as a table. But you've done such a great job designing the code and database that I'd like to follow a similar template and maybe PR some of the sources into the main project.

If someone can start a bare-bones structure I'm willing to help flesh it out.

@zaneselvans
Copy link
Member

What do you see as the advantage of pulling some of the API / bulk accessible data into the database directly, rather than grabbing it on the fly to squish together with the stuff that's in the database?

The only separate database we've created is for FERC Form 1, and that's really just a clone of the Foxpro database FERC is using (but in posgres) for ETL purposes. We've talked about putting the CEMS data into its own DB because it's so big relatively speeaking, and because the ETL process already takes like 15-20 minutes... and we're doing it over and over again sometimes in testing / data cleaning. But that's probably not really a good reason. Also the CEMS data looks pretty clean so far, so maybe we can just pull in a small bit of it for structural purposes when we're playing with the FERC & EIA stuff... and pull it all in when we're creating the DB for actual use.

Thus far we've mostly been focused on trying (maybe unsuccessfully) to make PUDL usable by folks who want access to the data, who have thus far tended not to be the coding types. Do you have any examples you like where this kind of contributor documentation/guidance has been laid out well in another project that we could learn from?

@gschivley
Copy link
Contributor Author

Some backstory might help. I was helping design a system to load CEMS, subsets of 923/860, and a few other sources into a server database for my research group here at CMU. PUDL is much better designed so I'm trying to figure out a good way to integrate the additional data we need in a way that might be useful to your project.

You're right that the EIA API makes it easier to grab individual datasets on the fly. I can think of use cases where automatically combining/updating the database would be nice to avoid lots of API calls. Things like facility and state generation over the past 3 months. It might also involve pulling data out of the bulkfile rather than the API (hard to ensure that we get every plant via the API without a full list of plant ids).

I don't have any examples I can point to. Maybe full documentation isn't worth the effort - I can always add sources on my own and then see if you're interested in them.

@gschivley
Copy link
Contributor Author

With respect to usability/access by lay users, have you thought about a frontend with something like Dash (python) or RShiny? It could let uses do some simple selection and filtering and then provide the data as a csv. Not saying it would be simple to design, just curious if you have considered it.

@zaneselvans
Copy link
Member

We've very much wanted to have exactly that kind of simple selection front end -- to allow CSV export for relatively non-technical users, but hadn't come across Dash. Looked at Data.world, and Tableau a little bit, as places that we could export occasional joinable table snapshots. Have you worked with Dash? We've been wanting to get the FERC & EIA stuff integrated well before moving on to other more outward facing aspects, and that has ended up being hard, because the FERC data is so poorly quality controlled, but it's the only public record of CapEx and non-fuel OpEx, and that integration is central to what the folks we've been collaborating with are trying to do (make the financial case to shut down and replace fossil fuel power plants in a regulatory context).

We've been relatively insular thus far, but obviously this is all going to be much more useful and durable and interesting if we can engage other energy data nerds to help build it collaboratively, and different people have suggested lots of different data sets that would be nice to link in, so having some kind of guide on how to do that is probably important.

I'm glad you feel like it's well designed! We mostly come from a policy and advocacy background, but I used Python a bunch in my PhD (which was totally unrelated to energy and policy) and it seemed to be one of the main platforms for data science, we we just kind of dove in and tried to keep track of what the best practices are for various things.

@karldw is a PhD student also working on integrating CEMS data at Berkeley, along with Ariana Guo and undergrad working with him for the summer, and then there's the Western Interstate Energy Board (WIEB) interns work from last summer that we ported in when they were done, but it's pretty notebook focused, and not meant to be terribly re-usable. Would you want to get on the phone at some point together and chat briefly about how we might best work together to get CEMS in and useful?

@gschivley
Copy link
Contributor Author

I've used Dash for a simple webapp. Wasn't needed for the backend (just a few csv files) but it provides an easy way to make dropdowns/filters/etc on the web using Python. I haven't tried to implement datatables yet but there are some posts on their community forum about doing it.

I know that this is a solved problem in RShiny because another student implemented it.

Let me talk with my advisor about sharing code for CEMS. I think there is a case that integrating our work into yours will help everyone.

@karldw
Copy link
Contributor

karldw commented Jun 13, 2018

@gschivley:

This description is going to be unsatisfyingly brief, but here's the process I've used as I've tried to load the CEMS data.

  • Add appropriate constants in pudl/constants.py. These are things like the download URL and the available years. Also add a DATA_DIR line for the new dataset in pudl/settings.py.
  • Edit pudl/datastore.py to download all the data for the new source. This script is set up for downloading individual files, but I guess pulling from an API would be possible.
  • Add files in pudl/extract/, pudl/transform/ and pudl/model/ that will extract, transform, and load your data. (Starting from the raw data and ending in postgres.)
    • The pudl/model/ file defines the postgres specification, including the table name, the column names and types, as well as any indexes or constraints.
  • Edit pudl/init.py to run your new ETL code, importing as necessary from the subdirectories.
  • Write tests.

Catalyst people, what have I missed?

@zaneselvans
Copy link
Member

That sounds about right. There's a very abbreviated pudl/load.py module, because thus far we've been able to get the dataframes prepared in a pretty uniform way for loading, using the little sqlalchemy-postgres-copy package which slurps directly from a CSV file and is very fast. Note that all of the database table definitions that are done within thepudl/models/datasource.py moules need to descend from the same SQLAlchemy declarative base object (PUDLBase) for them all to be part of the same database definition.

  • The pudl/extract/datasource.py modules should contain any functions required to get the data from its original format (CSV, the FERC Foxpro nightmare, Excel spreadhseets, etc.) into a raw dataframe that's probably pretty close in structure to the original data source. Sometimes a bit of re-arranging has been necessary to get things looking like meaningful individual records -- for example the pudl.extract.eia923.yearly_to_monthly_eia923 function that takes the 12 monthly columns that are provided for many of the EIA923 data... and turns them into 12 individual records. But usually very little is changed structurally.
  • The pudl/transform/datasource.py modules should take those raw dataframes as input, and return a collection of dataframes that represent well normalized proto-database tables. This is also where any datatype issues get dealt with, where categorical simplification happens, and any other data cleaning is done. I could imagine wanting to split this into two steps, honestly, one for structural re-arrangement and normalization, and another for data content cleanup, but for now they are munged together.
  • Then in pudl/init.py there a main-like function that calls datasource specific extract, transform, and load functions, also defined within pudl/init.py which make use of the functions that live in the modules discussed above to do the actual extraction, transformation and loading.

Separate from these datasource specific ETL steps, we've got some "glue" that binds the different data sources together, especially EIA860 and EIA923 right now (someday, FERC, someday), but we're trying to do our best not to make the ETL for the individual data sources dependent on the glue -- so one can bring a new datasource in without necessarily needing to fully integrate it (which makes playing with it... and then integrating it, easier).

pudl/constants.py has really gotten out of control. It needs to be broken down into some more rationally organized set of modules or other data storage arrangement. It's too long, and has too many different kinds of things in it, and is pretty hard to keep track of.

If we can pull down the data that's available via the EIA API in the bulk files, and just have it locally available, that would be consistent with the rest of the system's structure. At some point, depending on which data sources we go after, this may not continue to be a rational strategy. Amazon's Redshift databases are also postgres based, and we've thought that eventually it might make sense to push the production database up there, and just do local development. But we haven't really played with it yet. It sounds like the WIEB interns are storing the CEMS data in AWS, and I wonder if it's in Redshift.

@cmgosnell has mostly completed a re-vamp of the overall ETL architecture, and is working on making the standalone entity tables now I think.

@gschivley
Copy link
Contributor Author

This is a great overview. @priyald17 you might want to listen in too.

@cmgosnell
Copy link
Member

Hey I'm a little late to this party but @zaneselvans explained the current overall etl process pretty well. I'd expect a new module in extract, transform and load for each new datasource. But as you probably expect, each datasource will have it's own lovely quirks that need ironing out. But generally my hope is that we could continue to slurp in and transform each datasource in the same format. Right now it's all in dataframes but as long as its all consistent, we'd be able to make any interesting inter datasource connections before everything is loaded into pudl. We aren't really doing that right now, but I'd like to be able to preserve the ability. If you want a screen share-y spin through any of the modules or datasource set up let me know!

cmgosnell added a commit that referenced this issue Jul 6, 2018
Whoo. Closes Issue #177. This is a draft in many ways and we aresuper
willing to hear feedback from folks and modify standards, but thisshould
give a general outline of the pieces of the etl process and howwe've
thought about them and want them implemented moving forward. Thereare a
few sections (such as glue) that are partially aspirational. Butwe'll
move towards implementaion.
@zaneselvans zaneselvans added the docs Documentation for users and contributors. label Sep 28, 2018
@zaneselvans zaneselvans self-assigned this Sep 28, 2018
@zaneselvans zaneselvans added this to the v0.1.0 release milestone Sep 28, 2018
@gschivley
Copy link
Contributor Author

Adding a new data source is now covered in CONTRIBUTING.md.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation for users and contributors.
Projects
None yet
Development

No branches or pull requests

4 participants