Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read EIA860 data in parallel #2644

Merged
merged 1 commit into from
Jun 15, 2023

Conversation

dstansby
Copy link
Contributor

@dstansby dstansby commented Jun 8, 2023

PR Overview

This fixes (partially) #2385, by parallelising the reading in of EIA860 Excel spreadsheets. As suggested in that issue I've largely followed previous work in #2472 for the implementation.

I've tested this locally with the ETL_FAST job and it seems to work fine.

PR Checklist

  • Merge the most recent version of the branch you are merging into (probably dev).
  • All CI checks are passing. Run tests locally to debug failures
  • Make sure you've included good docstrings.
  • For major data coverage & analysis changes, run data validation tests
  • Include unit tests for new functions and classes.
  • Defensive data quality/sanity checks in analyses & data processing functions.
  • Update the release notes and reference reference the PR and related issues.
  • Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

@zaneselvans
Copy link
Member

Hey, thanks for taking this on! I think it'll only really be exercised fully in a multi-year run. Would you mind trying to generate the raw EIA-860 dataframes in the full ETL and see how it goes?

@zaneselvans zaneselvans self-assigned this Jun 8, 2023
@zaneselvans zaneselvans added eia860 Anything having to do with EIA Form 860 dagster Issues related to our use of the Dagster orchestrator labels Jun 8, 2023
@zaneselvans
Copy link
Member

Also we're super curious what you're up to with the PUDL data!

@dstansby
Copy link
Contributor Author

dstansby commented Jun 8, 2023

Would you mind trying to generate the raw EIA-860 dataframes in the full ETL and see how it goes?

👍 will do

Also we're super curious what you're up to with the PUDL data!

(perhaps disappointingly for you folks 😄) I'm not doing anything with the data. I've been doing Python OSS development for a while now, partly day job and partly for fun. Also worried about the climate, and looking for some fresh projects to contribute too, and this seemed like a great option!

@dstansby
Copy link
Contributor Author

dstansby commented Jun 8, 2023

FWIW it looks like doing a fast run loads data from 2020 and 2021:
Screenshot 2023-06-08 at 20 44 18

don't know if that's intended or not, but I'll do a full run anyway and report back.

@zaneselvans
Copy link
Member

That looks right. There's monthly data from the most recent year that's integrated into the EIA-860 extraction on top of the most recent year of annual data in the fast ETL.

It seems like there's an unexpected change to the contents of the data as a result of the parallel loading though, which is resulting in the generator technology_type field not being as consistent as we expect (but just by a small margin). I wonder what's going on there?

@zaneselvans
Copy link
Member

(perhaps disappointingly for you folks 😄) I'm not doing anything with the data. I've been doing Python OSS development for a while now, partly day job and partly for fun. Also worried about the climate, and looking for some fresh projects to contribute too, and this seemed like a great option!

Not at all! @rousik has also been helping us out on the side with his software and DevOps experience. If you'd ever like to hop on a call and get some more context on how the data feeds into the energy policy machine in the US or want help finding an appropriate issue to tackle given your skills, please let us know.

@dstansby
Copy link
Contributor Author

The full ETL seems to work fine through the new EIA-860 loading (at least up to the epacems steps where I get a bunch of timeouts from zenodo). I think I've fixed the problem with the data contents too, will wait and see what the integration tests say to be sure.

@codecov
Copy link

codecov bot commented Jun 12, 2023

Codecov Report

Patch coverage: 100.0% and no project coverage change.

Comparison is base (887d112) 87.1% compared to head (bd6ef93) 87.1%.

Additional details and impacted files
@@          Coverage Diff          @@
##             dev   #2644   +/-   ##
=====================================
  Coverage   87.1%   87.1%           
=====================================
  Files         86      86           
  Lines      10004   10027   +23     
=====================================
+ Hits        8719    8740   +21     
- Misses      1285    1287    +2     
Impacted Files Coverage Δ
src/pudl/extract/eia860.py 100.0% <100.0%> (ø)

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@zaneselvans
Copy link
Member

zaneselvans commented Jun 12, 2023

What did you have to change to fix whatever was going wrong before? It all seems to be in one big commit. Was it just needing to add a pd.concat() to mush all the dataframes together?

Tagging @jdangerx for a review since this super slow Excel reading has been a pet peeve of his.

If this works well should we try and use it for the eia923 and eia861 spreadsheet loading as well?

@dstansby
Copy link
Contributor Author

By habit I squash stuff into a single commit until a proper review on pull requests - it was previously broken because in I was accidentally overriting previous years when merging the dataframes into a single dictionary. Now fixed by the new merge_eia860_years op.

If this works well should we try and use it for the eia923 and eia861 spreadsheet loading as well?

Sounds good - happy to do that in this PR or a subsequent PR if this workflow is good, let me know!

Copy link
Member

@jdangerx jdangerx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good, thanks so much for the patch! But it'd be nice to catch that error you found in some sort of data validation test somewhere - hence "request changes."

@zaneselvans , you have a lot more context on our data validations than I do - do you have any thoughts on what to do here? If you feel comfortable pushing this through without data validation tests, too, I defer to your judgement and will happily approve the PR 😅

future/out-of-scope: if this pattern works well in our nightly build, it'd be nice to apply this everywhere we use that excel.GenericExtractor - then we could probably even delete a bunch of that GenericExtractor code...



@graph_asset
def eia860_raw_dfs() -> dict[str, pd.DataFrame]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern is pretty clean and I bet it would apply well to our other "read a lot of excel files in series" operations, as @zaneselvans pointed out! I don't think that needs to be part of this PR, though - happy to merge this chunk first and see how it performs before moving more into this pattern 😄

@zaneselvans
Copy link
Member

@jdangerx If you mean the accidental clobbering of other years bug, it broke a bunch of things downstream with more than one year of data, so it's probably okay as is? It wouldn't be a silent error.

@jdangerx
Copy link
Member

Sounds good! @dstansby any last thoughts before merge?

@dstansby
Copy link
Contributor Author

Nope, all good from my end - I'll look into rolling this out to other data sources in a follow up

@jdangerx jdangerx merged commit 0988cf5 into catalyst-cooperative:dev Jun 15, 2023
@dstansby dstansby deleted the eia860-parallel-load branch June 15, 2023 18:01
@dstansby dstansby mentioned this pull request Jun 15, 2023
8 tasks
@zaneselvans zaneselvans linked an issue Jun 16, 2023 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dagster Issues related to our use of the Dagster orchestrator eia860 Anything having to do with EIA Form 860
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Parallelize extraction of Excel spreadsheets
3 participants