Read EIA860 data in parallel #2644

dstansby · 2023-06-08T17:52:25Z

PR Overview

This fixes (partially) #2385, by parallelising the reading in of EIA860 Excel spreadsheets. As suggested in that issue I've largely followed previous work in #2472 for the implementation.

I've tested this locally with the ETL_FAST job and it seems to work fine.

PR Checklist

Merge the most recent version of the branch you are merging into (probably dev).
All CI checks are passing. Run tests locally to debug failures
Make sure you've included good docstrings.
For major data coverage & analysis changes, run data validation tests
Include unit tests for new functions and classes.
Defensive data quality/sanity checks in analyses & data processing functions.
Update the release notes and reference reference the PR and related issues.
Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

zaneselvans · 2023-06-08T18:32:30Z

Hey, thanks for taking this on! I think it'll only really be exercised fully in a multi-year run. Would you mind trying to generate the raw EIA-860 dataframes in the full ETL and see how it goes?

zaneselvans · 2023-06-08T19:12:20Z

Also we're super curious what you're up to with the PUDL data!

dstansby · 2023-06-08T19:38:54Z

Would you mind trying to generate the raw EIA-860 dataframes in the full ETL and see how it goes?

👍 will do

Also we're super curious what you're up to with the PUDL data!

(perhaps disappointingly for you folks 😄) I'm not doing anything with the data. I've been doing Python OSS development for a while now, partly day job and partly for fun. Also worried about the climate, and looking for some fresh projects to contribute too, and this seemed like a great option!

dstansby · 2023-06-08T19:45:31Z

FWIW it looks like doing a fast run loads data from 2020 and 2021:

don't know if that's intended or not, but I'll do a full run anyway and report back.

zaneselvans · 2023-06-08T21:04:48Z

That looks right. There's monthly data from the most recent year that's integrated into the EIA-860 extraction on top of the most recent year of annual data in the fast ETL.

It seems like there's an unexpected change to the contents of the data as a result of the parallel loading though, which is resulting in the generator technology_type field not being as consistent as we expect (but just by a small margin). I wonder what's going on there?

zaneselvans · 2023-06-08T21:17:37Z

(perhaps disappointingly for you folks 😄) I'm not doing anything with the data. I've been doing Python OSS development for a while now, partly day job and partly for fun. Also worried about the climate, and looking for some fresh projects to contribute too, and this seemed like a great option!

Not at all! @rousik has also been helping us out on the side with his software and DevOps experience. If you'd ever like to hop on a call and get some more context on how the data feeds into the energy policy machine in the US or want help finding an appropriate issue to tackle given your skills, please let us know.

dstansby · 2023-06-12T07:56:40Z

The full ETL seems to work fine through the new EIA-860 loading (at least up to the epacems steps where I get a bunch of timeouts from zenodo). I think I've fixed the problem with the data contents too, will wait and see what the integration tests say to be sure.

codecov · 2023-06-12T17:57:39Z

Codecov Report

Patch coverage: 100.0% and no project coverage change.

Comparison is base (887d112) 87.1% compared to head (bd6ef93) 87.1%.

Additional details and impacted files

@@          Coverage Diff          @@
##             dev   #2644   +/-   ##
=====================================
  Coverage   87.1%   87.1%           
=====================================
  Files         86      86           
  Lines      10004   10027   +23     
=====================================
+ Hits        8719    8740   +21     
- Misses      1285    1287    +2

Impacted Files	Coverage Δ
src/pudl/extract/eia860.py	`100.0% <100.0%> (ø)`

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

zaneselvans · 2023-06-12T19:05:18Z

What did you have to change to fix whatever was going wrong before? It all seems to be in one big commit. Was it just needing to add a pd.concat() to mush all the dataframes together?

Tagging @jdangerx for a review since this super slow Excel reading has been a pet peeve of his.

If this works well should we try and use it for the eia923 and eia861 spreadsheet loading as well?

dstansby · 2023-06-12T20:00:58Z

By habit I squash stuff into a single commit until a proper review on pull requests - it was previously broken because in I was accidentally overriting previous years when merging the dataframes into a single dictionary. Now fixed by the new merge_eia860_years op.

If this works well should we try and use it for the eia923 and eia861 spreadsheet loading as well?

Sounds good - happy to do that in this PR or a subsequent PR if this workflow is good, let me know!

jdangerx

This all looks good, thanks so much for the patch! But it'd be nice to catch that error you found in some sort of data validation test somewhere - hence "request changes."

@zaneselvans , you have a lot more context on our data validations than I do - do you have any thoughts on what to do here? If you feel comfortable pushing this through without data validation tests, too, I defer to your judgement and will happily approve the PR 😅

future/out-of-scope: if this pattern works well in our nightly build, it'd be nice to apply this everywhere we use that excel.GenericExtractor - then we could probably even delete a bunch of that GenericExtractor code...

jdangerx · 2023-06-13T17:55:45Z

src/pudl/extract/eia860.py

+
+
+@graph_asset
+def eia860_raw_dfs() -> dict[str, pd.DataFrame]:


This pattern is pretty clean and I bet it would apply well to our other "read a lot of excel files in series" operations, as @zaneselvans pointed out! I don't think that needs to be part of this PR, though - happy to merge this chunk first and see how it performs before moving more into this pattern 😄

zaneselvans · 2023-06-13T19:08:07Z

@jdangerx If you mean the accidental clobbering of other years bug, it broke a bunch of things downstream with more than one year of data, so it's probably okay as is? It wouldn't be a silent error.

jdangerx · 2023-06-14T12:53:33Z

Sounds good! @dstansby any last thoughts before merge?

dstansby · 2023-06-14T13:04:54Z

Nope, all good from my end - I'll look into rolling this out to other data sources in a follow up

zaneselvans self-assigned this Jun 8, 2023

zaneselvans added eia860 Anything having to do with EIA Form 860 dagster Issues related to our use of the Dagster orchestrator labels Jun 8, 2023

dstansby force-pushed the eia860-parallel-load branch from 8f7cfce to 9441450 Compare June 12, 2023 07:55

Read EIA860 data in parallel

bd6ef93

dstansby force-pushed the eia860-parallel-load branch from 9441450 to bd6ef93 Compare June 12, 2023 15:03

zaneselvans requested a review from jdangerx June 12, 2023 19:03

jdangerx requested changes Jun 13, 2023

View reviewed changes

jdangerx approved these changes Jun 14, 2023

View reviewed changes

jdangerx merged commit 0988cf5 into catalyst-cooperative:dev Jun 15, 2023

dstansby deleted the eia860-parallel-load branch June 15, 2023 18:01

dstansby mentioned this pull request Jun 15, 2023

Parallel load EIA{861, 923} #2673

Closed

8 tasks

zaneselvans linked an issue Jun 16, 2023 that may be closed by this pull request

Parallelize extraction of Excel spreadsheets #2385

Closed

zaneselvans mentioned this pull request Jun 21, 2023

Merge dev into main for 2023-06-21 #2685

Merged

bendnorman mentioned this pull request Jun 29, 2023

Parallelize extraction of Excel spreadsheets #2385

Closed

e-belfer mentioned this pull request Oct 16, 2023

Parallelize extraction of Excel spreadsheets #2943

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read EIA860 data in parallel #2644

Read EIA860 data in parallel #2644

dstansby commented Jun 8, 2023

zaneselvans commented Jun 8, 2023

zaneselvans commented Jun 8, 2023

dstansby commented Jun 8, 2023

dstansby commented Jun 8, 2023

zaneselvans commented Jun 8, 2023

zaneselvans commented Jun 8, 2023

dstansby commented Jun 12, 2023

codecov bot commented Jun 12, 2023

zaneselvans commented Jun 12, 2023 •

edited

Loading

dstansby commented Jun 12, 2023

jdangerx left a comment •

edited

Loading

jdangerx Jun 13, 2023

zaneselvans commented Jun 13, 2023

jdangerx commented Jun 14, 2023

dstansby commented Jun 14, 2023



		@graph_asset
		def eia860_raw_dfs() -> dict[str, pd.DataFrame]:

Read EIA860 data in parallel #2644

Read EIA860 data in parallel #2644

Conversation

dstansby commented Jun 8, 2023

PR Overview

PR Checklist

zaneselvans commented Jun 8, 2023

zaneselvans commented Jun 8, 2023

dstansby commented Jun 8, 2023

dstansby commented Jun 8, 2023

zaneselvans commented Jun 8, 2023

zaneselvans commented Jun 8, 2023

dstansby commented Jun 12, 2023

codecov bot commented Jun 12, 2023

Codecov Report

zaneselvans commented Jun 12, 2023 • edited Loading

dstansby commented Jun 12, 2023

jdangerx left a comment • edited Loading

Choose a reason for hiding this comment

jdangerx Jun 13, 2023

Choose a reason for hiding this comment

zaneselvans commented Jun 13, 2023

jdangerx commented Jun 14, 2023

dstansby commented Jun 14, 2023

zaneselvans commented Jun 12, 2023 •

edited

Loading

jdangerx left a comment •

edited

Loading