Parallel load EIA{861, 923} #2673

dstansby · 2023-06-15T20:39:35Z

PR Overview

This builds on top of #2644 to add parallel data loading to EIA 861 and EIA 923 datasets. All of the logic introducted in #2644 has been factored out to excel.py so it can be re-used across each of the datasets.

I'll take this out of draft when I've finished the checklist below, but I think implementation is done so opening for early comments and a CI run.

This is more work towards #2385, but doesn't quite fix it because the EIA 860m data isn't being parallel loaded yet. That's done a bit differently to other datasets in this PR at the moment, so I think best to tackle that in a follow up PR.

PR Checklist

Merge the most recent version of the branch you are merging into (probably dev).
All CI checks are passing. Run tests locally to debug failures
Make sure you've included good docstrings.
For major data coverage & analysis changes, run data validation tests
Include unit tests for new functions and classes.
Defensive data quality/sanity checks in analyses & data processing functions.
Update the release notes and reference reference the PR and related issues.
Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

codecov · 2023-06-15T23:12:41Z

Codecov Report

Patch coverage: 100.0% and project coverage change: -0.1 ⚠️

Comparison is base (c1fdbae) 88.4% compared to head (9e9e65c) 88.4%.

❗ Current head 9e9e65c differs from pull request most recent head e367529. Consider uploading reports for the commit e367529 to get more accurate results

Additional details and impacted files

@@           Coverage Diff           @@
##             dev   #2673     +/-   ##
=======================================
- Coverage   88.4%   88.4%   -0.1%     
=======================================
  Files         87      87             
  Lines      10139   10142      +3     
=======================================
+ Hits        8971    8972      +1     
- Misses      1168    1170      +2

Impacted Files	Coverage Δ
src/pudl/extract/eia860.py	`100.0% <100.0%> (ø)`
src/pudl/extract/eia861.py	`94.4% <100.0%> (-0.3%)`	⬇️
src/pudl/extract/eia923.py	`100.0% <100.0%> (ø)`
src/pudl/extract/excel.py	`95.6% <100.0%> (+0.9%)`	⬆️

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

jdangerx

This direction looks great so far! Might be nice to test the merging behavior a bit - I'm sure you've thought of that, though. Let me know if you want any help with testing Dagster machinery!

jdangerx · 2023-06-16T20:03:15Z

src/pudl/extract/excel.py

+
+
+@op
+def merge_yearly_dfs(


non-blocking: I suppose in theory these don't have to be _yearly_ - this is generic enough that it should work so long as everything we pass in has the same keys and the DFs that correspond have the same schema. That's pretty nice, we should maybe rename this though - maybe just merge_dfs_by_page or something?

At some point we'll need to generalize these annual things to work with arbitrary data partitions since some of the other spreadsheets will have other ways of being organized (the PHMSA data has weird irregular multi-year blocks I don't know why) but we haven't done this yet and have been very wedded to yearly chunks for the EIA spreadsheets so far.

Great shout, I've renamed as suggested and expanded the docstring to explain what's going on here.

src/pudl/extract/excel.py

dstansby · 2023-06-18T20:16:51Z

I've added some tests now. Hopefully they're self-explanatory, but if anything's confusing let me know and I can improve comments or docstrings.

jdangerx

Thanks for adding tests, and I'm glad you're looking out for future failing tests as well :) Do you think it's worth breaking that unit-test/real-world dependency more by sending in test data to the op config?

jdangerx · 2023-06-20T20:33:20Z

test/unit/extract/excel_test.py

+    with build_op_context(
+        resources={"dataset_settings": DatasetsSettings()}
+    ) as context:
+        # Assert actual years are a superset of expected. Instead of doing


Do you think we can get around this unit-test/real-world coupling by passing in expected years to DatasetsSettings() instead of using its default behavior?

Possibly, but I'm struggling to work out how to set years. The below didn't work, because the settings objects are immutable. I can't work out how to create them and pass years in though...

diff --git a/test/unit/extract/excel_test.py b/test/unit/extract/excel_test.py index 767e1d04..1b2be0f7 100644 --- a/test/unit/extract/excel_test.py +++ b/test/unit/extract/excel_test.py @@ -123,16 +123,21 @@ class TestGenericExtractor(unittest.TestCase): @pytest.mark.parametrize( "dataset, expected_years", ( - ("eia860", set(range(2001, 2022))), - ("eia861", set(range(2001, 2022))), - ("eia923", set(range(2001, 2022))), + ("eia860", [2001, 2002]), + ("eia861", [2003, 2004, 2005]), + ("eia923", [2022]), ), ) def test_years_from_settings(dataset, expected_years): years_from_settings = excel.years_from_settings_factory(dataset) + settings = DatasetsSettings() + settings.eia.eia860.years = [2001, 2002] + settings.eia.eia861.years = [2003, 2004, 2005] + settings.eia.eia923.years = [2022] + with build_op_context( - resources={"dataset_settings": DatasetsSettings()} + resources={"dataset_settings": settings} ) as context: # Assert actual years are a superset of expected. Instead of doing # an equality check, this avoids having to update expected years

Oh yeah, this datasource metadata thing is pretty weird and complicated - lots of stuff that reads from disk by default. I think we can keep your workaround for now 😰

dstansby · 2023-06-20T21:41:59Z

Note to myself to look into, the validation tests failed with

FAILED test/validate/eia_test.py::test_minmax_rows[eia_raw-bga_eia860-130326-130326-130326] - ValueError: bga_eia860: found 127418 rows, expected 130326. Off by -2.231%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-bga_eia860-130326-130326-130326] - ValueError: bga_eia860: found 127418 rows, expected 130326. Off by -2.231%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-gen_eia923-None-5171497-432570] - ValueError: gen_eia923: found 432561 rows, expected 432570. Off by -0.002%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-hr_by_unit-362381-30340] - ValueError: hr_by_unit: found 29697 rows, expected 30340. Off by -2.119%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-hr_by_gen-555119-46408] - ValueError: hr_by_gen: found 45332 rows, expected 46408. Off by -2.319%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-fuel_cost-555119-46408] - ValueError: fuel_cost: found 45332 rows, expected 46408. Off by -2.319%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-capacity_factor-5171497-432570] - ValueError: capacity_factor: found 432561 rows, expected 432570. Off by -0.002%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-mcoe-5171881-432602] - ValueError: mcoe: found 432591 rows, expected 432602. Off by -0.003%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-bga_eia860-130326-130326-130326] - ValueError: bga_eia860: found 127418 rows, expected 130326. Off by -2.231%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-gen_eia923-None-5171497-432570] - ValueError: gen_eia923: found 5171263 rows, expected 5171497. Off by -0.005%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-hr_by_unit-362381-30340] - ValueError: hr_by_unit: found 354711 rows, expected 362381. Off by -2.117%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-hr_by_gen-555119-46408] - ValueError: hr_by_gen: found 542243 rows, expected 555119. Off by -2.320%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-fuel_cost-555119-46408] - ValueError: fuel_cost: found 542243 rows, expected 555119. Off by -2.320%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-capacity_factor-5171497-432570] - ValueError: capacity_factor: found 5171263 rows, expected 5171497. Off by -0.005%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-mcoe-5171881-432602] - ValueError: mcoe: found 5171622 rows, expected 5171881. Off by -0.005%, allowed margin of 0.000%

zaneselvans · 2023-06-20T22:54:36Z

If I had to guess I'd say the changes are probably related to the boiler generator association table (bga_eia860) but I don't know why parallel loading of the input spreadsheets would have resulted in that output. This would probably be another good application of @rousik's output differ #2627

dstansby · 2023-06-21T12:57:30Z

I'll try it out. Is there a nightly or similar build of pudl.sqlite from the dev branch anywhere I can download? Would save me running the whole pipeline again on the dev branch to get something to compare against. (although maybe doing that would be quicker than downloading 5+GB...?)

zaneselvans · 2023-06-22T16:43:15Z

You can download a recent pudl.sqlite here from the nightly build on dev.

jdangerx · 2023-07-11T20:27:06Z

@dstansby - did you ever get a chance to compare the old and new outputs?

e-belfer · 2023-10-12T13:59:07Z

Hi @dstansby! Just a heads up, I'm working on extraction of a new set of Excel spreadsheets over in #2848, and thought it'd be a good time to try and hammer out the validation issue while I'm at it.

dstansby · 2023-10-12T15:33:22Z

Thanks - sorry I lost track of this, if you or someone else wants to finish it off (or close it) feel free!

e-belfer · 2023-10-16T18:20:08Z

Closing this and picking it up in #2943

zaneselvans requested a review from jdangerx June 16, 2023 17:25

zaneselvans added eia923 Anything having to do with EIA Form 923 eia861 Anything having to do with EIA Form 861 excel Issues involving data in Microsoft Excel spreadsheets dagster Issues related to our use of the Dagster orchestrator labels Jun 16, 2023

zaneselvans linked an issue Jun 16, 2023 that may be closed by this pull request

Parallelize extraction of Excel spreadsheets #2385

Closed

jdangerx reviewed Jun 16, 2023

View reviewed changes

dstansby force-pushed the eia861-parallel-load branch from d1b8689 to 8e4de12 Compare June 19, 2023 09:13

jdangerx requested changes Jun 20, 2023

View reviewed changes

dstansby added 8 commits June 25, 2023 17:02

Parallel load EIA861

3840373

De-dupe years from settings

82cef3b

Add a raw_df factory

5bced34

Clean up new docs

b048e83

Apply parallel loading to eia923

f060d93

Add a unit test for years_from_settings

4002f15

Add test for yearly df merge

c04e98f

Add docstring to merge_dfs_by_page

e367529

dstansby force-pushed the eia861-parallel-load branch from 9e9e65c to e367529 Compare June 25, 2023 16:02

zaneselvans changed the title ~~Parallel load EIA{891, 923}~~ Parallel load EIA{861, 923} Jul 21, 2023

zaneselvans mentioned this pull request Jul 21, 2023

Parallelize extraction of Excel spreadsheets #2385

Closed

bendnorman added the requires-debug Things that have been worked on but hit an issue that requires debugging. label Jul 24, 2023

Merge branch 'dev' into eia861-parallel-load

888307b

e-belfer mentioned this pull request Oct 16, 2023

Parallelize extraction of Excel spreadsheets #2943

Merged

8 tasks

e-belfer closed this Oct 16, 2023

dstansby deleted the eia861-parallel-load branch October 16, 2023 20:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel load EIA{861, 923} #2673

Parallel load EIA{861, 923} #2673

dstansby commented Jun 15, 2023 •

edited by zaneselvans

codecov bot commented Jun 15, 2023 •

edited

jdangerx left a comment

jdangerx Jun 16, 2023

zaneselvans Jun 17, 2023

dstansby Jun 18, 2023

dstansby commented Jun 18, 2023

jdangerx left a comment

jdangerx Jun 20, 2023

dstansby Jun 25, 2023

jdangerx Jul 11, 2023

dstansby commented Jun 20, 2023

zaneselvans commented Jun 20, 2023

dstansby commented Jun 21, 2023

zaneselvans commented Jun 22, 2023

jdangerx commented Jul 11, 2023

e-belfer commented Oct 12, 2023 •

edited

dstansby commented Oct 12, 2023

e-belfer commented Oct 16, 2023

Parallel load EIA{861, 923} #2673

Parallel load EIA{861, 923} #2673

Conversation

dstansby commented Jun 15, 2023 • edited by zaneselvans

PR Overview

PR Checklist

codecov bot commented Jun 15, 2023 • edited

Codecov Report

jdangerx left a comment

Choose a reason for hiding this comment

jdangerx Jun 16, 2023

Choose a reason for hiding this comment

zaneselvans Jun 17, 2023

Choose a reason for hiding this comment

dstansby Jun 18, 2023

Choose a reason for hiding this comment

dstansby commented Jun 18, 2023

jdangerx left a comment

Choose a reason for hiding this comment

jdangerx Jun 20, 2023

Choose a reason for hiding this comment

dstansby Jun 25, 2023

Choose a reason for hiding this comment

jdangerx Jul 11, 2023

Choose a reason for hiding this comment

dstansby commented Jun 20, 2023

zaneselvans commented Jun 20, 2023

dstansby commented Jun 21, 2023

zaneselvans commented Jun 22, 2023

jdangerx commented Jul 11, 2023

e-belfer commented Oct 12, 2023 • edited

dstansby commented Oct 12, 2023

e-belfer commented Oct 16, 2023

dstansby commented Jun 15, 2023 •

edited by zaneselvans

codecov bot commented Jun 15, 2023 •

edited

e-belfer commented Oct 12, 2023 •

edited