Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel load EIA{861, 923} #2673

Closed

Conversation

dstansby
Copy link
Contributor

@dstansby dstansby commented Jun 15, 2023

PR Overview

This builds on top of #2644 to add parallel data loading to EIA 861 and EIA 923 datasets. All of the logic introducted in #2644 has been factored out to excel.py so it can be re-used across each of the datasets.

I'll take this out of draft when I've finished the checklist below, but I think implementation is done so opening for early comments and a CI run.

This is more work towards #2385, but doesn't quite fix it because the EIA 860m data isn't being parallel loaded yet. That's done a bit differently to other datasets in this PR at the moment, so I think best to tackle that in a follow up PR.

PR Checklist

  • Merge the most recent version of the branch you are merging into (probably dev).
  • All CI checks are passing. Run tests locally to debug failures
  • Make sure you've included good docstrings.
  • For major data coverage & analysis changes, run data validation tests
  • Include unit tests for new functions and classes.
  • Defensive data quality/sanity checks in analyses & data processing functions.
  • Update the release notes and reference reference the PR and related issues.
  • Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

@codecov
Copy link

codecov bot commented Jun 15, 2023

Codecov Report

Patch coverage: 100.0% and project coverage change: -0.1 ⚠️

Comparison is base (c1fdbae) 88.4% compared to head (9e9e65c) 88.4%.

❗ Current head 9e9e65c differs from pull request most recent head e367529. Consider uploading reports for the commit e367529 to get more accurate results

Additional details and impacted files
@@           Coverage Diff           @@
##             dev   #2673     +/-   ##
=======================================
- Coverage   88.4%   88.4%   -0.1%     
=======================================
  Files         87      87             
  Lines      10139   10142      +3     
=======================================
+ Hits        8971    8972      +1     
- Misses      1168    1170      +2     
Impacted Files Coverage Δ
src/pudl/extract/eia860.py 100.0% <100.0%> (ø)
src/pudl/extract/eia861.py 94.4% <100.0%> (-0.3%) ⬇️
src/pudl/extract/eia923.py 100.0% <100.0%> (ø)
src/pudl/extract/excel.py 95.6% <100.0%> (+0.9%) ⬆️

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@zaneselvans zaneselvans added eia923 Anything having to do with EIA Form 923 eia861 Anything having to do with EIA Form 861 excel Issues involving data in Microsoft Excel spreadsheets dagster Issues related to our use of the Dagster orchestrator labels Jun 16, 2023
@zaneselvans zaneselvans linked an issue Jun 16, 2023 that may be closed by this pull request
Copy link
Member

@jdangerx jdangerx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This direction looks great so far! Might be nice to test the merging behavior a bit - I'm sure you've thought of that, though. Let me know if you want any help with testing Dagster machinery!



@op
def merge_yearly_dfs(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking: I suppose in theory these don't have to be _yearly_ - this is generic enough that it should work so long as everything we pass in has the same keys and the DFs that correspond have the same schema. That's pretty nice, we should maybe rename this though - maybe just merge_dfs_by_page or something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point we'll need to generalize these annual things to work with arbitrary data partitions since some of the other spreadsheets will have other ways of being organized (the PHMSA data has weird irregular multi-year blocks I don't know why) but we haven't done this yet and have been very wedded to yearly chunks for the EIA spreadsheets so far.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great shout, I've renamed as suggested and expanded the docstring to explain what's going on here.

src/pudl/extract/excel.py Outdated Show resolved Hide resolved
@dstansby
Copy link
Contributor Author

I've added some tests now. Hopefully they're self-explanatory, but if anything's confusing let me know and I can improve comments or docstrings.

Copy link
Member

@jdangerx jdangerx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding tests, and I'm glad you're looking out for future failing tests as well :) Do you think it's worth breaking that unit-test/real-world dependency more by sending in test data to the op config?

with build_op_context(
resources={"dataset_settings": DatasetsSettings()}
) as context:
# Assert actual years are a superset of expected. Instead of doing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we can get around this unit-test/real-world coupling by passing in expected years to DatasetsSettings() instead of using its default behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly, but I'm struggling to work out how to set years. The below didn't work, because the settings objects are immutable. I can't work out how to create them and pass years in though...

diff --git a/test/unit/extract/excel_test.py b/test/unit/extract/excel_test.py
index 767e1d04..1b2be0f7 100644
--- a/test/unit/extract/excel_test.py
+++ b/test/unit/extract/excel_test.py
@@ -123,16 +123,21 @@ class TestGenericExtractor(unittest.TestCase):
 @pytest.mark.parametrize(
     "dataset, expected_years",
     (
-        ("eia860", set(range(2001, 2022))),
-        ("eia861", set(range(2001, 2022))),
-        ("eia923", set(range(2001, 2022))),
+        ("eia860", [2001, 2002]),
+        ("eia861", [2003, 2004, 2005]),
+        ("eia923", [2022]),
     ),
 )
 def test_years_from_settings(dataset, expected_years):
     years_from_settings = excel.years_from_settings_factory(dataset)
 
+    settings = DatasetsSettings()
+    settings.eia.eia860.years = [2001, 2002]
+    settings.eia.eia861.years = [2003, 2004, 2005]
+    settings.eia.eia923.years = [2022]
+
     with build_op_context(
-        resources={"dataset_settings": DatasetsSettings()}
+        resources={"dataset_settings": settings}
     ) as context:
         # Assert actual years are a superset of expected. Instead of doing
         # an equality check, this avoids having to update expected years

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, this datasource metadata thing is pretty weird and complicated - lots of stuff that reads from disk by default. I think we can keep your workaround for now 😰

@dstansby
Copy link
Contributor Author

Note to myself to look into, the validation tests failed with

FAILED test/validate/eia_test.py::test_minmax_rows[eia_raw-bga_eia860-130326-130326-130326] - ValueError: bga_eia860: found 127418 rows, expected 130326. Off by -2.231%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-bga_eia860-130326-130326-130326] - ValueError: bga_eia860: found 127418 rows, expected 130326. Off by -2.231%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-gen_eia923-None-5171497-432570] - ValueError: gen_eia923: found 432561 rows, expected 432570. Off by -0.002%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-hr_by_unit-362381-30340] - ValueError: hr_by_unit: found 29697 rows, expected 30340. Off by -2.119%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-hr_by_gen-555119-46408] - ValueError: hr_by_gen: found 45332 rows, expected 46408. Off by -2.319%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-fuel_cost-555119-46408] - ValueError: fuel_cost: found 45332 rows, expected 46408. Off by -2.319%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-capacity_factor-5171497-432570] - ValueError: capacity_factor: found 432561 rows, expected 432570. Off by -0.002%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-mcoe-5171881-432602] - ValueError: mcoe: found 432591 rows, expected 432602. Off by -0.003%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-bga_eia860-130326-130326-130326] - ValueError: bga_eia860: found 127418 rows, expected 130326. Off by -2.231%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-gen_eia923-None-5171497-432570] - ValueError: gen_eia923: found 5171263 rows, expected 5171497. Off by -0.005%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-hr_by_unit-362381-30340] - ValueError: hr_by_unit: found 354711 rows, expected 362381. Off by -2.117%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-hr_by_gen-555119-46408] - ValueError: hr_by_gen: found 542243 rows, expected 555119. Off by -2.320%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-fuel_cost-555119-46408] - ValueError: fuel_cost: found 542243 rows, expected 555119. Off by -2.320%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-capacity_factor-5171497-432570] - ValueError: capacity_factor: found 5171263 rows, expected 5171497. Off by -0.005%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-mcoe-5171881-432602] - ValueError: mcoe: found 5171622 rows, expected 5171881. Off by -0.005%, allowed margin of 0.000%

@zaneselvans
Copy link
Member

If I had to guess I'd say the changes are probably related to the boiler generator association table (bga_eia860) but I don't know why parallel loading of the input spreadsheets would have resulted in that output. This would probably be another good application of @rousik's output differ #2627

@dstansby
Copy link
Contributor Author

I'll try it out. Is there a nightly or similar build of pudl.sqlite from the dev branch anywhere I can download? Would save me running the whole pipeline again on the dev branch to get something to compare against. (although maybe doing that would be quicker than downloading 5+GB...?)

@zaneselvans
Copy link
Member

You can download a recent pudl.sqlite here from the nightly build on dev.

@jdangerx
Copy link
Member

@dstansby - did you ever get a chance to compare the old and new outputs?

@zaneselvans zaneselvans changed the title Parallel load EIA{891, 923} Parallel load EIA{861, 923} Jul 21, 2023
@bendnorman bendnorman added the requires-debug Things that have been worked on but hit an issue that requires debugging. label Jul 24, 2023
@e-belfer
Copy link
Member

e-belfer commented Oct 12, 2023

Hi @dstansby! Just a heads up, I'm working on extraction of a new set of Excel spreadsheets over in #2848, and thought it'd be a good time to try and hammer out the validation issue while I'm at it.

@dstansby
Copy link
Contributor Author

Thanks - sorry I lost track of this, if you or someone else wants to finish it off (or close it) feel free!

@e-belfer
Copy link
Member

Closing this and picking it up in #2943

@e-belfer e-belfer closed this Oct 16, 2023
@dstansby dstansby deleted the eia861-parallel-load branch October 16, 2023 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dagster Issues related to our use of the Dagster orchestrator eia861 Anything having to do with EIA Form 861 eia923 Anything having to do with EIA Form 923 excel Issues involving data in Microsoft Excel spreadsheets requires-debug Things that have been worked on but hit an issue that requires debugging.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Parallelize extraction of Excel spreadsheets
5 participants