New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel load EIA{861, 923} #2673
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## dev #2673 +/- ##
=======================================
- Coverage 88.4% 88.4% -0.1%
=======================================
Files 87 87
Lines 10139 10142 +3
=======================================
+ Hits 8971 8972 +1
- Misses 1168 1170 +2
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This direction looks great so far! Might be nice to test the merging behavior a bit - I'm sure you've thought of that, though. Let me know if you want any help with testing Dagster machinery!
src/pudl/extract/excel.py
Outdated
|
||
|
||
@op | ||
def merge_yearly_dfs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking: I suppose in theory these don't have to be _yearly_
- this is generic enough that it should work so long as everything we pass in has the same keys and the DFs that correspond have the same schema. That's pretty nice, we should maybe rename this though - maybe just merge_dfs_by_page
or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point we'll need to generalize these annual things to work with arbitrary data partitions since some of the other spreadsheets will have other ways of being organized (the PHMSA data has weird irregular multi-year blocks I don't know why) but we haven't done this yet and have been very wedded to yearly chunks for the EIA spreadsheets so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great shout, I've renamed as suggested and expanded the docstring to explain what's going on here.
I've added some tests now. Hopefully they're self-explanatory, but if anything's confusing let me know and I can improve comments or docstrings. |
d1b8689
to
8e4de12
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding tests, and I'm glad you're looking out for future failing tests as well :) Do you think it's worth breaking that unit-test/real-world dependency more by sending in test data to the op config?
with build_op_context( | ||
resources={"dataset_settings": DatasetsSettings()} | ||
) as context: | ||
# Assert actual years are a superset of expected. Instead of doing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we can get around this unit-test/real-world coupling by passing in expected years to DatasetsSettings()
instead of using its default behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly, but I'm struggling to work out how to set years. The below didn't work, because the settings objects are immutable. I can't work out how to create them and pass years in though...
diff --git a/test/unit/extract/excel_test.py b/test/unit/extract/excel_test.py
index 767e1d04..1b2be0f7 100644
--- a/test/unit/extract/excel_test.py
+++ b/test/unit/extract/excel_test.py
@@ -123,16 +123,21 @@ class TestGenericExtractor(unittest.TestCase):
@pytest.mark.parametrize(
"dataset, expected_years",
(
- ("eia860", set(range(2001, 2022))),
- ("eia861", set(range(2001, 2022))),
- ("eia923", set(range(2001, 2022))),
+ ("eia860", [2001, 2002]),
+ ("eia861", [2003, 2004, 2005]),
+ ("eia923", [2022]),
),
)
def test_years_from_settings(dataset, expected_years):
years_from_settings = excel.years_from_settings_factory(dataset)
+ settings = DatasetsSettings()
+ settings.eia.eia860.years = [2001, 2002]
+ settings.eia.eia861.years = [2003, 2004, 2005]
+ settings.eia.eia923.years = [2022]
+
with build_op_context(
- resources={"dataset_settings": DatasetsSettings()}
+ resources={"dataset_settings": settings}
) as context:
# Assert actual years are a superset of expected. Instead of doing
# an equality check, this avoids having to update expected years
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, this datasource metadata thing is pretty weird and complicated - lots of stuff that reads from disk by default. I think we can keep your workaround for now 😰
Note to myself to look into, the validation tests failed with FAILED test/validate/eia_test.py::test_minmax_rows[eia_raw-bga_eia860-130326-130326-130326] - ValueError: bga_eia860: found 127418 rows, expected 130326. Off by -2.231%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-bga_eia860-130326-130326-130326] - ValueError: bga_eia860: found 127418 rows, expected 130326. Off by -2.231%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-gen_eia923-None-5171497-432570] - ValueError: gen_eia923: found 432561 rows, expected 432570. Off by -0.002%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-hr_by_unit-362381-30340] - ValueError: hr_by_unit: found 29697 rows, expected 30340. Off by -2.119%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-hr_by_gen-555119-46408] - ValueError: hr_by_gen: found 45332 rows, expected 46408. Off by -2.319%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-fuel_cost-555119-46408] - ValueError: fuel_cost: found 45332 rows, expected 46408. Off by -2.319%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-capacity_factor-5171497-432570] - ValueError: capacity_factor: found 432561 rows, expected 432570. Off by -0.002%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-mcoe-5171881-432602] - ValueError: mcoe: found 432591 rows, expected 432602. Off by -0.003%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-bga_eia860-130326-130326-130326] - ValueError: bga_eia860: found 127418 rows, expected 130326. Off by -2.231%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-gen_eia923-None-5171497-432570] - ValueError: gen_eia923: found 5171263 rows, expected 5171497. Off by -0.005%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-hr_by_unit-362381-30340] - ValueError: hr_by_unit: found 354711 rows, expected 362381. Off by -2.117%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-hr_by_gen-555119-46408] - ValueError: hr_by_gen: found 542243 rows, expected 555119. Off by -2.320%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-fuel_cost-555119-46408] - ValueError: fuel_cost: found 542243 rows, expected 555119. Off by -2.320%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-capacity_factor-5171497-432570] - ValueError: capacity_factor: found 5171263 rows, expected 5171497. Off by -0.005%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-mcoe-5171881-432602] - ValueError: mcoe: found 5171622 rows, expected 5171881. Off by -0.005%, allowed margin of 0.000% |
I'll try it out. Is there a nightly or similar build of |
You can download a recent |
9e9e65c
to
e367529
Compare
@dstansby - did you ever get a chance to compare the old and new outputs? |
Thanks - sorry I lost track of this, if you or someone else wants to finish it off (or close it) feel free! |
Closing this and picking it up in #2943 |
PR Overview
This builds on top of #2644 to add parallel data loading to EIA 861 and EIA 923 datasets. All of the logic introducted in #2644 has been factored out to
excel.py
so it can be re-used across each of the datasets.I'll take this out of draft when I've finished the checklist below, but I think implementation is done so opening for early comments and a CI run.
This is more work towards #2385, but doesn't quite fix it because the EIA 860m data isn't being parallel loaded yet. That's done a bit differently to other datasets in this PR at the moment, so I think best to tackle that in a follow up PR.
PR Checklist
dev
).