-
-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Untangle eia_transform / harvesting / bga multi-asset #2450
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some pointers / questions for both of you @bendnorman @cmgosnell
I still need to fix the ownership_eia860
issue, and I'd also like to:
- Get rid of the BGA wrapper
- Get rid of "keep_cols" and just never drop any columns from the dataframes being harvested (but if this breaks anything I will give up immediately and walk away)
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## dev #2450 +/- ##
=====================================
Coverage 86.7% 86.7%
=====================================
Files 81 81
Lines 9447 9453 +6
=====================================
+ Hits 8192 8203 +11
+ Misses 1255 1250 -5
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
I ran the integration tests on the full DB and got some errors: pytest --live-dbs --etl-settings src/pudl/package_data/settings/etl_full.yml test/integration
Edit: these errors were cropping up because I hadn't yet merged in the settings fixes from #2424. All good now! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the devtools/harvesting_debug.ipynb
notebooks need to be updated? Other than that this looks ready.
Ah yeah you're right the harvesting notebook probably needs to be updated. Should we think about adding some/all of the devtools notebooks to be run automatically as part of the CI? |
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Updated harvesting debug notebook added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really cool! I have a few suggestions but mostly questions to deepen my own understanding. Nothing blocking so I'm approving but I think a few things that you honed in on (re enumerating all of the assets) would make this more durable.
* Got rid of some unnecessary settings error checking which is now handled by the EiaSettings class upstream. * Incorporated `_restrict_years()` into the loop at the beginning of `boiler_generator_assn_eia860()` rather than doing it piecemeal later on. * Changed `_restrict_years()` to depend directly on an `EiaSettings` object rather than lists of years.
with nb_path.open() as f: | ||
nb = nbformat.read(f, as_version=4) | ||
ep = ExecutePreprocessor(timeout=600, kernel_name="python3") | ||
_ = ep.preprocess(nb, resources={"Application": {"log_level": 5}}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot for the life of me figure out how to turn off the DEBUG
level output coming from this test. The Application log_level thing is supposed to do it I think, but apparently I am wrong.
"devtools/inspect-assets.ipynb", | ||
"devtools/debug-eia-etl.ipynb", | ||
"devtools/debug-ferc1-etl.ipynb", | ||
"devtools/debug-harvesting.ipynb", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These all run pretty quickly now that they're relying on assets generated by Dagster for the most part, and they're prone to getting out of sync w/ the codebase, so I thought it made sense to add them into the tests finally. And also I just wanted to understand how to do that, since @katie-lamb and I are going to try and run some notebooks in CI for the record-linkage project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the notebook integration tests! I think once the harvesting function logic is deduplicated it can be merged in.
PR Overview
Owing to our early pattern of passing around huge dictionaries of all the dataframes and mutating them inside functions, we had kind of a hairball of implied dependencies within the
eia_transform
asset, which took care of harvesting and generating the boiler-generator association table. This PR breaks that asset into several pieces which can run in parallel, and reduces the inter-asset dependencies significantly.PR Checklist
dev
).MoveThere were other circular imports so this didn't work.occurrence_consistency
intopudl.helpers
and try to auto-generate the list of clean assets.EiaSettings
class.col_dfs
if not debug_restrict_years()
in dictionary comprehension at beginning of BGA processclean_*_dfs
rather thaneia_transformed_dfs
legacy name for multi-DF input assets.Enum
to define validentity
types.io_manager_key
is being passed in to all asset factories as a parameter.pudl.transform.eia
devtools
notebooks into the integration tests.