Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Untangle eia_transform / harvesting / bga multi-asset #2450

Merged
merged 26 commits into from
Mar 30, 2023
Merged

Conversation

zaneselvans
Copy link
Member

@zaneselvans zaneselvans commented Mar 24, 2023

PR Overview

Owing to our early pattern of passing around huge dictionaries of all the dataframes and mutating them inside functions, we had kind of a hairball of implied dependencies within the eia_transform asset, which took care of harvesting and generating the boiler-generator association table. This PR breaks that asset into several pieces which can run in parallel, and reduces the inter-asset dependencies significantly.

PR Checklist

Edit tasklist title
Beta Give feedback Tasklist PR Checklist, more options

Delete tasklist

Delete tasklist block?
Are you sure? All relationships in this tasklist will be removed.
  1. Merge the most recent version of the branch you are merging into (probably dev).
    Options
  2. All CI checks are passing. Run tests locally to debug failures
    Options
  3. For major data coverage & analysis changes, run data validation tests
    Options
  4. Include unit tests for new functions and classes.
    Options
  5. Defensive data quality/sanity checks in analyses & data processing functions.
    Options
  6. Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.
    Options
  7. Move occurrence_consistency into pudl.helpers and try to auto-generate the list of clean assets. There were other circular imports so this didn't work.
    Options
  8. Remove superfluous error checking from BGA given that we now do all that work in EiaSettings class.
    Options
  9. Return empty dictionary for col_dfs if not debug
    Options
  10. Do _restrict_years() in dictionary comprehension at beginning of BGA process
    Options
  11. Switch to useing clean_*_dfs rather than eia_transformed_dfs legacy name for multi-DF input assets.
    Options
  12. Use an Enum to define valid entity types.
    Options
  13. make sure io_manager_key is being passed in to all asset factories as a parameter.
    Options
  14. Move notebook function for harvesting debug into pudl.transform.eia
    Options
  15. Add running of devtools notebooks into the integration tests.
    Options
  16. Update the release notes and reference reference the PR and related issues.
    Options
  17. Make sure you've included good docstrings.
    Options

@zaneselvans zaneselvans added eia923 Anything having to do with EIA Form 923 eia860 Anything having to do with EIA Form 860 harvest Normalization of poorly normalized inputs and reconciliation of internal inconsistencies dagster Issues related to our use of the Dagster orchestrator labels Mar 24, 2023
@zaneselvans zaneselvans added this to the 2023Q1 milestone Mar 24, 2023
@zaneselvans zaneselvans linked an issue Mar 24, 2023 that may be closed by this pull request
@zaneselvans zaneselvans self-assigned this Mar 24, 2023
Copy link
Member Author

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some pointers / questions for both of you @bendnorman @cmgosnell

I still need to fix the ownership_eia860 issue, and I'd also like to:

  • Get rid of the BGA wrapper
  • Get rid of "keep_cols" and just never drop any columns from the dataframes being harvested (but if this breaks anything I will give up immediately and walk away)

src/pudl/transform/eia.py Show resolved Hide resolved
src/pudl/transform/eia.py Show resolved Hide resolved
src/pudl/transform/eia.py Show resolved Hide resolved
src/pudl/transform/eia.py Outdated Show resolved Hide resolved
src/pudl/transform/eia.py Outdated Show resolved Hide resolved
src/pudl/transform/eia.py Outdated Show resolved Hide resolved
src/pudl/transform/eia.py Outdated Show resolved Hide resolved
src/pudl/transform/eia.py Outdated Show resolved Hide resolved
src/pudl/transform/eia923.py Show resolved Hide resolved
src/pudl/transform/eia.py Show resolved Hide resolved
src/pudl/metadata/resources/__init__.py Show resolved Hide resolved
src/pudl/settings.py Show resolved Hide resolved
src/pudl/transform/eia.py Outdated Show resolved Hide resolved
src/pudl/transform/eia.py Show resolved Hide resolved
src/pudl/transform/eia.py Show resolved Hide resolved
src/pudl/transform/eia.py Outdated Show resolved Hide resolved
src/pudl/transform/eia.py Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Mar 24, 2023

Codecov Report

Patch coverage: 98.6% and no project coverage change.

Comparison is base (3d12720) 86.7% compared to head (c4eae78) 86.7%.

Additional details and impacted files
@@          Coverage Diff          @@
##             dev   #2450   +/-   ##
=====================================
  Coverage   86.7%   86.7%           
=====================================
  Files         81      81           
  Lines       9447    9453    +6     
=====================================
+ Hits        8192    8203   +11     
+ Misses      1255    1250    -5     
Impacted Files Coverage Δ
src/pudl/etl/check_foreign_keys.py 68.1% <ø> (ø)
src/pudl/settings.py 98.7% <ø> (ø)
src/pudl/transform/eia.py 97.2% <98.6%> (+1.8%) ⬆️
src/pudl/transform/eia923.py 85.1% <100.0%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@zaneselvans
Copy link
Member Author

zaneselvans commented Mar 24, 2023

I ran the integration tests on the full DB and got some errors:

pytest --live-dbs --etl-settings src/pudl/package_data/settings/etl_full.yml test/integration
test/integration/glue_test.py::test_for_fk_validation_and_unmapped_ids[missing_utility_id_pudl_in_utilities_ferc1] ERROR
test/integration/glue_test.py::test_for_fk_validation_and_unmapped_ids[missing_utility_id_ferc1_in_utilities_ferc1_dbf] ERROR
test/integration/glue_test.py::test_for_fk_validation_and_unmapped_ids[missing_utility_id_ferc1_in_utilities_ferc1_xbrl] ERROR
test/integration/glue_test.py::test_for_fk_validation_and_unmapped_ids[missing_utility_id_ferc1_in_plants_ferc1] ERROR
test/integration/glue_test.py::test_for_fk_validation_and_unmapped_ids[missing_utility_id_ferc1_xbrl_in_raw_xbrl] ERROR
test/integration/glue_test.py::test_for_fk_validation_and_unmapped_ids[missing_utility_id_ferc1_dbf_in_raw_dbf] ERROR
test/integration/glue_test.py::test_for_fk_validation_and_unmapped_ids[missing_plant_id_pudl_in_plants_ferc1] ERROR
test/integration/glue_test.py::test_for_fk_validation_and_unmapped_ids[missing_plants_in_plants_ferc1] ERROR
test/integration/glue_test.py::test_for_fk_validation_and_unmapped_ids[missing_plants_in_plants_eia] ERROR
test/integration/glue_test.py::test_for_unmapped_ids_minus_one[check_for_unmmapped_plants_in_plants_ferc1] ERROR
test/integration/glue_test.py::test_for_unmapped_ids_minus_one[validate_utility_id_ferc1_in_utilities_ferc1_xbrl] ERROR
test/integration/glue_test.py::test_unmapped_utils_eia ERROR

Edit: these errors were cropping up because I hadn't yet merged in the settings fixes from #2424. All good now!

Base automatically changed from dagster-eia861 to dev March 27, 2023 19:14
Copy link
Member

@bendnorman bendnorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the devtools/harvesting_debug.ipynb notebooks need to be updated? Other than that this looks ready.

src/pudl/transform/eia.py Outdated Show resolved Hide resolved
src/pudl/transform/eia.py Show resolved Hide resolved
src/pudl/transform/eia.py Outdated Show resolved Hide resolved
src/pudl/transform/eia.py Show resolved Hide resolved
src/pudl/transform/eia.py Outdated Show resolved Hide resolved
src/pudl/transform/eia.py Show resolved Hide resolved
src/pudl/transform/eia923.py Show resolved Hide resolved
@zaneselvans
Copy link
Member Author

Ah yeah you're right the harvesting notebook probably needs to be updated.

Should we think about adding some/all of the devtools notebooks to be run automatically as part of the CI?

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@zaneselvans
Copy link
Member Author

Updated harvesting debug notebook added.

Copy link
Member

@cmgosnell cmgosnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really cool! I have a few suggestions but mostly questions to deepen my own understanding. Nothing blocking so I'm approving but I think a few things that you honed in on (re enumerating all of the assets) would make this more durable.

src/pudl/metadata/resources/__init__.py Show resolved Hide resolved
src/pudl/transform/eia.py Outdated Show resolved Hide resolved
src/pudl/transform/eia.py Show resolved Hide resolved
src/pudl/transform/eia.py Outdated Show resolved Hide resolved
src/pudl/transform/eia.py Outdated Show resolved Hide resolved
src/pudl/transform/eia.py Outdated Show resolved Hide resolved
src/pudl/transform/eia.py Show resolved Hide resolved
src/pudl/transform/eia.py Show resolved Hide resolved
* Got rid of some unnecessary settings error checking which is now handled by the
  EiaSettings class upstream.
* Incorporated `_restrict_years()` into the loop at the beginning of
  `boiler_generator_assn_eia860()` rather than doing it piecemeal later on.
* Changed `_restrict_years()` to depend directly on an `EiaSettings` object rather than
  lists of years.
setup.py Show resolved Hide resolved
setup.py Show resolved Hide resolved
src/pudl/transform/eia.py Outdated Show resolved Hide resolved
with nb_path.open() as f:
nb = nbformat.read(f, as_version=4)
ep = ExecutePreprocessor(timeout=600, kernel_name="python3")
_ = ep.preprocess(nb, resources={"Application": {"log_level": 5}})
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot for the life of me figure out how to turn off the DEBUG level output coming from this test. The Application log_level thing is supposed to do it I think, but apparently I am wrong.

Comment on lines +12 to +15
"devtools/inspect-assets.ipynb",
"devtools/debug-eia-etl.ipynb",
"devtools/debug-ferc1-etl.ipynb",
"devtools/debug-harvesting.ipynb",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These all run pretty quickly now that they're relying on assets generated by Dagster for the most part, and they're prone to getting out of sync w/ the codebase, so I thought it made sense to add them into the tests finally. And also I just wanted to understand how to do that, since @katie-lamb and I are going to try and run some notebooks in CI for the record-linkage project.

Copy link
Member

@bendnorman bendnorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the notebook integration tests! I think once the harvesting function logic is deduplicated it can be merged in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dagster Issues related to our use of the Dagster orchestrator eia860 Anything having to do with EIA Form 860 eia923 Anything having to do with EIA Form 923 harvest Normalization of poorly normalized inputs and reconciliation of internal inconsistencies
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Simplify dependencies in entity resolution / harvesting
3 participants