Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize asset group and asset module names #2411

Merged
merged 5 commits into from
Mar 18, 2023

Conversation

zaneselvans
Copy link
Member

@zaneselvans zaneselvans commented Mar 17, 2023

PR Overview

  • Made sure that modules and asset group names use our existing data source short codes wherever possible (eia860, eia923, ferc1, eia_bulk_elec, epacems, etc.)
  • Split "output_assets" into denormalized_assets and analysis_assets. These are still placeholders, but I think we'll want to differentiate between those categories since what's inside them is pretty different (simple combinations of other tables vs. complex, novel derived values).
  • Removed the _assets suffix from all the asset groups. I think it will be clear from the context that these identifiers are being used in that they pertain to asset groups. I could be convinced otherwise though.
  • Split EIA assets into "raw" "clean" and "norm" (normalized) groups. Split the pre-harvesting EIA groups by original data source (eia860 vs. eia923)
  • I think there's a bunch of unnecessary tangling of dependencies inside the eia_transform function that we should untangle, but not right now.
  • I tried and failed (again) to find a way to grab all the asset keys for assets by asset group (without running into circular import issues) and so ended up enumerating the individual EIA-860 and EIA-923 input assets for the eia_transform function. This let me split the raw & clean EIA assets out into their own groups and seemed more readable. It also removed some top-level definitions that would have been globally accessible, and moved all of the asset group definitions into the pudl.etl module.
  • I noticed that the censusdp1tract database doesn't seem to show up as a resource or an asset anywhere. Should we add that to keep things uniform? It only has 3-4 tables, so maybe each of them can be its own asset? Or should it be analogous to the FERC 1 DBs?

PR Checklist

  • Merge the most recent version of the branch you are merging into (probably dev).
  • All CI checks are passing. Run tests locally to debug failures
  • Make sure you've included good docstrings.
  • For major data coverage & analysis changes, run data validation tests
  • Include unit tests for new functions and classes.
  • Defensive data quality/sanity checks in analyses & data processing functions.
  • Update the release notes and reference reference the PR and related issues.
  • Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

* Made sure that modules and asset group names use our existing data source short codes
  wherever possible (eia860, eia923, ferc1, eia_bulk_elec, epacems, etc.)
* Split "output_assets" into denormalized_assets and analysis_assets.  These are still
  placeholders, but I think we'll want to differentiate between those categories since
  what's inside them is pretty different (simple combinations of other tables vs.
  complex, novel derived values).
* Removed the _assets suffix from all the asset groups. I think it will be clear from
  the context that these identifiers are being used in that they pertain to asset
  groups. I could be convinced otherwise though.
* Split EIA assets into "raw" "clean" and "norm" (normalized) groups. Split the
  pre-harvesting EIA groups by original data source (eia860 vs. eia923)
* I think there's a bunch of unnecessary tangling of dependencies inside the
  eia_transform function that we should untangle, but not right now.
* I tried and failed (again) to find a way to grab all the asset keys for assets by
  asset group (without running into circular import issues) and so ended up enumerating
  the individual EIA-860 and EIA-923 input assets for the eia_transform function. This
  let me split the raw & clean EIA assets out into their own groups and seemed more
  readable. It also removed some top-level definitions that would have been globally
  accessible, and moved all of the asset group definitions into the pudl.etl module.
* I noticed that the censusdp1tract database doesn't seem to show up as a resource or an
  asset anywhere. Should we add that to keep things uniform? It only has 3-4 tables, so
  maybe each of them can be its own asset? Or should it be analogous to the FERC 1 DBs?
@zaneselvans zaneselvans added the dagster Issues related to our use of the Dagster orchestrator label Mar 17, 2023
@zaneselvans zaneselvans added this to the 2023Q1 milestone Mar 17, 2023
@zaneselvans zaneselvans self-assigned this Mar 17, 2023
@zaneselvans zaneselvans marked this pull request as ready for review March 17, 2023 04:54
@codecov
Copy link

codecov bot commented Mar 17, 2023

Codecov Report

Patch coverage: 87.5% and no project coverage change.

Comparison is base (a361342) 86.1% compared to head (4fc99af) 86.1%.

Additional details and impacted files
@@                Coverage Diff                @@
##           dagster-asset-etl   #2411   +/-   ##
=================================================
  Coverage               86.1%   86.1%           
=================================================
  Files                     80      81    +1     
  Lines                   9551    9552    +1     
=================================================
+ Hits                    8232    8233    +1     
  Misses                  1319    1319           
Impacted Files Coverage Δ
src/pudl/etl/denormalized_assets.py 71.4% <ø> (ø)
src/pudl/etl/eia_bulk_elec_assets.py 100.0% <ø> (ø)
src/pudl/helpers.py 86.1% <ø> (ø)
src/pudl/etl/analysis_assets.py 85.7% <85.7%> (ø)
src/pudl/transform/eia.py 95.4% <100.0%> (-0.1%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Member

@jdangerx jdangerx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is easier to understand for a policy/data neophyte like me :) And the ETL appears to still run. LGTM!

@@ -0,0 +1,33 @@
"""Derived / analysis assets that aren't simple to construct.

This is really too large & generic of a category. Should we have an asset group for each
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think, once we have more assets to actually group!

Comment on lines +3 to +6
Right now these will mostly be tables defined in `pudl.output`. Ultimately it would
also include any tables built by joining and aggregating normalized and analysis
tables. These are probably data warehouse style tables with a lot of duplicated
information, ready for analysts to use.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now we have pudl.etl.output_assets and pudl.etl.analysis_assets. I was thinking these modules would be place holders until we start converting output and analysis tables. Might be easier to just wrap the functions in pudl.output instead of moving things into these place holder modules. (Not a requested change or blocker for this PR just a thought.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wanted to conceptually separate "outputs" (which we get fuzzy with) into the denormalized vs. analysis categories from the beginning. Agree that for now we're probably going to just define these assets over wherever the existing functions are, and we can give them explicit asset groups over there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the transition to a more data-warehouse style architecture, I think that they day-to-day use is mostly going to be using the denormalized tables (as we do now, via PudlTabl) so maybe those are the tables that end up not having a funny prefix?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me! I like the separation of analysis and output for now.

You're suggesting adding a funny prefix to the normalized tables to differentiate them from the denormalized tables because most folks will be interacting with the denormed ones?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm imagining something like raw_generation_eia923, clean_generation_eia923, norm_generation_eia923 (or other meaningful indicators of what step you're at) and then just have generation_eia923 for the easy-to-use denormalized table.

Comment on lines +1154 to +1168
asset_id: AssetIn()
for asset_id in [
"clean_boiler_fuel_eia923",
"clean_boiler_generator_assn_eia860",
"clean_boilers_eia860",
"clean_coalmine_eia923",
"clean_fuel_receipts_costs_eia923",
"clean_generation_eia923",
"clean_generation_fuel_eia923",
"clean_generation_fuel_nuclear_eia923",
"clean_generators_eia860",
"clean_ownership_eia860",
"clean_plants_eia860",
"clean_utilities_eia860",
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you decide to list the tables out instead of dynamically loading them using:

eia_assets = load_assets_from_modules([eia860, eia923])

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loading the assets from those modules in this module, and then loading the assets defined in this module at the top level meant that all of the eia_transform, eia860, and eia923 modules had to part of the same asset group, which obscured both the clean vs. raw distinction in the eia860 and eia923 data sources (raw existed, but clean never did) and also combined the eia860 and eia923 asset groups.

I tried to figure out a way to construct the list of assets in this module based on the asset group definitions in pudl.etl but couldn't get it to work due to circular imports.

We can simplify the DAG and make the true dependencies clearer within the eia_transform group by using the clean_eia860 and clean_eia923 groups as inputs to eia_transform but restricting its outputs to the entity & annual tables. I think the non-entity table assets in norm_eia923 and norm_eia860 should only depend only on their clean_* siblings upstream (since all that happens to them in the harvesting step is columns that don't belong in them in the database get dropped, after their useful information has been used to compile the entity & annual tables).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh I think I ran into this issue in the devtool notebook PR. My workaround was to call load_assets_from_modules in the multi_asset decorator:

@multi_asset(
    ins={
        asset_key.to_python_identifier(): AssetIn()
        for eia_asset in load_assets_from_modules([eia860, eia923])
        for asset_key in eia_asset.keys
    },
   ...
)

This way the eia860 and eia923 clean assets can be in their own group and you don't have to enumerate all the table names in the decorator.

Good point about the nonentity tables only depending on the upstream clean tables. I've been meaning to create an issue for disentangling the eia_transform assets to reflect the real dependencies and maybe get some more parallelism going.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying that if I load_assets_from_modules() in the decorator (i.e NOT at the top level of the module), then they don't end up having their asset group defined by the top level load_assets_from_modules(["eia_transform"]) and could be loaded independently with their own asset groups in pudl.etl?

I created an issue for the EIA Transform cleanup #2387 that's part of a more general pudl.etl DAG cleanup #2386.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is correct. The assets returned by load_assets_from_modules in the decorator aren't included in the module's attributes.

Oh yay thank you for creating those issues!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this more elegant and dynamic solution for defining the upstream dependencies results in a (totally unrelated) circular import error, related to the use of occurrence_consistency() in pudl.output.eia860` so we need to do some untangling. 🤦🏼 I'm going to punt for the moment and use the hard-coded list of upstream assets.

AttributeError: partially initialized module 'pudl' has no attribute 'transform' (most likely due to a circular import)
  File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.10/site-packages/dagster/_grpc/server.py", line 267, in __init__
    self._loaded_repositories: Optional[LoadedRepositories] = LoadedRepositories(
  File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.10/site-packages/dagster/_grpc/server.py", line 116, in __init__
    loadable_targets = get_loadable_targets(
  File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.10/site-packages/dagster/_grpc/utils.py", line 47, in get_loadable_targets
    else loadable_targets_from_python_module(module_name, working_directory)
  File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.10/site-packages/dagster/_core/workspace/autodiscovery.py", line 36, in loadable_targets_from_python_module
    module = load_python_module(
  File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.10/site-packages/dagster/_core/code_pointer.py", line 138, in load_python_module
    return importlib.import_module(module_name)
  File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 992, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/Users/zane/code/catalyst/pudl/src/pudl/__init__.py", line 5, in <module>
    from . import (  # noqa: F401
  File "/Users/zane/code/catalyst/pudl/src/pudl/analysis/__init__.py", line 8, in <module>
    from . import (  # noqa: F401
  File "/Users/zane/code/catalyst/pudl/src/pudl/analysis/state_demand.py", line 38, in <module>
    import pudl.output.pudltabl
  File "/Users/zane/code/catalyst/pudl/src/pudl/output/__init__.py", line 13, in <module>
    from . import (  # noqa: F401
  File "/Users/zane/code/catalyst/pudl/src/pudl/output/eia860.py", line 8, in <module>
    from pudl.transform.eia import occurrence_consistency
  File "/Users/zane/code/catalyst/pudl/src/pudl/transform/__init__.py", line 62, in <module>
    from . import (  # noqa: F401
  File "/Users/zane/code/catalyst/pudl/src/pudl/transform/eia.py", line 1163, in <module>
    [pudl.transform.eia860, pudl.transform.eia923]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah bummer :/ the hard coded list is fine for now.

@bendnorman
Copy link
Member

Thanks for standardizing these asset group names! Left a couple of comments.

@zaneselvans zaneselvans merged commit c97e7f3 into dagster-asset-etl Mar 18, 2023
@zaneselvans zaneselvans deleted the rename-asset-groups branch March 18, 2023 01:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dagster Issues related to our use of the Dagster orchestrator
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

3 participants