Standardize asset group and asset module names #2411

zaneselvans · 2023-03-17T04:51:24Z

PR Overview

Made sure that modules and asset group names use our existing data source short codes wherever possible (eia860, eia923, ferc1, eia_bulk_elec, epacems, etc.)
Split "output_assets" into denormalized_assets and analysis_assets. These are still placeholders, but I think we'll want to differentiate between those categories since what's inside them is pretty different (simple combinations of other tables vs. complex, novel derived values).
Removed the _assets suffix from all the asset groups. I think it will be clear from the context that these identifiers are being used in that they pertain to asset groups. I could be convinced otherwise though.
Split EIA assets into "raw" "clean" and "norm" (normalized) groups. Split the pre-harvesting EIA groups by original data source (eia860 vs. eia923)
I think there's a bunch of unnecessary tangling of dependencies inside the eia_transform function that we should untangle, but not right now.
I tried and failed (again) to find a way to grab all the asset keys for assets by asset group (without running into circular import issues) and so ended up enumerating the individual EIA-860 and EIA-923 input assets for the eia_transform function. This let me split the raw & clean EIA assets out into their own groups and seemed more readable. It also removed some top-level definitions that would have been globally accessible, and moved all of the asset group definitions into the pudl.etl module.
I noticed that the censusdp1tract database doesn't seem to show up as a resource or an asset anywhere. Should we add that to keep things uniform? It only has 3-4 tables, so maybe each of them can be its own asset? Or should it be analogous to the FERC 1 DBs?

PR Checklist

Merge the most recent version of the branch you are merging into (probably dev).
All CI checks are passing. Run tests locally to debug failures
Make sure you've included good docstrings.
For major data coverage & analysis changes, run data validation tests
Include unit tests for new functions and classes.
Defensive data quality/sanity checks in analyses & data processing functions.
Update the release notes and reference reference the PR and related issues.
Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

* Made sure that modules and asset group names use our existing data source short codes wherever possible (eia860, eia923, ferc1, eia_bulk_elec, epacems, etc.) * Split "output_assets" into denormalized_assets and analysis_assets. These are still placeholders, but I think we'll want to differentiate between those categories since what's inside them is pretty different (simple combinations of other tables vs. complex, novel derived values). * Removed the _assets suffix from all the asset groups. I think it will be clear from the context that these identifiers are being used in that they pertain to asset groups. I could be convinced otherwise though. * Split EIA assets into "raw" "clean" and "norm" (normalized) groups. Split the pre-harvesting EIA groups by original data source (eia860 vs. eia923) * I think there's a bunch of unnecessary tangling of dependencies inside the eia_transform function that we should untangle, but not right now. * I tried and failed (again) to find a way to grab all the asset keys for assets by asset group (without running into circular import issues) and so ended up enumerating the individual EIA-860 and EIA-923 input assets for the eia_transform function. This let me split the raw & clean EIA assets out into their own groups and seemed more readable. It also removed some top-level definitions that would have been globally accessible, and moved all of the asset group definitions into the pudl.etl module. * I noticed that the censusdp1tract database doesn't seem to show up as a resource or an asset anywhere. Should we add that to keep things uniform? It only has 3-4 tables, so maybe each of them can be its own asset? Or should it be analogous to the FERC 1 DBs?

codecov · 2023-03-17T06:27:47Z

Codecov Report

Patch coverage: 87.5% and no project coverage change.

Comparison is base (a361342) 86.1% compared to head (4fc99af) 86.1%.

Additional details and impacted files

@@                Coverage Diff                @@
##           dagster-asset-etl   #2411   +/-   ##
=================================================
  Coverage               86.1%   86.1%           
=================================================
  Files                     80      81    +1     
  Lines                   9551    9552    +1     
=================================================
+ Hits                    8232    8233    +1     
  Misses                  1319    1319

Impacted Files	Coverage Δ
src/pudl/etl/denormalized_assets.py	`71.4% <ø> (ø)`
src/pudl/etl/eia_bulk_elec_assets.py	`100.0% <ø> (ø)`
src/pudl/helpers.py	`86.1% <ø> (ø)`
src/pudl/etl/analysis_assets.py	`85.7% <85.7%> (ø)`
src/pudl/transform/eia.py	`95.4% <100.0%> (-0.1%)`	⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

jdangerx

This is easier to understand for a policy/data neophyte like me :) And the ETL appears to still run. LGTM!

jdangerx · 2023-03-17T19:31:50Z

src/pudl/etl/analysis_assets.py

@@ -0,0 +1,33 @@
+"""Derived / analysis assets that aren't simple to construct.
+
+This is really too large & generic of a category. Should we have an asset group for each


Yes, I think, once we have more assets to actually group!

bendnorman · 2023-03-17T19:52:46Z

src/pudl/etl/denormalized_assets.py

+Right now these will mostly be tables defined in `pudl.output`. Ultimately it would
+also include any tables built by joining and aggregating normalized and analysis
+tables. These are probably data warehouse style tables with a lot of duplicated
+information, ready for analysts to use.


Right now we have pudl.etl.output_assets and pudl.etl.analysis_assets. I was thinking these modules would be place holders until we start converting output and analysis tables. Might be easier to just wrap the functions in pudl.output instead of moving things into these place holder modules. (Not a requested change or blocker for this PR just a thought.)

I just wanted to conceptually separate "outputs" (which we get fuzzy with) into the denormalized vs. analysis categories from the beginning. Agree that for now we're probably going to just define these assets over wherever the existing functions are, and we can give them explicit asset groups over there.

After the transition to a more data-warehouse style architecture, I think that they day-to-day use is mostly going to be using the denormalized tables (as we do now, via PudlTabl) so maybe those are the tables that end up not having a funny prefix?

Sounds good to me! I like the separation of analysis and output for now.

You're suggesting adding a funny prefix to the normalized tables to differentiate them from the denormalized tables because most folks will be interacting with the denormed ones?

I'm imagining something like raw_generation_eia923, clean_generation_eia923, norm_generation_eia923 (or other meaningful indicators of what step you're at) and then just have generation_eia923 for the easy-to-use denormalized table.

bendnorman · 2023-03-17T19:54:21Z

src/pudl/transform/eia.py

+        asset_id: AssetIn()
+        for asset_id in [
+            "clean_boiler_fuel_eia923",
+            "clean_boiler_generator_assn_eia860",
+            "clean_boilers_eia860",
+            "clean_coalmine_eia923",
+            "clean_fuel_receipts_costs_eia923",
+            "clean_generation_eia923",
+            "clean_generation_fuel_eia923",
+            "clean_generation_fuel_nuclear_eia923",
+            "clean_generators_eia860",
+            "clean_ownership_eia860",
+            "clean_plants_eia860",
+            "clean_utilities_eia860",
+        ]


Why did you decide to list the tables out instead of dynamically loading them using:

eia_assets = load_assets_from_modules([eia860, eia923])

Loading the assets from those modules in this module, and then loading the assets defined in this module at the top level meant that all of the eia_transform, eia860, and eia923 modules had to part of the same asset group, which obscured both the clean vs. raw distinction in the eia860 and eia923 data sources (raw existed, but clean never did) and also combined the eia860 and eia923 asset groups.

I tried to figure out a way to construct the list of assets in this module based on the asset group definitions in pudl.etl but couldn't get it to work due to circular imports.

We can simplify the DAG and make the true dependencies clearer within the eia_transform group by using the clean_eia860 and clean_eia923 groups as inputs to eia_transform but restricting its outputs to the entity & annual tables. I think the non-entity table assets in norm_eia923 and norm_eia860 should only depend only on their clean_* siblings upstream (since all that happens to them in the harvesting step is columns that don't belong in them in the database get dropped, after their useful information has been used to compile the entity & annual tables).

Ohh I think I ran into this issue in the devtool notebook PR. My workaround was to call load_assets_from_modules in the multi_asset decorator:

@multi_asset( ins={ asset_key.to_python_identifier(): AssetIn() for eia_asset in load_assets_from_modules([eia860, eia923]) for asset_key in eia_asset.keys }, ... )

This way the eia860 and eia923 clean assets can be in their own group and you don't have to enumerate all the table names in the decorator.

Good point about the nonentity tables only depending on the upstream clean tables. I've been meaning to create an issue for disentangling the eia_transform assets to reflect the real dependencies and maybe get some more parallelism going.

Are you saying that if I load_assets_from_modules() in the decorator (i.e NOT at the top level of the module), then they don't end up having their asset group defined by the top level load_assets_from_modules(["eia_transform"]) and could be loaded independently with their own asset groups in pudl.etl?

I created an issue for the EIA Transform cleanup #2387 that's part of a more general pudl.etl DAG cleanup #2386.

Yes, that is correct. The assets returned by load_assets_from_modules in the decorator aren't included in the module's attributes.

Oh yay thank you for creating those issues!

Unfortunately this more elegant and dynamic solution for defining the upstream dependencies results in a (totally unrelated) circular import error, related to the use of occurrence_consistency() in pudl.output.eia860` so we need to do some untangling. 🤦🏼 I'm going to punt for the moment and use the hard-coded list of upstream assets.

AttributeError: partially initialized module 'pudl' has no attribute 'transform' (most likely due to a circular import) File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.10/site-packages/dagster/_grpc/server.py", line 267, in __init__ self._loaded_repositories: Optional[LoadedRepositories] = LoadedRepositories( File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.10/site-packages/dagster/_grpc/server.py", line 116, in __init__ loadable_targets = get_loadable_targets( File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.10/site-packages/dagster/_grpc/utils.py", line 47, in get_loadable_targets else loadable_targets_from_python_module(module_name, working_directory) File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.10/site-packages/dagster/_core/workspace/autodiscovery.py", line 36, in loadable_targets_from_python_module module = load_python_module( File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.10/site-packages/dagster/_core/code_pointer.py", line 138, in load_python_module return importlib.import_module(module_name) File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.10/importlib/__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 1050, in _gcd_import File "<frozen importlib._bootstrap>", line 1027, in _find_and_load File "<frozen importlib._bootstrap>", line 992, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed File "<frozen importlib._bootstrap>", line 1050, in _gcd_import File "<frozen importlib._bootstrap>", line 1027, in _find_and_load File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 688, in _load_unlocked File "<frozen importlib._bootstrap_external>", line 883, in exec_module File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed File "/Users/zane/code/catalyst/pudl/src/pudl/__init__.py", line 5, in <module> from . import ( # noqa: F401 File "/Users/zane/code/catalyst/pudl/src/pudl/analysis/__init__.py", line 8, in <module> from . import ( # noqa: F401 File "/Users/zane/code/catalyst/pudl/src/pudl/analysis/state_demand.py", line 38, in <module> import pudl.output.pudltabl File "/Users/zane/code/catalyst/pudl/src/pudl/output/__init__.py", line 13, in <module> from . import ( # noqa: F401 File "/Users/zane/code/catalyst/pudl/src/pudl/output/eia860.py", line 8, in <module> from pudl.transform.eia import occurrence_consistency File "/Users/zane/code/catalyst/pudl/src/pudl/transform/__init__.py", line 62, in <module> from . import ( # noqa: F401 File "/Users/zane/code/catalyst/pudl/src/pudl/transform/eia.py", line 1163, in <module> [pudl.transform.eia860, pudl.transform.eia923]

Ah bummer :/ the hard coded list is fine for now.

bendnorman · 2023-03-17T19:56:30Z

Thanks for standardizing these asset group names! Left a couple of comments.

zaneselvans requested a review from bendnorman March 17, 2023 04:51

zaneselvans added the dagster Issues related to our use of the Dagster orchestrator label Mar 17, 2023

zaneselvans added this to the 2023Q1 milestone Mar 17, 2023

zaneselvans self-assigned this Mar 17, 2023

zaneselvans marked this pull request as ready for review March 17, 2023 04:54

jdangerx approved these changes Mar 17, 2023

View reviewed changes

bendnorman reviewed Mar 17, 2023

View reviewed changes

bendnorman approved these changes Mar 17, 2023

View reviewed changes

zaneselvans added 4 commits March 17, 2023 16:27

Attempt dynamic generation of upstream assets in eia_transform

7701a8c

Merge branch 'dagster-asset-etl' into rename-asset-groups

33c4f29

Add type hints to analysis assets.

0845d85

Remove example denormalized & analysis assets for the moment.

4fc99af

zaneselvans merged commit c97e7f3 into dagster-asset-etl Mar 18, 2023

zaneselvans deleted the rename-asset-groups branch March 18, 2023 01:07

bendnorman mentioned this pull request Mar 20, 2023

Convert EIA-861 ETL to use Dagster #2403

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardize asset group and asset module names #2411

Standardize asset group and asset module names #2411

zaneselvans commented Mar 17, 2023 •

edited

codecov bot commented Mar 17, 2023 •

edited

jdangerx left a comment

jdangerx Mar 17, 2023

bendnorman Mar 17, 2023

zaneselvans Mar 17, 2023

zaneselvans Mar 17, 2023

bendnorman Mar 17, 2023

zaneselvans Mar 17, 2023

bendnorman Mar 17, 2023

zaneselvans Mar 17, 2023

bendnorman Mar 17, 2023

zaneselvans Mar 17, 2023

bendnorman Mar 17, 2023

zaneselvans Mar 17, 2023

bendnorman Mar 17, 2023

bendnorman commented Mar 17, 2023

		@@ -0,0 +1,33 @@
		"""Derived / analysis assets that aren't simple to construct.

		This is really too large & generic of a category. Should we have an asset group for each

Standardize asset group and asset module names #2411

Standardize asset group and asset module names #2411

Conversation

zaneselvans commented Mar 17, 2023 • edited

PR Overview

PR Checklist

codecov bot commented Mar 17, 2023 • edited

Codecov Report

jdangerx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bendnorman commented Mar 17, 2023

zaneselvans commented Mar 17, 2023 •

edited

codecov bot commented Mar 17, 2023 •

edited