-
-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standardize asset group and asset module names #2411
Conversation
* Made sure that modules and asset group names use our existing data source short codes wherever possible (eia860, eia923, ferc1, eia_bulk_elec, epacems, etc.) * Split "output_assets" into denormalized_assets and analysis_assets. These are still placeholders, but I think we'll want to differentiate between those categories since what's inside them is pretty different (simple combinations of other tables vs. complex, novel derived values). * Removed the _assets suffix from all the asset groups. I think it will be clear from the context that these identifiers are being used in that they pertain to asset groups. I could be convinced otherwise though. * Split EIA assets into "raw" "clean" and "norm" (normalized) groups. Split the pre-harvesting EIA groups by original data source (eia860 vs. eia923) * I think there's a bunch of unnecessary tangling of dependencies inside the eia_transform function that we should untangle, but not right now. * I tried and failed (again) to find a way to grab all the asset keys for assets by asset group (without running into circular import issues) and so ended up enumerating the individual EIA-860 and EIA-923 input assets for the eia_transform function. This let me split the raw & clean EIA assets out into their own groups and seemed more readable. It also removed some top-level definitions that would have been globally accessible, and moved all of the asset group definitions into the pudl.etl module. * I noticed that the censusdp1tract database doesn't seem to show up as a resource or an asset anywhere. Should we add that to keep things uniform? It only has 3-4 tables, so maybe each of them can be its own asset? Or should it be analogous to the FERC 1 DBs?
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## dagster-asset-etl #2411 +/- ##
=================================================
Coverage 86.1% 86.1%
=================================================
Files 80 81 +1
Lines 9551 9552 +1
=================================================
+ Hits 8232 8233 +1
Misses 1319 1319
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is easier to understand for a policy/data neophyte like me :) And the ETL appears to still run. LGTM!
@@ -0,0 +1,33 @@ | |||
"""Derived / analysis assets that aren't simple to construct. | |||
|
|||
This is really too large & generic of a category. Should we have an asset group for each |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think, once we have more assets to actually group!
Right now these will mostly be tables defined in `pudl.output`. Ultimately it would | ||
also include any tables built by joining and aggregating normalized and analysis | ||
tables. These are probably data warehouse style tables with a lot of duplicated | ||
information, ready for analysts to use. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now we have pudl.etl.output_assets
and pudl.etl.analysis_assets
. I was thinking these modules would be place holders until we start converting output and analysis tables. Might be easier to just wrap the functions in pudl.output
instead of moving things into these place holder modules. (Not a requested change or blocker for this PR just a thought.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just wanted to conceptually separate "outputs" (which we get fuzzy with) into the denormalized vs. analysis categories from the beginning. Agree that for now we're probably going to just define these assets over wherever the existing functions are, and we can give them explicit asset groups over there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the transition to a more data-warehouse style architecture, I think that they day-to-day use is mostly going to be using the denormalized tables (as we do now, via PudlTabl
) so maybe those are the tables that end up not having a funny prefix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me! I like the separation of analysis and output for now.
You're suggesting adding a funny prefix to the normalized tables to differentiate them from the denormalized tables because most folks will be interacting with the denormed ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm imagining something like raw_generation_eia923
, clean_generation_eia923
, norm_generation_eia923
(or other meaningful indicators of what step you're at) and then just have generation_eia923
for the easy-to-use denormalized table.
asset_id: AssetIn() | ||
for asset_id in [ | ||
"clean_boiler_fuel_eia923", | ||
"clean_boiler_generator_assn_eia860", | ||
"clean_boilers_eia860", | ||
"clean_coalmine_eia923", | ||
"clean_fuel_receipts_costs_eia923", | ||
"clean_generation_eia923", | ||
"clean_generation_fuel_eia923", | ||
"clean_generation_fuel_nuclear_eia923", | ||
"clean_generators_eia860", | ||
"clean_ownership_eia860", | ||
"clean_plants_eia860", | ||
"clean_utilities_eia860", | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you decide to list the tables out instead of dynamically loading them using:
eia_assets = load_assets_from_modules([eia860, eia923])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loading the assets from those modules in this module, and then loading the assets defined in this module at the top level meant that all of the eia_transform
, eia860
, and eia923
modules had to part of the same asset group, which obscured both the clean vs. raw distinction in the eia860
and eia923
data sources (raw existed, but clean never did) and also combined the eia860
and eia923
asset groups.
I tried to figure out a way to construct the list of assets in this module based on the asset group definitions in pudl.etl
but couldn't get it to work due to circular imports.
We can simplify the DAG and make the true dependencies clearer within the eia_transform
group by using the clean_eia860
and clean_eia923
groups as inputs to eia_transform
but restricting its outputs to the entity & annual tables. I think the non-entity table assets in norm_eia923
and norm_eia860
should only depend only on their clean_*
siblings upstream (since all that happens to them in the harvesting step is columns that don't belong in them in the database get dropped, after their useful information has been used to compile the entity & annual tables).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh I think I ran into this issue in the devtool notebook PR. My workaround was to call load_assets_from_modules
in the multi_asset decorator:
@multi_asset(
ins={
asset_key.to_python_identifier(): AssetIn()
for eia_asset in load_assets_from_modules([eia860, eia923])
for asset_key in eia_asset.keys
},
...
)
This way the eia860 and eia923 clean assets can be in their own group and you don't have to enumerate all the table names in the decorator.
Good point about the nonentity tables only depending on the upstream clean tables. I've been meaning to create an issue for disentangling the eia_transform assets to reflect the real dependencies and maybe get some more parallelism going.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying that if I load_assets_from_modules()
in the decorator (i.e NOT at the top level of the module), then they don't end up having their asset group defined by the top level load_assets_from_modules(["eia_transform"])
and could be loaded independently with their own asset groups in pudl.etl
?
I created an issue for the EIA Transform cleanup #2387 that's part of a more general pudl.etl
DAG cleanup #2386.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is correct. The assets returned by load_assets_from_modules
in the decorator aren't included in the module's attributes.
Oh yay thank you for creating those issues!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately this more elegant and dynamic solution for defining the upstream dependencies results in a (totally unrelated) circular import error, related to the use of occurrence_consistency() in
pudl.output.eia860` so we need to do some untangling. 🤦🏼 I'm going to punt for the moment and use the hard-coded list of upstream assets.
AttributeError: partially initialized module 'pudl' has no attribute 'transform' (most likely due to a circular import)
File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.10/site-packages/dagster/_grpc/server.py", line 267, in __init__
self._loaded_repositories: Optional[LoadedRepositories] = LoadedRepositories(
File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.10/site-packages/dagster/_grpc/server.py", line 116, in __init__
loadable_targets = get_loadable_targets(
File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.10/site-packages/dagster/_grpc/utils.py", line 47, in get_loadable_targets
else loadable_targets_from_python_module(module_name, working_directory)
File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.10/site-packages/dagster/_core/workspace/autodiscovery.py", line 36, in loadable_targets_from_python_module
module = load_python_module(
File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.10/site-packages/dagster/_core/code_pointer.py", line 138, in load_python_module
return importlib.import_module(module_name)
File "/Users/zane/mambaforge/envs/pudl-dev/lib/python3.10/importlib/__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 992, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/Users/zane/code/catalyst/pudl/src/pudl/__init__.py", line 5, in <module>
from . import ( # noqa: F401
File "/Users/zane/code/catalyst/pudl/src/pudl/analysis/__init__.py", line 8, in <module>
from . import ( # noqa: F401
File "/Users/zane/code/catalyst/pudl/src/pudl/analysis/state_demand.py", line 38, in <module>
import pudl.output.pudltabl
File "/Users/zane/code/catalyst/pudl/src/pudl/output/__init__.py", line 13, in <module>
from . import ( # noqa: F401
File "/Users/zane/code/catalyst/pudl/src/pudl/output/eia860.py", line 8, in <module>
from pudl.transform.eia import occurrence_consistency
File "/Users/zane/code/catalyst/pudl/src/pudl/transform/__init__.py", line 62, in <module>
from . import ( # noqa: F401
File "/Users/zane/code/catalyst/pudl/src/pudl/transform/eia.py", line 1163, in <module>
[pudl.transform.eia860, pudl.transform.eia923]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah bummer :/ the hard coded list is fine for now.
Thanks for standardizing these asset group names! Left a couple of comments. |
PR Overview
PR Checklist
dev
).