Dagster asset etl #2104

bendnorman · 2022-12-01T20:16:33Z

PR Checklist

Merge the most recent version of the branch you are merging into (probably dev).
All CI checks are passing. Run tests locally to debug failures
Make sure you've included good docstrings.
Include unit tests for new functions and classes.
Defensive data quality/sanity checks in analyses & data processing functions.
Update the release notes and reference reference the PR and related issues.
Do your own explanatory review of the PR to help the reviewer understand what's going on.
For major data coverage & analysis changes, run data validation tests.

…inal tables [no ci] This version of the EIA ETL uses the fs_io_manager for the interim tables and the SQLiteIOManager for the tables that are in our metadata. This way we don't have to worry about adding interim table metadata right now. This version produces a sqlite constraint error when adding the ownership_eia860.owner_zip_code field. All of the values are 5 digits so I'm not sure what's causing the error.

… ci]

For some reason moving the dtype conversion to the transform function fixed the owner_zip_code constraint failure.

…tings resource

…ine attribute [no ci]

…IO manager [no ci]

bendnorman

Background

Here is my initial conversion of the EIA, static, EIA API and glue ETLs. Tables we want to persist are now assets. For example, the raw and pre-harvested and final EIA tables are now assets. The raw and pre-harvested tables are not saved to the database because I haven't created metadata for them yet. Tables we want to persist in the database need metadata so we can properly maintain column types as they move in and out of the database.

To test this branch out you'll need to recreate your environment and set two environment variables. Set the PUDL_CACHE env var to your .../pudl-work/data directory. Create a new directory for ETL outputs and set its path to PUDL_OUTPUT.

To start dagit, run:

dagit -f src/pudl/respository.py

To run the fast ETL, click on the etl_fast job in the pudl dagster repository. Then click "Materialize All". The final database will be at "$PUDL_OUTPUT/sqlite/pudl.sqlite". You can rerun subsets of the ETL by clicking on the assets you want to recreate and click "Materialize Selected".

I left a handful of # TODO (bendnorman): comments that might provide some additional context or future improvements.

Discussion topics

Clobbering the database isn't necessary when you want to recreate a table because the contents of the table are deleted before the new data is loaded to the database. However, you'll need to manually delete the database when you make a metadata change because the db metadata is created when the sqlite file is created. I think we have two options 1) just make it clear in documentation that you need to delete the sqlite file when you change the metadata or 2) start using database migrations with something like alembic. Database migrations allow you to version control database schemas and apply changed incrementally.
For this iteration, I tried to keep the package structure mostly the same so the PR diff doesn't get crazy. I could imagine organizing the package differently in the future so it's clearer how all of the assets and modules relate to one another.

Next steps

Convert the FERC ETL to use assets.
Remove deprecated code like most of the functions in pudl.etl and the pudl_settings object.
Start to convert the test suite to accommodate the dagster concepts.

bendnorman · 2022-12-06T02:20:41Z

setup.py

+        "coloredlogs>=14.0,<15.1",  # Dagster requires 14.0
+        "catalystcoop.ferc_xbrl_extractor @ git+https://github.com/catalyst-cooperative/ferc-xbrl-extractor.git@dev",


I relaxed the coloredlogs version range for the ferc_xbrl_extractor package because dagster requires 14.0.

This change is on main now, so you can use version 0.8.0 for ferc_xbrl_extractor if you want

src/pudl/etl.py

bendnorman · 2022-12-06T02:37:53Z

src/pudl/etl.py

+@asset(
+    io_manager_key="pudl_sqlite_io_manager",
+    required_resource_keys={"datastore"},
+    group_name="eia_api",
+)
+def fuel_receipts_costs_aggs_eia(context):


This could be a multi_asset if we are expecting to extract additional tables from the EIA API.

If this was defined as a multi_asset instead of an asset does that mean that all the assets in this eia_api group would have to be materialized together and you wouldn't be able to just get the fuel_receipts_costs_aggs_eia table? Although I guess these API tables are independent in the DAG so it seems like you are able to materialize them individually?

Yes, if this was defined as a multi_asset, all of the assets in the eia_api group would have to be materialized together.

bendnorman · 2022-12-06T02:40:16Z

src/pudl/glue/ferc1_eia.py

-    # if we're not ingesting eia, exclude eia only tables
-    if not eia:
-        glue_dfs = {name: df for (name, df) in glue_dfs.items() if "_eia" not in name}
-    # if we're not ingesting ferc, exclude ferc1 only tables
-    if not ferc1:
-        glue_dfs = {name: df for (name, df) in glue_dfs.items() if "_ferc1" not in name}
-


Removed this logic because dagster expects a multi_asset to return the same set of tables each run.

bendnorman · 2022-12-06T03:14:24Z

src/pudl/io_managers.py

+        "pudl_output_path": Field(
+            str,
+            description="Path of directory to store the database in.",
+            default_value=os.environ.get("PUDL_OUTPUT"),


I'm thinking we can replace the pudl_setting object by letting users specify output and input directories as environment vars.

I agree that this seems more approachable to users than the pudl_settings object.

Agree an error here if those vars aren't set would be nice, as I tried to import pudl without setting them and got a DagsterInvalidConfigError

I resolved this issue by creating an EnvVar config type (See previous commit). However I reverted back to using the pudl_settings object because the PudlTabl, our examples and documentation expect it. I think it makes to deprecate the pudl_settings object once all of PUDL has been converted to use dagster.

bendnorman · 2022-12-06T03:19:17Z

src/pudl/settings.py

+        # We do not want to configure tables so remove tables from the dict
+        d.pop("tables", None)


Dagster expects assets to produce the same tables each time so we don't want users selecting subsets of tables using a config file. If folks want to create or recreate subset of tables they should use the UI.

We could also remove the table argument from the settings objects so we don't have to remove them from this dictionary.

Seems like the tables argument is used in a few different contexts in the settings object but maybe can become obsolete? There could also be a message attached to this if/when the tables key is removed from the dictionary that prompts peeps to use the UI

I think the tables argument can become obsolete by the end of this refactor. If not, I like the idea of adding a warning or info message.

I think using the UI to select subsets of tables will generally simplify things substantially

src/pudl/io_managers.py

src/pudl/respository.py

katie-lamb · 2022-12-15T03:57:05Z

Wow @bendnorman super cool! This will be a huge improvement in all of our lives. I had fun playing around with the UI and everything worked pretty well for me. I left a few comments in response to your comments. Most of your TODOs felt pretty on point with the feedback I'd have for you just from my initial poking around. In particular, I think allowing for loading subsets of the glue tables (and some of the other multi assets). It doesn't take very long to create all the glue tables but I do think it would be nice to load subsets. Alternatively, separating out into different assets would also make the DAG more readable in the UI and makes more sense with the actual dependencies.

zschira · 2022-12-22T21:24:31Z

start using database migrations with something like alembic

I think that moving towards database migrations seems like a nice thing to have. Maybe doesn't need to happen right alongside the Dagster refactor, but generally having schema changes more explicitly version controlled would probably be really nice for any PUDL users.

zschira

Looks really good! I just have a few questions here, but happy to continue to give feedback as the design moves along. So far though, I think this all makes sense and will be really nice once everything is converted over!

zschira · 2022-12-22T21:26:06Z

src/pudl/etl.py

+    Returns:
+        A tuple of extracted EIA dataframes.
+    """
+    eia_settings = context.resources.dataset_settings.eia


Not having to directly pass around settings and resources like the datastore will be super handy

zschira · 2022-12-22T21:38:08Z

src/pudl/io_managers.py

@@ -0,0 +1,354 @@
+"""Dagster IO Managers."""


Is this something that can/should be generalized and pulled out of PUDL? It seems like a SQLiteIOManager could be a pretty general tool that other Dagster users might want to use.

I'd love to create a community SQLiteIOManager eventually. Maybe once PUDL is up an running with dagster we can make a contribution to the dagster codebase.

zschira · 2022-12-22T21:51:23Z

src/pudl/settings.py

+        # We do not want to configure tables so remove tables from the dict
+        d.pop("tables", None)


I think using the UI to select subsets of tables will generally simplify things substantially

src/pudl/settings.py

…l typos

I decided to create a dynamic graph that produces the hourly_emissions_epacems asset . I created two IO Managers to support producing a single combined parquet file and a collection of partitioned parquet files. Using pd.cocnat() to combine the partitions is problematic because it might blow up memory trying to concat >5GB of dataframes together. Will remove this step and have an IO Manager write each partition to the combined parquet file in parallel.

I honestly couldn't get the partitioned assets to work. It took almost a minute to launch sub processes for each partition, duplicate partitions were being launched. Users also would have to deal with the dagaster daemon and they couldn't configure resources for partition runs unless a configured job is created. I'm committing these changes incase we want to refer back

…le without an IO Manager [no ci] This strategy is simpler than the other options laid out in #2084. The downside is that we can't leverage IO Managers to write the data. Not a huge con compared to the other options.

…ning

* Made sure that modules and asset group names use our existing data source short codes wherever possible (eia860, eia923, ferc1, eia_bulk_elec, epacems, etc.) * Split "output_assets" into denormalized_assets and analysis_assets. These are still placeholders, but I think we'll want to differentiate between those categories since what's inside them is pretty different (simple combinations of other tables vs. complex, novel derived values). * Removed the _assets suffix from all the asset groups. I think it will be clear from the context that these identifiers are being used in that they pertain to asset groups. I could be convinced otherwise though. * Split EIA assets into "raw" "clean" and "norm" (normalized) groups. Split the pre-harvesting EIA groups by original data source (eia860 vs. eia923) * I think there's a bunch of unnecessary tangling of dependencies inside the eia_transform function that we should untangle, but not right now. * I tried and failed (again) to find a way to grab all the asset keys for assets by asset group (without running into circular import issues) and so ended up enumerating the individual EIA-860 and EIA-923 input assets for the eia_transform function. This let me split the raw & clean EIA assets out into their own groups and seemed more readable. It also removed some top-level definitions that would have been globally accessible, and moved all of the asset group definitions into the pudl.etl module. * I noticed that the censusdp1tract database doesn't seem to show up as a resource or an asset anywhere. Should we add that to keep things uniform? It only has 3-4 tables, so maybe each of them can be its own asset? Or should it be analogous to the FERC 1 DBs?

…anager-tests Throw error when db is missing table schema

Standardize asset group and asset module names

Rename check_pudl_fks to pudl_check_fks to align w/ other CLI names

…onda Pin grpcio==1.46.1 (arm64 compatible version available on conda-forge)

…ebooks Convert devtools notebooks to use dagster concepts

review-notebook-app · 2023-03-21T21:16:23Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

* parameterize `pudl.workspace.setup.get_defaults` so we can override behavior * test settings overriding behavior * override settings in centralized locations for test instead of ad-hoc, everywhere --------- Co-authored-by: zschira <zach.schira@catalyst.coop>

* Use PUDL_INPUT/PUDL_OUTPUT environment variables in Docker * Make IO directories *before* chowning so that catalyst user can still use them later. * Format pytest args so they don't mis-parse * Fix side effects of pudl_environment test; add tests for EnvVar configuration; remove obsolete database_test.py

bendnorman added 12 commits November 17, 2022 16:59

Initial conversion of EIA ETL to use assets [no ci]

9452c47

Relax coloredlogs version to accomodate for dagster

6f39de8

Create a dataset settings resource to configure the etl functions [no…

daaeab6

… ci]

Move convert column types to the eia transform function [no ci]

98075d6

For some reason moving the dtype conversion to the transform function fixed the owner_zip_code constraint failure.

Create method to check foreign key errors [no ci]

c1ca8ae

Create a ConfigSchema from DatasetsSettings class for the dataset_set…

5464366

…tings resource

Create a output view and analysis example

b93b592

Add other static tables, eia api table and create SQLiteIOManager eng…

e2cef23

…ine attribute [no ci]

Get multi_asset table names from metadata [no ci]

a5af81d

Flesh out doc strings, remove check_types and check_values args from …

1cbbd84

…IO manager [no ci]

Create an etl_fast job [no ci]

a1ede69

bendnorman commented Dec 6, 2022

View reviewed changes

bendnorman requested a review from zaneselvans December 6, 2022 04:19

katie-lamb reviewed Dec 15, 2022

View reviewed changes

src/pudl/io_managers.py Outdated Show resolved Hide resolved

katie-lamb reviewed Dec 15, 2022

View reviewed changes

src/pudl/respository.py Outdated Show resolved Hide resolved

This was referenced Dec 22, 2022

Initial Dagster ETL refactor #1835

Closed

WIP: XBRL dagster prototype #1724

Closed

Merge dev into dagster-asset-etl

12ef18f

zschira reviewed Dec 22, 2022

View reviewed changes

bendnorman added 4 commits December 27, 2022 20:46

Create EnvVar config class, create eia extraction module and fix smal…

63d9f5f

…l typos

Initial dagster conversion of ther ferc form 1 ETL [no ci]

35d5157

Initial pass at converting ferc_to_sqlite to dagster using ops [no ci]

97cee23

bendnorman requested review from cmgosnell and removed request for zaneselvans January 13, 2023 18:02

bendnorman added 2 commits January 17, 2023 17:23

bendnorman added 3 commits March 15, 2023 16:54

Update ferc1-etl-debug to use dagster concepts

015531e

Clean up ferc extraction in devtool debug notebook

18560f8

Improve sqlite io manager test coverage and remove missing schema war…

3b5fe8c

…ning

zaneselvans mentioned this pull request Mar 16, 2023

Update PUDL from Python 3.10 to 3.11 #2408

Merged

3 tasks

zaneselvans and others added 2 commits March 16, 2023 22:46

Add pk io_manager tests and missing column error

6a23bfa

bendnorman mentioned this pull request Mar 17, 2023

Bring Census DP1 database into Dagster #2412

Closed

bendnorman and others added 17 commits March 17, 2023 15:24

Merge pull request #2410 from catalyst-cooperative/improve-sqlite-iom…

a361342

…anager-tests Throw error when db is missing table schema

Attempt dynamic generation of upstream assets in eia_transform

7701a8c

Merge branch 'dagster-asset-etl' into rename-asset-groups

33c4f29

Add type hints to analysis assets.

0845d85

Remove example denormalized & analysis assets for the moment.

4fc99af

Merge pull request #2411 from catalyst-cooperative/rename-asset-groups

c97e7f3

Standardize asset group and asset module names

Add ferc1 debut extraction function tests

96ccbff

Resolve merge conflicts with dagster-asset-etl

1e9d7c7

Rename check_pudl_fks to pudl_check_fks to align w/ other CLI names

bd171b2

Merge branch 'dev' into dagster-asset-etl

f99c5a8

Merge pull request #2416 from catalyst-cooperative/rename-check-fks

7c26798

Rename check_pudl_fks to pudl_check_fks to align w/ other CLI names

Pin grpcio==1.46.1 (arm64 compatible version available on conda-forge)

2173021

Clear devtools notebook outputs.

e082b17

Update asset group names in notebooks.

b39f272

Fix docstring typos and test class name.

061cff7

Merge pull request #2428 from catalyst-cooperative/pin-grpcio-arm64-c…

bb74ea3

…onda Pin grpcio==1.46.1 (arm64 compatible version available on conda-forge)

Merge pull request #2356 from catalyst-cooperative/update-devtool-not…

ce939f5

…ebooks Convert devtools notebooks to use dagster concepts

zaneselvans linked an issue Mar 22, 2023 that may be closed by this pull request

Convert ETL to use Dagster software defined assets #1570

Closed

bendnorman merged commit 25e0465 into dev Mar 24, 2023

bendnorman deleted the dagster-asset-etl branch March 24, 2023 21:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dagster asset etl #2104

Dagster asset etl #2104

bendnorman commented Dec 1, 2022 •

edited by zaneselvans

bendnorman left a comment •

edited

bendnorman Dec 6, 2022

zschira Dec 21, 2022

bendnorman Dec 6, 2022

katie-lamb Dec 15, 2022 •

edited

bendnorman Dec 22, 2022

bendnorman Dec 6, 2022

bendnorman Dec 6, 2022

katie-lamb Dec 15, 2022 •

edited

bendnorman Jan 20, 2023

bendnorman Dec 6, 2022

bendnorman Dec 6, 2022

katie-lamb Dec 15, 2022 •

edited

bendnorman Dec 22, 2022

zschira Dec 22, 2022

katie-lamb commented Dec 15, 2022

zschira commented Dec 22, 2022

zschira left a comment

zschira Dec 22, 2022

zschira Dec 22, 2022

bendnorman Dec 28, 2022

zschira Dec 22, 2022

review-notebook-app bot commented Mar 21, 2023

		"coloredlogs>=14.0,<15.1", # Dagster requires 14.0
		"catalystcoop.ferc_xbrl_extractor @ git+https://github.com/catalyst-cooperative/ferc-xbrl-extractor.git@dev",

		# We do not want to configure tables so remove tables from the dict
		d.pop("tables", None)

Dagster asset etl #2104

Dagster asset etl #2104

Conversation

bendnorman commented Dec 1, 2022 • edited by zaneselvans

PR Checklist

bendnorman left a comment • edited

Choose a reason for hiding this comment

Background

Discussion topics

Next steps

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

katie-lamb Dec 15, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

katie-lamb Dec 15, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

katie-lamb Dec 15, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

katie-lamb commented Dec 15, 2022

zschira commented Dec 22, 2022

zschira left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

review-notebook-app bot commented Mar 21, 2023

bendnorman commented Dec 1, 2022 •

edited by zaneselvans

bendnorman left a comment •

edited

katie-lamb Dec 15, 2022 •

edited

katie-lamb Dec 15, 2022 •

edited

katie-lamb Dec 15, 2022 •

edited