Update integration tests to work with Dagster ETL #2299

zschira · 2023-02-09T20:48:05Z

Scope

This PR updates integration tests, and test fixtures to get them working with the Dagster ETL. All tests are now passing except the glue tests, which will be addressed by #2303.

bendnorman

This is looking good! Creating fixtures for job configuration seems like the dagster recommended method for testing jobs. I wonder if it makes sense to create fixtures for the actual dagster resources not just the configs.

bendnorman · 2023-02-09T21:39:21Z

test/conftest.py

-            batch_size=20,
-        )
-    engine = sa.create_engine(pudl_settings_fixture["ferc1_xbrl_db"])
+    engine = sa.create_engine(f"sqlite:///{os.getenv('PUDL_OUTPUT')}/ferc1_xbrl.sqlite")


Instead of creating the engine we could access it from the ferc1_xbrl_sqlite_io_manager:

from pudl.io_managers import ferc1_xbrl_sqlite_io_manager from dagster import build_init_resource_context @pytest.fixture(scope="session", name="ferc1_engine_xbrl") def ferc1_xbrl_sql_engine(ferc_to_sqlite, etl_settings): context = build_init_resource_context(resources={"dataset_settings": etl_settings.dataset_settings}) return ferc1_xbrl_sqlite_io_manager(context).engine

Much cleaner, fixed

bendnorman · 2023-02-09T21:46:03Z

test/conftest.py

        )
    # Grab a connection to the freshly populated PUDL DB, and hand it off.
    # All the hard work here is being done by the datapkg and
    # datapkg_to_sqlite fixtures, above.
-    engine = sa.create_engine(pudl_settings_fixture["pudl_db"])
+    engine = sa.create_engine(f"sqlite:///{os.getenv('PUDL_OUTPUT')}/pudl.sqlite")


We could get the engine from the pudl_sqlite_io_manager.

Also, if we have the pudl_sqlite_io_manager we could run pudl_sqlite_io_manager.check_foreign_keys() once the ETL is finished so we know the FKs are working. However, it might make sense to check the foreign keys in a separate test so it's clear what is failing.

Good call, I added this check in test_pudl_engine

bendnorman · 2023-02-09T21:58:11Z

test/conftest.py

+    tmpdir = tmpdir_factory.mktemp("PUDL_OUTPUT")
+    os.environ["PUDL_OUTPUT"] = str(tmpdir)
+    os.environ["DAGSTER_HOME"] = str(tmpdir)
+
+    # In CI we want a hard-coded path for input caching purposes:
+    if os.environ.get("GITHUB_ACTIONS", False):
+        os.environ["PUDL_CACHE"] = str(Path(os.environ["HOME"]) / "pudl-work")
+    # If --tmp-data is set, create a disposable temporary datastore:
+    elif request.config.getoption("--tmp-data"):
+        os.environ["PUDL_CACHE"] = str(tmpdir)


Does the pudl-work directory need to be created on the Github Action?

I think the pudl-work directory may have previously been created by pudl_setup, so I think it probably does need to be created somewhere in the new arrangement. As the comment says it's hard-coded so that if none of the DOIs have changed between CI runs, we can just use the previously cached raw inputs from the datastore.

You might want to look at how those DOI changes are detected in the caching step within the tox-pytest workflow. I'm not sure if it needs to be adapted to the new setup.

It looks like previously pudl-work was just being created directly by the action, and instead of using pudl_setup, the paths are just configured in the pudl_settings fixture.

bendnorman

Looks good! Just a couple of questions and suggestions.

src/pudl/etl/__init__.py

bendnorman · 2023-02-14T18:53:52Z

src/pudl/metadata/classes.py

@@ -1950,7 +1950,7 @@ def from_data_source_ids(
        xbrl_resources = {}
        for xbrl_id in xbrl_ids:
            # Read JSON Package descriptor from file
-            with open(pudl_settings[f"{xbrl_id}_datapackage"]) as f:
+            with open(output_path / f"{xbrl_id}_datapackage.json") as f:


We'll have to change this back to use pudl_settings for #2301

True, unless we wanted to go for some hybrid approach to solve #2301

bendnorman · 2023-02-14T18:55:02Z

test/conftest.py

    If we are using the test database, we initialize it from scratch first. If we're
    using the live database, then we just yield a conneciton to it.


Should this portion of the doc string be moved to the ferc_to_sqlite fixture?

bendnorman · 2023-02-14T19:01:33Z

test/integration/epacems_test.py

-    settings_tmp = {k: v for k, v in pudl_settings_fixture.items()}
-    settings_tmp["parquet_dir"] = parquet_tmp
-    etl_epacems(epacems_settings, settings_tmp, pudl_ds_kwargs)
+    epacems_path = Path(os.getenv("PUDL_OUTPUT")) / "hourly_emissions_epacems"


Why not use the epacems_parquet_path fixture here? Also, does this depend on pudl_engine to ensure the cems etl has run?

bendnorman · 2023-02-14T19:03:30Z

test/integration/etl_test.py

@@ -17,13 +17,17 @@
 logger = logging.getLogger(__name__)


-def test_pudl_engine(pudl_engine):
+def test_pudl_engine(pudl_engine, pudl_sql_io_manager, check_foreign_keys):
    """Try creating a pudl_engine...."""


I think we should add a more verbose docstring explaining how the foreign keys are checked. Something like "By default the foreign key checks are not enabled in pudl.sqlite. Run this test to check if there are any foreign key errors."

…ped-ids Convert FERC1 -> EIA missing ID validation ET[L] to Dagster

codecov · 2023-02-16T17:13:04Z

Codecov Report

Base: 85.2% // Head: 85.7% // Increases project coverage by +0.4% 🎉

Coverage data is based on head (ae1395e) compared to base (caa88a4).
Patch coverage: 94.5% of modified lines in pull request are covered.

Additional details and impacted files

@@                 Coverage Diff                 @@
##           dagster-asset-etl   #2299     +/-   ##
===================================================
+ Coverage               85.2%   85.7%   +0.4%     
===================================================
  Files                     72      79      +7     
  Lines                   8279    8906    +627     
===================================================
+ Hits                    7060    7633    +573     
- Misses                  1219    1273     +54

Impacted Files	Coverage Δ
src/pudl/analysis/allocate_net_gen.py	`96.7% <ø> (ø)`
src/pudl/analysis/epacamd_eia.py	`21.5% <0.0%> (ø)`
src/pudl/extract/eia860m.py	`100.0% <ø> (ø)`
src/pudl/extract/epacems.py	`97.3% <ø> (ø)`
src/pudl/metadata/fields.py	`100.0% <ø> (ø)`
src/pudl/metadata/classes.py	`82.4% <61.5%> (-0.1%)`	⬇️
src/pudl/extract/ferc1.py	`81.2% <68.6%> (-6.4%)`	⬇️
src/pudl/convert/datasette_metadata_to_yml.py	`76.1% <75.0%> (+11.1%)`	⬆️
src/pudl/helpers.py	`85.8% <78.5%> (ø)`
src/pudl/cli.py	`87.5% <85.7%> (+17.8%)`	⬆️
... and 45 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

bendnorman

🎉

bendnorman · 2023-02-16T20:41:19Z

test/integration/etl_test.py

+    if check_foreign_keys:
+        # Raises ForeignKeyErrors if there are any
+        pudl_sql_io_manager.check_foreign_keys()
+


Now that we have this check we can delete test/validate/database_test.py. We could also do it in another PR so we don't have to rerun the CI for this PR.

zschira added 2 commits February 9, 2023 15:38

Get epacems test working with dagster

6f16c54

Fix handling of env variables

20f93f6

zschira requested a review from bendnorman February 9, 2023 21:49

bendnorman reviewed Feb 9, 2023

View reviewed changes

jdangerx added dagster Issues related to our use of the Dagster orchestrator inframundo labels Feb 10, 2023

jdangerx linked an issue Feb 10, 2023 that may be closed by this pull request

Update tests to reflect dagster changes #1874

Closed

This was referenced Feb 10, 2023

Convert epacems_to_parquet command to run dagster asset #2300

Merged

Bridge between PUDL_CACHE/PUDL_OUT world and pudl_settings world #2301

Closed

zschira added 4 commits February 14, 2023 10:09

Update ferc1_xbrl_metadata and datastore fixtures

f5ecb74

Fix datasette metadata integration tests

d107eab

Fix epacems parquet path

8a03fc1

Change test sql engines to use io managers

c4a1d2e

zschira changed the title ~~Get epacems test working with dagster~~ Update integration tests to work with Dagster ETL Feb 14, 2023

zschira added 3 commits February 14, 2023 12:48

Fix package_data sphinx import error

e987185

Fix ferc1 transform args

57f6935

Add check_foreign_keys test

e95fc84

bendnorman reviewed Feb 14, 2023

View reviewed changes

zschira and others added 12 commits February 14, 2023 15:00

Use epacems parquet path fixture

da2f32a

Improve test_pudl_engine docstring

b7ee290

Set dagster required env vars in tox-pytest workflow

455809b

Get tox -e get_unmapped_ids to run at all

34b623e

Fix dagster required env vars in tox-pytest workflow

631c5d5

Pass PUDL env vars through tox

7e912e6

WIP: turn get_plants_ferc1_raw() into a Dagster job

05d0513

Convert plants_ferc1_raw to dagster job

715012b

Fix failing ferc transforms

4cf50e0

Pass io manager key to plants_steam_ferc1 asset

4aad100

Remove old ferc extract functions

adefcd9

Change get_unmapped_ids to use full etl settings

5abd8c4

bendnorman and others added 2 commits February 15, 2023 15:03

Remove tox notebook ignore

e4c015c

Merge pull request #2309 from catalyst-cooperative/daz/2303-get-unmap…

ae1395e

…ped-ids Convert FERC1 -> EIA missing ID validation ET[L] to Dagster

bendnorman approved these changes Feb 16, 2023

View reviewed changes

zschira merged commit 8dfef94 into dagster-asset-etl Feb 16, 2023

zschira deleted the dagster_integration_tests branch February 16, 2023 21:33

bendnorman mentioned this pull request Feb 17, 2023

Update tests to reflect dagster changes #1874

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update integration tests to work with Dagster ETL #2299

Update integration tests to work with Dagster ETL #2299

zschira commented Feb 9, 2023 •

edited

bendnorman left a comment

bendnorman Feb 9, 2023

zschira Feb 14, 2023

bendnorman Feb 9, 2023

bendnorman Feb 9, 2023 •

edited

zschira Feb 14, 2023

bendnorman Feb 9, 2023 •

edited

zaneselvans Feb 10, 2023

zschira Feb 14, 2023

bendnorman left a comment

bendnorman Feb 14, 2023

zschira Feb 14, 2023

bendnorman Feb 14, 2023

zschira Feb 14, 2023

bendnorman Feb 14, 2023

zschira Feb 14, 2023

bendnorman Feb 14, 2023

zschira Feb 14, 2023

codecov bot commented Feb 16, 2023 •

edited

bendnorman left a comment

bendnorman Feb 16, 2023

		If we are using the test database, we initialize it from scratch first. If we're
		using the live database, then we just yield a conneciton to it.

Update integration tests to work with Dagster ETL #2299

Update integration tests to work with Dagster ETL #2299

Conversation

zschira commented Feb 9, 2023 • edited

Scope

bendnorman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bendnorman Feb 9, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bendnorman Feb 9, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bendnorman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Feb 16, 2023 • edited

Codecov Report

bendnorman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zschira commented Feb 9, 2023 •

edited

bendnorman Feb 9, 2023 •

edited

bendnorman Feb 9, 2023 •

edited

codecov bot commented Feb 16, 2023 •

edited