Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update integration tests to work with Dagster ETL #2299

Merged
merged 23 commits into from
Feb 16, 2023

Conversation

zschira
Copy link
Member

@zschira zschira commented Feb 9, 2023

Scope

This PR updates integration tests, and test fixtures to get them working with the Dagster ETL. All tests are now passing except the glue tests, which will be addressed by #2303.

Copy link
Member

@bendnorman bendnorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good! Creating fixtures for job configuration seems like the dagster recommended method for testing jobs. I wonder if it makes sense to create fixtures for the actual dagster resources not just the configs.

test/conftest.py Outdated
batch_size=20,
)
engine = sa.create_engine(pudl_settings_fixture["ferc1_xbrl_db"])
engine = sa.create_engine(f"sqlite:///{os.getenv('PUDL_OUTPUT')}/ferc1_xbrl.sqlite")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of creating the engine we could access it from the ferc1_xbrl_sqlite_io_manager:

from pudl.io_managers import ferc1_xbrl_sqlite_io_manager
from dagster import build_init_resource_context

@pytest.fixture(scope="session", name="ferc1_engine_xbrl")
def ferc1_xbrl_sql_engine(ferc_to_sqlite, etl_settings):
       context = build_init_resource_context(resources={"dataset_settings": etl_settings.dataset_settings})
       return ferc1_xbrl_sqlite_io_manager(context).engine

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much cleaner, fixed

test/conftest.py Outdated
)
# Grab a connection to the freshly populated PUDL DB, and hand it off.
# All the hard work here is being done by the datapkg and
# datapkg_to_sqlite fixtures, above.
engine = sa.create_engine(pudl_settings_fixture["pudl_db"])
engine = sa.create_engine(f"sqlite:///{os.getenv('PUDL_OUTPUT')}/pudl.sqlite")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could get the engine from the pudl_sqlite_io_manager.

Copy link
Member

@bendnorman bendnorman Feb 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if we have the pudl_sqlite_io_manager we could run pudl_sqlite_io_manager.check_foreign_keys() once the ETL is finished so we know the FKs are working. However, it might make sense to check the foreign keys in a separate test so it's clear what is failing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, I added this check in test_pudl_engine

Comment on lines +89 to +98
tmpdir = tmpdir_factory.mktemp("PUDL_OUTPUT")
os.environ["PUDL_OUTPUT"] = str(tmpdir)
os.environ["DAGSTER_HOME"] = str(tmpdir)

# In CI we want a hard-coded path for input caching purposes:
if os.environ.get("GITHUB_ACTIONS", False):
os.environ["PUDL_CACHE"] = str(Path(os.environ["HOME"]) / "pudl-work")
# If --tmp-data is set, create a disposable temporary datastore:
elif request.config.getoption("--tmp-data"):
os.environ["PUDL_CACHE"] = str(tmpdir)
Copy link
Member

@bendnorman bendnorman Feb 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the pudl-work directory need to be created on the Github Action?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the pudl-work directory may have previously been created by pudl_setup, so I think it probably does need to be created somewhere in the new arrangement. As the comment says it's hard-coded so that if none of the DOIs have changed between CI runs, we can just use the previously cached raw inputs from the datastore.

You might want to look at how those DOI changes are detected in the caching step within the tox-pytest workflow. I'm not sure if it needs to be adapted to the new setup.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like previously pudl-work was just being created directly by the action, and instead of using pudl_setup, the paths are just configured in the pudl_settings fixture.

@jdangerx jdangerx added dagster Issues related to our use of the Dagster orchestrator inframundo labels Feb 10, 2023
@jdangerx jdangerx linked an issue Feb 10, 2023 that may be closed by this pull request
@zschira zschira changed the title Get epacems test working with dagster Update integration tests to work with Dagster ETL Feb 14, 2023
Copy link
Member

@bendnorman bendnorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just a couple of questions and suggestions.

src/pudl/etl/__init__.py Show resolved Hide resolved
@@ -1950,7 +1950,7 @@ def from_data_source_ids(
xbrl_resources = {}
for xbrl_id in xbrl_ids:
# Read JSON Package descriptor from file
with open(pudl_settings[f"{xbrl_id}_datapackage"]) as f:
with open(output_path / f"{xbrl_id}_datapackage.json") as f:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll have to change this back to use pudl_settings for #2301

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, unless we wanted to go for some hybrid approach to solve #2301

test/conftest.py Outdated
Comment on lines 228 to 229
If we are using the test database, we initialize it from scratch first. If we're
using the live database, then we just yield a conneciton to it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this portion of the doc string be moved to the ferc_to_sqlite fixture?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call

settings_tmp = {k: v for k, v in pudl_settings_fixture.items()}
settings_tmp["parquet_dir"] = parquet_tmp
etl_epacems(epacems_settings, settings_tmp, pudl_ds_kwargs)
epacems_path = Path(os.getenv("PUDL_OUTPUT")) / "hourly_emissions_epacems"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use the epacems_parquet_path fixture here? Also, does this depend on pudl_engine to ensure the cems etl has run?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -17,13 +17,17 @@
logger = logging.getLogger(__name__)


def test_pudl_engine(pudl_engine):
def test_pudl_engine(pudl_engine, pudl_sql_io_manager, check_foreign_keys):
"""Try creating a pudl_engine...."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a more verbose docstring explaining how the foreign keys are checked. Something like "By default the foreign key checks are not enabled in pudl.sqlite. Run this test to check if there are any foreign key errors."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

bendnorman and others added 2 commits February 15, 2023 15:03
…ped-ids

Convert FERC1 -> EIA missing ID validation ET[L] to Dagster
@codecov
Copy link

codecov bot commented Feb 16, 2023

Codecov Report

Base: 85.2% // Head: 85.7% // Increases project coverage by +0.4% 🎉

Coverage data is based on head (ae1395e) compared to base (caa88a4).
Patch coverage: 94.5% of modified lines in pull request are covered.

Additional details and impacted files
@@                 Coverage Diff                 @@
##           dagster-asset-etl   #2299     +/-   ##
===================================================
+ Coverage               85.2%   85.7%   +0.4%     
===================================================
  Files                     72      79      +7     
  Lines                   8279    8906    +627     
===================================================
+ Hits                    7060    7633    +573     
- Misses                  1219    1273     +54     
Impacted Files Coverage Δ
src/pudl/analysis/allocate_net_gen.py 96.7% <ø> (ø)
src/pudl/analysis/epacamd_eia.py 21.5% <0.0%> (ø)
src/pudl/extract/eia860m.py 100.0% <ø> (ø)
src/pudl/extract/epacems.py 97.3% <ø> (ø)
src/pudl/metadata/fields.py 100.0% <ø> (ø)
src/pudl/metadata/classes.py 82.4% <61.5%> (-0.1%) ⬇️
src/pudl/extract/ferc1.py 81.2% <68.6%> (-6.4%) ⬇️
src/pudl/convert/datasette_metadata_to_yml.py 76.1% <75.0%> (+11.1%) ⬆️
src/pudl/helpers.py 85.8% <78.5%> (ø)
src/pudl/cli.py 87.5% <85.7%> (+17.8%) ⬆️
... and 45 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Member

@bendnorman bendnorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Comment on lines +31 to +34
if check_foreign_keys:
# Raises ForeignKeyErrors if there are any
pudl_sql_io_manager.check_foreign_keys()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have this check we can delete test/validate/database_test.py. We could also do it in another PR so we don't have to rerun the CI for this PR.

@zschira zschira merged commit 8dfef94 into dagster-asset-etl Feb 16, 2023
@zschira zschira deleted the dagster_integration_tests branch February 16, 2023 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dagster Issues related to our use of the Dagster orchestrator inframundo
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Update tests to reflect dagster changes
4 participants