-
-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update integration tests to work with Dagster ETL #2299
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good! Creating fixtures for job configuration seems like the dagster recommended method for testing jobs. I wonder if it makes sense to create fixtures for the actual dagster resources not just the configs.
test/conftest.py
Outdated
batch_size=20, | ||
) | ||
engine = sa.create_engine(pudl_settings_fixture["ferc1_xbrl_db"]) | ||
engine = sa.create_engine(f"sqlite:///{os.getenv('PUDL_OUTPUT')}/ferc1_xbrl.sqlite") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of creating the engine we could access it from the ferc1_xbrl_sqlite_io_manager
:
from pudl.io_managers import ferc1_xbrl_sqlite_io_manager
from dagster import build_init_resource_context
@pytest.fixture(scope="session", name="ferc1_engine_xbrl")
def ferc1_xbrl_sql_engine(ferc_to_sqlite, etl_settings):
context = build_init_resource_context(resources={"dataset_settings": etl_settings.dataset_settings})
return ferc1_xbrl_sqlite_io_manager(context).engine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much cleaner, fixed
test/conftest.py
Outdated
) | ||
# Grab a connection to the freshly populated PUDL DB, and hand it off. | ||
# All the hard work here is being done by the datapkg and | ||
# datapkg_to_sqlite fixtures, above. | ||
engine = sa.create_engine(pudl_settings_fixture["pudl_db"]) | ||
engine = sa.create_engine(f"sqlite:///{os.getenv('PUDL_OUTPUT')}/pudl.sqlite") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could get the engine from the pudl_sqlite_io_manager
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if we have the pudl_sqlite_io_manager
we could run pudl_sqlite_io_manager.check_foreign_keys()
once the ETL is finished so we know the FKs are working. However, it might make sense to check the foreign keys in a separate test so it's clear what is failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, I added this check in test_pudl_engine
tmpdir = tmpdir_factory.mktemp("PUDL_OUTPUT") | ||
os.environ["PUDL_OUTPUT"] = str(tmpdir) | ||
os.environ["DAGSTER_HOME"] = str(tmpdir) | ||
|
||
# In CI we want a hard-coded path for input caching purposes: | ||
if os.environ.get("GITHUB_ACTIONS", False): | ||
os.environ["PUDL_CACHE"] = str(Path(os.environ["HOME"]) / "pudl-work") | ||
# If --tmp-data is set, create a disposable temporary datastore: | ||
elif request.config.getoption("--tmp-data"): | ||
os.environ["PUDL_CACHE"] = str(tmpdir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the pudl-work
directory need to be created on the Github Action?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the pudl-work
directory may have previously been created by pudl_setup
, so I think it probably does need to be created somewhere in the new arrangement. As the comment says it's hard-coded so that if none of the DOIs have changed between CI runs, we can just use the previously cached raw inputs from the datastore.
You might want to look at how those DOI changes are detected in the caching step within the tox-pytest
workflow. I'm not sure if it needs to be adapted to the new setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like previously pudl-work
was just being created directly by the action, and instead of using pudl_setup
, the paths are just configured in the pudl_settings
fixture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just a couple of questions and suggestions.
@@ -1950,7 +1950,7 @@ def from_data_source_ids( | |||
xbrl_resources = {} | |||
for xbrl_id in xbrl_ids: | |||
# Read JSON Package descriptor from file | |||
with open(pudl_settings[f"{xbrl_id}_datapackage"]) as f: | |||
with open(output_path / f"{xbrl_id}_datapackage.json") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll have to change this back to use pudl_settings
for #2301
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, unless we wanted to go for some hybrid approach to solve #2301
test/conftest.py
Outdated
If we are using the test database, we initialize it from scratch first. If we're | ||
using the live database, then we just yield a conneciton to it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this portion of the doc string be moved to the ferc_to_sqlite
fixture?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call
test/integration/epacems_test.py
Outdated
settings_tmp = {k: v for k, v in pudl_settings_fixture.items()} | ||
settings_tmp["parquet_dir"] = parquet_tmp | ||
etl_epacems(epacems_settings, settings_tmp, pudl_ds_kwargs) | ||
epacems_path = Path(os.getenv("PUDL_OUTPUT")) / "hourly_emissions_epacems" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use the epacems_parquet_path
fixture here? Also, does this depend on pudl_engine
to ensure the cems etl has run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
test/integration/etl_test.py
Outdated
@@ -17,13 +17,17 @@ | |||
logger = logging.getLogger(__name__) | |||
|
|||
|
|||
def test_pudl_engine(pudl_engine): | |||
def test_pudl_engine(pudl_engine, pudl_sql_io_manager, check_foreign_keys): | |||
"""Try creating a pudl_engine....""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add a more verbose docstring explaining how the foreign keys are checked. Something like "By default the foreign key checks are not enabled in pudl.sqlite
. Run this test to check if there are any foreign key errors."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
…ped-ids Convert FERC1 -> EIA missing ID validation ET[L] to Dagster
Codecov ReportBase: 85.2% // Head: 85.7% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## dagster-asset-etl #2299 +/- ##
===================================================
+ Coverage 85.2% 85.7% +0.4%
===================================================
Files 72 79 +7
Lines 8279 8906 +627
===================================================
+ Hits 7060 7633 +573
- Misses 1219 1273 +54
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
if check_foreign_keys: | ||
# Raises ForeignKeyErrors if there are any | ||
pudl_sql_io_manager.check_foreign_keys() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we have this check we can delete test/validate/database_test.py
. We could also do it in another PR so we don't have to rerun the CI for this PR.
Scope
This PR updates integration tests, and test fixtures to get them working with the Dagster ETL. All tests are now passing except the glue tests, which will be addressed by #2303.