Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prototype dagster-pandera integration #3282

Merged
merged 12 commits into from Feb 20, 2024
Merged

Conversation

jdangerx
Copy link
Member

@jdangerx jdangerx commented Jan 24, 2024

Overview

Relevant to #1572, but doesn't close it. Opens the possibility of closing it.

We need to integrate:

  • our existing resource schema definition system
  • Pandera
  • dagster

Fortunately, there's already integration between pandera and dagster. We just need a shim layer between our existing schema definition system and pandera - hence, this PR.

Changes:

  • shim layer between our schema definitions & pandera
  • create dagster asset checks for all assets that correspond to a Resource
  • port boiler fuel validation checks to resource schema

See details for the original rant, but I ended up introducing some classes to define our existing RESOURCE_METADATA data structure. That lays the groundwork for moving towards using the frictionless library in the future, as well as further refactoring to make our metadata handling interoperate with various different libraries like we discussed in https://github.com/orgs/catalyst-cooperative/discussions/2546.

Our existing schema definition system is hard to change for many reasons; one of which is that the shape of the data in RESOURCE_METADATA is actually not documented anywhere! Instead we have all sorts of weird logic scattered throughout to turn that undocumented shape into something resembling Frictionless Packages/Resources/Fields (but that are distinct classes...)

One thing we could do to make this whole thing a bit more tractable is to actually define some sort of ResourceMetadata or ResourceSpec class (and, potentially, companion SchemaSpec, FieldSpec classes) that:

  • describes what we expect from RESOURCE_METADATA
  • knows how to translate from itself into various different formats: frictionless, pandera, pyarrow, pandas, etc.

Right now, our Resource (and companion) classes happens to be "a copy of the frictionless class, sort of, that knows how to translate from the undefined shape of what's in RESOURCE_METADATA into our ersatz frictionless class." Which causes much entanglement/confusion.

It doesn't seem like it would be too big of a refactor, to split up Resource into ResourceSpec and the official frictionless.Resource, so maybe we should do that sometime soon.

Testing

How did you make sure this worked? How can a reviewer verify this?

I ran the whole ETL, and saw that all the asset checks passed. I also happened to pick an asset which has some validation checks we expect to fail - so I saw that failing checks did actually show up in Dagster, before then adding machinery to xfail certain checks.

To-do list

Edit tasklist title
Beta Give feedback Tasklist To-do list, more options

Delete tasklist

Delete tasklist block?
Are you sure? All relationships in this tasklist will be removed.
  1. use Pandera to express real data type and validation tests for an asset in Dagster
    Options
  2. Make sure full ETL runs & make pytest-integration-full passes locally
    Options
  3. Update the release notes: reference the PR and related issues.
    Options
  4. Review the PR yourself and call out any questions or issues you have
    Options

Copy link
Member Author

@jdangerx jdangerx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some preliminary comments

pyproject.toml Outdated Show resolved Hide resolved
src/pudl/metadata/classes.py Show resolved Hide resolved
src/pudl/metadata/classes.py Outdated Show resolved Hide resolved
src/pudl/metadata/resources/eia.py Outdated Show resolved Hide resolved
@jdangerx
Copy link
Member Author

jdangerx commented Jan 24, 2024

cc @zaneselvans, who I think has done a lot of the recent touching of this, and @cmgosnell + @e-belfer , who have recently expressed some desire for using our hard-won schema information to annotate our asset types - not tagging for review since it's not ready yet, but just an FYI.

@bendnorman
Copy link
Member

Something to note is dagster's addition of asset checks which allow you to run arbitrary tests after an asset executes. The feature is still experimental but seems pretty useful. I asked how asset checks and something like dagster-pandera compare. If we want to use pandera + dagster asset checks they recommend validating the asset using pandera inside of the asset check instead of using dagster-pandera.

@jdangerx
Copy link
Member Author

Ooh, thanks @bendnorman ! Using something that's more "core" dagster is better than using something that's less core :)

@zaneselvans
Copy link
Member

I'm not sure if it matters but it seems like there's a conflation of 2 related functions here: asset type checking, and more complex data validations that look at the data contents, beyond just the schema (though I guess you could consider constraints on the values and relationships between columns part of a more complicated "type" with the validations that Pydantic does).

But if we use an asset check and happen to use Pandera inside it, will the dataframe typing information still be available to Dagster? Or to us in the IDE? It sounded like there was a vague plan to updated dagster-pandera to incorporate asset checks

For now you should use pandera directly. We'll update dagster-pandera it to add checks support but don't have a date yet

@jdangerx jdangerx force-pushed the prototype-dagster-pandera branch 4 times, most recently from 6230349 to 7157ca3 Compare February 15, 2024 19:25
src/pudl/etl/__init__.py Outdated Show resolved Hide resolved
src/pudl/metadata/classes.py Show resolved Hide resolved
@@ -1001,6 +1058,69 @@ class ResourceHarvest(PudlMeta):
"""Fraction of invalid fields above which result is considerd invalid."""


class PudlResourceDescriptor(PudlMeta):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this class so that we have a proper description/enforcement of the shape of RESOURCE_METADATAs. This should make it easier for someone to understand how to add a new resource.

Ran into a couple things that are a little funky, documented as TODOs.

@@ -1185,8 +1305,18 @@ def _check_harvest_primary_key(cls, value, info: ValidationInfo):
return value

@staticmethod
def dict_from_id(x: str) -> dict: # noqa: C901
"""Construct dictionary from PUDL identifier (`resource.name`).
def dict_from_id(resource_id: str) -> dict:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was surprisingly simple to run all of our RESOURCE_METADATA through the new type and validation machinery!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this uses the new high-level composite PudlResourceDescriptor to validate whatever we've encoded in our giant dictionary of doom, rather than waiting for each of the individual subcomponents to get instantiated and validated? The main goal being to ensure that we have a complete explicit, rather than implicit, description of what needs to be defined for PUDL at the resource / table level?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Basically:

  • It's helpful to have something like PudlResourceDescriptor to explicitly encode our bespoke resource descriptor data structure!
  • It's... much less helpful if we don't actually enforce that that's what we're working with in the rest of the code.
  • This should also blow up on incorrectly defined resources at validation time, instead of when we're then trying to do all this transformation logic to them.

@@ -1533,6 +1554,7 @@ def plot_vs_agg(orig_df, agg_df, validation_cases):
"low_bound": 0.95,
"data_col": "fuel_mmbtu_per_unit",
"weight_col": "fuel_consumed_units",
"xfail": True,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worthwhile to define a BoundsCheckParams class that all these bounds checks conform to, instead of having them be dictionaries that can be anything. It's not necessary for this code to work, though. What do people think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module is a horror show written long ago, before we knew about Pydantic or data classes or typed dicts. It's long overdue for an overhaul. I don't even know if it should continue existing after we rip out PudlTabl and change how the tests and validations access the database.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to be running the validations during the ETL as soon as the relevant table has been created, where does it make the most sense to define the validations that apply to that table? Should it be adjacent to the asset definitions (where all of the cleaning / munging is actually happening)? And then referenced from the Resource definition? Or should it actually be in the resource definition? Or would we want to continue to have a separate module or set of modules dedicated to defining the validations like we do now?

At the same time we probably want to develop a library of generic kinds-of-checks that we'll end up using in many different contexts, instantiating them with appropriate parameters when they're defined for the individual tables that they apply to. So maybe we turn pudl.validations into the place where those validation classes / types are defined?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what makes the most sense is:

  1. Use the PUDL resource schemas to define the data type validations. If e.g. "everything in this column should be >10" that seems like it lives in the schema.
  2. More complicated validations should be defined alongside the assets - you can just define an @asset_check right in the module, in which case we'll have to add some load_asset_checks_from_module calls to etl/__init__.py.
  3. We should have utility functions + types related to validations in the pudl.validations module.

I don't think we need to then reference the more complicated validations alongside the schema in RESOURCE_METADATA.

@jdangerx jdangerx marked this pull request as ready for review February 15, 2024 19:39
Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow this is great! Huge progress toward a better validation setup, and yet not huge changes. I had some minor clarifying questions and requests for docstrings, but I think we should get this in as incremental progress and think about whether we want to move on to anything mentioned below:

Since we have so much new data coming into the DB right now, I think it would be helpful to create some more aspirational templates / examples of how we actually want the data validations to look going forward, so that as we add new cases, we can be doing it the Right Way. The factory wrapper for converting the old validation feels little opaque for this purpose.

So it would be nice to have examples of both new df_checks and field_checks which are not tied to the old data validation implementation (though the underlying content could be drawn from those validations if we want)

Probably this would mean implementing at least one parametrized check class and migrating at least some of the existing data validations into that container completely, not just by reference to pudl.validate

Running the validations in the ETL will require more compute, but I think there's likely to be a big overall performance speedup for the validations, given that the way the validations are currently organized, the same dataframes are being read out of the DB and into memory repeatedly (since the tests are grouped by type of validation, not by which dataframe they apply to). This also results in memory usage blowing up if you try and run the data validations on a machine with lots of cores with pytest-xdist.

And then of course we'll also discover violations of our validation checks immediately in the ETL instead of in the nightly builds, which means we'll fix them sooner.

IIRC by default failing asset checks don't cause the ETL to fail, do they? How will we be notified when a check fails? Do we want to set them to ERROR out on check failure? I think I saw that was a new option as of the most recent dagster release.

Are there any checks that we can't encode this way? I guess any check (like referential integrity) that involves more than one table won't work.

src/pudl/etl/__init__.py Outdated Show resolved Hide resolved
asset_key: AssetKey,
package: pudl.metadata.classes.Package,
) -> AssetChecksDefinition | None:
"""Create a dagster asset check based on the resource schema, if defined."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect there to be cases in which a Resource does not have a Schema defined?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't think so. But I do expect there to be assets that are built without corresponding Resources - e.g. raw_*.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay. I think the docstring could be clearer then. "if defined" is not "if there's a schema defined for this resource" but "if there's a resource defined for this asset."

Comment on lines +148 to +180
warnings.filterwarnings("ignore", category=ExperimentalWarning)
_package = pudl.metadata.classes.Package.from_resource_ids()
_asset_keys = itertools.chain.from_iterable(
_get_keys_from_assets(asset_def) for asset_def in default_assets
)
default_asset_checks = [
check
for check in (
asset_check_from_schema(asset_key, _package) for asset_key in _asset_keys
)
if check is not None
]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do the default asset checks relate to the default assets / how are they associated with each other? Are they just identified with each other based on the ordering of these lists that are passed into the definition of defs? Or is there some key within the assets and the checks that matches them up?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're associated by explicitly linking them in the asset param you pass to @asset_check.

src/pudl/metadata/classes.py Show resolved Hide resolved
Comment on lines 776 to 777
df_checks: list[Callable] = []
field_checks: dict[SnakeCase, list[Callable]] = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this shows up below but, where do you envision these additional (beyond the schema) table/column level checks being defined?

src/pudl/metadata/resources/eia923.py Outdated Show resolved Hide resolved
@@ -1533,6 +1554,7 @@ def plot_vs_agg(orig_df, agg_df, validation_cases):
"low_bound": 0.95,
"data_col": "fuel_mmbtu_per_unit",
"weight_col": "fuel_consumed_units",
"xfail": True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to be running the validations during the ETL as soon as the relevant table has been created, where does it make the most sense to define the validations that apply to that table? Should it be adjacent to the asset definitions (where all of the cleaning / munging is actually happening)? And then referenced from the Resource definition? Or should it actually be in the resource definition? Or would we want to continue to have a separate module or set of modules dedicated to defining the validations like we do now?

At the same time we probably want to develop a library of generic kinds-of-checks that we'll end up using in many different contexts, instantiating them with appropriate parameters when they're defined for the individual tables that they apply to. So maybe we turn pudl.validations into the place where those validation classes / types are defined?

src/pudl/validate.py Outdated Show resolved Hide resolved
src/pudl/workspace/resource_cache.py Show resolved Hide resolved
test/unit/metadata_test.py Show resolved Hide resolved
@zaneselvans zaneselvans added the testing Writing tests, creating test data, automating testing, etc. label Feb 16, 2024
@zaneselvans zaneselvans added the metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. label Feb 16, 2024
@jdangerx
Copy link
Member Author

jdangerx commented Feb 16, 2024

Wow this is great! Huge progress toward a better validation setup, and yet not huge changes. I had some minor clarifying questions and requests for docstrings, but I think we should get this in as incremental progress and think about whether we want to move on to anything mentioned below:

Sweet! Yeah, I was a bit bummed that I let the scope creep up (from Do One Asset to Do A Bunch of Assets + Make Some Ergonomic Improvements) , but I think the extra work was very high bang for buck.

Since we have so much new data coming into the DB right now, I think it would be helpful to create some more aspirational templates / examples of how we actually want the data validations to look going forward, so that as we add new cases, we can be doing it the Right Way. The factory wrapper for converting the old validation feels little opaque for this purpose.

So it would be nice to have examples of both new df_checks and field_checks which are not tied to the old data validation implementation (though the underlying content could be drawn from those validations if we want)

Agreed - I think this would be a great follow-up! We should define those directly alongside the assets, I think, as mentioned above. Since @asset_check(asset=foo) will associate the function with the asset, we can define asset checks wherever we want and they'll get picked up. We'll just have to remember to run load_asset_checks_from_module() in etl/__init__.py. Might make sense to have those module-crawlers crawl everything automatically...

Probably this would mean implementing at least one parametrized check class and migrating at least some of the existing data validations into that container completely, not just by reference to pudl.validate

Yeah, sounds right.

Running the validations in the ETL will require more compute, but I think there's likely to be a big overall performance speedup for the validations, given that the way the validations are currently organized, the same dataframes are being read out of the DB and into memory repeatedly (since the tests are grouped by type of validation, not by which dataframe they apply to). This also results in memory usage blowing up if you try and run the data validations on a machine with lots of cores with pytest-xdist.

And then of course we'll also discover violations of our validation checks immediately in the ETL instead of in the nightly builds, which means we'll fix them sooner.

Yes! Agreed on these benefits :)

IIRC by default failing asset checks don't cause the ETL to fail, do they? How will we be notified when a check fails? Do we want to set them to ERROR out on check failure? I think I saw that was a new option as of the most recent dagster release.

I think they don't cause the ETL to stop running but they do cause the whole run to be marked as a "failure". I can make a bogus failing test and run pudl_etl to see if things failed.

Are there any checks that we can't encode this way? I guess any check (like referential integrity) that involves more than one table won't work.

Asset checks have additional_ins which should let us encode checks across multiple tables.

I think the upshot of all this is that we need the following before we merge:

  • check that a failing asset check does cause pudl_etl to exit with a non-zero status
  • fix docstrings

And we want to encode some validations in-line instead of in-schema as a follow-up PR. For that one, we could even rip out the "df_checks" and "field_checks" machinery in this PR, and say:

  • complicated checks live in @asset_checks that are defined alongside each asset
  • simple dtype/min-max checks are auto-generated based on the field descriptions.
  • add data types for our different frequent validation check parameters (e.g. BoundsCase)

If we agree on ^, should we then rip out the df_checks and field_checks machinery in this PR pre-emptively?

@jdangerx jdangerx force-pushed the prototype-dagster-pandera branch 2 times, most recently from 391b637 to afa314b Compare February 20, 2024 20:18
@jdangerx
Copy link
Member Author

Made a "minimal checks" issue here: #3412 - @zaneselvans this should be ready for re-review!

@jdangerx jdangerx added this pull request to the merge queue Feb 20, 2024
Merged via the queue into main with commit eba35e0 Feb 20, 2024
12 checks passed
@jdangerx jdangerx deleted the prototype-dagster-pandera branch February 20, 2024 23:10
@e-belfer e-belfer linked an issue Feb 22, 2024 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. testing Writing tests, creating test data, automating testing, etc.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Enable Pandera dtype and validation checks
3 participants