Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate data validation from tests #941

Open
zaneselvans opened this issue Mar 5, 2021 · 0 comments
Open

Separate data validation from tests #941

zaneselvans opened this issue Mar 5, 2021 · 0 comments
Labels
cli Scripts and other command line interfaces to PUDL. testing Writing tests, creating test data, automating testing, etc.

Comments

@zaneselvans
Copy link
Member

zaneselvans commented Mar 5, 2021

After talking through some data validation / testing issues with @rousik in the context of doing a data release, it seems like it might be desirable to replace the data validation tests with a standalone script which is distributed with the package.

  • Would allow users and the build system to run the data validation without needing to pull the GitHub repo
  • Would encapsulate all of the data validation test cases in the package itself, instead of having some of it in the validate.py module, and some of it (like number of expected rows) in the test/validate/*.py modules.
  • Most of the logic and validation test case descriptions are already contained in the main codebase, not the tests.
  • A pudl_validate script could have command line args that allow the user to choose a particular table to validate, since we have a lot of table specific validations.
  • Could produce a validation report / data profile in JSON or something for display and programmatic processing. Maybe look at the Great Expectations + Pandas Profiling integration for ideas.
  • If it can run against PostgreSQL as well as SQLite, the data validation could be parallelized in the Cloud ETL using Prefect.
  • I think it would also make sense to split out all of the information specifying the data validations into data files which are separate from the code, and which are easily tracked independently in revision control -- so it's easy to see where they are, what they all are, and add or modify individual validations over time as we discover new things that should be true. My guess is this will need to be a JSON-like structure, not a CSV, and that there will be several different types of things which we need to specify. More research required. Maybe look at Pandera and Hypothesis for inspiration.
@zaneselvans zaneselvans added testing Writing tests, creating test data, automating testing, etc. cli Scripts and other command line interfaces to PUDL. labels Mar 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cli Scripts and other command line interfaces to PUDL. testing Writing tests, creating test data, automating testing, etc.
Projects
None yet
Development

No branches or pull requests

1 participant