Separate data validation from tests #941

zaneselvans · 2021-03-05T15:30:56Z

After talking through some data validation / testing issues with @rousik in the context of doing a data release, it seems like it might be desirable to replace the data validation tests with a standalone script which is distributed with the package.

Would allow users and the build system to run the data validation without needing to pull the GitHub repo
Would encapsulate all of the data validation test cases in the package itself, instead of having some of it in the validate.py module, and some of it (like number of expected rows) in the test/validate/*.py modules.
Most of the logic and validation test case descriptions are already contained in the main codebase, not the tests.
A pudl_validate script could have command line args that allow the user to choose a particular table to validate, since we have a lot of table specific validations.
Could produce a validation report / data profile in JSON or something for display and programmatic processing. Maybe look at the Great Expectations + Pandas Profiling integration for ideas.
If it can run against PostgreSQL as well as SQLite, the data validation could be parallelized in the Cloud ETL using Prefect.
I think it would also make sense to split out all of the information specifying the data validations into data files which are separate from the code, and which are easily tracked independently in revision control -- so it's easy to see where they are, what they all are, and add or modify individual validations over time as we discover new things that should be true. My guess is this will need to be a JSON-like structure, not a CSV, and that there will be several different types of things which we need to specify. More research required. Maybe look at Pandera and Hypothesis for inspiration.

The text was updated successfully, but these errors were encountered:

zaneselvans added testing Writing tests, creating test data, automating testing, etc. cli Scripts and other command line interfaces to PUDL. labels Mar 5, 2021

zaneselvans mentioned this issue Mar 5, 2021

Simplify the organization of our test suites #942

Closed

9 tasks

zaneselvans mentioned this issue Sep 17, 2021

Implement CHECK(type) constraints in SQLite ETL/schema #1197

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate data validation from tests #941

Separate data validation from tests #941

zaneselvans commented Mar 5, 2021 •

edited

Separate data validation from tests #941

Separate data validation from tests #941

Comments

zaneselvans commented Mar 5, 2021 • edited

zaneselvans commented Mar 5, 2021 •

edited