You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After talking through some data validation / testing issues with @rousik in the context of doing a data release, it seems like it might be desirable to replace the data validation tests with a standalone script which is distributed with the package.
Would allow users and the build system to run the data validation without needing to pull the GitHub repo
Would encapsulate all of the data validation test cases in the package itself, instead of having some of it in the validate.py module, and some of it (like number of expected rows) in the test/validate/*.py modules.
Most of the logic and validation test case descriptions are already contained in the main codebase, not the tests.
A pudl_validate script could have command line args that allow the user to choose a particular table to validate, since we have a lot of table specific validations.
If it can run against PostgreSQL as well as SQLite, the data validation could be parallelized in the Cloud ETL using Prefect.
I think it would also make sense to split out all of the information specifying the data validations into data files which are separate from the code, and which are easily tracked independently in revision control -- so it's easy to see where they are, what they all are, and add or modify individual validations over time as we discover new things that should be true. My guess is this will need to be a JSON-like structure, not a CSV, and that there will be several different types of things which we need to specify. More research required. Maybe look at Pandera and Hypothesis for inspiration.
The text was updated successfully, but these errors were encountered:
zaneselvans
added
testing
Writing tests, creating test data, automating testing, etc.
cli
Scripts and other command line interfaces to PUDL.
labels
Mar 5, 2021
After talking through some data validation / testing issues with @rousik in the context of doing a data release, it seems like it might be desirable to replace the data validation tests with a standalone script which is distributed with the package.
validate.py
module, and some of it (like number of expected rows) in thetest/validate/*.py
modules.pudl_validate
script could have command line args that allow the user to choose a particular table to validate, since we have a lot of table specific validations.The text was updated successfully, but these errors were encountered: