New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement CHECK(type) constraints in SQLite ETL/schema #1197
Comments
Hmm, I would discourage against relying on SQLite for validation. Sure, we can construct check constraints from the metadata, and this could be nice to have as a failsafe. But these checks may not necessarily work the same way in different SQL databases, and the checks disappear if you drop the SQL output in the future. Maybe the biggest issue, for development, is that SQL databases will fail on the first error and getting them to scan for and report all validation errors requires near-impossible SQL-fu (and will be very slow if you succeed). I suspect in-Python validation is the way to go. For the WGMS, I need a tool for multi-table parsing and validation that supports arbitrary checks at different levels – on a column (e.g. |
I added all the possibleSQL constraints to pudl/src/pudl/metadata/classes.py Line 403 in 3824005
You may find it useful to add arguments to control the types of checks which are included. |
I've added some switches to the For loading from pandas into SQL it looks like we can (and probably should) use the https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html Hopefully this will address the conversion of |
@ezwelty We looked at pandera last year and it definitely looked like it was pointed in the right direction, but didn't cover the kinds of data validations that we wanted to do without being pretty messy. For instance some times we need to check weighted averages, which need to look at a value column and a weighting column. I'm sure the validations we have kludged together right now under The SQL instant fail on a single issue is pretty annoying. Getting a readable report that tells you everything that's wrong so you can go address a bunch of issues all at once would be so much more helpful. Putting the straightforward things into |
* Added an explicit dtype dictionary in the SQLite load call to df.to_sql() that pulls the column types from the SQLAlchemy metadata object we've just used to create the database, so they should always line up. * Turned on check_types and check_values in the integration tests. Other minor stuff: * Omitted deprecated datapackage modules from the test coverage reporting. * Replaced instances of in-place list.sort() with sorted(list) in a few places, since it's idiomatically more similar to what we're used to with dataframes, and I've been bitten several times by in-place sorting returning None when I didn't expect it. Closes: #1197
When we were loading into datapackage outputs, we did a bunch of data type and constraint checking via goodtables-pandas-py. Now that we are outputting directly to a database, it would be nice if we can have the database do these kinds of checks rather than inserting another step and set of tools. However SQLite is notoriously lazy about this kind of thing (compared to say, Postgres). It doesn't check types or foreign key constraints by default.
Get type checks working...
CHECK
statements to the SQL generated bypudl.metadata.classes.Field.to_sql()
depending on theField
data type.dtype
dictionary to thedf.to_sql()
calls which load the database, so that the appropriate type conversions are used. E.g. ofdatetime64[ns]
toDATE
Debug and Evaluate
check_types
: 16scheck_types
andcheck_values
: 23sThe text was updated successfully, but these errors were encountered: