Implement CHECK(type) constraints in SQLite ETL/schema #1197

zaneselvans · 2021-09-06T15:56:15Z

When we were loading into datapackage outputs, we did a bunch of data type and constraint checking via goodtables-pandas-py. Now that we are outputting directly to a database, it would be nice if we can have the database do these kinds of checks rather than inserting another step and set of tools. However SQLite is notoriously lazy about this kind of thing (compared to say, Postgres). It doesn't check types or foreign key constraints by default.

Get type checks working...

Investigate the extent to which data type compatibility is something we can check within SQLite. (it is)
Append appropriate CHECK statements to the SQL generated by pudl.metadata.classes.Field.to_sql() depending on the Field data type.
Add an dtype dictionary to the df.to_sql() calls which load the database, so that the appropriate type conversions are used. E.g. of datetime64[ns] to DATE

Debug and Evaluate

Benchmark ETL with type checking turned on vs. off. with 2019 data in the tests, just for the load step, I saw:
- no checks: 13s
- check_types: 16s
- check_types and check_values: 23s
Fix any true type incompatibilities which become apparent (None! 🎉 )

The text was updated successfully, but these errors were encountered:

ezwelty · 2021-09-08T16:47:00Z

Hmm, I would discourage against relying on SQLite for validation. Sure, we can construct check constraints from the metadata, and this could be nice to have as a failsafe. But these checks may not necessarily work the same way in different SQL databases, and the checks disappear if you drop the SQL output in the future. Maybe the biggest issue, for development, is that SQL databases will fail on the first error and getting them to scan for and report all validation errors requires near-impossible SQL-fu (and will be very slow if you succeed). I suspect in-Python validation is the way to go.

For the WGMS, I need a tool for multi-table parsing and validation that supports arbitrary checks at different levels – on a column (e.g. x < 1), on a table row-wise (e.g. df.x < df.y) or group-wise (e.g. df.groupby(['y']).apply(lambda g: g['x'].sum() > 0)), and across tables (including foreign key checks). frictionless cannot handle groups (but can handle joined tables if you construct a descriptor for the join result) and as we've found it is slow for large data. I am looking at extending pandera to meet my needs (https://pandera.readthedocs.io), or rolling my own... Either way converting frictionless-style metadata to a pandas-based validation library is fairly straightforward.

ezwelty · 2021-09-15T21:49:56Z

I added all the possibleSQL constraints to Field.to_sql(), for now written for sqlite only:

pudl/src/pudl/metadata/classes.py

Line 403 in 3824005

    
           def to_sql(self, dialect: Literal["sqlite"] = "sqlite") -> sa.Column:  # noqa: C901

You may find it useful to add arguments to control the types of checks which are included.

zaneselvans · 2021-09-16T16:22:03Z

I've added some switches to the to_sql() methods to turn foreign key, type, and value constraint checking on or off, since we currently have many failures here that we need to work through, but the database as it loads right now is pretty close to what we were getting before.

For loading from pandas into SQL it looks like we can (and probably should) use the dtype argument to df.to_sql() which lets you explicitly specify the SQLAlchemy type for each column.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html

Hopefully this will address the conversion of datetime64[ns] columns in pandas into the appropraite DATE or DATETIME types within SQLite, which was the first type checking issue I came across.

zaneselvans · 2021-09-17T02:09:52Z

@ezwelty We looked at pandera last year and it definitely looked like it was pointed in the right direction, but didn't cover the kinds of data validations that we wanted to do without being pretty messy. For instance some times we need to check weighted averages, which need to look at a value column and a weighting column.

I'm sure the validations we have kludged together right now under test/validate could be much more elegant and faster, and we definitely want to compile a richer library of validations over time. So let us know what direction you take! Getting better dataframe validations integrated into data processing pipelines seems to be a pain point for a lot of people.

The SQL instant fail on a single issue is pretty annoying. Getting a readable report that tells you everything that's wrong so you can go address a bunch of issues all at once would be so much more helpful. Putting the straightforward things into CHECK constraints like you've done so at least they are always associated with the data, and using Python for more complicated stuff on top of that seems great. That's basically what we've cobbled together now in a messy way with the tox -e validate tests. We want to move the validation out of the testing framework so that it can be deployed easily as part of the Python package without all the additional test dependencies. See #941

* Added an explicit dtype dictionary in the SQLite load call to df.to_sql() that pulls the column types from the SQLAlchemy metadata object we've just used to create the database, so they should always line up. * Turned on check_types and check_values in the integration tests. Other minor stuff: * Omitted deprecated datapackage modules from the test coverage reporting. * Replaced instances of in-place list.sort() with sorted(list) in a few places, since it's idiomatically more similar to what we're used to with dataframes, and I've been bitten several times by in-place sorting returning None when I didn't expect it. Closes: #1197

zaneselvans added sqlite Issues related to interacting with sqlite databases data-types Dtype conversions, standardization and implications of data types labels Sep 6, 2021

zaneselvans changed the title ~~Implement type consistency checking in direct to SQLite ETL~~ Implement CHECK(type) constraints in SQLite ETL/schema Sep 10, 2021

zaneselvans mentioned this issue Sep 10, 2021

Redesign metadata and harvest process #806

Merged

62 tasks

zaneselvans assigned zaneselvans and ezwelty Sep 17, 2021

zaneselvans closed this as completed Sep 24, 2021

zaneselvans mentioned this issue Oct 2, 2021

Fix DB constraint violations #1212

Closed

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement CHECK(type) constraints in SQLite ETL/schema #1197

Implement CHECK(type) constraints in SQLite ETL/schema #1197

zaneselvans commented Sep 6, 2021 •

edited

ezwelty commented Sep 8, 2021

ezwelty commented Sep 15, 2021

zaneselvans commented Sep 16, 2021

zaneselvans commented Sep 17, 2021

Implement CHECK(type) constraints in SQLite ETL/schema #1197

Implement CHECK(type) constraints in SQLite ETL/schema #1197

Comments

zaneselvans commented Sep 6, 2021 • edited

Get type checks working...

Debug and Evaluate

ezwelty commented Sep 8, 2021

ezwelty commented Sep 15, 2021

zaneselvans commented Sep 16, 2021

zaneselvans commented Sep 17, 2021

zaneselvans commented Sep 6, 2021 •

edited