Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement CHECK(type) constraints in SQLite ETL/schema #1197

Closed
5 tasks done
Tracked by #1212
zaneselvans opened this issue Sep 6, 2021 · 4 comments
Closed
5 tasks done
Tracked by #1212

Implement CHECK(type) constraints in SQLite ETL/schema #1197

zaneselvans opened this issue Sep 6, 2021 · 4 comments
Assignees
Labels
data-types Dtype conversions, standardization and implications of data types sqlite Issues related to interacting with sqlite databases

Comments

@zaneselvans
Copy link
Member

zaneselvans commented Sep 6, 2021

When we were loading into datapackage outputs, we did a bunch of data type and constraint checking via goodtables-pandas-py. Now that we are outputting directly to a database, it would be nice if we can have the database do these kinds of checks rather than inserting another step and set of tools. However SQLite is notoriously lazy about this kind of thing (compared to say, Postgres). It doesn't check types or foreign key constraints by default.

Get type checks working...

  • Investigate the extent to which data type compatibility is something we can check within SQLite. (it is)
  • Append appropriate CHECK statements to the SQL generated by pudl.metadata.classes.Field.to_sql() depending on the Field data type.
  • Add an dtype dictionary to the df.to_sql() calls which load the database, so that the appropriate type conversions are used. E.g. of datetime64[ns] to DATE

Debug and Evaluate

  • Benchmark ETL with type checking turned on vs. off. with 2019 data in the tests, just for the load step, I saw:
    • no checks: 13s
    • check_types: 16s
    • check_types and check_values: 23s
  • Fix any true type incompatibilities which become apparent (None! 🎉 )
@zaneselvans zaneselvans added sqlite Issues related to interacting with sqlite databases data-types Dtype conversions, standardization and implications of data types labels Sep 6, 2021
@ezwelty
Copy link
Contributor

ezwelty commented Sep 8, 2021

Hmm, I would discourage against relying on SQLite for validation. Sure, we can construct check constraints from the metadata, and this could be nice to have as a failsafe. But these checks may not necessarily work the same way in different SQL databases, and the checks disappear if you drop the SQL output in the future. Maybe the biggest issue, for development, is that SQL databases will fail on the first error and getting them to scan for and report all validation errors requires near-impossible SQL-fu (and will be very slow if you succeed). I suspect in-Python validation is the way to go.

For the WGMS, I need a tool for multi-table parsing and validation that supports arbitrary checks at different levels – on a column (e.g. x < 1), on a table row-wise (e.g. df.x < df.y) or group-wise (e.g. df.groupby(['y']).apply(lambda g: g['x'].sum() > 0)), and across tables (including foreign key checks). frictionless cannot handle groups (but can handle joined tables if you construct a descriptor for the join result) and as we've found it is slow for large data. I am looking at extending pandera to meet my needs (https://pandera.readthedocs.io), or rolling my own... Either way converting frictionless-style metadata to a pandas-based validation library is fairly straightforward.

@zaneselvans zaneselvans changed the title Implement type consistency checking in direct to SQLite ETL Implement CHECK(type) constraints in SQLite ETL/schema Sep 10, 2021
@ezwelty
Copy link
Contributor

ezwelty commented Sep 15, 2021

I added all the possibleSQL constraints to Field.to_sql(), for now written for sqlite only:

def to_sql(self, dialect: Literal["sqlite"] = "sqlite") -> sa.Column: # noqa: C901

You may find it useful to add arguments to control the types of checks which are included.

@zaneselvans
Copy link
Member Author

I've added some switches to the to_sql() methods to turn foreign key, type, and value constraint checking on or off, since we currently have many failures here that we need to work through, but the database as it loads right now is pretty close to what we were getting before.

For loading from pandas into SQL it looks like we can (and probably should) use the dtype argument to df.to_sql() which lets you explicitly specify the SQLAlchemy type for each column.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html

Hopefully this will address the conversion of datetime64[ns] columns in pandas into the appropraite DATE or DATETIME types within SQLite, which was the first type checking issue I came across.

@zaneselvans
Copy link
Member Author

@ezwelty We looked at pandera last year and it definitely looked like it was pointed in the right direction, but didn't cover the kinds of data validations that we wanted to do without being pretty messy. For instance some times we need to check weighted averages, which need to look at a value column and a weighting column.

I'm sure the validations we have kludged together right now under test/validate could be much more elegant and faster, and we definitely want to compile a richer library of validations over time. So let us know what direction you take! Getting better dataframe validations integrated into data processing pipelines seems to be a pain point for a lot of people.

The SQL instant fail on a single issue is pretty annoying. Getting a readable report that tells you everything that's wrong so you can go address a bunch of issues all at once would be so much more helpful. Putting the straightforward things into CHECK constraints like you've done so at least they are always associated with the data, and using Python for more complicated stuff on top of that seems great. That's basically what we've cobbled together now in a messy way with the tox -e validate tests. We want to move the validation out of the testing framework so that it can be deployed easily as part of the Python package without all the additional test dependencies. See #941

zaneselvans added a commit that referenced this issue Sep 17, 2021
* Added an explicit dtype dictionary in the SQLite load call to df.to_sql() that pulls
  the column types from the SQLAlchemy metadata object we've just used to create the
  database, so they should always line up.
* Turned on check_types and check_values in the integration tests.

Other minor stuff:
* Omitted deprecated datapackage modules from the test coverage reporting.
* Replaced instances of in-place list.sort() with sorted(list) in a few places, since
  it's idiomatically more similar to what we're used to with dataframes, and I've been
  bitten several times by in-place sorting returning None when I didn't expect it.

Closes: #1197
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-types Dtype conversions, standardization and implications of data types sqlite Issues related to interacting with sqlite databases
Projects
None yet
Development

No branches or pull requests

2 participants