-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle bad records in "data contract" mode #135
Comments
may be related to #40 imo generating contract should happen during development, not on production system. so I'd suggest that a complete the bad data table should probably also say the table and field name. what are the bad records?
|
I'd like to describe the behavior we implement here better (and also take some community feedback):
The interface
contract settings should be stored in schema. you can drop we can also implement hooks but not in this ticket Contract modes following community feedback I'd add a mode where we do not allow for variant to appear. my take is that we have three types of contracts so we should accept three (slightly different) types of settings:
Contract vs schema inference |
The current PR has a global setting that can be set to the current values:
Right now our PR will evolve the schema in the first run if there is only an empty schema present, regardless of the schema mode. Based on what you wrote @rudolfix I would specify settings that can be set on the source or the resource level as well as a global override as follows: for new tables and new columns that are encountered we can each specify one of the above settings. for new column variants that are encountered, we could use the same settings, which means evolve allows new variants, and the other settings work the same way, so either variants are ignored, fail the whole load or rows that would produce variants are discarded completely. We could think about an option for automatic casting, so say we have a column that is a string, it will convert an incoming integer for this column to a string, but this will not work in many other cases, say a column is a date time and some random string comes in... We could also have a specific setting that allows for controlling the creation of new subtables, but I am not sure wether this is needed. |
We need to be able to enforce schemas. We should be able to set a pipelne to a "fixed schema" mode or "data contract" mode
In this mode, the schema persisted at destination becomes the "valid" schema. If there is no schema yet, then a schema is created by parsing some data. This would enable a person wanting to track data to generate a schema from an event, and enforce it.
When attempting to load the records into the frozen schema, the pipeline should handle bad records into a "bad records" table, with the info
Once the feature is implemented, please also create documentation or an example.
The text was updated successfully, but these errors were encountered: