-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 [CDK, Declarative Source]: fix bug when type
is missing for anyOf
in nested arrays
#40667
Conversation
if isinstance(node, dict): | ||
if "anyOf" in node: | ||
if len(node["anyOf"]) == 2 and {"type": _NULL_TYPE} in node["anyOf"]: | ||
if len(node["anyOf"]) == 2 and self._null_type_in_any_of(node): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhat tangential question: why is this checking that the length of node["anyOf"]
is 2
? What if there are three or more values in the anyOf
? It seems like we'd want to handle that case here as well, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lmossman do we allow for multiple types? My understanding is that len == 2 means there is a type + null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the OC issue that prompted this PR, the input data is
{
"attributes": [
{
"title": "Usual Clothing Size",
"type": "multi-value",
"value": [
"XL"
]
},
{
"title": "Where do you live?",
"type": "location",
"value": {
"countryCode": "GB",
"countryName": "United Kingdom"
}
}
]
}
which is resulting in this inferred schema for the attributes
property:
{
"anyOf": [
{
"type": "array",
"items": {
"type": "string"
}
},
{
"type": "object",
"properties": {
"countryCode": {
"type": "string"
},
"countryName": {
"type": "string"
}
}
}
]
}
so that is an example where we have an anyOf
with one of the values not being type: null
. So I could imagine a case where that value
field had yet another value type in the data (e.g. a number) that would result in the inferred anyOf
containing 3 different values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tl;dr I think this is fine, with some caveats
DV2 destinations current behavior:
{oneOf: [...]}
is translated to the destination's JSON type (jsonb/super/variant/etc). So this will work reasonably well, though the UX isn't the greatest{anyOf: [...]}
will be treated as an "unrecognized" type, which also just defaults to the destination's JSON type. So it'll behave the same as oneOf, just for different reasons{type: [something_not_null, something_else_not_null, ...]}
is handled using legacy logic ported from normalization (it picks the "widest" type amongst the options)- this logic is kind of dumb, e.g.
type: [boolean, timestamp]
becomes a timestamp - but even if it picks a dumb type, we'll just null out offending values and add an entry to
airbyte_meta
, so this doesn't strictly block syncs
- this logic is kind of dumb, e.g.
in the distant future we'll probably (a) make all of those behave identically, and (b) pick the smallest type that can hold all of the unioned types. Not happening anytime soon though.
statically-typed file destinations (i.e. gcs/s3 in avro/parquet mode) already have handling for oneOf
(... and we want to rewrite these destinations anyway).
references:
- DV2 JsonSchema parser https://github.com/airbytehq/airbyte/blob/master/airbyte-cdk/java/airbyte-cdk/typing-deduping/src/main/kotlin/io/airbyte/integrations/base/destination/typing_deduping/AirbyteType.kt#L25
- file destinations JsonSchema parser (we want to kill this and switch all destinations to use the DV2 schema parser, just fyi) https://github.com/airbytehq/airbyte/blob/master/airbyte-cdk/java/airbyte-cdk/s3-destinations/src/main/kotlin/io/airbyte/cdk/integrations/destination/s3/avro/JsonToAvroSchemaConverter.kt
# populate `type` for `anyOf` if it's not present to pass all other checks | ||
elif len(node["anyOf"]) == 2 and not self._null_type_in_any_of(node): | ||
node["type"] = [_NULL_TYPE] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels strange to me to add the null
type to the anyOf
field, and then remove it at the end of this method.
It seems like it would make more sense to do one of the following:
- Update the other checks to account for the
anyOf
case so that they allow the node to not have atype
set, or - Add the
type: [object, null]
to theanyOf
field and leave it there in the output schema
I'm not sure which approach is preferred. @girarda @maxi297 do either of you have an opinion here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always adding type: [object, null]
feels ok to me since we implicitly allow all columns to be nullable
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
@edgao @girarda @lmossman @maxi297 Do have any action items to do here? |
any reason you want to use anyOf instead of oneOf? (I mildly prefer oneOf for stricter semantics, but otherwise no concerns) |
your call, I don't know the LOE here, and I don't feel strongly 🤷 like my comment says - they'll both work, but |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm ✅
good scout work @bazarnov ! Thanks for the refactor into smaller functions to make the code easier to read
What
Resolving:
How
type
when there is nestedanyOf
with notype
expected furtherunit_test
_clean
method to make it easier to follow:_clean
method into multiple smaller ones, so we can narrow down the errors when they will occur next time (much easier to know the root cause of the issue with schema)reusable
code parts, such as Literals to be re-usable (_NULL_TYPE, _TYPE, etc)User Impact
No impact is expected.
Can this PR be safely reverted and rolled back?