Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 [CDK, Declarative Source]: fix bug when type is missing for anyOf in nested arrays #40667

Merged
merged 10 commits into from
Jul 8, 2024

Conversation

bazarnov
Copy link
Collaborator

@bazarnov bazarnov commented Jul 1, 2024

What

Resolving:

How

  • added the check for missing type when there is nested anyOf with no type expected further
  • covered the case with unit_test
  • refactored the _clean method to make it easier to follow:
    • split the responsibilities of the single _clean method into multiple smaller ones, so we can narrow down the errors when they will occur next time (much easier to know the root cause of the issue with schema)
    • made reusable code parts, such as Literals to be re-usable (_NULL_TYPE, _TYPE, etc)

User Impact

No impact is expected.

Can this PR be safely reverted and rolled back?

  • YES 💚
  • NO ❌

@bazarnov bazarnov self-assigned this Jul 1, 2024
@bazarnov bazarnov requested review from girarda and maxi297 July 1, 2024 21:10
@octavia-squidington-iii octavia-squidington-iii added the CDK Connector Development Kit label Jul 1, 2024
@bazarnov bazarnov requested review from lmossman and a team July 1, 2024 21:10
@bazarnov bazarnov marked this pull request as ready for review July 1, 2024 21:12
@bazarnov bazarnov requested a review from a team as a code owner July 1, 2024 21:12
if isinstance(node, dict):
if "anyOf" in node:
if len(node["anyOf"]) == 2 and {"type": _NULL_TYPE} in node["anyOf"]:
if len(node["anyOf"]) == 2 and self._null_type_in_any_of(node):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat tangential question: why is this checking that the length of node["anyOf"] is 2? What if there are three or more values in the anyOf? It seems like we'd want to handle that case here as well, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lmossman do we allow for multiple types? My understanding is that len == 2 means there is a type + null

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the OC issue that prompted this PR, the input data is

{
    "attributes": [
        {
            "title": "Usual Clothing Size",
            "type": "multi-value",
            "value": [
                "XL"
            ]
        },
        {
            "title": "Where do you live?",
            "type": "location",
            "value": {
                "countryCode": "GB",
                "countryName": "United Kingdom"
            }
        }
    ]
}

which is resulting in this inferred schema for the attributes property:

{
    "anyOf": [
        {
            "type": "array",
            "items": {
                "type": "string"
            }
        },
        {
            "type": "object",
            "properties": {
                "countryCode": {
                    "type": "string"
                },
                "countryName": {
                    "type": "string"
                }
            }
        }
    ]
}

so that is an example where we have an anyOf with one of the values not being type: null. So I could imagine a case where that value field had yet another value type in the data (e.g. a number) that would result in the inferred anyOf containing 3 different values

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lmossman is that even supported by destinations? My feeling is the connector's output should be consistent - in this case, emitting only arrays of strings, and wrapping the singleton strings as necessary.

CC @edgao do destinations support fields with multiple types (eg Union[str, List[str]])?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tl;dr I think this is fine, with some caveats

DV2 destinations current behavior:

  • {oneOf: [...]} is translated to the destination's JSON type (jsonb/super/variant/etc). So this will work reasonably well, though the UX isn't the greatest
  • {anyOf: [...]} will be treated as an "unrecognized" type, which also just defaults to the destination's JSON type. So it'll behave the same as oneOf, just for different reasons
  • {type: [something_not_null, something_else_not_null, ...]} is handled using legacy logic ported from normalization (it picks the "widest" type amongst the options)
    • this logic is kind of dumb, e.g. type: [boolean, timestamp] becomes a timestamp
    • but even if it picks a dumb type, we'll just null out offending values and add an entry to airbyte_meta, so this doesn't strictly block syncs

in the distant future we'll probably (a) make all of those behave identically, and (b) pick the smallest type that can hold all of the unioned types. Not happening anytime soon though.

statically-typed file destinations (i.e. gcs/s3 in avro/parquet mode) already have handling for oneOf (... and we want to rewrite these destinations anyway).

references:

Comment on lines 114 to 116
# populate `type` for `anyOf` if it's not present to pass all other checks
elif len(node["anyOf"]) == 2 and not self._null_type_in_any_of(node):
node["type"] = [_NULL_TYPE]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels strange to me to add the null type to the anyOf field, and then remove it at the end of this method.

It seems like it would make more sense to do one of the following:

  1. Update the other checks to account for the anyOf case so that they allow the node to not have a type set, or
  2. Add the type: [object, null] to the anyOf field and leave it there in the output schema

I'm not sure which approach is preferred. @girarda @maxi297 do either of you have an opinion here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always adding type: [object, null] feels ok to me since we implicitly allow all columns to be nullable

Copy link

vercel bot commented Jul 2, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
airbyte-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jul 5, 2024 2:40pm

@bazarnov
Copy link
Collaborator Author

bazarnov commented Jul 5, 2024

@edgao @girarda @lmossman @maxi297
I think we may proceed with what we have now; we should be fine, as far as this message explains how the destination works now.

Do have any action items to do here?

@edgao
Copy link
Contributor

edgao commented Jul 5, 2024

any reason you want to use anyOf instead of oneOf? (I mildly prefer oneOf for stricter semantics, but otherwise no concerns)

@bazarnov
Copy link
Collaborator Author

bazarnov commented Jul 5, 2024

I don't think i have any preferences on this subject, it's more a matter of the fix, rather than logic change.

The AnyOf was before this issue have occurred, and it was working, should we hold this fix and refactor all of this logic in the way it fits the destination needs?

@girarda @edgao ?

@edgao
Copy link
Contributor

edgao commented Jul 5, 2024

your call, I don't know the LOE here, and I don't feel strongly 🤷 like my comment says - they'll both work, but oneOf is the case that we have explicit handling for. allOf is just falling into the backstop handler.

Copy link
Contributor

@girarda girarda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm ✅

good scout work @bazarnov ! Thanks for the refactor into smaller functions to make the code easier to read

@bazarnov

This comment was marked as outdated.

@bazarnov bazarnov merged commit 0c237d8 into master Jul 8, 2024
30 checks passed
@bazarnov bazarnov deleted the baz/cdk/fix-any-of-for-nested-arrays branch July 8, 2024 07:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CDK Connector Development Kit
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants