Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDK: Improve schema detection #26741

Merged
merged 6 commits into from
May 31, 2023
Merged

Conversation

flash1293
Copy link
Contributor

@flash1293 flash1293 commented May 30, 2023

What

Closes #25669

Improves the built-in schema inferrer to produce schemas that work better with the Airbyte platform by default:

  • Always assume a number is "number", not "integer", because the sample size is usually small
  • Ignore properties of type "null", as this is most likely not correct anyway and can cause errors later on when the schema gets refined - instead this will just be unexpected fields that are filtered out by the platform
  • Do no emit types like this:
{
  "a": {
    "anyOf": [
      { "type: "null" },
      { "type: "object", "properties": { ... } },
    ]
  }
}

instead, do this:

{
  "a": {
    "type": ["null", "object"],
    "properties": { ... }
  }
}

as the platform is ignoring anyOf but can extract the nested object in the second case.

How

The number/integer case can be handled by adding another extra strategy.

However, the null handling is baked in too deeply into genson to be able to get the desired behavior by extra strategies (e.g. the anyOf can't be changed by provided strategies and the library always expects an output for a schema node, so an "ignore null" strategy can't be implemented). To solve this, a post-processing function is introduced that traverses the built schema and changes the output as desired.

🚨 User Impact 🚨

Schema detection in the connector builder will change for the better (existing connectors won't be affected)

@flash1293 flash1293 added the CDK Connector Development Kit label May 30, 2023
@flash1293 flash1293 marked this pull request as ready for review May 30, 2023 09:05
@flash1293 flash1293 requested a review from a team as a code owner May 30, 2023 09:05
@flash1293 flash1293 requested a review from girarda May 30, 2023 09:06
@flash1293 flash1293 enabled auto-merge (squash) May 31, 2023 11:03
@flash1293 flash1293 merged commit ec5aa7b into master May 31, 2023
16 checks passed
@flash1293 flash1293 deleted the flash1293/improve-schema-detection branch May 31, 2023 13:57
marcosmarxm pushed a commit to natalia-miinto/airbyte that referenced this pull request Jun 8, 2023
* improve schema detection

* improve schema detection

* review comment

* Automated Commit - Formatting Changes

---------

Co-authored-by: flash1293 <flash1293@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CDK Connector Development Kit
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Connector builder: Improve schema detection
2 participants