fix: remove references resolution from normalization #812

maxi297 · 2025-10-23T15:14:46Z

What

Addresses https://github.com/airbytehq/airbyte-internal-issues/issues/14928

Reasoning

Schema Loaders

Here is an analysis of all the schema loaders. Overall, it does not seem possible to have $def references returned as part of the discover or used as part of the read.

JsonFileSchemaLoader

This loader will break with jsonref.JsonRefError: FileNotFoundError: [Errno 2] No such file or directory: '<reference value>' is a reference is not a file

DefaultSchemaLoader

JsonFileSchemaLoader that defaults on an empty schema

SchemaLoaderCachingDecorator

It's just a caching layer

CompositeSchemaLoader

It aggregates multiple schemas without any opinion on the content so this layer should be fine.

InlineSchemaLoader

Yaml references are resolved when loading the manifest so internal references to the schema_loader would be resolved by the CDK but would fail because it uses the root of the yaml as a base, not the root of the schema. For example:

version: 7.3.6

type: DeclarativeSource

check:
  type: CheckStream
  stream_names:
    - test_ref_in_schema

streams:
  - type: DeclarativeStream
    name: test_ref_in_schema
    retriever:
      type: SimpleRetriever
      requester:
        type: HttpRequester
        url: >-
          https://test.toto.com
        http_method: GET
      record_selector:
        type: RecordSelector
        extractor:
          type: DpathExtractor
          field_path:
            - data
    primary_key:
      - - id
    schema_loader:
      type: InlineSchemaLoader
      schema:
        type: object
        $schema: http://json-schema.org/schema#
        definitions:
          link_id:
            type:
              - string
              - "null"
        properties:
          link_id:
            $ref: "#/definitions/link_id"
          guide_id:
            type:
              - string
              - "null"
        additionalProperties: true

spec:
  type: Spec
  documentation_url: https://test.toto.com
  connection_specification:
    type: object
    $schema: http://json-schema.org/draft-07/schema#
    additionalProperties: true

... will yield airbyte_cdk.sources.declarative.parsers.custom_exceptions.UndefinedReferenceException: Undefined reference #/definitions/link_id from definitions/link_id

DynamicSchemaLoader

Given the keys declared here, the only thing it could reference is another property which is not very likely.

Not low code

It is still possible that some "non low-code" connectors would use this referencing stuff. Searching for TransformConfig, I see the following sources:

source-appflyers
source-amazon-seller-partner
source-facebook-marketing (non low-code but rely on ResourceSchemaLoader which is the underlying component of JsonFileSchemaLoader mentioned above)
source-google-ads
source-hubspot
source-instagram
source-mixpanel
source-recharge (non low-code but rely on ResourceSchemaLoader which is the underlying component of JsonFileSchemaLoader mentioned above)
source-salesforce (non low-code but rely on ResourceSchemaLoader which is the underlying component of JsonFileSchemaLoader mentioned above)
source-survey-cto (non low-code but rely on ResourceSchemaLoader which is the underlying component of JsonFileSchemaLoader mentioned above)
source-twilio
source-youtube-analytics (non low-code but rely on ResourceSchemaLoader which is the underlying component of JsonFileSchemaLoader mentioned above)

If not marked in the list above, the source is low-code and can't be using ref based on the Schema Loaders section above.

More notes

We still use expand_refs as part of airbyte_cdk.source.config.BaseConfig. I don't know how relevant this code is as I've only seen usage in the tests so I assume it might be affected by performance issues but that it does not really matter to us right now.

Remaining questions

Does the platform actually supports schema references? Currently, we assume it does not
Could TypeTransform be used for other things than normalizing schema? I personally don't know of other situations...
Why did we add this $ref resolving on the first iteration of the TypeTransform? I don't see a good documented reason so I assume it was preemptive optimization.

Summary by CodeRabbit

Chores
- Simplified internal schema handling to streamline transformation logic; no changes to public APIs.
Tests
- Updated test fixtures to inline previously referenced schema definitions.
Notes
- No user-facing behavior changes expected.

github-actions · 2025-10-23T15:15:07Z

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@issue_14928/normalization_performance#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch issue_14928/normalization_performance

Helpful Resources

CDK API Reference

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

/autofix - Fixes most formatting and linting issues
/poetry-lock - Updates poetry.lock file
/test - Runs connector tests with the updated CDK
/poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
/poe <command> - Runs any poe command in the CDK environment

📝 Edit this welcome message.

unit_tests/sources/utils/test_transform.py

coderabbitai · 2025-10-23T15:18:16Z

📝 Walkthrough

Walkthrough

Removed JSON Schema $ref-expansion preprocessing from the transformation pipeline in transform.py; unit tests updated to inline previously referenced schema definitions and imports adjusted in test_transform.py.

Changes

Cohort / File(s)	Summary
Removed ref-resolution preprocessing `airbyte_cdk/sources/utils/transform.py`	Removed imports and logic for ref-resolution and pre-expansion (`deepcopy`, `TYPE_CHECKING` typing, `Registry`, `Resource`, `Resolver`, `Unresolvable`, `DRAFT7`, ref-resolution helpers). Eliminated `expand_refs()` calls and the two-phase schema/value ref-expansion in `TransformConfig`, `TypeTransformer` initialization, and `normalizator()`. Public APIs remained unchanged.
Inlined test schemas and minor import change `unit_tests/sources/utils/test_transform.py`	Replaced schema `definitions` and `$ref` usages with inline schema objects (e.g., array items now `{ "type": "string" }`, nested objects inlined), removed top-level `definitions`, and adjusted test data; import changes (added `time`) updated accordingly.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant TypeTransformer
    participant Validator
    rect #F0F8FF
        Note right of Caller: Old flow (with pre-expansion)
    end
    Caller->>TypeTransformer: provide schema + value
    TypeTransformer->>TypeTransformer: expand_refs(schema, value)
    TypeTransformer->>Validator: validate(expanded_schema, value)
    Validator-->>TypeTransformer: validation result
    TypeTransformer-->>Caller: normalized value

sequenceDiagram
    participant Caller
    participant TypeTransformer
    participant Validator
    rect #FFF7F0
        Note right of Caller: New flow (no pre-expansion)
    end
    Caller->>TypeTransformer: provide schema + value
    TypeTransformer->>Validator: validate(schema, value)
    Validator-->>TypeTransformer: validation result
    TypeTransformer-->>Caller: normalized value

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Would you like me to run a checklist of potential edge cases that may break without pre-expansion (external/$id refs, recursive refs, resolver-dependent schemas), wdyt?

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The PR title "fix: remove references resolution from normalization" is specific and directly aligned with the main changes in the changeset. The code modifications in `transform.py` clearly demonstrate the removal of $ref resolution mechanisms—specifically eliminating deepcopy imports, removing the expand_refs invocations, and deleting the pre-validation/ref-resolution steps that previously expanded references in schemas. The supporting test changes in `test_transform.py` inline previously referenced definitions, confirming that the removal of references resolution is the primary objective. A teammate scanning the git history would immediately understand that this PR removes $ref resolution from the normalization step, making the title both clear and specific to the changeset's main intent.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch issue_14928/normalization_performance

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3e86b19 and 32ca1d4.

📒 Files selected for processing (1)

unit_tests/sources/utils/test_transform.py (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Check: source-shopify

🔇 Additional comments (2)

unit_tests/sources/utils/test_transform.py (2)

21-22: LGTM! Schema definitions correctly inlined.

The array and nested properties now contain inline type definitions instead of JSON Schema $ref references, which aligns perfectly with the PR objective of removing reference resolution from the normalization step. The test cases continue to work as expected with these inlined definitions.

12-28: Quick question about test coverage for the removed field.

I noticed the COMPLEX_SCHEMA previously had a "def" field (per the AI summary) that was removed along with the definitions block. I don't see any test cases that were exercising this field—was this intentional, or was the "def" field just never covered by tests? Just curious, wdyt?

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

unit_tests/sources/utils/test_transform.py (2)

385-386: Consider refactoring large inline constants.

The HUBSPOT_CONTACTS_SCHEMA and HUBSPOT_CONTACTS_RECORD are extremely large inline dictionaries that make the test function hard to read. Would it make sense to move these to module-level constants or a separate fixture file? This would improve maintainability and readability. Wdyt?

388-393: Enhance test documentation and isolation.

A few suggestions for the performance test:

Add a docstring explaining what performance characteristic this test validates

The 3-second threshold is acknowledged as arbitrary - could you document why this value was chosen, or make it configurable via pytest markers/environment variables?

Since transform() mutates the record in-place, should each iteration use a fresh copy to ensure consistent test conditions?

Wdyt?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 20ae208 and f0a52bb.

📒 Files selected for processing (2)

airbyte_cdk/sources/utils/transform.py (1 hunks)
unit_tests/sources/utils/test_transform.py (3 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

unit_tests/sources/utils/test_transform.py (1)

airbyte_cdk/sources/utils/transform.py (3)

TypeTransformer (67-266)

TransformConfig (48-64)

transform (218-236)

🪛 GitHub Actions: Linters

unit_tests/sources/utils/test_transform.py

[error] 382-382: ruff format check failed. 1 file would be reformatted. Run 'ruff format' to fix formatting.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)

GitHub Check: Check: destination-motherduck
GitHub Check: Check: source-hardcoded-records
GitHub Check: Check: source-intercom
GitHub Check: Check: source-shopify
GitHub Check: Check: source-pokeapi
GitHub Check: Pytest (Fast)
GitHub Check: Pytest (All, Python 3.13, Ubuntu)
GitHub Check: Pytest (All, Python 3.11, Ubuntu)
GitHub Check: Pytest (All, Python 3.10, Ubuntu)
GitHub Check: SDM Docker Image Build
GitHub Check: Pytest (All, Python 3.12, Ubuntu)
GitHub Check: Manifest Server Docker Image Build

🔇 Additional comments (4)

airbyte_cdk/sources/utils/transform.py (2)

7-7: LGTM! Clean import removal aligns with $ref resolution removal.

The import changes properly reflect the removal of $ref resolution preprocessing from the transformation pipeline, as described in the PR objectives.

86-86: Verify whether "$ref" should remain in the validator keys list.

Since this PR removes $ref resolution preprocessing, should "$ref" still be included in the validator keys on line 86? If TypeTransformer no longer processes $ref schemas, this key might be unnecessary. Wdyt?

unit_tests/sources/utils/test_transform.py (2)

6-6: LGTM! Time import supports the new performance test.

22-23: LGTM! Schema inlining removes $ref dependencies.

The transition from referenced definitions to inline schemas aligns with the PR objective and makes the test schemas more explicit and self-contained.

unit_tests/sources/utils/test_transform.py

github-actions · 2025-10-23T15:22:37Z

PyTest Results (Fast)

3 813 tests +11 3 801 ✅ +11 6m 32s ⏱️ -11s
1 suites ± 0 12 💤 ± 0
1 files ± 0 0 ❌ ± 0

Results for commit 32ca1d4. ± Comparison against base commit 20ae208.

This pull request removes 1 and adds 12 tests. Note that renamed tests count towards both.

unit_tests.sources.declarative.requesters.query_properties.test_query_properties ‑ test_get_request_property_chunks_static_list_with_chunking

unit_tests.sources.declarative.parsers.test_model_to_component_factory ‑ test_create_component_with_configured_catalog
unit_tests.sources.declarative.requesters.query_properties.property_selector.test_json_schema_property_selector ‑ test_select_properties[configured_stream_is_none]
unit_tests.sources.declarative.requesters.query_properties.property_selector.test_json_schema_property_selector ‑ test_select_properties[configured_stream_no_properties_key_in_json_schema]
unit_tests.sources.declarative.requesters.query_properties.property_selector.test_json_schema_property_selector ‑ test_select_properties[test_select_properties_with_transformations]
unit_tests.sources.declarative.requesters.query_properties.test_property_chunking ‑ test_get_request_property_chunks_empty_configured_properties
unit_tests.sources.declarative.requesters.query_properties.test_property_chunking ‑ test_get_request_property_chunks_none_configured_properties
unit_tests.sources.declarative.requesters.query_properties.test_query_properties ‑ test_get_request_no_property_chunking_always_include_properties
unit_tests.sources.declarative.requesters.query_properties.test_query_properties ‑ test_get_request_no_property_chunking_selected_properties_always_include_properties
unit_tests.sources.declarative.requesters.query_properties.test_query_properties ‑ test_get_request_property_chunks_static_list_with_chunking_property_selection
unit_tests.sources.declarative.requesters.query_properties.test_query_properties ‑ test_get_request_property_chunks_with_configured_catalog_dynamic_endpoint
…

♻️ This comment has been updated with latest results.

github-actions · 2025-10-23T15:27:45Z

PyTest Results (Full)

3 816 tests 3 804 ✅ 11m 14s ⏱️
1 suites 12 💤
1 files 0 ❌

Results for commit 32ca1d4.

♻️ This comment has been updated with latest results.

brianjlai

🚤

some formatting on the test i think but not blocking

aaronsteers

🚀

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f0a52bb and 3e86b19.

📒 Files selected for processing (1)

unit_tests/sources/utils/test_transform.py (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)

GitHub Check: Check: source-shopify
GitHub Check: Check: source-pokeapi
GitHub Check: Check: source-hardcoded-records
GitHub Check: Check: destination-motherduck
GitHub Check: Check: source-intercom
GitHub Check: Pytest (Fast)
GitHub Check: SDM Docker Image Build
GitHub Check: Manifest Server Docker Image Build
GitHub Check: Pytest (All, Python 3.12, Ubuntu)
GitHub Check: Pytest (All, Python 3.10, Ubuntu)
GitHub Check: Pytest (All, Python 3.11, Ubuntu)
GitHub Check: Pytest (All, Python 3.13, Ubuntu)

🔇 Additional comments (1)

unit_tests/sources/utils/test_transform.py (1)

22-23: LGTM - schema inlining looks good!

The schema definitions have been correctly updated to inline the previously referenced definitions. This aligns perfectly with the PR objective to remove $ref resolution from the normalization step. The test cases should continue to work as expected with these inline definitions.

unit_tests/sources/utils/test_transform.py

The comment is outdated as I've removed the test that was making formatting fail

remove references from normalization

f0a52bb

github-actions bot added bug Something isn't working security labels Oct 23, 2025

maxi297 requested review from aaronsteers, brianjlai and tolik0 October 23, 2025 15:15

maxi297 commented Oct 23, 2025

View reviewed changes

unit_tests/sources/utils/test_transform.py Outdated Show resolved Hide resolved

unit_tests/sources/utils/test_transform.py Outdated Show resolved Hide resolved

coderabbitai bot requested changes Oct 23, 2025

View reviewed changes

unit_tests/sources/utils/test_transform.py Outdated Show resolved Hide resolved

brianjlai approved these changes Oct 23, 2025

View reviewed changes

aaronsteers approved these changes Oct 24, 2025

View reviewed changes

remove flaky test

3e86b19

coderabbitai bot previously requested changes Oct 24, 2025

View reviewed changes

unit_tests/sources/utils/test_transform.py Outdated Show resolved Hide resolved

remove unused import

32ca1d4

coderabbitai bot approved these changes Oct 24, 2025

View reviewed changes

maxi297 merged commit 59e4163 into main Oct 24, 2025
29 of 30 checks passed

maxi297 deleted the issue_14928/normalization_performance branch October 24, 2025 15:31

fix: remove references resolution from normalization #812

fix: remove references resolution from normalization #812

Uh oh!

Conversation

maxi297 commented Oct 23, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Reasoning

Schema Loaders

JsonFileSchemaLoader

DefaultSchemaLoader

SchemaLoaderCachingDecorator

CompositeSchemaLoader

InlineSchemaLoader

DynamicSchemaLoader

Not low code

More notes

Remaining questions

Summary by CodeRabbit

Uh oh!

github-actions bot commented Oct 23, 2025

👋 Greetings, Airbyte Team Member!

Testing This CDK Version

Helpful Resources

PR Slash Commands

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PyTest Results (Fast)

Uh oh!

github-actions bot commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PyTest Results (Full)

Uh oh!

brianjlai left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aaronsteers left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

maxi297 commented Oct 23, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 23, 2025 •

edited

Loading

github-actions bot commented Oct 23, 2025 •

edited

Loading

github-actions bot commented Oct 23, 2025 •

edited

Loading

brianjlai left a comment •

edited

Loading