Skip to content

Conversation

@maxi297
Copy link
Contributor

@maxi297 maxi297 commented Oct 23, 2025

What

Addresses https://github.com/airbytehq/airbyte-internal-issues/issues/14928

Reasoning

Schema Loaders

Here is an analysis of all the schema loaders. Overall, it does not seem possible to have $def references returned as part of the discover or used as part of the read.

JsonFileSchemaLoader

This loader will break with jsonref.JsonRefError: FileNotFoundError: [Errno 2] No such file or directory: '<reference value>' is a reference is not a file

DefaultSchemaLoader

JsonFileSchemaLoader that defaults on an empty schema

SchemaLoaderCachingDecorator

It's just a caching layer

CompositeSchemaLoader

It aggregates multiple schemas without any opinion on the content so this layer should be fine.

InlineSchemaLoader

Yaml references are resolved when loading the manifest so internal references to the schema_loader would be resolved by the CDK but would fail because it uses the root of the yaml as a base, not the root of the schema. For example:

version: 7.3.6

type: DeclarativeSource

check:
  type: CheckStream
  stream_names:
    - test_ref_in_schema

streams:
  - type: DeclarativeStream
    name: test_ref_in_schema
    retriever:
      type: SimpleRetriever
      requester:
        type: HttpRequester
        url: >-
          https://test.toto.com
        http_method: GET
      record_selector:
        type: RecordSelector
        extractor:
          type: DpathExtractor
          field_path:
            - data
    primary_key:
      - - id
    schema_loader:
      type: InlineSchemaLoader
      schema:
        type: object
        $schema: http://json-schema.org/schema#
        definitions:
          link_id:
            type:
              - string
              - "null"
        properties:
          link_id:
            $ref: "#/definitions/link_id"
          guide_id:
            type:
              - string
              - "null"
        additionalProperties: true

spec:
  type: Spec
  documentation_url: https://test.toto.com
  connection_specification:
    type: object
    $schema: http://json-schema.org/draft-07/schema#
    additionalProperties: true

... will yield airbyte_cdk.sources.declarative.parsers.custom_exceptions.UndefinedReferenceException: Undefined reference #/definitions/link_id from definitions/link_id

DynamicSchemaLoader

Given the keys declared here, the only thing it could reference is another property which is not very likely.

Not low code

It is still possible that some "non low-code" connectors would use this referencing stuff. Searching for TransformConfig, I see the following sources:

  • source-appflyers
  • source-amazon-seller-partner
  • source-facebook-marketing (non low-code but rely on ResourceSchemaLoader which is the underlying component of JsonFileSchemaLoader mentioned above)
  • source-google-ads
  • source-hubspot
  • source-instagram
  • source-mixpanel
  • source-recharge (non low-code but rely on ResourceSchemaLoader which is the underlying component of JsonFileSchemaLoader mentioned above)
  • source-salesforce (non low-code but rely on ResourceSchemaLoader which is the underlying component of JsonFileSchemaLoader mentioned above)
  • source-survey-cto (non low-code but rely on ResourceSchemaLoader which is the underlying component of JsonFileSchemaLoader mentioned above)
  • source-twilio
  • source-youtube-analytics (non low-code but rely on ResourceSchemaLoader which is the underlying component of JsonFileSchemaLoader mentioned above)

If not marked in the list above, the source is low-code and can't be using ref based on the Schema Loaders section above.

More notes

We still use expand_refs as part of airbyte_cdk.source.config.BaseConfig. I don't know how relevant this code is as I've only seen usage in the tests so I assume it might be affected by performance issues but that it does not really matter to us right now.

Remaining questions

  • Does the platform actually supports schema references? Currently, we assume it does not
  • Could TypeTransform be used for other things than normalizing schema? I personally don't know of other situations...
  • Why did we add this $ref resolving on the first iteration of the TypeTransform? I don't see a good documented reason so I assume it was preemptive optimization.

Summary by CodeRabbit

  • Chores

    • Simplified internal schema handling to streamline transformation logic; no changes to public APIs.
  • Tests

    • Updated test fixtures to inline previously referenced schema definitions.
  • Notes

    • No user-facing behavior changes expected.

@github-actions github-actions bot added bug Something isn't working security labels Oct 23, 2025
@github-actions
Copy link

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@issue_14928/normalization_performance#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch issue_14928/normalization_performance

Helpful Resources

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test - Runs connector tests with the updated CDK
  • /poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
  • /poe <command> - Runs any poe command in the CDK environment

📝 Edit this welcome message.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 23, 2025

📝 Walkthrough

Walkthrough

Removed JSON Schema $ref-expansion preprocessing from the transformation pipeline in transform.py; unit tests updated to inline previously referenced schema definitions and imports adjusted in test_transform.py.

Changes

Cohort / File(s) Summary
Removed ref-resolution preprocessing
airbyte_cdk/sources/utils/transform.py
Removed imports and logic for ref-resolution and pre-expansion (deepcopy, TYPE_CHECKING typing, Registry, Resource, Resolver, Unresolvable, DRAFT7, ref-resolution helpers). Eliminated expand_refs() calls and the two-phase schema/value ref-expansion in TransformConfig, TypeTransformer initialization, and normalizator(). Public APIs remained unchanged.
Inlined test schemas and minor import change
unit_tests/sources/utils/test_transform.py
Replaced schema definitions and $ref usages with inline schema objects (e.g., array items now { "type": "string" }, nested objects inlined), removed top-level definitions, and adjusted test data; import changes (added time) updated accordingly.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant TypeTransformer
    participant Validator
    rect #F0F8FF
        Note right of Caller: Old flow (with pre-expansion)
    end
    Caller->>TypeTransformer: provide schema + value
    TypeTransformer->>TypeTransformer: expand_refs(schema, value)
    TypeTransformer->>Validator: validate(expanded_schema, value)
    Validator-->>TypeTransformer: validation result
    TypeTransformer-->>Caller: normalized value
Loading
sequenceDiagram
    participant Caller
    participant TypeTransformer
    participant Validator
    rect #FFF7F0
        Note right of Caller: New flow (no pre-expansion)
    end
    Caller->>TypeTransformer: provide schema + value
    TypeTransformer->>Validator: validate(schema, value)
    Validator-->>TypeTransformer: validation result
    TypeTransformer-->>Caller: normalized value
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Would you like me to run a checklist of potential edge cases that may break without pre-expansion (external/$id refs, recursive refs, resolver-dependent schemas), wdyt?

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The PR title "fix: remove references resolution from normalization" is specific and directly aligned with the main changes in the changeset. The code modifications in transform.py clearly demonstrate the removal of $ref resolution mechanisms—specifically eliminating deepcopy imports, removing the expand_refs invocations, and deleting the pre-validation/ref-resolution steps that previously expanded references in schemas. The supporting test changes in test_transform.py inline previously referenced definitions, confirming that the removal of references resolution is the primary objective. A teammate scanning the git history would immediately understand that this PR removes $ref resolution from the normalization step, making the title both clear and specific to the changeset's main intent.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch issue_14928/normalization_performance

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3e86b19 and 32ca1d4.

📒 Files selected for processing (1)
  • unit_tests/sources/utils/test_transform.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Check: source-shopify
🔇 Additional comments (2)
unit_tests/sources/utils/test_transform.py (2)

21-22: LGTM! Schema definitions correctly inlined.

The array and nested properties now contain inline type definitions instead of JSON Schema $ref references, which aligns perfectly with the PR objective of removing reference resolution from the normalization step. The test cases continue to work as expected with these inlined definitions.


12-28: Quick question about test coverage for the removed field.

I noticed the COMPLEX_SCHEMA previously had a "def" field (per the AI summary) that was removed along with the definitions block. I don't see any test cases that were exercising this field—was this intentional, or was the "def" field just never covered by tests? Just curious, wdyt?


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
unit_tests/sources/utils/test_transform.py (2)

385-386: Consider refactoring large inline constants.

The HUBSPOT_CONTACTS_SCHEMA and HUBSPOT_CONTACTS_RECORD are extremely large inline dictionaries that make the test function hard to read. Would it make sense to move these to module-level constants or a separate fixture file? This would improve maintainability and readability. Wdyt?


388-393: Enhance test documentation and isolation.

A few suggestions for the performance test:

  1. Add a docstring explaining what performance characteristic this test validates
  2. The 3-second threshold is acknowledged as arbitrary - could you document why this value was chosen, or make it configurable via pytest markers/environment variables?
  3. Since transform() mutates the record in-place, should each iteration use a fresh copy to ensure consistent test conditions?

Wdyt?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 20ae208 and f0a52bb.

📒 Files selected for processing (2)
  • airbyte_cdk/sources/utils/transform.py (1 hunks)
  • unit_tests/sources/utils/test_transform.py (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
unit_tests/sources/utils/test_transform.py (1)
airbyte_cdk/sources/utils/transform.py (3)
  • TypeTransformer (67-266)
  • TransformConfig (48-64)
  • transform (218-236)
🪛 GitHub Actions: Linters
unit_tests/sources/utils/test_transform.py

[error] 382-382: ruff format check failed. 1 file would be reformatted. Run 'ruff format' to fix formatting.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: Check: destination-motherduck
  • GitHub Check: Check: source-hardcoded-records
  • GitHub Check: Check: source-intercom
  • GitHub Check: Check: source-shopify
  • GitHub Check: Check: source-pokeapi
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.13, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: SDM Docker Image Build
  • GitHub Check: Pytest (All, Python 3.12, Ubuntu)
  • GitHub Check: Manifest Server Docker Image Build
🔇 Additional comments (4)
airbyte_cdk/sources/utils/transform.py (2)

7-7: LGTM! Clean import removal aligns with $ref resolution removal.

The import changes properly reflect the removal of $ref resolution preprocessing from the transformation pipeline, as described in the PR objectives.


86-86: Verify whether "$ref" should remain in the validator keys list.

Since this PR removes $ref resolution preprocessing, should "$ref" still be included in the validator keys on line 86? If TypeTransformer no longer processes $ref schemas, this key might be unnecessary. Wdyt?

unit_tests/sources/utils/test_transform.py (2)

6-6: LGTM! Time import supports the new performance test.


22-23: LGTM! Schema inlining removes $ref dependencies.

The transition from referenced definitions to inline schemas aligns with the PR objective and makes the test schemas more explicit and self-contained.

@github-actions
Copy link

github-actions bot commented Oct 23, 2025

PyTest Results (Fast)

3 813 tests  +11   3 801 ✅ +11   6m 32s ⏱️ -11s
    1 suites ± 0      12 💤 ± 0 
    1 files   ± 0       0 ❌ ± 0 

Results for commit 32ca1d4. ± Comparison against base commit 20ae208.

This pull request removes 1 and adds 12 tests. Note that renamed tests count towards both.
unit_tests.sources.declarative.requesters.query_properties.test_query_properties ‑ test_get_request_property_chunks_static_list_with_chunking
unit_tests.sources.declarative.parsers.test_model_to_component_factory ‑ test_create_component_with_configured_catalog
unit_tests.sources.declarative.requesters.query_properties.property_selector.test_json_schema_property_selector ‑ test_select_properties[configured_stream_is_none]
unit_tests.sources.declarative.requesters.query_properties.property_selector.test_json_schema_property_selector ‑ test_select_properties[configured_stream_no_properties_key_in_json_schema]
unit_tests.sources.declarative.requesters.query_properties.property_selector.test_json_schema_property_selector ‑ test_select_properties[test_select_properties_with_transformations]
unit_tests.sources.declarative.requesters.query_properties.test_property_chunking ‑ test_get_request_property_chunks_empty_configured_properties
unit_tests.sources.declarative.requesters.query_properties.test_property_chunking ‑ test_get_request_property_chunks_none_configured_properties
unit_tests.sources.declarative.requesters.query_properties.test_query_properties ‑ test_get_request_no_property_chunking_always_include_properties
unit_tests.sources.declarative.requesters.query_properties.test_query_properties ‑ test_get_request_no_property_chunking_selected_properties_always_include_properties
unit_tests.sources.declarative.requesters.query_properties.test_query_properties ‑ test_get_request_property_chunks_static_list_with_chunking_property_selection
unit_tests.sources.declarative.requesters.query_properties.test_query_properties ‑ test_get_request_property_chunks_with_configured_catalog_dynamic_endpoint
…

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Oct 23, 2025

PyTest Results (Full)

3 816 tests   3 804 ✅  11m 14s ⏱️
    1 suites     12 💤
    1 files        0 ❌

Results for commit 32ca1d4.

♻️ This comment has been updated with latest results.

Copy link
Contributor

@brianjlai brianjlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚤

some formatting on the test i think but not blocking

Copy link
Contributor

@aaronsteers aaronsteers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

coderabbitai[bot]
coderabbitai bot previously requested changes Oct 24, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f0a52bb and 3e86b19.

📒 Files selected for processing (1)
  • unit_tests/sources/utils/test_transform.py (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: Check: source-shopify
  • GitHub Check: Check: source-pokeapi
  • GitHub Check: Check: source-hardcoded-records
  • GitHub Check: Check: destination-motherduck
  • GitHub Check: Check: source-intercom
  • GitHub Check: Pytest (Fast)
  • GitHub Check: SDM Docker Image Build
  • GitHub Check: Manifest Server Docker Image Build
  • GitHub Check: Pytest (All, Python 3.12, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.13, Ubuntu)
🔇 Additional comments (1)
unit_tests/sources/utils/test_transform.py (1)

22-23: LGTM - schema inlining looks good!

The schema definitions have been correctly updated to inline the previously referenced definitions. This aligns perfectly with the PR objective to remove $ref resolution from the normalization step. The test cases should continue to work as expected with these inline definitions.

@maxi297 maxi297 dismissed coderabbitai[bot]’s stale review October 24, 2025 15:31

The comment is outdated as I've removed the test that was making formatting fail

@maxi297 maxi297 merged commit 59e4163 into main Oct 24, 2025
29 of 30 checks passed
@maxi297 maxi297 deleted the issue_14928/normalization_performance branch October 24, 2025 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working security

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants