- 
                Notifications
    
You must be signed in to change notification settings  - Fork 31
 
fix: remove references resolution from normalization #812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
          👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. Testing This CDK VersionYou can test this version of the CDK using the following: # Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@issue_14928/normalization_performance#egg=airbyte-python-cdk[dev]' --help
# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch issue_14928/normalization_performanceHelpful ResourcesPR Slash CommandsAirbyte Maintainers can execute the following slash commands on your PR: 
  | 
    
          
📝 WalkthroughWalkthroughRemoved JSON Schema $ref-expansion preprocessing from the transformation pipeline in transform.py; unit tests updated to inline previously referenced schema definitions and imports adjusted in test_transform.py. Changes
 Sequence Diagram(s)sequenceDiagram
    participant Caller
    participant TypeTransformer
    participant Validator
    rect #F0F8FF
        Note right of Caller: Old flow (with pre-expansion)
    end
    Caller->>TypeTransformer: provide schema + value
    TypeTransformer->>TypeTransformer: expand_refs(schema, value)
    TypeTransformer->>Validator: validate(expanded_schema, value)
    Validator-->>TypeTransformer: validation result
    TypeTransformer-->>Caller: normalized value
    sequenceDiagram
    participant Caller
    participant TypeTransformer
    participant Validator
    rect #FFF7F0
        Note right of Caller: New flow (no pre-expansion)
    end
    Caller->>TypeTransformer: provide schema + value
    TypeTransformer->>Validator: validate(schema, value)
    Validator-->>TypeTransformer: validation result
    TypeTransformer-->>Caller: normalized value
    Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Would you like me to run a checklist of potential edge cases that may break without pre-expansion (external/$id refs, recursive refs, resolver-dependent schemas), wdyt? Pre-merge checks and finishing touches❌ Failed checks (1 warning)
 ✅ Passed checks (2 passed)
 ✨ Finishing touches
 🧪 Generate unit tests (beta)
 📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
 ⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
 🔇 Additional comments (2)
 Comment   | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
unit_tests/sources/utils/test_transform.py (2)
385-386: Consider refactoring large inline constants.The HUBSPOT_CONTACTS_SCHEMA and HUBSPOT_CONTACTS_RECORD are extremely large inline dictionaries that make the test function hard to read. Would it make sense to move these to module-level constants or a separate fixture file? This would improve maintainability and readability. Wdyt?
388-393: Enhance test documentation and isolation.A few suggestions for the performance test:
- Add a docstring explaining what performance characteristic this test validates
 - The 3-second threshold is acknowledged as arbitrary - could you document why this value was chosen, or make it configurable via pytest markers/environment variables?
 - Since
 transform()mutates the record in-place, should each iteration use a fresh copy to ensure consistent test conditions?Wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
airbyte_cdk/sources/utils/transform.py(1 hunks)unit_tests/sources/utils/test_transform.py(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
unit_tests/sources/utils/test_transform.py (1)
airbyte_cdk/sources/utils/transform.py (3)
TypeTransformer(67-266)TransformConfig(48-64)transform(218-236)
🪛 GitHub Actions: Linters
unit_tests/sources/utils/test_transform.py
[error] 382-382: ruff format check failed. 1 file would be reformatted. Run 'ruff format' to fix formatting.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
- GitHub Check: Check: destination-motherduck
 - GitHub Check: Check: source-hardcoded-records
 - GitHub Check: Check: source-intercom
 - GitHub Check: Check: source-shopify
 - GitHub Check: Check: source-pokeapi
 - GitHub Check: Pytest (Fast)
 - GitHub Check: Pytest (All, Python 3.13, Ubuntu)
 - GitHub Check: Pytest (All, Python 3.11, Ubuntu)
 - GitHub Check: Pytest (All, Python 3.10, Ubuntu)
 - GitHub Check: SDM Docker Image Build
 - GitHub Check: Pytest (All, Python 3.12, Ubuntu)
 - GitHub Check: Manifest Server Docker Image Build
 
🔇 Additional comments (4)
airbyte_cdk/sources/utils/transform.py (2)
7-7: LGTM! Clean import removal aligns with $ref resolution removal.The import changes properly reflect the removal of $ref resolution preprocessing from the transformation pipeline, as described in the PR objectives.
86-86: Verify whether "$ref" should remain in the validator keys list.Since this PR removes $ref resolution preprocessing, should "$ref" still be included in the validator keys on line 86? If TypeTransformer no longer processes $ref schemas, this key might be unnecessary. Wdyt?
unit_tests/sources/utils/test_transform.py (2)
6-6: LGTM! Time import supports the new performance test.
22-23: LGTM! Schema inlining removes $ref dependencies.The transition from referenced definitions to inline schemas aligns with the PR objective and makes the test schemas more explicit and self-contained.
          PyTest Results (Fast)3 813 tests  +11   3 801 ✅ +11   6m 32s ⏱️ -11s Results for commit 32ca1d4. ± Comparison against base commit 20ae208. This pull request removes 1 and adds 12 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results.  | 
    
          PyTest Results (Full)3 816 tests   3 804 ✅  11m 14s ⏱️ Results for commit 32ca1d4. ♻️ This comment has been updated with latest results.  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚤
some formatting on the test i think but not blocking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
unit_tests/sources/utils/test_transform.py(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
- GitHub Check: Check: source-shopify
 - GitHub Check: Check: source-pokeapi
 - GitHub Check: Check: source-hardcoded-records
 - GitHub Check: Check: destination-motherduck
 - GitHub Check: Check: source-intercom
 - GitHub Check: Pytest (Fast)
 - GitHub Check: SDM Docker Image Build
 - GitHub Check: Manifest Server Docker Image Build
 - GitHub Check: Pytest (All, Python 3.12, Ubuntu)
 - GitHub Check: Pytest (All, Python 3.10, Ubuntu)
 - GitHub Check: Pytest (All, Python 3.11, Ubuntu)
 - GitHub Check: Pytest (All, Python 3.13, Ubuntu)
 
🔇 Additional comments (1)
unit_tests/sources/utils/test_transform.py (1)
22-23: LGTM - schema inlining looks good!The schema definitions have been correctly updated to inline the previously referenced definitions. This aligns perfectly with the PR objective to remove
$refresolution from the normalization step. The test cases should continue to work as expected with these inline definitions.
The comment is outdated as I've removed the test that was making formatting fail
What
Addresses https://github.com/airbytehq/airbyte-internal-issues/issues/14928
Reasoning
Schema Loaders
Here is an analysis of all the schema loaders. Overall, it does not seem possible to have $def references returned as part of the discover or used as part of the read.
JsonFileSchemaLoader
This loader will break with
jsonref.JsonRefError: FileNotFoundError: [Errno 2] No such file or directory: '<reference value>'is a reference is not a fileDefaultSchemaLoader
JsonFileSchemaLoader that defaults on an empty schema
SchemaLoaderCachingDecorator
It's just a caching layer
CompositeSchemaLoader
It aggregates multiple schemas without any opinion on the content so this layer should be fine.
InlineSchemaLoader
Yaml references are resolved when loading the manifest so internal references to the schema_loader would be resolved by the CDK but would fail because it uses the root of the yaml as a base, not the root of the schema. For example:
... will yield
airbyte_cdk.sources.declarative.parsers.custom_exceptions.UndefinedReferenceException: Undefined reference #/definitions/link_id from definitions/link_idDynamicSchemaLoader
Given the keys declared here, the only thing it could reference is another property which is not very likely.
Not low code
It is still possible that some "non low-code" connectors would use this referencing stuff. Searching for
TransformConfig, I see the following sources:ResourceSchemaLoaderwhich is the underlying component ofJsonFileSchemaLoadermentioned above)ResourceSchemaLoaderwhich is the underlying component ofJsonFileSchemaLoadermentioned above)ResourceSchemaLoaderwhich is the underlying component ofJsonFileSchemaLoadermentioned above)ResourceSchemaLoaderwhich is the underlying component ofJsonFileSchemaLoadermentioned above)ResourceSchemaLoaderwhich is the underlying component ofJsonFileSchemaLoadermentioned above)If not marked in the list above, the source is low-code and can't be using ref based on the
Schema Loaderssection above.More notes
We still use
expand_refsas part ofairbyte_cdk.source.config.BaseConfig. I don't know how relevant this code is as I've only seen usage in the tests so I assume it might be affected by performance issues but that it does not really matter to us right now.Remaining questions
Summary by CodeRabbit
Chores
Tests
Notes