build: add ijson as a runtime dependency#1011
build: add ijson as a runtime dependency#1011devin-ai-integration[bot] wants to merge 2 commits into
Conversation
Adds the ijson streaming JSON parser as a direct dependency so connectors that ship inside the source-declarative-manifest base image can stream-parse very large JSON response bodies without materializing the full document in memory. Motivation: source-amazon-seller-partner currently OOMs while reading GET_BRAND_ANALYTICS_SEARCH_TERMS_REPORT documents that can exceed 3 GB uncompressed. See airbytehq/oncall#12143.
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. 💡 Show Tips and TricksTesting This CDK VersionYou can test this version of the CDK using the following: # Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1777722612-add-ijson-dep#egg=airbyte-python-cdk[dev]' --help
# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1777722612-add-ijson-depPR Slash CommandsAirbyte Maintainers can execute the following slash commands on your PR:
|
PyTest Results (Fast)4 040 tests ±0 4 028 ✅ - 1 6m 41s ⏱️ - 1m 4s Results for commit d2cd1ac. ± Comparison against base commit 886fcf8. This pull request skips 1 test. |
|
↩️ Triggering Reason: CDK PR is linked to a connector oncall issue, CI is passing or already has local/full pytest evidence, and no AI review marker is present. |
|
Correction: attempted to trigger the CDK This PR remains marked for human review/next-step decision. |
Summary
Adds
ijsonas a direct runtime dependency ofairbyte-cdk.ijsonis a streaming JSON parser. Adding it as a CDK dep makes it available inside thesource-declarative-manifest(SDM) base image so that manifest-only connectors with custom components can stream-parse very large JSON response bodies without materializing the entire document in memory.Motivation
source-amazon-seller-partneris currently OOMing onGET_BRAND_ANALYTICS_SEARCH_TERMS_REPORT(and the other Brand Analytics reports) when reports exceed a few GB uncompressed. Its customGzipJsonDecoderdoes:response.content— buffers the full compressed payload.gzip.decompress(...)— materializes the full decompressed bytes (~3 GB)..decode("iso-8859-1")— copies that into a Python string.json.loads(document)— parses the entire JSON tree.For a ~3.2 GB report this easily peaks at 12–20 GB of memory, well past the 8 GB cap the customer is allocating. See airbytehq/oncall#12143 for the customer-facing context.
The follow-up connector PR will add a streaming
GzipJsonDecodervariant insource-amazon-seller-partner/components.pythat usesijsonto yield records one at a time. That PR depends on this CDK release shipping in a new SDM base image.Why a direct dep, not optional
ijsonis a small (single-digit MB) wheel with prebuilt binaries for the platforms the CDK supports. Several connectors already depend on it transitively (e.g. viaunstructured/document loaders), so making it explicit is low-risk and unblocks streaming use cases for any connector that needs them.Review & Testing Checklist for Human
ijsonshows up in the newpoetry.lockand that no other dependency had its resolved version changed.import ijsonand use the C backend (ijson.backends.yajl2_c).Notes
composite_raw_decoder.py(e.g., aJsonItemsParserpowered byijson) so connectors don't have to roll their own. Intentionally deferred to keep this change minimal.Link to Devin session: https://app.devin.ai/sessions/e31a7df6ebe54ce4a68e0eecc7117555