fix: hubspot property chunking #797

maxi297 · 2025-10-15T21:53:44Z

What

Follow up to https://airbytehq-team.slack.com/archives/C02U9R3AF37/p1760561662456189

Summary by CodeRabbit

Bug Fixes
- Always applies additional query parameter chunks for paginated requests when such parameters are present.
- Ensures records without a merge key are still emitted and tracked, preventing lost or unreported data.
- Improves reliability of parameter-driven pagination to reduce missing or incorrectly merged records during syncs.

coderabbitai · 2025-10-15T21:53:54Z

📝 Walkthrough

Walkthrough

Always iterate request-property chunks when additional_query_properties exists; compute merge_key only if property_chunking is truthy; ensure records without a merge_key are still observed by the pagination tracker and yielded; add a unit test for additional_query_properties without property_chunking.

Changes

Cohort / File(s)	Change summary
Retriever pagination & record emission `airbyte_cdk/sources/declarative/retrievers/simple_retriever.py`	Refactored `_read_pages` to iterate `get_request_property_chunks` whenever `additional_query_properties` exists; compute `merge_key` only when `property_chunking` is truthy; always observe and yield records lacking a `merge_key`.
Unit tests — simple retriever `unit_tests/sources/declarative/retrievers/test_simple_retriever.py`	Added `test_simple_retriever_with_additional_query_properties_but_without_property_chunking` validating two pages read, expected records returned, and that `stream_slice` includes `extra_fields` on first requester call when `property_chunking` is None.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant C as Caller
  participant SR as SimpleRetriever
  participant AQP as AdditionalQueryProperties
  participant PG as Paginator
  participant PT as PaginationTracker

  C->>SR: read()
  SR->>SR: init stream_slice
  alt additional_query_properties present
    SR->>AQP: get_request_property_chunks(stream_slice)
    AQP-->>SR: chunks
    loop for each chunk
      SR->>SR: extend stream_slice with chunk
      SR->>PG: next_page(stream_slice)
      PG-->>SR: page (records)
      loop for each record
        alt property_chunking truthy
          SR->>SR: compute merge_key
          alt merge_key present
            SR->>PT: observe merged record
            SR-->>C: yield merged record
          else merge_key missing
            SR->>PT: observe record
            SR-->>C: yield raw record
          end
        else property_chunking falsy
          SR->>PT: observe record
          SR-->>C: yield raw record
        end
      end
    end
  else
    SR->>PG: next_page(stream_slice)
    PG-->>SR: page (records)
    SR->>PT: observe records
    SR-->>C: yield records
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

fix(property chunking): Switch the ordering page iteration and property chunking process chunks first instead of pages first #487: Modifies SimpleRetriever pagination/property-chunking logic in _read_pages, closely related to these changes.

Suggested reviewers

darynaishchenko
tolik0

Would you like a short unit-test checklist covering records-without-merge-key and chunked vs non-chunked paths, wdyt?

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title "fix: hubspot property chunking" correctly points to the area of the bug but is fairly broad and doesn’t capture that the change updates simple_retriever to handle additional_query_properties even when property_chunking is unset. As a result, a reviewer scanning history might not realize the fix now processes request property chunks without requiring the property_chunking flag. Despite this, the title still directly relates to the core issue being addressed.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch maxi297/fix_hubspot_property_chunking

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2025-10-15T21:53:56Z

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@maxi297/fix_hubspot_property_chunking#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch maxi297/fix_hubspot_property_chunking

Helpful Resources

CDK API Reference

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

/autofix - Fixes most formatting and linting issues
/poetry-lock - Updates poetry.lock file
/test - Runs connector tests with the updated CDK
/poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
/poe <command> - Runs any poe command in the CDK environment

📝 Edit this welcome message.

github-actions · 2025-10-15T22:01:10Z

PyTest Results (Fast)

3 802 tests +6 3 790 ✅ +6 6m 28s ⏱️ +2s
1 suites ±0 12 💤 ±0
1 files ±0 0 ❌ ±0

Results for commit 5816cd8. ± Comparison against base commit f360d91.

♻️ This comment has been updated with latest results.

maxi297 · 2025-10-15T22:06:53Z

/autofix

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.

Note: This job can only be run by maintainers. On PRs from forks, this command requires
that the PR author has enabled the Allow edits from maintainers option.

PR auto-fix job started... Check job output.

✅ Changes applied successfully.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

airbyte_cdk/sources/declarative/retrievers/simple_retriever.py (1)

409-419: Consider extracting the duplicate record emission logic, wdyt?

Lines 409-414 and 416-419 perform identical operations (observe, increment, set last_record, yield). Could these be unified to reduce duplication?

For example, you could extract this into a helper:

def _emit_record(self, record: Record, pagination_tracker: PaginationTracker) -> Tuple[int, Record]:
    """Emit a record and update tracking state."""
    pagination_tracker.observe(record)
    return 1, record  # page_size_increment, last_record

Then use it in both places:

                                if merge_key:
                                    _deep_merge(merged_records[merge_key], current_record)
                                else:
-                                    # We should still emit records even if the record did not have a merge key
-                                    pagination_tracker.observe(current_record)
-                                    last_page_size += 1
-                                    last_record = current_record
-                                    yield current_record
+                                    page_size_increment, last_record = self._emit_record(current_record, pagination_tracker)
+                                    last_page_size += page_size_increment
+                                    yield current_record
                            else:
-                                pagination_tracker.observe(current_record)
-                                last_page_size += 1
-                                last_record = current_record
-                                yield current_record
+                                page_size_increment, last_record = self._emit_record(current_record, pagination_tracker)
+                                last_page_size += page_size_increment
+                                yield current_record

This is just a suggestion to improve maintainability - the current implementation is functionally correct!

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e2de3d4 and 0bf488e.

📒 Files selected for processing (1)

airbyte_cdk/sources/declarative/retrievers/simple_retriever.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

airbyte_cdk/sources/declarative/retrievers/simple_retriever.py (2)

airbyte_cdk/sources/declarative/requesters/query_properties/property_chunking.py (1)

get_merge_key (70-71)

airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1)

observe (510-543)

🪛 GitHub Actions: Linters

airbyte_cdk/sources/declarative/retrievers/simple_retriever.py

[error] 399-406: ruff format --diff detected formatting changes needed in this file. Exit code 1 after showing the diff; please run the formatter to fix code style issues.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)

GitHub Check: Check: destination-motherduck
GitHub Check: Check: source-intercom
GitHub Check: Check: source-pokeapi
GitHub Check: Check: source-shopify
GitHub Check: Check: source-hardcoded-records
GitHub Check: SDM Docker Image Build
GitHub Check: Manifest Server Docker Image Build
GitHub Check: Pytest (Fast)
GitHub Check: Pytest (All, Python 3.11, Ubuntu)
GitHub Check: Pytest (All, Python 3.12, Ubuntu)
GitHub Check: Pytest (All, Python 3.13, Ubuntu)
GitHub Check: Pytest (All, Python 3.10, Ubuntu)

🔇 Additional comments (2)

airbyte_cdk/sources/declarative/retrievers/simple_retriever.py (2)

387-387: LGTM! The condition simplification makes sense.

Checking only for additional_query_properties allows property chunking to occur even when property_chunking is not defined, which aligns with the PR's intent to fix the chunking behavior.

401-414: No additional tests needed – existing test test_simple_retriever_still_emit_records_if_no_merge_key covers this edge case.

airbyte_cdk/sources/declarative/retrievers/simple_retriever.py

github-actions · 2025-10-15T22:19:42Z

PyTest Results (Full)

3 805 tests +6 3 793 ✅ +6 11m 2s ⏱️ +2s
1 suites ±0 12 💤 ±0
1 files ±0 0 ❌ ±0

Results for commit 5816cd8. ± Comparison against base commit f360d91.

♻️ This comment has been updated with latest results.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

unit_tests/sources/declarative/retrievers/test_simple_retriever.py (3)
1121-1121: Add a docstring to explain the test's specific purpose?

This test covers an important edge case (additional_query_properties without property_chunking + pagination). A docstring would help future maintainers quickly understand what scenario this test validates, wdyt?

1174-1176: Consider setting get_initial_token return value for test robustness?

Other pagination tests (e.g., lines 662, 1548) explicitly set paginator.get_initial_token.return_value. While the test might work without it due to Mock's behavior, explicitly setting it to None would make the test more explicit and prevent potential issues if the retriever calls this method, wdyt?

Apply this diff:
 paginator = _mock_paginator()
+paginator.get_initial_token.return_value = None
 paginator.next_page_token.side_effect = [{"next_page_token": 1}, None]
1194-1194: Strengthen the extra_fields assertion to verify content?

The current assertion only checks that extra_fields is truthy. Consider verifying:

The actual content of extra_fields (should contain {"query_properties": ...})

Both send_request calls have extra_fields populated (both pages in the same property chunk)

This would make the test more thorough and catch potential regressions, wdyt?

Apply this diff:
-assert requester.send_request.call_args_list[0].kwargs["stream_slice"].extra_fields
+# Verify both pages have extra_fields populated (same property chunk)
+first_call_extra_fields = requester.send_request.call_args_list[0].kwargs["stream_slice"].extra_fields
+second_call_extra_fields = requester.send_request.call_args_list[1].kwargs["stream_slice"].extra_fields
+assert first_call_extra_fields.get("query_properties") == ["first_name", "last_name", "nonary", "bracelet"]
+assert second_call_extra_fields.get("query_properties") == ["first_name", "last_name", "nonary", "bracelet"]

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 69a490b and 5816cd8.

📒 Files selected for processing (1)

unit_tests/sources/declarative/retrievers/test_simple_retriever.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

unit_tests/sources/declarative/retrievers/test_simple_retriever.py (4)

airbyte_cdk/sources/types.py (7)

Record (21-72)

data (35-36)

associated_slice (39-40)

StreamSlice (75-169)

cursor_slice (107-112)

partition (99-104)

extra_fields (115-117)

airbyte_cdk/sources/declarative/requesters/requester.py (1)

send_request (138-156)

airbyte_cdk/sources/declarative/requesters/query_properties/query_properties.py (1)

QueryProperties (14-48)

airbyte_cdk/sources/declarative/retrievers/simple_retriever.py (2)

name (118-126)

name (129-131)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)

GitHub Check: Check: source-intercom
GitHub Check: Check: destination-motherduck
GitHub Check: Check: source-shopify
GitHub Check: Check: source-pokeapi
GitHub Check: Check: source-hardcoded-records
GitHub Check: Pytest (All, Python 3.13, Ubuntu)
GitHub Check: Pytest (All, Python 3.12, Ubuntu)
GitHub Check: Pytest (All, Python 3.11, Ubuntu)
GitHub Check: Pytest (All, Python 3.10, Ubuntu)
GitHub Check: Pytest (Fast)
GitHub Check: Manifest Server Docker Image Build
GitHub Check: SDM Docker Image Build
GitHub Check: Analyze (python)

brianjlai

📦 lgtm

… for missing custom properties during incremental syncs (#68159) ## What We had a bug where during incremental syncs Hubspot CRM Search streams were not including the custom properties in the json body of the POST request so they were not getting received and emitted with records. ## How The bug was in the CDK and it was fixed in version `7.3.7` in this airbytehq/airbyte-python-cdk#797 We need to bump the version of SDM to get the fix, but in addition, we need to upgrade the unit_test `pyproject.toml` which is still on v6. I've also added a new test that validates that properties are indeed populated in the outbound request. And with the bump from v6 to v7 I fixed the tests which have now changed. **Note**: It does feel like we have something of a gap where our unit tests don't properly test CDK changes since the two are independently versioned... This is something we may want to investigate and solve so these types of things don't happen again ## Can this PR be safely reverted and rolled back? - [ ] YES 💚 - [ ] NO ❌ Kind of... If we do this wrong then we have to reset customers back to their previous state, but this is no different than the state we were previously in

fix hubspot property chunking

e2de3d4

github-actions bot added bug Something isn't working security labels Oct 15, 2025

coderabbitai bot approved these changes Oct 15, 2025

View reviewed changes

emit record when not chunking

0bf488e

maxi297 requested a review from brianjlai October 15, 2025 22:06

Auto-fix lint and format issues

69a490b

coderabbitai bot requested changes Oct 15, 2025

View reviewed changes

airbyte_cdk/sources/declarative/retrievers/simple_retriever.py Show resolved Hide resolved

coderabbitai bot approved these changes Oct 15, 2025

View reviewed changes

adding test for query properties without propertychunking

5816cd8

coderabbitai bot reviewed Oct 15, 2025

View reviewed changes

brianjlai approved these changes Oct 15, 2025

View reviewed changes

brianjlai merged commit 55ea028 into main Oct 16, 2025
28 of 29 checks passed

brianjlai deleted the maxi297/fix_hubspot_property_chunking branch October 16, 2025 16:41

brianjlai mentioned this pull request Oct 17, 2025

fix(source-hubspot): bump to latest version of CDK to get the bug fix for missing custom properties during incremental syncs airbytehq/airbyte#68159

Merged

2 tasks

coderabbitai bot mentioned this pull request Oct 23, 2025

feat: cache properties from endpoint #808

Merged

fix: hubspot property chunking #797

fix: hubspot property chunking #797

Uh oh!

Conversation

maxi297 commented Oct 15, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

github-actions bot commented Oct 15, 2025

👋 Greetings, Airbyte Team Member!

Testing This CDK Version

Helpful Resources

PR Slash Commands

Uh oh!

github-actions bot commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PyTest Results (Fast)

Uh oh!

maxi297 commented Oct 15, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PyTest Results (Full)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

brianjlai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

maxi297 commented Oct 15, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 15, 2025 •

edited

Loading

github-actions bot commented Oct 15, 2025 •

edited

Loading

maxi297 commented Oct 15, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Oct 15, 2025 •

edited

Loading