Skip to content

Conversation

@maxi297
Copy link
Contributor

@maxi297 maxi297 commented Oct 15, 2025

What

Follow up to https://airbytehq-team.slack.com/archives/C02U9R3AF37/p1760561662456189

Summary by CodeRabbit

  • Bug Fixes
    • Always applies additional query parameter chunks for paginated requests when such parameters are present.
    • Ensures records without a merge key are still emitted and tracked, preventing lost or unreported data.
    • Improves reliability of parameter-driven pagination to reduce missing or incorrectly merged records during syncs.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 15, 2025

📝 Walkthrough

Walkthrough

Always iterate request-property chunks when additional_query_properties exists; compute merge_key only if property_chunking is truthy; ensure records without a merge_key are still observed by the pagination tracker and yielded; add a unit test for additional_query_properties without property_chunking.

Changes

Cohort / File(s) Change summary
Retriever pagination & record emission
airbyte_cdk/sources/declarative/retrievers/simple_retriever.py
Refactored _read_pages to iterate get_request_property_chunks whenever additional_query_properties exists; compute merge_key only when property_chunking is truthy; always observe and yield records lacking a merge_key.
Unit tests — simple retriever
unit_tests/sources/declarative/retrievers/test_simple_retriever.py
Added test_simple_retriever_with_additional_query_properties_but_without_property_chunking validating two pages read, expected records returned, and that stream_slice includes extra_fields on first requester call when property_chunking is None.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant C as Caller
  participant SR as SimpleRetriever
  participant AQP as AdditionalQueryProperties
  participant PG as Paginator
  participant PT as PaginationTracker

  C->>SR: read()
  SR->>SR: init stream_slice
  alt additional_query_properties present
    SR->>AQP: get_request_property_chunks(stream_slice)
    AQP-->>SR: chunks
    loop for each chunk
      SR->>SR: extend stream_slice with chunk
      SR->>PG: next_page(stream_slice)
      PG-->>SR: page (records)
      loop for each record
        alt property_chunking truthy
          SR->>SR: compute merge_key
          alt merge_key present
            SR->>PT: observe merged record
            SR-->>C: yield merged record
          else merge_key missing
            SR->>PT: observe record
            SR-->>C: yield raw record
          end
        else property_chunking falsy
          SR->>PT: observe record
          SR-->>C: yield raw record
        end
      end
    end
  else
    SR->>PG: next_page(stream_slice)
    PG-->>SR: page (records)
    SR->>PT: observe records
    SR-->>C: yield records
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • darynaishchenko
  • tolik0

Would you like a short unit-test checklist covering records-without-merge-key and chunked vs non-chunked paths, wdyt?

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "fix: hubspot property chunking" correctly points to the area of the bug but is fairly broad and doesn’t capture that the change updates simple_retriever to handle additional_query_properties even when property_chunking is unset. As a result, a reviewer scanning history might not realize the fix now processes request property chunks without requiring the property_chunking flag. Despite this, the title still directly relates to the core issue being addressed.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch maxi297/fix_hubspot_property_chunking

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added bug Something isn't working security labels Oct 15, 2025
@github-actions
Copy link

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@maxi297/fix_hubspot_property_chunking#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch maxi297/fix_hubspot_property_chunking

Helpful Resources

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test - Runs connector tests with the updated CDK
  • /poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
  • /poe <command> - Runs any poe command in the CDK environment

📝 Edit this welcome message.

@github-actions
Copy link

github-actions bot commented Oct 15, 2025

PyTest Results (Fast)

3 802 tests  +6   3 790 ✅ +6   6m 28s ⏱️ +2s
    1 suites ±0      12 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit 5816cd8. ± Comparison against base commit f360d91.

♻️ This comment has been updated with latest results.

@maxi297 maxi297 requested a review from brianjlai October 15, 2025 22:06
@maxi297
Copy link
Contributor Author

maxi297 commented Oct 15, 2025

/autofix

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.

Note: This job can only be run by maintainers. On PRs from forks, this command requires
that the PR author has enabled the Allow edits from maintainers option.

PR auto-fix job started... Check job output.

✅ Changes applied successfully.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
airbyte_cdk/sources/declarative/retrievers/simple_retriever.py (1)

409-419: Consider extracting the duplicate record emission logic, wdyt?

Lines 409-414 and 416-419 perform identical operations (observe, increment, set last_record, yield). Could these be unified to reduce duplication?

For example, you could extract this into a helper:

def _emit_record(self, record: Record, pagination_tracker: PaginationTracker) -> Tuple[int, Record]:
    """Emit a record and update tracking state."""
    pagination_tracker.observe(record)
    return 1, record  # page_size_increment, last_record

Then use it in both places:

                                if merge_key:
                                    _deep_merge(merged_records[merge_key], current_record)
                                else:
-                                    # We should still emit records even if the record did not have a merge key
-                                    pagination_tracker.observe(current_record)
-                                    last_page_size += 1
-                                    last_record = current_record
-                                    yield current_record
+                                    page_size_increment, last_record = self._emit_record(current_record, pagination_tracker)
+                                    last_page_size += page_size_increment
+                                    yield current_record
                            else:
-                                pagination_tracker.observe(current_record)
-                                last_page_size += 1
-                                last_record = current_record
-                                yield current_record
+                                page_size_increment, last_record = self._emit_record(current_record, pagination_tracker)
+                                last_page_size += page_size_increment
+                                yield current_record

This is just a suggestion to improve maintainability - the current implementation is functionally correct!

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e2de3d4 and 0bf488e.

📒 Files selected for processing (1)
  • airbyte_cdk/sources/declarative/retrievers/simple_retriever.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
airbyte_cdk/sources/declarative/retrievers/simple_retriever.py (2)
airbyte_cdk/sources/declarative/requesters/query_properties/property_chunking.py (1)
  • get_merge_key (70-71)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1)
  • observe (510-543)
🪛 GitHub Actions: Linters
airbyte_cdk/sources/declarative/retrievers/simple_retriever.py

[error] 399-406: ruff format --diff detected formatting changes needed in this file. Exit code 1 after showing the diff; please run the formatter to fix code style issues.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: Check: destination-motherduck
  • GitHub Check: Check: source-intercom
  • GitHub Check: Check: source-pokeapi
  • GitHub Check: Check: source-shopify
  • GitHub Check: Check: source-hardcoded-records
  • GitHub Check: SDM Docker Image Build
  • GitHub Check: Manifest Server Docker Image Build
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.12, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.13, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (2)
airbyte_cdk/sources/declarative/retrievers/simple_retriever.py (2)

387-387: LGTM! The condition simplification makes sense.

Checking only for additional_query_properties allows property chunking to occur even when property_chunking is not defined, which aligns with the PR's intent to fix the chunking behavior.


401-414: No additional tests needed – existing test test_simple_retriever_still_emit_records_if_no_merge_key covers this edge case.

@github-actions
Copy link

github-actions bot commented Oct 15, 2025

PyTest Results (Full)

3 805 tests  +6   3 793 ✅ +6   11m 2s ⏱️ +2s
    1 suites ±0      12 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit 5816cd8. ± Comparison against base commit f360d91.

♻️ This comment has been updated with latest results.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
unit_tests/sources/declarative/retrievers/test_simple_retriever.py (3)

1121-1121: Add a docstring to explain the test's specific purpose?

This test covers an important edge case (additional_query_properties without property_chunking + pagination). A docstring would help future maintainers quickly understand what scenario this test validates, wdyt?


1174-1176: Consider setting get_initial_token return value for test robustness?

Other pagination tests (e.g., lines 662, 1548) explicitly set paginator.get_initial_token.return_value. While the test might work without it due to Mock's behavior, explicitly setting it to None would make the test more explicit and prevent potential issues if the retriever calls this method, wdyt?

Apply this diff:

 paginator = _mock_paginator()
+paginator.get_initial_token.return_value = None
 paginator.next_page_token.side_effect = [{"next_page_token": 1}, None]

1194-1194: Strengthen the extra_fields assertion to verify content?

The current assertion only checks that extra_fields is truthy. Consider verifying:

  1. The actual content of extra_fields (should contain {"query_properties": ...})
  2. Both send_request calls have extra_fields populated (both pages in the same property chunk)

This would make the test more thorough and catch potential regressions, wdyt?

Apply this diff:

-assert requester.send_request.call_args_list[0].kwargs["stream_slice"].extra_fields
+# Verify both pages have extra_fields populated (same property chunk)
+first_call_extra_fields = requester.send_request.call_args_list[0].kwargs["stream_slice"].extra_fields
+second_call_extra_fields = requester.send_request.call_args_list[1].kwargs["stream_slice"].extra_fields
+assert first_call_extra_fields.get("query_properties") == ["first_name", "last_name", "nonary", "bracelet"]
+assert second_call_extra_fields.get("query_properties") == ["first_name", "last_name", "nonary", "bracelet"]
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 69a490b and 5816cd8.

📒 Files selected for processing (1)
  • unit_tests/sources/declarative/retrievers/test_simple_retriever.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
unit_tests/sources/declarative/retrievers/test_simple_retriever.py (4)
airbyte_cdk/sources/types.py (7)
  • Record (21-72)
  • data (35-36)
  • associated_slice (39-40)
  • StreamSlice (75-169)
  • cursor_slice (107-112)
  • partition (99-104)
  • extra_fields (115-117)
airbyte_cdk/sources/declarative/requesters/requester.py (1)
  • send_request (138-156)
airbyte_cdk/sources/declarative/requesters/query_properties/query_properties.py (1)
  • QueryProperties (14-48)
airbyte_cdk/sources/declarative/retrievers/simple_retriever.py (2)
  • name (118-126)
  • name (129-131)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
  • GitHub Check: Check: source-intercom
  • GitHub Check: Check: destination-motherduck
  • GitHub Check: Check: source-shopify
  • GitHub Check: Check: source-pokeapi
  • GitHub Check: Check: source-hardcoded-records
  • GitHub Check: Pytest (All, Python 3.13, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.12, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Manifest Server Docker Image Build
  • GitHub Check: SDM Docker Image Build
  • GitHub Check: Analyze (python)

Copy link
Contributor

@brianjlai brianjlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📦 lgtm

@brianjlai brianjlai merged commit 55ea028 into main Oct 16, 2025
28 of 29 checks passed
@brianjlai brianjlai deleted the maxi297/fix_hubspot_property_chunking branch October 16, 2025 16:41
brianjlai added a commit to airbytehq/airbyte that referenced this pull request Oct 20, 2025
… for missing custom properties during incremental syncs (#68159)

## What

We had a bug where during incremental syncs Hubspot CRM Search streams
were not including the custom properties in the json body of the POST
request so they were not getting received and emitted with records.

## How

The bug was in the CDK and it was fixed in version `7.3.7` in this
airbytehq/airbyte-python-cdk#797

We need to bump the version of SDM to get the fix, but in addition, we
need to upgrade the unit_test `pyproject.toml` which is still on v6.
I've also added a new test that validates that properties are indeed
populated in the outbound request. And with the bump from v6 to v7 I
fixed the tests which have now changed.

**Note**: It does feel like we have something of a gap where our unit
tests don't properly test CDK changes since the two are independently
versioned... This is something we may want to investigate and solve so
these types of things don't happen again

## Can this PR be safely reverted and rolled back?

- [ ] YES 💚
- [ ] NO ❌

Kind of... If we do this wrong then we have to reset customers back to
their previous state, but this is no different than the state we were
previously in
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working security

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants