🐛 [airbyte-cdk] Fix bug where substreams depending on an RFR parent stream don't paginate or use existing state #40671

brianjlai · 2024-07-02T04:53:40Z

What

We identified a bug related to RFR that was blocking low-code migrations and error fixes. The specific case was for a substream that depended on a parent stream that was configured to use RFR. The two issues were:

The parent RFR stream did not paginate beyond one page because paging was moved out of read_records() and into read(). This would lead to missing records because the substream would be missing parent records
When a parent stream was already synced or shared between substreams, the up to date state would cause no new records be retrieved.

How

There are two aspects to how we're solving the above problems. To solve the pagination issue, for both the low-code and Python CDKs, we need a new entrypoint to triggering a read that reuses the more complex pagination logic originally implemented for RFR streams. read_stateless() which is mostly a convenience method so that every place we invoke read() for substreams doesn't require creating a dummy configured catalog and connector state manager

As for the shared state issue, we've made a decision to push for all substreams to instantiate independent parent streams to avoid state collisions. This also should help us in a concurrent world where streams should be as independent as possible. This PR fix doesn't strictly address this aspect. See this PR for an example of independent substreams: https://github.com/airbytehq/airbyte/pull/39559/files

Tested against source-jira using RFR streams and substreams depending on RFR streams. The original CDK w/ the bug resulted in substreams that only had parent records from the first page. I've confirmed that each of the substreams had child records across the entire parent stream:

boards
board_issues
filters
filter_sharing
users
user_groups_detailed

Review guide

User Impact

Ideally none

Can this PR be safely reverted and rolled back?

YES 💚
NO ❌

…stead of using stream_slices() + read_records() which doesn't work with RFR

vercel · 2024-07-02T04:53:44Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment

Name	Status	Preview	Comments	Updated (UTC)
airbyte-docs	⬜️ Ignored (Inspect)	Visit Preview		Jul 10, 2024 10:59pm

brianjlai · 2024-07-02T22:18:08Z

...e-cdk/python/airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py

-                    if incremental_dependency:
-                        self._parent_state[parent_stream.name] = parent_stream.state
+                # update the parent state, as parent stream read all record for current slice and state is already updated
+                if incremental_dependency:


one drawback is that we no longer checkpoint per slice like we used to, although not sure this was something we must retain given we lose some context going from read_records to read()

this seems fairly undesirable to me. Will we only checkpoint at the end of the sync?

naive question: could we instead keep the iteration on the stream_slices and expose a method to read a single, but complete slice?

It feels like in order to do this, we would have to track in the model_to_component_factory which stream is a parent and if it is a parent, avoid instantiating ResumableFullRefreshCursor. It feels possible to me but it would require passing a new parameter all the way to _merge_stream_slicers. This seems fair to me though

We should definitely yield state after every slice. Otherwise, we risk having stuck syncs, where some transient error will result in a failed sync without any progress.

yep, alex and i discussed a bit on wednesday, it's a bit hacky, but one way we can do this is summarized in the above comment. By inspecting the associated_slice we can tell if we moved onto the next slice if it changes and checkpoint + emit the current set of records as mentioned in the new comment above.

If this approach I implemented seems too crazy or prone to failure, then @maxi297 's suggestion to just not use RFR might be reasonable. Although the drawback is that we have effectively two different implementations of the parent stream at runtime. Since incremental_dependency will switch our the cursor for substream, it's a small gotcha

brianjlai · 2024-07-02T23:41:49Z

airbyte-cdk/python/airbyte_cdk/sources/streams/core.py

            airbyte_state_message = self._checkpoint_state(checkpoint, state_manager=state_manager)
            yield airbyte_state_message

+    def read_stateless(  # type: ignore  # ignoring typing for ConnectorStateManager because of circular dependencies
+        self,
+        connector_state_manager=None,


This is something of a frustrating pattern we have. I can't actually instantiate a ConnectorStateManager instance within this method because I can't import it due to a circular dependency between ConnectorStateManager and the Stream class.

This is starting poke at us a bit and become an annoying pattern so we may want to see if we want to refactor or adjust the code in connector_state_manager.py to not reference the Stream class. But for now, not sure of a good way to avoid this parameter.

And the alternative is we pass in None with the existing code and we make sure to not attempt any connector_state_manager operations within core.py if state_manager is None as the code currently has

can you move or copy this comment to the code so we have a trace. that's a pretty big gotcha

girarda · 2024-07-03T03:22:24Z

airbyte-cdk/python/airbyte_cdk/sources/streams/core.py

                        airbyte_state_message = self._checkpoint_state(checkpoint, state_manager=state_manager)
                        yield airbyte_state_message

                    if internal_config.is_limit_reached(record_counter):
                        break
            self._observe_state(checkpoint_reader)
            checkpoint_state = checkpoint_reader.get_checkpoint()
-            if checkpoint_state is not None:
+            if state_manager and checkpoint_state is not None:


nit: we could refactor this predicate in a method _should_checkpoint to ensure the logic stays aligned with line 215

given it's simplicity of the predicate, how about we just define a variable at the beginning called should_checkpoint and we pass that in, subtle difference to what we have but still conveys the same idea that each condition is based on a single predicate defined at the beginning

girarda · 2024-07-03T03:29:07Z

...e-cdk/python/airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py

-                    if incremental_dependency:
-                        self._parent_state[parent_stream.name] = parent_stream.state
+                # update the parent state, as parent stream read all record for current slice and state is already updated
+                if incremental_dependency:


this seems fairly undesirable to me. Will we only checkpoint at the end of the sync?

naive question: could we instead keep the iteration on the stream_slices and expose a method to read a single, but complete slice?

ChristoGrab

@brianjlai I won't merge yet in case there's any last changes still to be made, but approving to unblock you when you're ready. I was able to test the behavior using the builder server locally, substreams that were previously stuck on the first parent page are now returning records from multiple parent slices 🙌

brianjlai · 2024-07-08T20:54:48Z

@ChristoGrab awesome thanks for confirming!

girarda

looks great @brianjlai !

...e-cdk/python/airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py

girarda · 2024-07-09T00:12:10Z

...e-cdk/python/airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py

-                            stream_slices_for_parent.append(
-                                StreamSlice(partition={partition_field: partition_value, "parent_slice": parent_partition}, cursor_slice={})
+                            continue
+                    elif isinstance(parent_record, Record):


should we raise an exception if parent_record is neither an AirbyteMessage nor a Record?

since we are effectively calling the top level read() command which allows for StreamData which could be a Mapping, I think we also need to account for that type, but if not those 3 (record, message, mapping), then we can throw the error. will add

how will we map the record to the slice when parent_record is a Mapping?

we can't in the end. granted like we discussed above, its only really for custom components thats don't return the right Record interface, but for a majority of our cases this should not be an issue

maxi297

I have a couple of questions but I think there shouldn't be much left before this gets approved

maxi297 · 2024-07-09T14:30:21Z

...e-cdk/python/airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py

+
+                stream_slices_for_parent = []
+                previous_associated_slice = None
+                for parent_record in parent_stream.read_stateless():


This implementation has the constraint that the parent stream can't be concurrent as this code assumes the slices are consumed one after the other. I think that is a fair constraint in order for us to unblock a couple of things we are working on but I think we need to acknowledge that this might introduce debt that we will need to pay once we move to concurrent. For me, this is worth it as this will unblock a couple of sources to update to the latest CDK version.

Agree with everything here, we're putting dependency on slices being ordered, but this is still a fundamental problem that already exists in the legacy CDK? Since concurrent doesn't support substream parallelization, we're not worse off than we were before the fix

Since concurrent doesn't support substream parallelization, we're not worse off than we were before the fix

It doesn't support it but I don't think there is code in the concurrent part that assumes that. It's just that we haven't implemented a PartitionGenerator that consumes a parent concurrently.

The point I want to make is that if we start having streams where stream.read_stateless() is concurrent, we will need to:

Implement the concurrent part in read_stateless

Know where the code assumes read_stateless is not concurrent and fix it

The first part is fine as whatever we do, we will need to do it. However, the second part is a bit more challenging as it is not explicit and I'm not sure we have tests to signal that this will break if it is concurrent so we rely solely on the fact that we will hopefully thing about it. This feels dangerous to me although I don't have anything better to propose so what I'm trying to do is socialize this problem so that you can also raise the flag if this happens.

ah i see what you mean. its not a solution, but i can at least add some comments in the code to make it aware that that concurrent substreams will require additional investigation to find a solution

...e-cdk/python/airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py

maxi297 · 2024-07-09T14:45:52Z

airbyte-cdk/python/airbyte_cdk/sources/streams/http/http.py

-        )
-
-        # iterate over all parent stream_slices
-        for stream_slice in parent_stream_slices:


I still see cases where some sources use stream_slices and read_record afterward (like this). Are we fine with those because they don't rely on low-code so they shouldn't be affected by the RFR cursor?

Good point and that's correct. This bug was only for low-code because all non-incremental and non-substreams for low-code automatically turned on RFR. whereas for python based sources streams are currently opt in based on a code change. I think we can live with this for now, but as we look into implementing auto-rfr for concurrent/python then we need to be aware.

maxi297 · 2024-07-09T14:46:03Z

...k/python/unit_tests/sources/declarative/incremental/test_per_partition_cursor_integration.py

@@ -232,7 +232,7 @@ def test_substream_without_input_state():

    stream_instance = test_source.streams({})[1]

-    stream_slice = StreamSlice(partition={"parent_id": "1"},
+    stream_slice = StreamSlice(partition={},


This seems like a breaking change. Why has this changed?

@maxi297 ah yes sorry i should've added a PR comment ahead of time for this. It's a bit of a tale.

So we mocked output for the parent stream's read_records() method (it overwrites _read_pages() but basically equates to the response used by the SubstreamPartitionRouter's call to parent_stream.read_records().

The problem here is that this stream_slice here is actually defined incorrectly. I dug into the code and in the dependent incremental parent stream, when we call stream_slices() which returns StreamSlices where there is a time window for cursor_value and no partition. Which is expected since its just a plain incremental stream.

This only went uncaught because in the old implementation of SubstreamPartitionRouter.stream_slices(), we assign the parent_slice of the resulting StreamSlices by calling parent_stream.stream_slices() and extracting parent_stream_slice.partition. This correctly returned:

StreamSlice(partition={}, cursor_slice={"start_time": "2022-01-01", "end_time": "2022-01-31"})

and this would get added to the resulting final slices in parent_slice.

However, with the change where we no longer get parent records by calling parent_stream.stream_slices() + parent_stream.read_records() in favor of the higher level read_stateless(), we now have to inspect the record's associated_slice to populate parent_slice.

Record({"id": "1", CURSOR_FIELD: "2022-01-15"}, StreamSlice(partition={"parent_id": 1}, cursor_slice={"start_time": "2022-01-01", "end_time": "2022-01-31"}))

In the mocked parent record output, by accidentally populating partition in the slice, we were breaking this test because the new SubstreamPartitionRouter.stream_slices() relies on inspecting record.associated_slice

TLDR: we were incorrectly populating the partition field in the mocked records which caused the test to fail. So I fixed the incorrect mocked input which was only apparent w/ this new change

What a journey! Thanks to your detailed comment, I get it now. I think the test setup confused me a bit even though it is appropriate. Can we add a comment over the patch.object to make it explicit that it will mock the parent stream Rates HTTP responses? Or rename stream_slice to parent_stream_slice? If you see any wait to make the test more explicit as well, feel free to chime in!

yep i'll add a comment. it'll definitely be helpful given even i had to refresh myself yesterday for why i did this a week ago

maxi297 · 2024-07-09T14:58:52Z

...k/python/unit_tests/sources/declarative/partition_routers/test_substream_partition_router.py

+    )
+
+    expected_counter = 0
+    for actual_slice in partition_router.stream_slices():


nit: This seems a bit weird to me. As a user, it means that I can't call list(partition_router.stream_slices()) else it would update the state even though the slice has not been consumed. Should we document somewhere that we assume the processing of a slice is expected to be done before the user call next when incremental_dependency = True?

maxi297

Thanks @brianjlai for diligently answering all my questions and educating me. I'm good with this change!

…tream don't paginate or use existing state (#40671)

Replace substream router and HttpSubstream to invoke a full read() in…

df655bf

…stead of using stream_slices() + read_records() which doesn't work with RFR

octavia-squidington-iii added the CDK Connector Development Kit label Jul 2, 2024

brianjlai changed the title ~~[airbyte-cdk] Fix bug where substreams depending on an RFR parent stream don't paginate or use existing state~~ 🐛 [airbyte-cdk] Fix bug where substreams depending on an RFR parent stream don't paginate or use existing state Jul 2, 2024

brianjlai commented Jul 2, 2024

View reviewed changes

brianjlai requested review from girarda and a team July 3, 2024 00:03

brianjlai marked this pull request as ready for review July 3, 2024 00:03

brianjlai requested a review from a team as a code owner July 3, 2024 00:03

brianjlai requested a review from maxi297 July 3, 2024 00:06

girarda requested a review from lazebnyi July 3, 2024 03:18

girarda reviewed Jul 3, 2024

View reviewed changes

brianjlai and others added 3 commits July 5, 2024 16:21

PR feedback and adding back slice checkpointing

b6febc8

better variable name

fca7f71

Merge branch 'master' into brian/rfr_fix_substream_depends_on_rfr_parent

94b4d77

vercel bot deployed to Preview July 5, 2024 23:47 View deployment

natikgadzhi mentioned this pull request Jul 6, 2024

[connector-builder] Substream Only Fetches Records from the First Page of Parent Stream #40734

Closed

natikgadzhi linked an issue Jul 6, 2024 that may be closed by this pull request

[connector-builder] Substream Only Fetches Records from the First Page of Parent Stream #40734

Closed

ChristoGrab approved these changes Jul 8, 2024

View reviewed changes

girarda approved these changes Jul 9, 2024

View reviewed changes

maxi297 reviewed Jul 9, 2024

View reviewed changes

Merge branch 'master' into brian/rfr_fix_substream_depends_on_rfr_parent

c69c578

vercel bot deployed to Preview July 9, 2024 15:25 View deployment

brianjlai added 3 commits July 9, 2024 15:47

PR feedback and reformatting

d114927

Merge branch 'master' into brian/rfr_fix_substream_depends_on_rfr_parent

5d5a2c2

linting

77cc58d

maxi297 approved these changes Jul 10, 2024

View reviewed changes

brianjlai added 4 commits July 10, 2024 13:15

pr feedback and some helpful comments

85c92e5

Merge branch 'master' into brian/rfr_fix_substream_depends_on_rfr_parent

3979cbf

fix bug to still use state for legacy python http substreams

eec284b

Merge branch 'master' into brian/rfr_fix_substream_depends_on_rfr_parent

2b0801b

vercel bot deployed to Preview July 10, 2024 22:45 View deployment

brianjlai added 2 commits July 10, 2024 15:54

more mypy fun

bfde8a5

mypy why

fe402e9

brianjlai merged commit 9e23b3f into master Jul 11, 2024
32 checks passed

brianjlai deleted the brian/rfr_fix_substream_depends_on_rfr_parent branch July 11, 2024 06:53

xiaohansong pushed a commit that referenced this pull request Jul 12, 2024

🐛 [airbyte-cdk] Fix bug where substreams depending on an RFR parent s…

a1f3d77

…tream don't paginate or use existing state (#40671)

brianjlai mentioned this pull request Jul 15, 2024

[airbyte-cdk] update 3.4.0 changelog to reflect RFR bug fix that was released #41963

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 [airbyte-cdk] Fix bug where substreams depending on an RFR parent stream don't paginate or use existing state #40671

🐛 [airbyte-cdk] Fix bug where substreams depending on an RFR parent stream don't paginate or use existing state #40671

brianjlai commented Jul 2, 2024 •

edited

Loading

vercel bot commented Jul 2, 2024 •

edited

Loading

brianjlai Jul 2, 2024

girarda Jul 3, 2024

maxi297 Jul 3, 2024

tolik0 Jul 5, 2024

brianjlai Jul 6, 2024

brianjlai Jul 2, 2024 •

edited

Loading

girarda Jul 3, 2024

girarda Jul 3, 2024

brianjlai Jul 3, 2024

girarda Jul 3, 2024

ChristoGrab left a comment •

edited

Loading

brianjlai commented Jul 8, 2024

girarda left a comment

girarda Jul 9, 2024

brianjlai Jul 9, 2024

girarda Jul 10, 2024

brianjlai Jul 10, 2024

maxi297 left a comment

maxi297 Jul 9, 2024

brianjlai Jul 9, 2024

maxi297 Jul 10, 2024

brianjlai Jul 10, 2024

maxi297 Jul 9, 2024

brianjlai Jul 10, 2024

maxi297 Jul 9, 2024

brianjlai Jul 9, 2024 •

edited

Loading

maxi297 Jul 10, 2024

brianjlai Jul 10, 2024

maxi297 Jul 9, 2024

maxi297 left a comment

🐛 [airbyte-cdk] Fix bug where substreams depending on an RFR parent stream don't paginate or use existing state #40671

🐛 [airbyte-cdk] Fix bug where substreams depending on an RFR parent stream don't paginate or use existing state #40671

Conversation

brianjlai commented Jul 2, 2024 • edited Loading

What

How

Review guide

User Impact

Can this PR be safely reverted and rolled back?

vercel bot commented Jul 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brianjlai Jul 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChristoGrab left a comment • edited Loading

Choose a reason for hiding this comment

brianjlai commented Jul 8, 2024

girarda left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maxi297 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brianjlai Jul 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maxi297 left a comment

Choose a reason for hiding this comment

brianjlai commented Jul 2, 2024 •

edited

Loading

vercel bot commented Jul 2, 2024 •

edited

Loading

brianjlai Jul 2, 2024 •

edited

Loading

ChristoGrab left a comment •

edited

Loading

brianjlai Jul 9, 2024 •

edited

Loading