Concurrent CDK: fix state message ordering #34131

clnoll · 2024-01-10T22:31:03Z

Don't clear out slices when converting to sequential state

vercel · 2024-01-10T22:31:12Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
airbyte-docs	⬜️ Ignored (Inspect)	Visit Preview		Jan 18, 2024 4:13pm

maxi297

Can you confirm than what I understand is fair? If so, I think we need to change the logic a bit

maxi297 · 2024-01-11T20:19:47Z

...n/airbyte_cdk/sources/streams/concurrent/state_converters/datetime_stream_state_converter.py

-            return first_interval
+            merged_intervals = self.merge_intervals(slices)
+            first_interval = merged_intervals[0]
+            if start < first_interval[self.START_KEY]:


Given that start is defined as self.parse_timestamp(state["start"]) if "start" in state else self.zero_value here, I think will never update the state.

Assuming it is the first incremental sync, start = self.zero_value which is 0001-01-01T00:00:00.000Z so we would have the following scenario:

Start incremental concurrent sync with config["start"] = 2023-01-01

Async generation of partitions {"start": "2023-01-01", "end": "2023-12-31"}

Async processing of {"start": "2023-01-01", "end": "2023-12-31"}. Closing this partition. At that point, state is

{ "state_type": ConcurrencyCompatibleStateType.date_range.value, "metadata": { … }, "start": "0001-01-01T00:00:00.000Z" "slices": [ {"start": "2023-01-01", "end": "2023-12-31"} ] }

Merging intervals (not relevant here)

start < first_interval[self.START_KEY] == True so we return 0001-01-01T00:00:00.000Z as latest completed time

Outcome

Next sync starts from "0001-01-01T00:00:00.000Z"

Expected result

Next sync starts from "2023-12-31"

It feels like the value of start should be initialized as the lower boundaries of all the slices. Based on @girarda's comment yesterday, the easiest wait to have that might be to ask the developer to provide it.

Right, this should be taking the start from the config into account too. Will make that update!

maxi297 · 2024-01-11T20:22:10Z

...n/airbyte_cdk/sources/streams/concurrent/state_converters/datetime_stream_state_converter.py

@@ -81,26 +82,34 @@ def convert_from_sequential_state(self, cursor_field: CursorField, stream_state:
        {
            "state_type": ConcurrencyCompatibleStateType.date_range.value,
            "metadata": { … },
+            "start": <timestamp representing the latest date before which all records were synced>


Do we need to set this as part of the state? Could it simply be a field in DateTimeStreamStateConverter? I'm wary of adding stuff in the state as it means possible breaking changes when we want to change that

One thing to note is that this is just in the internal state; it isn't part of the state that we're emitting.

It isn't essential that it's part of the internal state, but would have to be passed into a lot of functions instead. I don't have a strong feeling either way. Given all that do you still prefer that it's not part of the state?

Ok! I think I have more context given your comment.

In my mind, the _state was something we could emit when we will move from the sequential state to the new fancy concurrent state hence why I was surprised. Will having start as part of the response from convert_from_sequential_state create more work than having an internal field when we will do this switch?

maxi297 · 2024-01-11T20:24:03Z

...n/airbyte_cdk/sources/streams/concurrent/state_converters/datetime_stream_state_converter.py

+            self.parse_timestamp(stream_state[cursor_field.cursor_field_key])
+            if cursor_field.cursor_field_key in stream_state
+            else self.zero_value
+        )
        if cursor_field.cursor_field_key in stream_state:


We have the condition if cursor_field.cursor_field_key in stream_state both in the definition of low_water_mark and this line. Can it be grouped up?

maxi297

I'm missing context on the edge cases that were identified. Can you add more information? We can also sync if you prefer

maxi297 · 2024-01-16T13:44:04Z

...n/airbyte_cdk/sources/streams/concurrent/state_converters/datetime_stream_state_converter.py

-            return first_interval
+            merged_intervals = self.merge_intervals(slices)
+            first_interval = merged_intervals[0]
+            if previous_sync_end < first_interval[self.START_KEY]:


I would have assume that nothing would have changed except from https://github.com/airbytehq/airbyte/pull/34131/files#diff-44999ccc78e8a64a5a79f8352f8aa2a45d8dde6ecaa92d3dff0cc773de104197R118. How can this case be possible? It feels like since we add an interval with low_water_mark as an upper boundary, we should never have this case. Do we fear that connectors might start syncing earlier slices?

This is handling what I assume would be an unusual case, which is that the start date was changed to something more recent.

maxi297 · 2024-01-16T13:45:45Z

...n/airbyte_cdk/sources/streams/concurrent/state_converters/datetime_stream_state_converter.py

+
+            # `previous_sync_end` falls outside of the first interval; this is unexpected because we shouldn't have tried
+            # to sync anything before `previous_sync_end`, but we can handle it anyway.
+            return self._get_latest_complete_time(previous_sync_end, merged_intervals[1:])


This would mean that the first interval is completely before the first interval. In which case can this occur?

Like the comment mentions, we don't really expect this to happen. Since it's unexpected I can modify this to raise an exception instead.

maxi297 · 2024-01-16T13:46:36Z

...n/airbyte_cdk/sources/streams/concurrent/state_converters/datetime_stream_state_converter.py

        else:
-            return None
+            # Nothing has been synced so we don't advance


Shouldn't we at least have one slice here as when we convert_from_sequential_state, we always create one?

Yes, true, but I didn't want to have a condition that wasn't handled in this method. Would it make more sense to you to have this throw an exception instead?

...n/airbyte_cdk/sources/streams/concurrent/state_converters/abstract_stream_state_converter.py

maxi297

I'm a bit confused as there seems to be some code still fetching for stream_state["low_water_mark"]. Is that expected? The last commit is very clean though and I really like it

airbyte-cdk/python/airbyte_cdk/sources/streams/concurrent/cursor.py

maxi297 · 2024-01-17T20:36:39Z

...n/airbyte_cdk/sources/streams/concurrent/state_converters/datetime_stream_state_converter.py

        else:
-            self._actual_sync_start = start


Should we log if prev_sync_low_water_mark and prev_sync_low_water_mark < sync_start?

maxi297

Two very small comments which I'll let you decide if it's worth changing. Good job on this very annoying time-based brain melting problem. Thanks Catherine!

maxi297 · 2024-01-18T14:43:47Z

airbyte-cdk/python/airbyte_cdk/sources/streams/concurrent/cursor.py

+        self.start, self._concurrent_state = self._get_concurrent_state(stream_state)
+
+    @property
+    def state(self) -> MutableMapping[str, Any]:


Should this be private? From a very quick look, I don't see this used outside of the class (which I think is very nice as it means we don't expose non-domain object like MutableMapping

It will actually be used externally, as you'll see in the follow up Salesforce PR.

Can you show me where? I can't see to identify this change on https://github.com/airbytehq/airbyte/pull/33522/files

Sorry that was out of date. It's there now.

maxi297 · 2024-01-18T14:53:04Z

...n/airbyte_cdk/sources/streams/concurrent/state_converters/abstract_stream_state_converter.py

    @abstractmethod
    def deserialize(self, state: MutableMapping[str, Any]) -> MutableMapping[str, Any]:
        """
        Perform any transformations needed for compatibility with the converter.
        """
        ...

+    @abstractmethod
+    def get_sync_start(self, cursor_field: "CursorField", stream_state: MutableMapping[str, Any], start: Optional[Any]) -> Any:


nit: Is there a case where we would call get_sync_start without calling convert_from_sequential_state? Else, I think I would merge the two to expose as few things as possible

I like this idea! Updated to make this change.

maxi297 · 2024-01-18T14:53:41Z

...n/airbyte_cdk/sources/streams/concurrent/state_converters/datetime_stream_state_converter.py

                merged_end_time = max(last_end_time, interval[self.END_KEY])
                merged_intervals[-1][self.END_KEY] = merged_end_time
            else:
                merged_intervals.append(interval)

        return merged_intervals

-    def compare_intervals(self, end_time: Any, start_time: Any) -> bool:
+    def _compare_intervals(self, end_time: Any, start_time: Any) -> bool:


maxi297 · 2024-01-18T14:54:08Z

...n/airbyte_cdk/sources/streams/concurrent/state_converters/datetime_stream_state_converter.py

-            slices = []
+
+        # Create a slice to represent the records synced during prior syncs.
+        # The start and end are the same to avoid confusion as to whether the records for this slice


Don't clear out `slices` when converting to sequential state

This reverts commit 04d6089.

clnoll requested a review from a team as a code owner January 10, 2024 22:31

clnoll requested review from girarda and maxi297 January 10, 2024 22:31

octavia-squidington-iii added the CDK Connector Development Kit label Jan 10, 2024

clnoll force-pushed the ccdk-state-order-fix branch 5 times, most recently from f64538b to d40a7ba Compare January 11, 2024 15:24

maxi297 reviewed Jan 11, 2024

View reviewed changes

clnoll requested a review from maxi297 January 12, 2024 16:08

clnoll force-pushed the ccdk-state-order-fix branch from f28b7a1 to d310df8 Compare January 12, 2024 18:21

maxi297 reviewed Jan 16, 2024

View reviewed changes

girarda reviewed Jan 17, 2024

View reviewed changes

...n/airbyte_cdk/sources/streams/concurrent/state_converters/abstract_stream_state_converter.py Outdated Show resolved Hide resolved

octavia-squidington-iii added area/connectors Connector related issues area/documentation Improvements or additions to documentation connectors/source/salesforce labels Jan 17, 2024

octavia-squidington-iv requested a review from a team January 17, 2024 19:44

clnoll force-pushed the ccdk-state-order-fix branch from f20dc58 to 99cd60f Compare January 17, 2024 19:44

maxi297 reviewed Jan 17, 2024

View reviewed changes

clnoll force-pushed the ccdk-state-order-fix branch from cdbaf23 to 77bebfe Compare January 18, 2024 02:22

octavia-squidington-iii removed area/documentation Improvements or additions to documentation area/connectors Connector related issues labels Jan 18, 2024

clnoll requested review from maxi297 and girarda January 18, 2024 02:24

vercel bot deployed to Preview January 18, 2024 02:25 View deployment

clnoll force-pushed the ccdk-state-order-fix branch from c0cf6b7 to 8ebca1b Compare January 18, 2024 02:26

vercel bot deployed to Preview January 18, 2024 02:28 View deployment

clnoll force-pushed the ccdk-state-order-fix branch from acd9c73 to 1d196ab Compare January 18, 2024 02:37

clnoll force-pushed the ccdk-state-order-fix branch 4 times, most recently from 7dd181d to 291c851 Compare January 18, 2024 13:45

maxi297 approved these changes Jan 18, 2024

View reviewed changes

clnoll added 9 commits January 18, 2024 10:58

Concurrent CDK: fix state message ordering

ee0d75a

Don't clear out `slices` when converting to sequential state

Use start argument to get the lower boundary of the first slice

4ca7faf

WIP - Salesforce

4e80e92

Remove start and low_water_mark from the internal concurrent state

16063e0

Everything goes through the cursor

90f93b4

Revert "WIP - Salesforce"

7cd3519

This reverts commit 04d6089.

Formatting

659a517

Fix mypy

34fb52f

CR updates

1578550

clnoll force-pushed the ccdk-state-order-fix branch from 152d464 to 1578550 Compare January 18, 2024 15:58

formatting

672dac7

clnoll merged commit e3e58cc into master Jan 18, 2024
23 checks passed

clnoll deleted the ccdk-state-order-fix branch January 18, 2024 16:35

jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 26, 2024

Concurrent CDK: fix state message ordering (airbytehq#34131)

ec3dbd9

jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 26, 2024

Concurrent CDK: fix state message ordering (airbytehq#34131)

2dd3252

jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 26, 2024

Concurrent CDK: fix state message ordering (airbytehq#34131)

5638837

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrent CDK: fix state message ordering #34131

Concurrent CDK: fix state message ordering #34131

clnoll commented Jan 10, 2024

vercel bot commented Jan 10, 2024 •

edited

Loading

maxi297 left a comment

maxi297 Jan 11, 2024

clnoll Jan 11, 2024

maxi297 Jan 11, 2024

clnoll Jan 12, 2024

maxi297 Jan 15, 2024

maxi297 Jan 11, 2024

maxi297 left a comment

maxi297 Jan 16, 2024

clnoll Jan 16, 2024

maxi297 Jan 16, 2024

clnoll Jan 16, 2024

maxi297 Jan 16, 2024

clnoll Jan 16, 2024

maxi297 left a comment

maxi297 Jan 17, 2024

maxi297 left a comment

maxi297 Jan 18, 2024

clnoll Jan 18, 2024

maxi297 Jan 18, 2024

clnoll Jan 18, 2024

maxi297 Jan 18, 2024

clnoll Jan 18, 2024

maxi297 Jan 18, 2024

maxi297 Jan 18, 2024

Concurrent CDK: fix state message ordering #34131

Concurrent CDK: fix state message ordering #34131

Conversation

clnoll commented Jan 10, 2024

vercel bot commented Jan 10, 2024 • edited Loading

maxi297 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maxi297 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maxi297 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maxi297 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vercel bot commented Jan 10, 2024 •

edited

Loading