Emit final state message for full refresh syncs and consolidate read flows #35622

brianjlai · 2024-02-26T01:44:37Z

Closes https://github.com/airbytehq/airbyte-internal-issues/issues/2925

First pass at adding state for full refresh, consolidating flows, remove emitting legacy state, adding integration tests, fixing lots of tests

What

At a high level what this change does is makes it so that regardless of the sync mode that a stream is running in, we always emit at least one state message upon successful completion of a sync. For incremental streams running as full refresh w/ valid record cursors, the final state message will be the last record cursor value observed. For full refresh streams, the final state message will be a dummy value {"sync_mode": "full_refresh"}.

How

Removed all code related to connectors emitting legacy state I confirmed w/ destinations they no longer need this. Accepting incoming legacy state is still needed for connectors unpublished since 2022
Incremental streams running as full refresh do not periodically checkpoint state, but will emit a final state w/ the stream defined cursor value
Full refresh streams do not checkpoint state and emit a placeholder value for final state
At the AbstractSource level, consolidates full refresh and incremental helper functions into a single one
At the Stream level, removes the read_full_refresh and read_incremental methods from the Stream interface. This would be classified as breaking. We could opt for just marking this as deprecated. However, this might be even more misleading because the AbstractSource flow is now using the Stream.read() method. I didn't see any usage of this in our repo beyond concurrent or file-based, but technically OSS customers could have done something like this.
CDK level ~~integration tests~~ MOCK SERVER TESTS using a full refresh, an incremental, and an incremental running in full refresh mode
This is already compatible with synchronous (concurrency == None) File Based CDK because cursor state is managed regardless of sync_mode. I've tested this on a S3 pre-release where I disabled concurrency and verify state emission
The changes to file-based sources is removing legacy state and deleting the old read_incremental and read_full_refresh method implementations
The changes to concurrent sources is just the cleanup for removing legacy state and the interfaces. Full refresh state messages are not included in this change.

Testing and Validation

~~Integration test~~ MOCK SERVER TESTS cases:

Full refresh stream emits final state message
Incremental stream running as incremental emits multiple state messages
Incremental stream running as full refresh emits 1 state message
Legacy incremental stream using get_updated_state() as incremental updates state
Sync a catalog of two streams, one full refresh and one incremental

Pre-release testing:

source-greenhouse: Tested combinations of full refresh only, incremental syncs, and both. Verify state messages emitted properly via extra logging and final state in the database
S3 w/ concurrency disabled. Full refresh and incremental syncs
source-stripe: Sanity check that it still works, not expecting behavior to differ. I tested w/ some full refresh streams (which run concurrently) and some incremental streams whose behavior causes them to be excluded from the concurrent workflow

🚨 User Impact 🚨

Are there any breaking changes? What is the end result perceived by the user?

We are deprecating two methods on the Stream interface read_full_refresh() and read_incremental(). There are not any sources in our repository that override these methods, but it's possible for OSS customers who've written a highly customized connector to be impacted. We've moved to a consolidated flow using the Stream.read() method moving forward and in order to use the current version of the CDK, they must migrate their connector to use this new method.

…ove emitting legacy state, fixing lots of tests

vercel · 2024-02-26T01:44:43Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
airbyte-docs	⬜️ Ignored (Inspect)	Visit Preview		Mar 5, 2024 5:39am

brianjlai · 2024-02-27T00:43:04Z

airbyte-cdk/python/airbyte_cdk/sources/streams/core.py

+            if sync_mode == SyncMode.full_refresh:
+                # We use a dummy state if there is no suitable value provided by full_refresh streams that do not have a valid cursor.
+                # Incremental streams running full_refresh mode emit a meaningful state
+                stream_state = stream_state or {"sync_mode": "full_refresh"}


After testing on the platform, the platform will persist stream state regardless of the sync mode it's running in. It might actually be better to store {} rather than a bogus value that isn't usable.

The other alternative is a best attempt at a cursor using the emitted_at of the last record. My big reservation on why this is a bad idea is that gives a false perception of an accurate cursor value which is misleading for all our full refresh streams. And this value is not reliable because its our internal clock not the API's.

brianjlai · 2024-02-28T01:12:15Z

airbyte-cdk/python/airbyte_cdk/sources/streams/core.py

        slices = self.stream_slices(
            cursor_field=cursor_field,
-            sync_mode=SyncMode.incremental,
+            sync_mode=sync_mode,  # todo: change this interface to no longer rely on sync_mode for behavior


Leaving the todo was intentional. I don't think the stream_slice() or read_records() should be designed to rely on a sync_mode, but I want to try to reduce the complexity and breaking changes in this one PR.

That being said, didn't see any usage of sync_mode for internal implementations this in some connectors I glanced at

brianjlai · 2024-02-28T22:08:52Z

airbyte-integrations/connectors/source-s3/source_s3/v4/source.py

@@ -21,7 +21,7 @@


 class SourceS3(FileBasedSource):
-    _concurrency_level = DEFAULT_CONCURRENCY
+    _concurrency_level = None


Note: i'll remove this before merging, but I turned this off to test that this worked on the platform for synchronous file-based sources

maxi297

I got one concern about emitting dummy state messages that could be re-used by the platform. Apart from that, my comments are mostly nits

maxi297 · 2024-02-28T23:32:02Z

airbyte-cdk/python/airbyte_cdk/sources/abstract_source.py

+        stream_state = state_manager.get_stream_state(stream_name, stream_instance.namespace)
+
+        if stream_state and "state" in dir(stream_instance):
+            stream_instance.state = stream_state  # type: ignore # we check that state in the dir(stream_instance)


Can the platform use the state {"sync_mode": "full_refresh"} to pass it to the source? I would fear some sources might break like this

+1. I think we should filter out the dummy state objects before setting stream states to ensure we only set valid states

discussed offline, we'll use a sentinel value to indicate how this was set so we. can avoid triggering a bad setter value

maxi297 · 2024-02-28T23:34:58Z

airbyte-cdk/python/airbyte_cdk/sources/abstract_source.py

+            stream_state,
+            state_manager,
+            internal_config,
+        )

        record_counter = 0
        stream_name = configured_stream.stream.name


nit: This seems redundant given https://github.com/airbytehq/airbyte/pull/35622/files#diff-51c861828e0a614ef6ee21390f3ba17d7138549f0575051fdb24868ee05eabbbR204

will remove!

maxi297 · 2024-02-28T23:38:24Z

airbyte-cdk/python/airbyte_cdk/sources/connector_state_manager.py

-                ),
-            )
-        return AirbyteMessage(type=MessageType.STATE, state=AirbyteStateMessage(data=dict(self._get_legacy_state())))
+        # According to the Airbyte protocol, the StreamDescriptor namespace field is not required. However, the platform will throw


Is this only for state messages? I don't see this logic being applied here

Honestly, I'm not sure. This comment was written when I did per-stream status 1.5 years ago. It might've been fixed in the platform in the time since. But what I can do is republish a pre-release version and test it w/ this streamlined. Depending on how that goes I may or may not update this.

I retested this on a pre-release and it looks like the original hack is no longer needed and descriptor with namespace None works how we expect it would.

I'll clean this up and the test thanks for the suggestion!

maxi297 · 2024-02-28T23:48:14Z

airbyte-cdk/python/airbyte_cdk/sources/streams/core.py

+                            state_value = (
+                                airbyte_state_message.state.stream.stream_state.dict() if airbyte_state_message.state.stream else {}
+                            )
+                            logger.info(


I think this is a nice log to have but I fear it might be quite noisy. Do we know of a source that usually have big syncs and checkpoint_interval that could blast this log message a ton and see if this is acceptable?

ah yeah so I plan to remove this after finishing all functional validations. It just helps me validate the behavior on the source side against what makes it to the DB. But thanks for the reminder

maxi297 · 2024-02-28T23:50:33Z

airbyte-cdk/python/airbyte_cdk/sources/streams/core.py

-            checkpoint = self._checkpoint_state(stream_state, state_manager, per_stream_state_enabled)
-            yield checkpoint
+            airbyte_state_message = self._checkpoint_state(stream_state, state_manager)
+            state_value = airbyte_state_message.state.stream.stream_state.dict() if airbyte_state_message.state.stream else {}


nit: should we extract to an accessor for state_value as we seem to do the same logic multiple time. I think the maintenance is probably null though as it would mean a breaking change in the Airbyte protocol lib

actually as per the earlier comment, I'm going to get rid of the logging statements which were just used for testing. I only had this long string of accessors to populate the log statement so we won't need the helper method

maxi297 · 2024-02-29T00:05:23Z

airbyte-cdk/python/unit_tests/sources/mock_server_tests/test_mock_server_abstract_source.py

+    return create_response_builder(
+        response_template=RESPONSE_TEMPLATE,
+        records_path=FieldPath("data"),
+        # pagination_strategy=StripePaginationStrategy()


nit: should we remove this?

maxi297 · 2024-02-29T00:07:53Z

airbyte-cdk/python/unit_tests/sources/mock_server_tests/test_mock_server_abstract_source.py

+    def test_full_refresh_sync(self, http_mocker):
+        start_datetime = _NOW - timedelta(days=14)
+        config = {
+            "start_date": start_datetime.isoformat()[:-13]+"Z",


This is dangerous since if _NOW have microseconds == 0, this will give a unexpected format.

>>> datetime(2024, 1, 1, 2, 3, 4).isoformat() '2024-01-01T02:03:04' >>> datetime(2024, 1, 1, 2, 3, 4, 5).isoformat() '2024-01-01T02:03:04.000005'

We should probably define the format we expect instead

maxi297 · 2024-02-29T00:14:12Z

airbyte-cdk/python/unit_tests/sources/streams/test_stream_read.py

+
+def _incremental_concurrent_stream(slice_to_partition_mapping, slice_logger, logger, message_repository, cursor):
+    stream = _concurrent_stream(slice_to_partition_mapping, slice_logger, logger, message_repository, cursor)
+    # stream.state = {"created_at": timestamp}


Is there a reason why this is commented out?

me being careless probably. removed

girarda · 2024-02-29T21:10:27Z

airbyte-cdk/python/airbyte_cdk/sources/streams/core.py

@@ -107,39 +107,24 @@ def get_error_display_message(self, exception: BaseException) -> Optional[str]:
        """
        return None

-    def read_full_refresh(
+    def read(  # type: ignore  # ignoring typing for ConnectorStateManager because of circular dependencies


nit: an alternative is to only import the type when running mypy

example:

if typing.TYPE_CHECKING: from airbyte_cdk.sources import Source from airbyte_cdk.sources.streams.availability_strategy import AvailabilityStrategy

girarda · 2024-02-29T21:18:10Z

airbyte-cdk/python/airbyte_cdk/sources/abstract_source.py

+        stream_state = state_manager.get_stream_state(stream_name, stream_instance.namespace)
+
+        if stream_state and "state" in dir(stream_instance):
+            stream_instance.state = stream_state  # type: ignore # we check that state in the dir(stream_instance)


+1. I think we should filter out the dummy state objects before setting stream states to ensure we only set valid states

girarda · 2024-02-29T21:37:21Z

airbyte-cdk/python/airbyte_cdk/sources/streams/core.py


-        if not has_slices:
            # Safety net to ensure we always emit at least one state message even if there are no slices


nit: we should update this comment

…space

girarda

!

girarda · 2024-03-05T02:06:00Z

airbyte-cdk/python/airbyte_cdk/sources/streams/core.py

@@ -31,6 +31,8 @@

 JsonSchema = Mapping[str, Any]

+FULL_REFRESH_SENTINEL_STATE_KEY = "__ab_full_refresh_state_message"


let's add a comment explaining why this is needed

sentry-io · 2024-03-06T12:55:02Z

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

‼️ source_amazon_ads.streams.report_streams.report_streams.TooManyRequests /airbyte/integration_code/source_amazon_ads/str... View Issue
‼️ requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://api.billwithorb.com/v1/events?limit=500 /airbyte/integration_code/source_orb/source.py ... View Issue
‼️ requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://efavormart-new.myshopify.com/admin/api/2023-07/cus... /usr/local/lib/python3.9/site-packages/airbyte_... View Issue
‼️ requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://gitlab.com/api/v4/projects/50240639/issues?per_pag... /usr/local/lib/python3.9/site-packages/airbyte_... View Issue
‼️ airbyte_cdk.utils.traced_exception.AirbyteTracedException: Stream: pull_request_commits, slice: {'repository': 'Reemote/ree-reecu', 'pull_number': 1341}... /airbyte/integration_code/source_github/streams... View Issue

_{Did you find this useful? React with a 👍 or 👎}

…flows (#35622)

first pass at adding state for full refresh, consolidating flows, rem…

d376521

…ove emitting legacy state, fixing lots of tests

octavia-squidington-iii added the CDK Connector Development Kit label Feb 26, 2024

brianjlai added 5 commits February 25, 2024 22:30

formatting and fix tests

dccc02f

add a few more integration tests

a8052de

get rid of commented out method

777c68f

add temporary log messages to verify correct state messages

2bf0174

formatting more extra statements

d7a68a6

brianjlai commented Feb 27, 2024

View reviewed changes

remove S3 concurrency to test file-based full refresh states

520b115

octavia-squidington-iii added area/connectors Connector related issues connectors/source/s3 labels Feb 27, 2024

brianjlai commented Feb 28, 2024

View reviewed changes

brianjlai added 2 commits February 27, 2024 17:15

rename integration to mock_server tests

0733708

fix import

f886407

brianjlai commented Feb 28, 2024

View reviewed changes

brianjlai marked this pull request as ready for review February 28, 2024 23:09

brianjlai requested a review from a team as a code owner February 28, 2024 23:09

octavia-squidington-iv requested a review from a team February 28, 2024 23:10

maxi297 reviewed Feb 29, 2024

View reviewed changes

girarda reviewed Feb 29, 2024

View reviewed changes

brianjlai added 2 commits March 3, 2024 15:51

pr feedback, use sentinel full refresh value, test emitting none name…

c03c9bf

…space

cleaning up code for final review and bug with incoming platform state

f54f56f

octavia-squidington-iii removed the area/connectors Connector related issues label Mar 4, 2024

Merge branch 'master' into brian/emit_state_messages_for_full_refresh

d96023f

vercel bot deployed to Preview March 4, 2024 06:02 View deployment

Merge branch 'master' into brian/emit_state_messages_for_full_refresh

837d781

vercel bot deployed to Preview March 4, 2024 18:00 View deployment

girarda approved these changes Mar 5, 2024

View reviewed changes

add comment

9c1ac0d

brianjlai merged commit ef98194 into master Mar 5, 2024
26 of 27 checks passed

brianjlai deleted the brian/emit_state_messages_for_full_refresh branch March 5, 2024 06:05

xiaohansong pushed a commit that referenced this pull request Mar 7, 2024

Emit final state message for full refresh syncs and consolidate read …

439c6f8

…flows (#35622)

maxi297 mentioned this pull request Mar 11, 2024

🚨🚨 Source Mailchimp: Migrate to Low code #35281

Merged

lazebnyi mentioned this pull request Mar 13, 2024

CAT: add validation for state messages #36001

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Emit final state message for full refresh syncs and consolidate read flows #35622

Emit final state message for full refresh syncs and consolidate read flows #35622

brianjlai commented Feb 26, 2024 •

edited

vercel bot commented Feb 26, 2024 •

edited

brianjlai Feb 27, 2024 •

edited

brianjlai Feb 28, 2024 •

edited

brianjlai Feb 28, 2024

maxi297 left a comment

maxi297 Feb 28, 2024

girarda Feb 29, 2024

brianjlai Mar 1, 2024

maxi297 Feb 28, 2024

brianjlai Mar 1, 2024

maxi297 Feb 28, 2024

brianjlai Mar 1, 2024

brianjlai Mar 4, 2024

maxi297 Feb 28, 2024

brianjlai Mar 1, 2024

maxi297 Feb 28, 2024

brianjlai Mar 1, 2024

maxi297 Feb 29, 2024

maxi297 Feb 29, 2024

maxi297 Feb 29, 2024

brianjlai Mar 1, 2024

girarda Feb 29, 2024

girarda Feb 29, 2024

girarda Feb 29, 2024

girarda left a comment

girarda Mar 5, 2024

sentry-io bot commented Mar 6, 2024 •

edited


		if not has_slices:
		# Safety net to ensure we always emit at least one state message even if there are no slices

		@@ -31,6 +31,8 @@

		JsonSchema = Mapping[str, Any]

		FULL_REFRESH_SENTINEL_STATE_KEY = "__ab_full_refresh_state_message"

Emit final state message for full refresh syncs and consolidate read flows #35622

Emit final state message for full refresh syncs and consolidate read flows #35622

Conversation

brianjlai commented Feb 26, 2024 • edited

What

How

Testing and Validation

Recommended reading order

🚨 User Impact 🚨

vercel bot commented Feb 26, 2024 • edited

brianjlai Feb 27, 2024 • edited

Choose a reason for hiding this comment

brianjlai Feb 28, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maxi297 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

girarda left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sentry-io bot commented Mar 6, 2024 • edited

Suspect Issues

brianjlai commented Feb 26, 2024 •

edited

vercel bot commented Feb 26, 2024 •

edited

brianjlai Feb 27, 2024 •

edited

brianjlai Feb 28, 2024 •

edited

sentry-io bot commented Mar 6, 2024 •

edited