File-based CDK: make incremental syncs concurrent #34540

clnoll · 2024-01-25T20:08:35Z

Provides the ability for connector developers to make incremental syncs concurrent, by using the FileBasedConcurrentCursor.

This has largely the same logic as the DefaultFileBasedCursor, with the exception of the cursor value. In DefaultFileBasedCursor, the cursor is set to the most recently synced file. Because we sync files in order, this is an appropriate value. To avoid an edge case condition where a file with an older last-modified finishes uploading between syncs, we don't simply pick back up from this cursor value during the next sync, but look back some amount of time.

In the FileBasedConcurrentCursor, because we don't know what order the files are being synced, we have to keep track of the low water mark instead - the time before which all other files have been synced. To do this, we keep track of all pending files. When a file has been synced, we pop it off of the list of pending files. This way, to determine the low water mark, we can simply look at the oldest pending file. Note: in cases where the sync doesn't complete, this means the cursor value will not be in history (because it's not put into history until it's popped out of the pending list). However, since we still maintain the behavior where we check to see if the file is in history before deciding whether to sync it based on the cursor value, we will not skip the file.

TODO

Test against S3.

vercel · 2024-01-25T20:08:42Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
airbyte-docs	⬜️ Ignored (Inspect)	Visit Preview		Feb 8, 2024 1:15am

github-actions · 2024-01-25T20:08:58Z

Before Merging a Connector Pull Request

Wow! What a great pull request you have here! 🎉

To merge this PR, ensure the following has been done/considered for each connector added or updated:

PR name follows PR naming conventions
Breaking changes are considered. If a Breaking Change is being introduced, ensure an Airbyte engineer has created a Breaking Change Plan.
Connector version has been incremented in the Dockerfile and metadata.yaml according to our Semantic Versioning for Connectors guidelines
You've updated the connector's metadata.yaml file any other relevant changes, including a breakingChanges entry for major version bumps. See metadata.yaml docs
Secrets in the connector's spec are annotated with airbyte_secret
All documentation files are up to date. (README.md, bootstrap.md, docs.md, etc...)
Changelog updated in docs/integrations/<source or destination>/<name>.md with an entry for the new version. See changelog example
Migration guide updated in docs/integrations/<source or destination>/<name>-migrations.md with an entry for the new version, if the version is a breaking change. See migration guide example
If set, you've ensured the icon is present in the platform-internal repo. (Docs)

If the checklist is complete, but the CI check is failing,

Check for hidden checklists in your PR description
Toggle the github label checklist-action-run on/off to re-run the checklist CI.

maxi297

I think the file_based_concurrent_cursor code makes sense but I would like to check it back later with some added context

maxi297 · 2024-01-26T14:07:14Z

.../python/unit_tests/sources/file_based/stream/concurrent/test_file_based_concurrent_cursor.py

+    expected_cursor_value: str,
+):
+    cursor = _make_cursor(initial_state)
+    cursor._pending_files = {uri: RemoteFile(uri=uri, last_modified=datetime.strptime(timestamp, DATE_TIME_FORMAT)) for uri, timestamp in pending_files}


I don't like tests that sets up internal parameters and assert them because:

It does not describe very well how the user would use the object and what he can expect from it as they should not use internal fields. As a reviewer, there is a relatively high number of lines added in the PR so I wanted to see how this code would be used by checking the tests and I'm still unsure as to how it should be used when I check this

It generates test that if they are red, it doesn't mean that the behavior that it test is not working anymore. For example, if we change the internals of cursor because we want to add a feature, this test might break even though the behavior still works fine

Me not liking this should not be a reason to change that but do the reasons make sense to you? If so, how can we change that so that it represents user usage?

Would it address your concern if I called set_pending_files instead of setting cursor._pending_files?

If it represents the user usage, yes! I would also challenge the assertions though as the user would probably not call cursor._pending_files as well

I would also challenge the assertions though as the user would probably not call cursor._pending_files as well

True, but I think I'm okay with that in this case because the assertion is there to ensure that the internal state has been set up appropriately when add_file is called.

airbyte-cdk/python/airbyte_cdk/sources/file_based/stream/default_file_based_stream.py

maxi297 · 2024-01-26T14:36:22Z

airbyte-cdk/python/airbyte_cdk/sources/file_based/stream/concurrent/adapters.py

    ) -> "FileBasedStreamFacade":
        """
        Create a ConcurrentStream from a FileBasedStream object.
        """
        pk = get_primary_key_from_stream(stream.primary_key)
        cursor_field = get_cursor_field_from_stream(stream)
+        stream.cursor = cursor


I have a couple of question regarding this line:

AbstractFileBasedStream does not seem to have a cursor field. Am I missing something?

Could stream.cursor be set at the init time of the stream? It feels like this makes the stream stateful which adds complexity to the code

We definitely could (should?) init AbstractFileBasedStream with a cursor, but with the way the code is now it would effectively still be None.

The problem here is that the ConnectionStateManager requires us to give it a mapping of stream name to stream (on init), but the cursor requires a connection state manager. So we can either init the stream with the cursor and set the state manager on the cursor separately, or init the cursor with the state manager and set it on the stream separately. Do you think it would be preferable to set the state manager on the cursor separately?

that's unfortunate.

looking at how the connection state manager uses the stream name to instance mapping, I think it only needs an AirbyteStream, not the actual stream instance so I think the path forward should be

Modify the state manager so it depends on AirbyteStreams instead of on Streams

Instantiate the state manager before instantiating the cursor

Instantiate the cursor from the state manager

@girarda that doesn't quite work either because we get the AirbyteStreams from the catalog, but we don't always have a catalog - e.g. during check.

It looks like the only pieces of information the state manager actually needs are the name and namespace, which I can get off of the config. WDYT about that modification? My worry is that we might eventually want/need other info in the ConnectionStateManager.

isn't the namespace also defined in the configured_catalog?

the check operation shouldn't need the namespace or the state manager. maybe it should only be instantiated for reads

That would mean having different behavior in the streams method depending on who the caller is, which I don't love. But that seems like a reasonable tradeoff to be able to init the stream with a cursor.

airbyte-cdk/python/airbyte_cdk/sources/file_based/stream/concurrent/adapters.py

airbyte-cdk/python/airbyte_cdk/sources/streams/concurrent/default_stream.py

airbyte-cdk/python/airbyte_cdk/sources/file_based/stream/concurrent/adapters.py

maxi297 · 2024-01-26T14:48:16Z

airbyte-cdk/python/airbyte_cdk/sources/file_based/stream/concurrent/adapters.py

        self._cursor.set_pending_partitions(pending_partitions)
+        if not pending_partitions:


I think this is a concurrent concern that is shared and hence should be fixed for every concurrent cases by the action item from this discussion. Does that make sense?

Makes sense to me, I'll keep an eye out for your change.

Should this be removed following #34605?

Yes, good call. Removed.

...thon/airbyte_cdk/sources/file_based/stream/concurrent/cursor/file_based_concurrent_cursor.py

airbyte-cdk/python/airbyte_cdk/sources/concurrent_source/concurrent_source_adapter.py

maxi297

Checking the PR, I can't spot any logical issues so I think I'm fine with approving the PR as is. However, I feel we should be going in a different direction eventually. My concerns are:

Given the lock on set_pending_partitions and the fact that it supports only one call, it seems like the partition generation for a specific stream/cursor can't be concurrent (not on the horizon now but this would be an added blocker to enabling that)
There is a lot of work to maintain _ab_source_file_last_modified which I don't see much value in except from helping debug
The cursor basically seems like per-partition with a limit of partitions (correct me if I'm wrong). If we were to implement this in a generic manner, it feels like many sources could leverage this

maxi297 · 2024-02-05T16:00:59Z

...cdk/python/airbyte_cdk/sources/file_based/stream/concurrent/cursor/file_based_noop_cursor.py

+    def close_partition(self, partition: "FileBasedStreamPartition") -> None:
+        pass
+
+    def set_pending_partitions(self, partitions: Dict[str, "FileBasedStreamPartition"]) -> None:


nit: The type here does not match the parent

...thon/airbyte_cdk/sources/file_based/stream/concurrent/cursor/file_based_concurrent_cursor.py

maxi297 · 2024-02-05T17:10:23Z

airbyte-cdk/python/airbyte_cdk/sources/file_based/stream/concurrent/adapters.py

        self._cursor.set_pending_partitions(pending_partitions)
+        if not pending_partitions:


Should this be removed following #34605?

clnoll

Given the lock on set_pending_partitions and the fact that it supports only one call, it seems like the partition generation for a specific stream/cursor can't be concurrent (not on the horizon now but this would be an added blocker to enabling that)
This is an interesting point, but as of right now file-based streams don't support substreams and I'm not sure what other scenario we'd want to have the partition generation be concurrent. So I'm okay punting until we need it, especially since this implementation is pretty consistent with the non-concurrent cursor.

There is a lot of work to maintain _ab_source_file_last_modified which I don't see much value in except from helping debug
This actually has another role: as the sync progresses it gets bumped up when all files before it have been synced. If history is full because we synced a bunch of newer files first, but there are a bunch of older files that haven't been synced yet, the cursor value should reflect the fact that we need to go back and sync those older files.

The cursor basically seems like per-partition with a limit of partitions (correct me if I'm wrong). If we were to implement this in a generic manner, it feels like many sources could leverage this
Interesting thought, I could see that.

clnoll · 2024-02-06T00:13:30Z

airbyte-cdk/python/airbyte_cdk/sources/file_based/stream/concurrent/adapters.py

        self._cursor.set_pending_partitions(pending_partitions)
+        if not pending_partitions:


Yes, good call. Removed.

...thon/airbyte_cdk/sources/file_based/stream/concurrent/cursor/file_based_concurrent_cursor.py

clnoll requested a review from a team as a code owner January 25, 2024 20:08

octavia-squidington-iii added area/connectors Connector related issues CDK Connector Development Kit connectors/source/s3 labels Jan 25, 2024

clnoll force-pushed the fcdk-with-ccdk-incremental branch from dc60267 to 77c09d4 Compare January 25, 2024 20:09

octavia-squidington-iii removed the area/connectors Connector related issues label Jan 25, 2024

clnoll requested review from girarda and maxi297 January 25, 2024 20:11

maxi297 reviewed Jan 26, 2024

View reviewed changes

octavia-squidington-iv requested review from a team January 26, 2024 15:18

clnoll force-pushed the fcdk-with-ccdk branch from 01b2cc5 to fbce5ce Compare January 28, 2024 19:30

girarda reviewed Jan 28, 2024

View reviewed changes

airbyte-cdk/python/airbyte_cdk/sources/concurrent_source/concurrent_source_adapter.py Outdated Show resolved Hide resolved

clnoll force-pushed the fcdk-with-ccdk-incremental branch from d117085 to 17423b7 Compare January 29, 2024 15:21

octavia-squidington-iii added the area/connectors Connector related issues label Jan 29, 2024

clnoll force-pushed the fcdk-with-ccdk branch from c9b2752 to afb9578 Compare January 30, 2024 00:14

Base automatically changed from fcdk-with-ccdk to master January 30, 2024 00:33

clnoll mentioned this pull request Jan 31, 2024

Source Azure Table: update use of ConnectorStateManager #34716

Closed

clnoll force-pushed the fcdk-with-ccdk-incremental branch from 17423b7 to aceeb2f Compare January 31, 2024 21:21

clnoll requested review from girarda and maxi297 February 1, 2024 20:08

octavia-squidington-iii removed the area/connectors Connector related issues label Feb 1, 2024

clnoll force-pushed the fcdk-with-ccdk-incremental branch 2 times, most recently from a3cdfdf to 8a99006 Compare February 1, 2024 20:37

maxi297 approved these changes Feb 5, 2024

View reviewed changes

clnoll force-pushed the fcdk-with-ccdk-incremental branch from 8a99006 to bd8908c Compare February 6, 2024 00:49

clnoll commented Feb 6, 2024

View reviewed changes

clnoll mentioned this pull request Feb 6, 2024

Source S3: run incremental syncs with concurrency #34895

Merged

File-based CDK: add option to make incremental syncs concurrent

2d0ecd9

clnoll force-pushed the fcdk-with-ccdk-incremental branch from d5c7989 to 2d0ecd9 Compare February 8, 2024 01:15

clnoll merged commit e8910e4 into master Feb 8, 2024
22 checks passed

clnoll deleted the fcdk-with-ccdk-incremental branch February 8, 2024 01:41

clnoll mentioned this pull request Feb 8, 2024

CDK: allow ConnectorStateManager stream_instance_map to take ConfiguredAirbyteStream or Stream #35000

Merged

xiaohansong pushed a commit that referenced this pull request Feb 13, 2024

File-based CDK: make incremental syncs concurrent (#34540)

4f740d9

jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 21, 2024

File-based CDK: make incremental syncs concurrent (airbytehq#34540)

8407296

jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 26, 2024

File-based CDK: make incremental syncs concurrent (airbytehq#34540)

f8567de

jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 26, 2024

File-based CDK: make incremental syncs concurrent (airbytehq#34540)

c5dd63d

jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 26, 2024

File-based CDK: make incremental syncs concurrent (airbytehq#34540)

ee3fdd0

xiaohansong pushed a commit that referenced this pull request Feb 27, 2024

File-based CDK: make incremental syncs concurrent (#34540)

d82f06e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File-based CDK: make incremental syncs concurrent #34540

File-based CDK: make incremental syncs concurrent #34540

clnoll commented Jan 25, 2024 •

edited

Loading

vercel bot commented Jan 25, 2024 •

edited

Loading

github-actions bot commented Jan 25, 2024

maxi297 left a comment

maxi297 Jan 26, 2024 •

edited

Loading

clnoll Jan 29, 2024

maxi297 Jan 29, 2024

clnoll Feb 5, 2024

maxi297 Jan 26, 2024

clnoll Jan 29, 2024

girarda Jan 29, 2024

clnoll Jan 31, 2024

girarda Jan 31, 2024

clnoll Jan 31, 2024

maxi297 Jan 26, 2024

clnoll Jan 29, 2024

maxi297 Feb 5, 2024

clnoll Feb 6, 2024

maxi297 left a comment

maxi297 Feb 5, 2024

maxi297 Feb 5, 2024

clnoll left a comment

clnoll Feb 6, 2024

		self._cursor.set_pending_partitions(pending_partitions)
		if not pending_partitions:

File-based CDK: make incremental syncs concurrent #34540

File-based CDK: make incremental syncs concurrent #34540

Conversation

clnoll commented Jan 25, 2024 • edited Loading

Recommended Reading Order

TODO

vercel bot commented Jan 25, 2024 • edited Loading

github-actions bot commented Jan 25, 2024

Before Merging a Connector Pull Request

maxi297 left a comment

Choose a reason for hiding this comment

maxi297 Jan 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maxi297 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clnoll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clnoll commented Jan 25, 2024 •

edited

Loading

vercel bot commented Jan 25, 2024 •

edited

Loading

maxi297 Jan 26, 2024 •

edited

Loading