concurrent cdk: improve resource usage and stop waiting on the main thread #33669

girarda · 2023-12-20T07:54:12Z

What

A stripe connection has been experiencing OOM.
This was caused by the concurrent_read_processor keeping all partitions in memory

Two more issues were uncovered along the way:

While the partition enqueuer and the partition reader should backoff, the main thread shouldn't as it should try to process records as fast as possible
- Instead of backing off on the main thread, we can backoff on the task threads. PartitionEnqueuer and PartitionReader now wait before enqueing elements if there are too many futures in the threadpool's task list
futures.exception is a blocking call. since this was called on the main thread, pruning the futures blocked the threads until the futures completed. this can be resolved by checking if a future raised an exception only if it completed

How

Instead of keeping all partitions in memory and checking if they are done, only keep the partitions that are currently running in memory and remove them when they are done
Before putting new items on the queue, check if we're already at the maximum number of futures. If we are, wait. The queue was wrapped in a ThrottledQueue for ease of use

Recommended reading order

airbyte-cdk/python/airbyte_cdk/sources/concurrent_source/concurrent_read_processor.py
airbyte-cdk/python/airbyte_cdk/sources/concurrent_source/thread_pool_manager.py
airbyte-cdk/python/airbyte_cdk/sources/concurrent_source/throttler.py
airbyte-cdk/python/airbyte_cdk/sources/streams/concurrent/partitions/throttled_queue.py

vercel · 2023-12-20T07:54:18Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
airbyte-docs	⬜️ Ignored (Inspect)	Visit Preview		Jan 18, 2024 7:16am

github-actions · 2023-12-20T07:54:30Z

Before Merging a Connector Pull Request

Wow! What a great pull request you have here! 🎉

To merge this PR, ensure the following has been done/considered for each connector added or updated:

PR name follows PR naming conventions
Breaking changes are considered. If a Breaking Change is being introduced, ensure an Airbyte engineer has created a Breaking Change Plan.
Connector version has been incremented in the Dockerfile and metadata.yaml according to our Semantic Versioning for Connectors guidelines
You've updated the connector's metadata.yaml file any other relevant changes, including a breakingChanges entry for major version bumps. See metadata.yaml docs
Secrets in the connector's spec are annotated with airbyte_secret
All documentation files are up to date. (README.md, bootstrap.md, docs.md, etc...)
Changelog updated in docs/integrations/<source or destination>/<name>.md with an entry for the new version. See changelog example
Migration guide updated in docs/integrations/<source or destination>/<name>-migrations.md with an entry for the new version, if the version is a breaking change. See migration guide example
If set, you've ensured the icon is present in the platform-internal repo. (Docs)

If the checklist is complete, but the CI check is failing,

Check for hidden checklists in your PR description
Toggle the github label checklist-action-run on/off to re-run the checklist CI.

girarda · 2023-12-20T08:00:00Z

airbyte-integrations/connectors/source-stripe/source_stripe/streams.py

-            for record in parent_records:
-                self.logger.info(f"Fetching parent stream slices for stream {self.name}.")
-                yield {"parent": record}
+    # def stream_slices(


the stripe connector was modified to add prints. This had a significant impact on the conenctor's performance because it prints one line per parent record. this should be reverted before updating the connector again

girarda · 2023-12-20T08:09:57Z

airbyte-cdk/python/airbyte_cdk/sources/streams/concurrent/partition_reader.py

@@ -32,6 +37,8 @@ def process_partition(self, partition: Partition) -> None:
        """
        try:
            for record in partition.read():
+                while self._queue.qsize() > self._max_size:


this should have a max attempt

girarda · 2023-12-20T08:10:10Z

airbyte-cdk/python/airbyte_cdk/sources/streams/concurrent/partition_enqueuer.py

@@ -34,6 +39,8 @@ def generate_partitions(self, stream: AbstractStream) -> None:
        """
        try:
            for partition in stream.generate_partitions():
+                while self._queue.qsize() >= self._max_size:


this should have a max attempt

Can we explain why this is needed on top of the priority? It's not so obvious why both would be needed

girarda · 2023-12-20T08:12:06Z

airbyte-cdk/python/airbyte_cdk/sources/concurrent_source/thread_pool_manager.py

-            if len(futures) < self._max_concurrent_tasks:
-                break
-            self._logger.info("Main thread is sleeping because the task queue is full...")
-            time.sleep(self._sleep_time)


avoid waiting on the main thread

airbyte-cdk/python/airbyte_cdk/sources/concurrent_source/thread_pool_manager.py

maxi297

In general, the idea of waiting in the child threads instead of the main one makes sense to me. I'll continue thinking about this until our sync

maxi297 · 2024-01-09T14:35:05Z

airbyte-cdk/python/airbyte_cdk/sources/streams/concurrent/partitions/types.py

+class QueueItemObject:
+    def __init__(self, item: QueueItem):
+        self.value = item
+        self._order = {


nit: Could be a constant

maxi297 · 2024-01-09T14:38:34Z

airbyte-cdk/python/airbyte_cdk/sources/streams/concurrent/partition_enqueuer.py

@@ -34,6 +39,8 @@ def generate_partitions(self, stream: AbstractStream) -> None:
        """
        try:
            for partition in stream.generate_partitions():
+                while self._queue.qsize() >= self._max_size:


Can we explain why this is needed on top of the priority? It's not so obvious why both would be needed

maxi297 · 2024-01-09T14:38:56Z

airbyte-cdk/python/airbyte_cdk/sources/streams/concurrent/partition_enqueuer.py

@@ -34,7 +39,9 @@ def generate_partitions(self, stream: AbstractStream) -> None:
        """
        try:
            for partition in stream.generate_partitions():
-                self._queue.put(partition)


It feels like there is some logic associated with how we access the queue. I wonder if it would make sense to regroup this in a class. We have similar logic in PartitionReader for example

for sure! I actually ran into another bottleneck preventing the sync from succeeding in a reasonable amount of time. I'll clean this up before asking for a formal review

…am_done

girarda · 2024-01-16T19:55:30Z

airbyte-cdk/python/unit_tests/sources/streams/concurrent/test_concurrent_partition_generator.py

-        actual_partitions.append(partition)
-
-    assert actual_partitions == partitions
+    assert queue.put.has_calls([call(p) for p in partitions] + [call(PartitionGenerationCompletedSentinel(stream))])


verify that the partitions and the sentinel were put on the queue

girarda · 2024-01-16T20:00:09Z

airbyte-cdk/python/airbyte_cdk/sources/streams/concurrent/partitions/types.py

@@ -1,7 +1,7 @@
 #
 # Copyright (c) 2023 Airbyte, Inc., all rights reserved.
 #
-
+from queue import Queue


unused import

…irbyte into alex/concurrent_no_wait_main

clnoll · 2024-01-16T22:08:28Z

airbyte-cdk/python/airbyte_cdk/sources/concurrent_source/throttler.py

+
+    def __init__(self, futures_list: List[Future[Any]], sleep_time: float, max_concurrent_tasks: int):
+        """
+        :param futures_list: The list of futures to monitor


Maybe this is obvious, but why do we need to throttle workers enqueuing items based on the number of pending tasks? I'm wondering if we can solve some of these memory issues by setting queue.maxsize, which prevents items from being added until something else is dequeued, creating a sort of back pressure.

Good question!

The issue I ran into with blocking on the size of the queue is that the main thread will remove elements from the queue before potentially adding them to the list of futures. Since the main thread doesn't wait, it'll be able to remove items from the queue even if the list of futures is large, so the tasks won't wait.

Cool, thanks for the clarification! I think we should document the reason that we need this. Mind adding a docstring on this file?

clnoll

Looks good @girarda! Just one small documentation request.

Would you also mind doing some memory profiling of this branch versus master to show that it fixes the problem?

girarda · 2024-01-17T00:24:14Z

airbyte-integrations/connectors/source-stripe/source_stripe/streams.py

-            for record in parent_records:
-                self.logger.info(f"Fetching parent stream slices for stream {self.name}.")
-                yield {"parent": record}
+    # def stream_slices(


TODO @girarda: revert this change before merging

maxi297 · 2024-01-17T17:49:50Z

airbyte-cdk/python/airbyte_cdk/sources/concurrent_source/throttler.py

+        self._sleep_time = sleep_time
+        self._max_concurrent_tasks = max_concurrent_tasks
+
+    def wait_and_acquire(self) -> None:


Should we have unit tests for that?

maxi297 · 2024-01-17T17:49:54Z

airbyte-cdk/python/airbyte_cdk/sources/streams/concurrent/partitions/throttled_queue.py

+        self._throttler = throttler
+        self._timeout = timeout
+
+    def put(self, item: QueueItem) -> None:


Should we have unit tests for that?

girarda · 2024-01-17T20:07:02Z

Graph from master showing we rapidly hit almost 5GB of memory usage. This is a little misleading as the sync ended up failing due to a failed heartbeat because the main thread was waiting...

On this branch, resource usage is still ~half the peak, but the sync is able to complete.

This reverts commit 2609043.

…irbyte into alex/concurrent_no_wait_main

…hread (airbytehq#33669) Co-authored-by: Augustin <augustin@airbyte.io>

octavia-squidington-iii added area/connectors Connector related issues CDK Connector Development Kit labels Dec 20, 2023

octavia-squidington-iii added the connectors/source/stripe label Dec 20, 2023

girarda force-pushed the alex/concurrent_no_wait_main branch from 121e78a to 6dc7008 Compare December 20, 2023 07:59

do not wait on the main thread

4fc61a3

girarda force-pushed the alex/concurrent_no_wait_main branch from 6dc7008 to 4fc61a3 Compare December 20, 2023 08:02

update

7e0c603

girarda commented Dec 20, 2023

View reviewed changes

Merge branch 'master' into alex/concurrent_no_wait_main

a25c2a2

vercel bot deployed to Preview January 8, 2024 16:57 View deployment

girarda added 9 commits January 8, 2024 11:05

reset to 10k

ea0e3e3

hardcode queue length to 20k

b55fdd8

bump to 50k

90f7712

100k

8cd7608

use priority queue

068b64a

do not wait before putting records

769f421

set size to 10k

6674aab

only enqueue up to 100 items

a8687e1

revert TODOs

93e5e44

maxi297 reviewed Jan 9, 2024

View reviewed changes

girarda added 7 commits January 9, 2024 06:57

avoid iterating over all partitions to improve performance of is_stre…

19d08a2

…am_done

remove extra log

606e55d

remove the partition instead of setting the value to an int

1a122e1

remove wait in partition enqueuer

36dec54

revert priority queue

8c165ac

fix import

91e5692

reset to master

b093bbc

Merge branch 'master' into alex/concurrent_no_wait_main

d980e95

vercel bot deployed to Preview January 16, 2024 16:57 View deployment

girarda added 2 commits January 16, 2024 11:50

extract in separate class

c70a5d4

move to separate file

e8c0c56

girarda commented Jan 16, 2024

View reviewed changes

girarda added 4 commits January 16, 2024 12:02

Merge branch 'master' into alex/concurrent_no_wait_main

744abf5

format

a0e47aa

Merge branch 'alex/concurrent_no_wait_main' of github.com:airbytehq/a…

b392fcc

…irbyte into alex/concurrent_no_wait_main

simplify is_done implementation

b2d85f4

clnoll reviewed Jan 16, 2024

View reviewed changes

clnoll approved these changes Jan 16, 2024

View reviewed changes

girarda commented Jan 17, 2024

View reviewed changes

comment

52ebe02

maxi297 approved these changes Jan 17, 2024

View reviewed changes

girarda added 4 commits January 17, 2024 19:10

unit tests

8fc916e

Merge branch 'master' into alex/concurrent_no_wait_main

12d3f99

Revert "Revert "reset to master""

b3ee6b5

This reverts commit 2609043.

Merge branch 'alex/concurrent_no_wait_main' of github.com:airbytehq/a…

da15392

…irbyte into alex/concurrent_no_wait_main

octavia-squidington-iii removed the area/connectors Connector related issues label Jan 18, 2024

vercel bot deployed to Preview January 18, 2024 03:14 View deployment

girarda added 3 commits January 17, 2024 21:40

fix format

a042c63

flake

79c6158

whitespace

4acaf31

girarda merged commit 0faa69d into master Jan 18, 2024
25 checks passed

girarda deleted the alex/concurrent_no_wait_main branch January 18, 2024 07:54

jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 26, 2024

concurrent cdk: improve resource usage and stop waiting on the main t…

4e986bf

…hread (airbytehq#33669) Co-authored-by: Augustin <augustin@airbyte.io>

jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 26, 2024

concurrent cdk: improve resource usage and stop waiting on the main t…

8037365

…hread (airbytehq#33669) Co-authored-by: Augustin <augustin@airbyte.io>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

concurrent cdk: improve resource usage and stop waiting on the main thread #33669

concurrent cdk: improve resource usage and stop waiting on the main thread #33669

girarda commented Dec 20, 2023 •

edited

Loading

vercel bot commented Dec 20, 2023 •

edited

Loading

github-actions bot commented Dec 20, 2023

girarda Dec 20, 2023

girarda Dec 20, 2023

girarda Dec 20, 2023

maxi297 Jan 9, 2024

girarda Dec 20, 2023

maxi297 left a comment

maxi297 Jan 9, 2024

maxi297 Jan 9, 2024

maxi297 Jan 9, 2024

girarda Jan 9, 2024

girarda Jan 16, 2024

girarda Jan 16, 2024

clnoll Jan 16, 2024

girarda Jan 16, 2024

clnoll Jan 16, 2024

clnoll left a comment

girarda Jan 17, 2024

maxi297 Jan 17, 2024

girarda Jan 18, 2024

maxi297 Jan 17, 2024

girarda Jan 18, 2024

girarda commented Jan 17, 2024

concurrent cdk: improve resource usage and stop waiting on the main thread #33669

concurrent cdk: improve resource usage and stop waiting on the main thread #33669

Conversation

girarda commented Dec 20, 2023 • edited Loading

What

How

Recommended reading order

vercel bot commented Dec 20, 2023 • edited Loading

github-actions bot commented Dec 20, 2023

Before Merging a Connector Pull Request

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maxi297 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clnoll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

girarda commented Jan 17, 2024

girarda commented Dec 20, 2023 •

edited

Loading

vercel bot commented Dec 20, 2023 •

edited

Loading