-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
concurrent cdk: improve resource usage and stop waiting on the main thread #33669
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Ignored Deployment
|
Before Merging a Connector Pull RequestWow! What a great pull request you have here! 🎉 To merge this PR, ensure the following has been done/considered for each connector added or updated:
If the checklist is complete, but the CI check is failing,
|
121e78a
to
6dc7008
Compare
6dc7008
to
4fc61a3
Compare
for record in parent_records: | ||
self.logger.info(f"Fetching parent stream slices for stream {self.name}.") | ||
yield {"parent": record} | ||
# def stream_slices( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the stripe connector was modified to add prints. This had a significant impact on the conenctor's performance because it prints one line per parent record. this should be reverted before updating the connector again
@@ -32,6 +37,8 @@ def process_partition(self, partition: Partition) -> None: | |||
""" | |||
try: | |||
for record in partition.read(): | |||
while self._queue.qsize() > self._max_size: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should have a max attempt
@@ -34,6 +39,8 @@ def generate_partitions(self, stream: AbstractStream) -> None: | |||
""" | |||
try: | |||
for partition in stream.generate_partitions(): | |||
while self._queue.qsize() >= self._max_size: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should have a max attempt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we explain why this is needed on top of the priority? It's not so obvious why both would be needed
if len(futures) < self._max_concurrent_tasks: | ||
break | ||
self._logger.info("Main thread is sleeping because the task queue is full...") | ||
time.sleep(self._sleep_time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid waiting on the main thread
airbyte-cdk/python/airbyte_cdk/sources/concurrent_source/thread_pool_manager.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, the idea of waiting in the child threads instead of the main one makes sense to me. I'll continue thinking about this until our sync
class QueueItemObject: | ||
def __init__(self, item: QueueItem): | ||
self.value = item | ||
self._order = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Could be a constant
@@ -34,6 +39,8 @@ def generate_partitions(self, stream: AbstractStream) -> None: | |||
""" | |||
try: | |||
for partition in stream.generate_partitions(): | |||
while self._queue.qsize() >= self._max_size: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we explain why this is needed on top of the priority? It's not so obvious why both would be needed
@@ -34,7 +39,9 @@ def generate_partitions(self, stream: AbstractStream) -> None: | |||
""" | |||
try: | |||
for partition in stream.generate_partitions(): | |||
self._queue.put(partition) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like there is some logic associated with how we access the queue. I wonder if it would make sense to regroup this in a class. We have similar logic in PartitionReader for example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for sure! I actually ran into another bottleneck preventing the sync from succeeding in a reasonable amount of time. I'll clean this up before asking for a formal review
actual_partitions.append(partition) | ||
|
||
assert actual_partitions == partitions | ||
assert queue.put.has_calls([call(p) for p in partitions] + [call(PartitionGenerationCompletedSentinel(stream))]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
verify that the partitions and the sentinel were put on the queue
@@ -1,7 +1,7 @@ | |||
# | |||
# Copyright (c) 2023 Airbyte, Inc., all rights reserved. | |||
# | |||
|
|||
from queue import Queue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused import
…irbyte into alex/concurrent_no_wait_main
|
||
def __init__(self, futures_list: List[Future[Any]], sleep_time: float, max_concurrent_tasks: int): | ||
""" | ||
:param futures_list: The list of futures to monitor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is obvious, but why do we need to throttle workers enqueuing items based on the number of pending tasks? I'm wondering if we can solve some of these memory issues by setting queue.maxsize
, which prevents items from being added until something else is dequeued, creating a sort of back pressure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question!
The issue I ran into with blocking on the size of the queue is that the main thread will remove elements from the queue before potentially adding them to the list of futures. Since the main thread doesn't wait, it'll be able to remove items from the queue even if the list of futures is large, so the tasks won't wait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thanks for the clarification! I think we should document the reason that we need this. Mind adding a docstring on this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good @girarda! Just one small documentation request.
Would you also mind doing some memory profiling of this branch versus master to show that it fixes the problem?
for record in parent_records: | ||
self.logger.info(f"Fetching parent stream slices for stream {self.name}.") | ||
yield {"parent": record} | ||
# def stream_slices( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO @girarda: revert this change before merging
self._sleep_time = sleep_time | ||
self._max_concurrent_tasks = max_concurrent_tasks | ||
|
||
def wait_and_acquire(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have unit tests for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, done!
self._throttler = throttler | ||
self._timeout = timeout | ||
|
||
def put(self, item: QueueItem) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have unit tests for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, done!
This reverts commit 2609043.
…irbyte into alex/concurrent_no_wait_main
…hread (airbytehq#33669) Co-authored-by: Augustin <augustin@airbyte.io>
…hread (airbytehq#33669) Co-authored-by: Augustin <augustin@airbyte.io>
What
Two more issues were uncovered along the way:
How
Recommended reading order
airbyte-cdk/python/airbyte_cdk/sources/concurrent_source/concurrent_read_processor.py
airbyte-cdk/python/airbyte_cdk/sources/concurrent_source/thread_pool_manager.py
airbyte-cdk/python/airbyte_cdk/sources/concurrent_source/throttler.py
airbyte-cdk/python/airbyte_cdk/sources/streams/concurrent/partitions/throttled_queue.py