Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed race condition on multi-topic consumer #11764

Merged
merged 1 commit into from
Aug 25, 2021

Conversation

merlimat
Copy link
Contributor

@merlimat merlimat commented Aug 25, 2021

Motivation

Under certain conditions applications using the multi-topic consumers might get the consumption stalled:

The conditions to reproduce the issue are:

  • Consumer is subscribed to multiple topics, but only 1 topic has traffic
  • Messages are published in batches (no repro if no batches)
  • Receiver queue size == 1 (or small, in order to exercise race condition)

The problem is that there is race condition between 2 threads when we're deciding to put one of the individual consumers in "paused" state, when the shared queue is full.

What happens is that, just after we checked the conditions and we decide to mark the consumer as paused, the application has emptied the shared queue completely. From that point on, there is no re-attempt to check whether we need to unblock that consumer.

Modification

Instead of introducing a sync block (contended by many consumers), we just double check the state of the shared queue after marking the consumer as "paused". If the other thread has emptied the queue in the meantime, we'll be guaranteed to unblock the consumer.

@merlimat merlimat added type/bug The PR fixed a bug or issue reported a bug release/2.8.1 doc-not-needed Your PR changes do not impact docs release/2.7.4 labels Aug 25, 2021
@merlimat merlimat added this to the 2.9.0 milestone Aug 25, 2021
@merlimat merlimat self-assigned this Aug 25, 2021
Copy link
Contributor

@hangc0276 hangc0276 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great catch!

Copy link
Contributor

@315157973 315157973 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

pendingReceives also seems to have a race condition, two threads will operate pendingReceives :

receiveMessageFromConsumer is executed by client.getInternalExecutorService().execute
internalReceiveAsync is executed by the user's thread

It is better to put them all in pinnedExecutor, I am working on it. And it can solve the above scenario at the same time

@merlimat
Copy link
Contributor Author

LGTM

pendingReceives also seems to have a race condition, two threads will operate pendingReceives :

receiveMessageFromConsumer is executed by client.getInternalExecutorService().execute
internalReceiveAsync is executed by the user's thread

I think the usage on pendingReceives is correct even if used by different threads. Can you describe the particular race condition and the effects that it would have?

It is better to put them all in pinnedExecutor, I am working on it. And it can solve the above scenario at the same time

My concern there is that if you have a multi-topic consumer pulling from many topics, we shouldn't be bottlenecked by 1 single thread.

Copy link
Contributor

@codelipenghui codelipenghui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

@codelipenghui codelipenghui merged commit f1d66d1 into apache:master Aug 25, 2021
hangc0276 pushed a commit that referenced this pull request Aug 25, 2021
### Motivation

Under certain conditions applications using the multi-topic consumers might get the consumption stalled:

The conditions to reproduce the issue are:
 * Consumer is subscribed to multiple topics, but only 1 topic has traffic
 * Messages are published in batches (no repro if no batches)
 * Receiver queue size == 1 (or small, in order to exercise race condition)

The problem is that there is race condition between 2 threads when we're deciding to put one of the individual consumers in "paused" state, when the shared queue is full.

What happens is that, just after we checked the conditions and we decide to mark the consumer as paused, the application  has emptied the shared queue completely. From that point on, there is no re-attempt to check whether we need to unblock that consumer.

### Modification

Instead of introducing a sync block (contended by many consumers), we just double check the state of the shared queue after marking the consumer as "paused". If the other thread has emptied the queue in the meantime, we'll be guaranteed to unblock the consumer.

(cherry picked from commit f1d66d1)
@hangc0276 hangc0276 added the cherry-picked/branch-2.8 Archived: 2.8 is end of life label Aug 25, 2021
@merlimat merlimat deleted the multi-consumer-stuck branch August 25, 2021 13:41
codelipenghui pushed a commit that referenced this pull request Dec 11, 2021
### Motivation

Under certain conditions applications using the multi-topic consumers might get the consumption stalled:

The conditions to reproduce the issue are:
 * Consumer is subscribed to multiple topics, but only 1 topic has traffic
 * Messages are published in batches (no repro if no batches)
 * Receiver queue size == 1 (or small, in order to exercise race condition)

The problem is that there is race condition between 2 threads when we're deciding to put one of the individual consumers in "paused" state, when the shared queue is full.

What happens is that, just after we checked the conditions and we decide to mark the consumer as paused, the application  has emptied the shared queue completely. From that point on, there is no re-attempt to check whether we need to unblock that consumer.

### Modification

Instead of introducing a sync block (contended by many consumers), we just double check the state of the shared queue after marking the consumer as "paused". If the other thread has emptied the queue in the meantime, we'll be guaranteed to unblock the consumer.

(cherry picked from commit f1d66d1)
@codelipenghui codelipenghui added the cherry-picked/branch-2.7 Archived: 2.7 is end of life label Dec 11, 2021
bharanic-dev pushed a commit to bharanic-dev/pulsar that referenced this pull request Mar 18, 2022
### Motivation

Under certain conditions applications using the multi-topic consumers might get the consumption stalled: 

The conditions to reproduce the issue are:
 * Consumer is subscribed to multiple topics, but only 1 topic has traffic
 * Messages are published in batches (no repro if no batches)
 * Receiver queue size == 1 (or small, in order to exercise race condition)
 
The problem is that there is race condition between 2 threads when we're deciding to put one of the individual consumers in "paused" state, when the shared queue is full. 

What happens is that, just after we checked the conditions and we decide to mark the consumer as paused, the application  has emptied the shared queue completely. From that point on, there is no re-attempt to check whether we need to unblock that consumer. 

### Modification

Instead of introducing a sync block (contended by many consumers), we just double check the state of the shared queue after marking the consumer as "paused". If the other thread has emptied the queue in the meantime, we'll be guaranteed to unblock the consumer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-picked/branch-2.7 Archived: 2.7 is end of life cherry-picked/branch-2.8 Archived: 2.8 is end of life doc-not-needed Your PR changes do not impact docs release/2.7.4 release/2.8.1 type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants