-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Client] Fix ConcurrentModificationException in sendAsync #11884
[Client] Fix ConcurrentModificationException in sendAsync #11884
Conversation
@merlimat @codelipenghui does this PR make sense? any recommendation how to add tests? |
703b838
to
72e54d0
Compare
Can we do that?
and add msgs to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't just guarantee the thread safety of the queue, so we can use a Concurrent queue. The reason why the ProducerImpl object was locked before is because there are other operations that must be guaranteed to be atomic, for example: ProducerImpl#ackReceived, we need to peek first
In this case it doesn't seem to be about thread safety and there aren't multiple threads. It's about adding an item to a queue while the same thread is in the forEach loop. That fails with a ConcurrentModificationException. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This solution looks a little bit hacky as it is fixing some specific case and sequence of method calls.
in order to fix the issue, isn't it better to handle correctly concurrent access to the pendingMessagesQueue?
if we want to go down this way I believe that at least we have to:
- create a new interface QueueThatCanBeAddedWhileExecutingForEachXX...
- create a specific subclass
- use QueueThatCanBeAddedWhileExecutingForEachXX and not Queue
otherwise it may be dangerous to modify the code that accesses the pendingMessagesQueue, as access patterns may change in the future
agree, |
Thanks for your contribution. For this PR, do we need to update docs? (The PR template contains info about doc, which helps others know more about the changes. Can you provide doc-related info in this and future PR descriptions? Thanks) |
29838af
to
511c340
Compare
Several comments have been about using a concurrent queue implementation. I don't think that is necessary as long as the access of the queue is properly synchronized. The solution in this PR demonstrates in a unit test how the ConcurrentModificationException can happen in a single thread. When using an ordinary ArrayDeque the test would fail. The root cause of #11783 is that while the current thread is executing the action in the forEach block, the callback code might try to add a new OpSendMsg entry in the calling thread. Some locations: https://github.com/apache/pulsar/blob/d86db3f4ec4fb6bd04216a123cde2fee5c43f9d9/pulsar- pulsar/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ProducerImpl.java Lines 1638 to 1657 in d86db3f
I guess it's a valid case that after failing all pending messages the client code is allowed to access the producer in the callback to send more messages. (there are more details in #11783) @merlimat @codelipenghui Does the approach in this PR make sense, to postpone elements added while the forEach loop is in progress? |
511c340
to
71b59ad
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@merlimat @codelipenghui @315157973 |
I don't know if it is a good way, waiting for others to CR. @codelipenghui @hangc0276 @merlimat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks good. I just left a couple of comments.
pulsar-client/src/main/java/org/apache/pulsar/client/impl/ProducerImpl.java
Outdated
Show resolved
Hide resolved
pulsar-client/src/main/java/org/apache/pulsar/client/impl/ProducerImpl.java
Outdated
Show resolved
Hide resolved
…edOpSendMsgs list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm
(cherry picked from commit a1c1028)
### Background The issue is a race condition introduced by this PR apache#11884 which introduced a struct to maintain pending messages, but it has a race condition for `foreach()` and `peek` , the issue is during the `foreach` process, if the new messages sent to the broker and received the receipt from the broker, then the producer will peek a null message from the `OpSendMsgQueue` , I have added some logs to confirm the issue. Here are the logs which can explain the bug: ``` 2022-02-11T12:40:00,571+0800 [pulsar-timer-5-1] ERROR org.apache.pulsar.client.impl.ProducerImpl - For each OpSendMsgQueue, 0 2022-02-11T12:40:00,572+0800 [pulsar-timer-5-1] INFO org.apache.pulsar.client.impl.ProducerStatsRecorderImpl - [public/default/t_topic] [standalone-0-3] Pending messages: 1 --- Publish throughput: 0.99 msg/s --- 0.00 Mbit/s --- Latency: med: 0.000 ms - 95pct: 0.000 ms - 99pct: 0.000 ms - 99.9pct: 0.000 ms - max: -∞ ms --- Ack received rate: 0.00 ack/s --- Failed messages: 0 2022-02-11T12:40:01,564+0800 [pulsar-client-io-1-1] ERROR org.apache.pulsar.client.impl.ProducerImpl - Add message to OpSendMsgQueue.postponedOpSendMgs, 1, 182801 2022-02-11T12:40:01,566+0800 [pulsar-client-io-1-1] INFO org.apache.pulsar.client.impl.ProducerImpl - [public/default/s_topic] [standalone-0-6] Got ack for timed out msg 182801 - 182900 2022-02-11T12:40:01,573+0800 [pulsar-timer-5-1] ERROR org.apache.pulsar.client.impl.ProducerImpl - For each OpSendMsgQueue, 0 2022-02-11T12:40:01,573+0800 [pulsar-timer-5-1] INFO org.apache.pulsar.client.impl.ProducerImpl - Put the opsend back to deque of sequenceID 182801 ``` From the logs, you can see a message with sequence ID 182801 add the `OpSendMsgQueue` first, but after the producer received the receipt, the log shows Got ack for timed out msg which means got null when peeking messages, and after, the message add back to the internal queue, but the producer side is blocked at this time. ### Modification 1. Avoid using foreach to iterate the pending ops, use iterator to instead 2. Keep using OpSendMsgQueue to avoid expose `foreach` method
Fixes #11783
Motivation
See #11783 for details.
Modifications
Create a subclass for ArrayDeque that can postpone the adding of items that are added during a forEach iteration. To support all possible cases, this checks for nested calls too although that might not happen in this case.
The items added during a forEach loop will be added after the loop completes.
Todo
Find a way to create a failing test for the current issue.