Fix race condition of OpSendMsgQueue when publishing messages #14231

codelipenghui · 2022-02-11T05:42:31Z

Background

The issue is a race condition related to #11884 and #12704

When call getPendingQueueSize() which will call pendingMessages.foreach() but without a lock protection. After the send receipt come back, peek from the pendingMessages but which might get null here during the foreachprocess. I have added some logs to confirm the issue, and to reproduce the issue better to use use.statsInterval(1, TimeUnit.SECONDS)` to let the stats update more frequently.

Here are the logs which can explain the bug:

2022-02-11T12:40:00,571+0800 [pulsar-timer-5-1] INFO org.apache.pulsar.client.impl.ProducerImpl - For each OpSendMsgQueue, 0
2022-02-11T12:40:00,572+0800 [pulsar-timer-5-1] INFO  org.apache.pulsar.client.impl.ProducerStatsRecorderImpl - [public/default/t_topic] [standalone-0-3] Pending messages: 1 --- Publish throughput: 0.99 msg/s --- 0.00 Mbit/s --- Latency: med: 0.000 ms - 95pct: 0.000 ms - 99pct: 0.000 ms - 99.9pct: 0.000 ms - max: -∞ ms --- Ack received rate: 0.00 ack/s --- Failed messages: 0
2022-02-11T12:40:01,564+0800 [pulsar-client-io-1-1] INFO org.apache.pulsar.client.impl.ProducerImpl - Add message to OpSendMsgQueue.postponedOpSendMgs, 1, 182801
2022-02-11T12:40:01,566+0800 [pulsar-client-io-1-1] INFO  org.apache.pulsar.client.impl.ProducerImpl - [public/default/s_topic] [standalone-0-6] Got ack for timed out msg 182801 - 182900
2022-02-11T12:40:01,573+0800 [pulsar-timer-5-1] INFO org.apache.pulsar.client.impl.ProducerImpl - For each OpSendMsgQueue, 0
2022-02-11T12:40:01,573+0800 [pulsar-timer-5-1] INFO  org.apache.pulsar.client.impl.ProducerImpl - Put the opsend back to deque of sequenceID 182801

From the logs, you can see a message with sequence ID 182801 add the OpSendMsgQueue first, but after the producer received the receipt, the log shows Got ack for timed out msg which means got null when peeking messages, and after, the message add back to the internal queue, but the producer side is blocked at this time.

Documentation

Check the box below or label this PR directly (if you have committer privilege).

Need to update docs?

doc-required

(If you need help on updating docs, create a doc issue)
no-need-doc

(Please explain why)
doc

(If this PR contains doc changes)

lhotari

The queue implementation that backs pendingMessages isn't thread safe.

pulsar/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ProducerImpl.java

Line 1426 in 58af7a5

* This queue is not thread safe.

Isn't the correct solution to make getPendingQueueSize method synchronized? There's a similar thread safety issue in getDelayInMillis method which should be synchronized.

codelipenghui · 2022-02-11T11:55:56Z

@lhotari Yes, I have tried to synchronize the getPendingQueueSize, but looks like we don't have to synchronize here because we can accept the incorrect pending messages at a moment? so we can avoid the print stats task will hold the lock if there are many pending messages.

And yes, getDelayInMillis also have the thread safety issue, but looks only affects the accuracy of replication delay.

lhotari · 2022-02-11T15:05:52Z

looks like we don't have to synchronize here because we can accept the incorrect pending messages at a moment?

That is risky. It's undefined what the behavior of ArrayDeque would be if it's accessed without proper synchronization. Usually it would be "just" a data consistency issue. In the case of java.util.HashMap, concurrent modifications and access could lead to an infinite loop before Java 8.

michaeljmarshall

I would prefer to use the synchronized keyword to ensure correctness. If we don't synchronize our access and we continue to use the forEach method, I think we need to modify the implementation of OpSendMsgQueue#forEach. As it is currently implemented, concurrent calls to forEach could lead to different views of the forEachDepth variable, which might lead to a postponedOpSendMgs that is never removed.

If we really want to avoid adding the synchronized keyword, we could use a LongAdder in the OpSendMsgQueue. Then, anytime an object is added or removed, we can increment/decrement accordingly, and the getPendingQueueSize will return the sum of the LongAdder.

One advantage of using the synchronized keyword is that it only affects producers that are running with stats logging or applications that are calling the getPendingQueueSize method.

merlimat · 2022-02-11T18:53:40Z

If we really want to avoid adding the synchronized keyword, we could use a LongAdder in the OpSendMsgQueue. Then, anytime an object is added or removed, we can increment/decrement accordingly, and the getPendingQueueSize will return the sum of the LongAdder.

I agree that it would be better to have a messageCount method in OpSendMsgQueue, so that we can encapsulate the logic in there (and we don't have to deal with batching enabled/disabled cases either).

merlimat · 2022-02-11T22:49:48Z

@codelipenghui @lhotari @michaeljmarshall Changed the logic to use counter. PTAL.

michaeljmarshall

LGTM

michaeljmarshall · 2022-02-11T23:17:44Z

pulsar-client/src/main/java/org/apache/pulsar/client/impl/ProducerImpl.java

@@ -1743,7 +1748,7 @@ private void resendMessages(ClientCnx cnx, long expectedEpoch) {
                    cnx.channel().close();
                    return;
                }
-                int messagesToResend = pendingMessages.size();
+                long messagesToResend = pendingMessages.messagesCount();


Nit: technically, users might notice a change in the Re-Sending log line below when running with batch messaging enabled. I don't think this is a problem, but I wanted to mention the behavior change.

Yes, it was a left-over. fixed

Oh, sorry, I was too vague. I just meant that the log line's content will be different. Instead of logging X batch messages, it'll log the total number of messages (regardless of batching). This is fine, but it's a change.

Yes, though the message is more accurate now compared to before :)

codelipenghui · 2022-02-12T03:26:51Z

LGTM

codelipenghui · 2022-02-12T03:29:03Z

@michaeljmarshall We don't need this commit to 2.8.3, #12704 is only in branch-2.9. 2.8.x not affected by this problem

codelipenghui · 2022-02-12T04:21:30Z

@lhotari Please help review again, thanks.

lhotari

Elegant solution, very good!

* Add synchronized for getPendingQueueSize(); * Use iterator instead. * Use counter to keep track of messages count * Changed to int Co-authored-by: Matteo Merli <mmerli@apache.org>

michaeljmarshall · 2022-02-14T17:49:35Z

@michaeljmarshall We don't need this commit to 2.8.3, #12704 is only in branch-2.9. 2.8.x not affected by this problem

@codelipenghui - I agree that the specific bug that led to this fix is not in 2.8. However, in 2.8, we access ArrayDequeue from multiple threads. The data structure is explicitly labeled as not being thread safe. The solution in this PR would remove the unsafe access in branch-2.8. As such, I think we should cherry-pick to 2.8.

michaeljmarshall · 2022-02-14T17:57:23Z

@codelipenghui - I discussed this with @lhotari offline, and he pointed out that we've used the current paradigm for a while. I will not cherry pick this commit to branch-2.8. Please ignore my above comment.

…#14231) * Add synchronized for getPendingQueueSize(); * Use iterator instead. * Use counter to keep track of messages count * Changed to int Co-authored-by: Matteo Merli <mmerli@apache.org>

codelipenghui requested review from lhotari, 315157973, BewareMyPower, congbobo184, gaoran10, hangc0276, merlimat and Technoboy- February 11, 2022 05:43

codelipenghui self-assigned this Feb 11, 2022

codelipenghui added this to the 2.10.0 milestone Feb 11, 2022

codelipenghui added release/2.9.2 release/2.8.3 labels Feb 11, 2022

github-actions bot added the doc-not-needed Your PR changes do not impact docs label Feb 11, 2022

codelipenghui added the release/blocker Indicate the PR or issue that should block the release until it gets resolved label Feb 11, 2022

Add synchronized for getPendingQueueSize();

dca2efa

codelipenghui force-pushed the penghui/fix-pending-message-race-condition branch from 10b88f5 to dca2efa Compare February 11, 2022 07:50

Use iterator instead.

57349c9

codelipenghui changed the title ~~Fix race condtion of OpSendMsgQueue which publishing messages~~ Fix race condition of OpSendMsgQueue when publishing messages Feb 11, 2022

codelipenghui removed the release/2.8.3 label Feb 11, 2022

lhotari requested a review from michaeljmarshall February 11, 2022 10:18

lhotari reviewed Feb 11, 2022

View reviewed changes

michaeljmarshall reviewed Feb 11, 2022

View reviewed changes

Use counter to keep track of messages count

b35f2d8

merlimat approved these changes Feb 11, 2022

View reviewed changes

michaeljmarshall approved these changes Feb 11, 2022

View reviewed changes

michaeljmarshall reviewed Feb 11, 2022

View reviewed changes

michaeljmarshall added the release/2.8.3 label Feb 11, 2022

Changed to int

8d2a69c

Jason918 approved these changes Feb 12, 2022

View reviewed changes

shoothzj approved these changes Feb 12, 2022

View reviewed changes

merlimat merged commit 2c9af46 into apache:master Feb 12, 2022

lhotari approved these changes Feb 12, 2022

View reviewed changes

merlimat added cherry-picked/branch-2.9 Archived: 2.9 is end of life and removed release/2.8.3 labels Feb 12, 2022

codelipenghui deleted the penghui/fix-pending-message-race-condition branch February 13, 2022 10:22

codelipenghui mentioned this pull request Apr 12, 2022

[fix][test] Fix MemoryLimitTest for branch-2.8 #15138

Merged

codelipenghui restored the penghui/fix-pending-message-race-condition branch May 17, 2022 01:20

codelipenghui deleted the penghui/fix-pending-message-race-condition branch May 17, 2022 01:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition of OpSendMsgQueue when publishing messages #14231

Fix race condition of OpSendMsgQueue when publishing messages #14231

codelipenghui commented Feb 11, 2022 •

edited

Loading

lhotari left a comment •

edited

Loading

codelipenghui commented Feb 11, 2022 •

edited

Loading

lhotari commented Feb 11, 2022

michaeljmarshall left a comment

merlimat commented Feb 11, 2022

merlimat commented Feb 11, 2022

michaeljmarshall left a comment

michaeljmarshall Feb 11, 2022

merlimat Feb 11, 2022

michaeljmarshall Feb 11, 2022 •

edited

Loading

merlimat Feb 11, 2022

codelipenghui commented Feb 12, 2022

codelipenghui commented Feb 12, 2022

codelipenghui commented Feb 12, 2022

lhotari left a comment

michaeljmarshall commented Feb 14, 2022

michaeljmarshall commented Feb 14, 2022

Fix race condition of OpSendMsgQueue when publishing messages #14231

Fix race condition of OpSendMsgQueue when publishing messages #14231

Conversation

codelipenghui commented Feb 11, 2022 • edited Loading

Background

Documentation

lhotari left a comment • edited Loading

Choose a reason for hiding this comment

codelipenghui commented Feb 11, 2022 • edited Loading

lhotari commented Feb 11, 2022

michaeljmarshall left a comment

Choose a reason for hiding this comment

merlimat commented Feb 11, 2022

merlimat commented Feb 11, 2022

michaeljmarshall left a comment

Choose a reason for hiding this comment

michaeljmarshall Feb 11, 2022

Choose a reason for hiding this comment

merlimat Feb 11, 2022

Choose a reason for hiding this comment

michaeljmarshall Feb 11, 2022 • edited Loading

Choose a reason for hiding this comment

merlimat Feb 11, 2022

Choose a reason for hiding this comment

codelipenghui commented Feb 12, 2022

codelipenghui commented Feb 12, 2022

codelipenghui commented Feb 12, 2022

lhotari left a comment

Choose a reason for hiding this comment

michaeljmarshall commented Feb 14, 2022

michaeljmarshall commented Feb 14, 2022

codelipenghui commented Feb 11, 2022 •

edited

Loading

lhotari left a comment •

edited

Loading

codelipenghui commented Feb 11, 2022 •

edited

Loading

michaeljmarshall Feb 11, 2022 •

edited

Loading