[Java Client] Use epoch to version producer's cnx to prevent early delivery of messages #12779

michaeljmarshall · 2021-11-12T22:54:53Z

…livery of messages

Motivation

I discovered a race condition in Pulsar’s Java Client ProducerImpl that can lead to messages persisted out-of-order for a single producer sending to a non-partitioned topic. This PR removes the race condition by keeping track of the current cnx and introducing a new update to the state while in a lock on the ProducerImpl.this.

Reproducing the issue

I can consistently reproduce the issue by interrupting an active producer sending messages to Pulsar. I interrupt the producer by restarting the broker hosting the producer's topic or by unloading the producer's topic/namespace. Because of the nature of this race condition, the out of order issue does not happen every time, but it does happen frequently enough to have noticed it. In order to increase the probability of the error, my test set up included producing to 100 topics on a 3 broker pulsar cluster at a total rate of 50k msgs/sec. I determined that the messages were out of order in two ways. First, by producing messages from a single thread and putting my own, monotonically increasing sequence id as the messages payload. Second, by inspecting the message's sequence id assigned by the pulsar java client. Inspecting the messages using the reader api revealed messages in sequences like 1,4,2,3,4. The 4 in that sequence is duplicated because when the client receives an ack for an unexpected message, it re-sends all pending messages.

Description of the problem

The problem comes from the way that the pulsar producer changes state when a connection is closed and when a connection is opened. Here are the state changes that take place when a connection closes and when one opens:

Connection closed:

Set cnx to null, as long as the current cnx is the connection being closed.
Set the state to Connecting.
Schedule a runnable to get a new cnx.
Once the task from step 3 is run, asynchronously get a cnx and then call ProducerImpl#connectionOpened.

ProducerImpl#connectionOpened:

Set cnx to the new connection.
Send Producer command to the broker to register the new producer.
Once the producer is registered, schedule a runnable to redeliver pendingMessages.
Once the task from step 3 is run, go to Ready, as long as the current state is Uninitialized, Connecting, or RegisteringSchema.

There is nothing that prevents a connection from being closed while another connection is currently being established in the connectionOpened method. This is exactly what exposes the race condition fixed by this PR. In the race, a connection is established and we call connectionOpened and successfully register the producer with the broker. Then, that connection is closed before step 4 of connectionOpened and the state changes from Connecting to Connecting. Because step 4 only checks that the current state is Uninitialized, Connecting, or RegisteringSchema, it updates the state to Ready. When adding some extra logging, I could see that the we changed the state to Ready while cnx() returned null. At this point, messages wouldn't yet deliver. When the new connection is established, the connectionOpened is called, and we set the new cnx. Since our state is still Ready, we start delivering any new messages received by the client. These are out of order. We asynchronously register the producer with the broker and then messages start to persist.

Modifications

Update where the producer’s epoch value is incremented. It was previously updated before getting a connection. However, this seems prone to races. By updating it when we have the connection and within a lock on the producer, we ensure that the connection gets the next highest epoch number.
Set the state to Connecting in the connectionOpened method. This is important because the connection could have been closed after the check for cnx() != null in the recoverProcessOpSendMsgFrom method but before that method gets to the point of setting state to Ready. Since we update the state within the lock on ProducerImpl.this, we won't delivery any messages. There is still a chance that the broker will have state Ready and cnx() == null.
Use the producer's epoch value to ensure the cnx reference passed to recoverProcessOpSendMsgFrom is still the current cnx. Ensure that cnx() != null. These checks ensure that it is safe to update state to Ready.

Alternatives

Instead of using the epoch value, we could have checked that cnx() == cnx in the recoverProcessOpSendMsgFrom. This check would be valid in all cases except where the next connection is the same connection. I think this would happen if a topic were unloaded from a broker and then loaded back on to the same broker. The epoch value gives a consistent way to know that the cnx is the current cnx.

We could have added a new state called RegisteringProducer. I investigated this option, but it seemed more complicated and less elegant than this solution. I chose the simpler solution here.

Verifying this change

I tried to implement a unit test that would consistently reproduce this issue. I was only able to do so when I introduced a 50 millisecond delay in the scheduling of the runnable within resendMessages. Using that method, this PR prevented the race. I also built a custom client with this PR, and I can confirm that I observed the log line indicating Producer epoch mismatch or the current connection is null.. This log confirms that the race would have happened but was avoided.

Also, I provided extra detail in this PR description since I wasn't able to add a new test to specifically verify this change. I think the change is pretty straightforward, and I try to add enough context to make the change easy to understand.

Does this pull request potentially affect one of the following parts:

If yes was chosen, please highlight the changes

Dependencies (does it add or upgrade a dependency): no
The public API: no
The schema: no
The default values of configurations: no
The wire protocol: no
The rest endpoints: no
The admin cli options: no
Anything that affects deployment: no

Documentation

no-need-doc

This is an internal change to the client. We should make sure to include this fix in the release notes, but no documentation changes need to be made.

…livery of messages

michaeljmarshall · 2021-11-12T23:36:24Z

I sent a note to the mailing list to discuss this change: https://lists.apache.org/thread/6y6jfdx432j2gqxgk9cnhdw48fq1m6b1.

michaeljmarshall · 2021-11-13T03:52:15Z

/pulsarbot run-failure-checks

…connection

Jason918

Can you add a test to cover this case?
Not sure if it's easy to do.

Jason918 · 2021-11-13T11:47:33Z

pulsar-client/src/main/java/org/apache/pulsar/client/impl/ConnectionHandler.java

@@ -37,7 +37,8 @@
    protected final Backoff backoff;
    private static final AtomicLongFieldUpdater<ConnectionHandler> EPOCH_UPDATER = AtomicLongFieldUpdater
            .newUpdater(ConnectionHandler.class, "epoch");
-    private volatile long epoch = 0L;
+    // Start with -1L because it gets incremented before sending on the first connection
+    private volatile long epoch = -1L;


Does the starting value of epoch mean anything? In my understanding, it's just used for comparison, right?

I don't believe the starting value inherently means anything. However, the test ProducerCreationTest#testGeneratedNameProducerReconnect asserts that the epoch value is 2 after a single producer reconnect. I could have updated the test or the starting value. I chose to update the starting value here to maintain the original behavior.

OK, I see. I am OK with this new starting value.
But IMHO, it's better to assert that the epoch value increased after a single producer reconnect to avoid more flaky test.

michaeljmarshall · 2021-11-17T05:30:23Z

Can you add a test to cover this case?
Not sure if it's easy to do.

@Jason918 - as I mentioned in the PR description, I was only able to get a unit test to consistently reproduce the underlying issue by modifying the client code (scheduling a delay for one of the callbacks). I am pretty sure we have tests that verify producer (re)connection, which will verify that this code path works for the happy path. Also, I verified that this change correctly removes the race condition by testing in the k8s environment when I discovered the race. At this point, I'm not exactly sure how to add a test, but I am open to suggestions.

Jason918 · 2021-11-17T07:24:29Z

Can you add a test to cover this case?
Not sure if it's easy to do.

@Jason918 - as I mentioned in the PR description, I was only able to get a unit test to consistently reproduce the underlying issue by modifying the client code (scheduling a delay for one of the callbacks). I am pretty sure we have tests that verify producer (re)connection, which will verify that this code path works for the happy path. Also, I verified that this change correctly removes the race condition by testing in the k8s environment when I discovered the race. At this point, I'm not exactly sure how to add a test, but I am open to suggestions.

Great. Thanks for the detailed explanation.

michaeljmarshall · 2021-11-23T20:27:48Z

@merlimat @ivankelly @eolivelli @BewareMyPower @codelipenghui PTAL

eolivelli

+1

#12779) * [Java Client] Use epoch to version producer's cnx to prevent early delivery of messages * Update initial epoch value: it now gets incremented before the first connection (cherry picked from commit ab652b8)

apache#12779) * [Java Client] Use epoch to version producer's cnx to prevent early delivery of messages * Update initial epoch value: it now gets incremented before the first connection (cherry picked from commit ab652b8)

#12779) * [Java Client] Use epoch to version producer's cnx to prevent early delivery of messages * Update initial epoch value: it now gets incremented before the first connection (cherry picked from commit ab652b8)

apache#12779) * [Java Client] Use epoch to version producer's cnx to prevent early delivery of messages * Update initial epoch value: it now gets incremented before the first connection

[Java Client] Use epoch to version producer's cnx to prevent early de…

5931afd

…livery of messages

github-actions bot assigned michaeljmarshall Nov 12, 2021

github-actions bot added the doc-not-needed Your PR changes do not impact docs label Nov 12, 2021

michaeljmarshall mentioned this pull request Nov 12, 2021

[ServerCnx] Close connection after receiving unexpected SendCommand #12780

Merged

1 task

Update initial epoch value: it now gets incremented before the first …

1c1791c

…connection

lhotari requested review from merlimat, codelipenghui, ivankelly, massakam, rdhabalia and sijie November 13, 2021 06:10

Jason918 reviewed Nov 13, 2021

View reviewed changes

lhotari requested review from hangc0276, aahmed-se and BewareMyPower November 23, 2021 21:02

michaeljmarshall added the area/client label Dec 7, 2021

michaeljmarshall added this to the 2.10.0 milestone Dec 7, 2021

eolivelli approved these changes Dec 7, 2021

View reviewed changes

eolivelli merged commit ab652b8 into apache:master Dec 7, 2021

michaeljmarshall deleted the use-epoch-to-version-producer-cnx branch December 7, 2021 19:19

michaeljmarshall added the type/bug The PR fixed a bug or issue reported a bug label Dec 7, 2021

michaeljmarshall added cherry-picked/branch-2.9 Archived: 2.9 is end of life release/2.9.1 labels Dec 7, 2021

lhotari added the release/2.8.2 label Dec 9, 2021

lhotari added the cherry-picked/branch-2.8 Archived: 2.8 is end of life label Dec 9, 2021

eolivelli changed the title ~~[Java Client] Use epoch to version producer's cnx to prevent early de…~~ [Java Client] Use epoch to version producer's cnx to prevent early delivery of messages Dec 15, 2021

michaeljmarshall mentioned this pull request Feb 7, 2023

[fix][broker] Expect msgs after server initiated CloseProducer #19446

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Java Client] Use epoch to version producer's cnx to prevent early delivery of messages #12779

[Java Client] Use epoch to version producer's cnx to prevent early delivery of messages #12779

michaeljmarshall commented Nov 12, 2021

michaeljmarshall commented Nov 12, 2021

michaeljmarshall commented Nov 13, 2021

Jason918 left a comment

Jason918 Nov 13, 2021

michaeljmarshall Nov 14, 2021

Jason918 Nov 14, 2021

michaeljmarshall commented Nov 17, 2021

Jason918 commented Nov 17, 2021

michaeljmarshall commented Nov 23, 2021

eolivelli left a comment

[Java Client] Use epoch to version producer's cnx to prevent early delivery of messages #12779

[Java Client] Use epoch to version producer's cnx to prevent early delivery of messages #12779

Conversation

michaeljmarshall commented Nov 12, 2021

Motivation

Reproducing the issue

Description of the problem

Connection closed:

ProducerImpl#connectionOpened:

Modifications

Alternatives

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

michaeljmarshall commented Nov 12, 2021

michaeljmarshall commented Nov 13, 2021

Jason918 left a comment

Choose a reason for hiding this comment

Jason918 Nov 13, 2021

Choose a reason for hiding this comment

michaeljmarshall Nov 14, 2021

Choose a reason for hiding this comment

Jason918 Nov 14, 2021

Choose a reason for hiding this comment

michaeljmarshall commented Nov 17, 2021

Jason918 commented Nov 17, 2021

michaeljmarshall commented Nov 23, 2021

eolivelli left a comment

Choose a reason for hiding this comment