PIP-91: Separate lookup timeout from operation timeout #11627

ivankelly · 2021-08-10T12:09:06Z

This patch contains a number of changes.

TooManyRequests is retried for partition metadata and lookups

Lookup timeout configuration has been added. By default it matches
operation timeout.

Partition metadata timeout calculation has been fixed to calculate
the elapsed time correctly.

Small refactor on broker construction to allow a mocked ServerCnx
implementation for testing. Unfortunately, the test takes over 50
seconds, but this is unavoidable due to the fact that we're working
with timeouts here.

PulsarClientExceptions have been reworked to contain more
context (remote/local/reqid) and any previous exceptions which may
have occurred triggering retries. The previous exceptions must be
manually recorded, so this only applies to lookups on the consumer
side for now.

This patch contains a number of changes. TooManyRequests is retried for partition metadata and lookups Lookup timeout configuration has been added. By default it matches operation timeout. Partition metadata timeout calculation has been fixed to calculate the elapsed time correctly. Small refactor on broker construction to allow a mocked ServerCnx implementation for testing. Unfortunately, the test takes over 50 seconds, but this is unavoidable due to the fact that we're working with timeouts here. PulsarClientExceptions have been reworked to contain more context (remote/local/reqid) and any previous exceptions which may have occurred triggering retries. The previous exceptions must be manually recorded, so this only applies to lookups on the consumer side for now.

eolivelli

Great work!

LGTM

ivankelly · 2021-08-10T12:52:35Z

Some tests are failing. At least one of the failures is legit, so I'll look into it and post an update

pulsar-client/src/main/java/org/apache/pulsar/client/impl/conf/ClientConfigurationData.java

pulsar-client-api/src/main/java/org/apache/pulsar/client/api/PulsarClientException.java

lhotari

LGTM

rdhabalia · 2021-08-11T09:06:42Z

The history behind introducing TooManyRequest error is to handle backpressure for zookeeper by throttling a large number of concurrent topics loading during broker cold restart. Therefore, pulsar has lookup throttling at both client and server-side that slows down lookup because lookup ultimately triggers topic loading at server side. So, when a client sees TooManyRequest errors, the client should retry to perform this operation and the client will eventually reconnect to the broker, TooManyRequest can not harm the broker because broker already has a safeguard to reject the flood of the requests.
I am not sure what problem #6584 PR tries to solve but it should not solve it by making TooManyRequest non-retriable. TooManyRequest is a retriable error and the client should retry. Also, it should definitely not close the producer/consumer due to this error otherwise it can bring down the entire application which depends on the availability of the pulsar client entities.Pulsar lookup is an operation similar to other operations such as: connect, publish, subscribe, etc. So, I don’t think it needs special treatment with a separate timeout config and we can avoid the complexity introduced in this PR that caches and depends on the previously seen exception for lookup retry. Anyways, removing TooManyRequest from the non-retriable error list will simplify the client behavior and we can avoid the complexity of this PR.

ivankelly · 2021-08-11T13:13:17Z

@rdhabalia
There are some similarities between CommandProducer, CommandSubscribe and CommandLookup in that they are all control plane operations, but there are also important differences.
CommandLookup has no side effects. Multiple CommandLookup requests will not interfere with each other, while multiple CommandProducer/CommandSubscribe will.
CommandLookup can be served by any broker. CommandProducer/CommandSubscribe can only be served by the owner of the topic.
With timeouts, you can have how long I'm willing to wait for a node to respond, or how long I'm willing to wait for a operation to complete. For producer/subscribe, these are the same, as any retry would hit the same node and one of the requests would necessarily have to fail. This is not the case for partition metadata/lookup, as another node can be tried.
We even already consider lookup type operations to be different in the code, by the fact that we have a http lookup service and a binary lookup service.

There is very little complexity added to separate the timeout. The complexity I think you are referring to is recordkeeping so that if an exception is thrown, it contains information about previous failures, not just the last failure. Business logic never actually looks at the List of exceptions.
The change for the new timeout is just adding the config and changing 2-3 lines.

BrokerClientIntegrationTest#testCloseConnectionOnBrokerRejected was depending on the fact that TooManyRequests was previously fatal for partition metadata request. Now that it retries, that test was failing. It's a bad test anyhow, depending on thread interactions and whatnot. I've rewritten it to use the ServerCnx mock. It now actually tests for the thing it should, that clients close the connection after the max rejects. The schema tests were failing because they expected a certain exception message which has been extended. I changes endsWith to contains. I also added Producer retries similiar to the Consumer ones. I was going to do as a followon PR, but decided to put in this one.

Anonymitaet · 2021-08-12T00:21:55Z

Thanks for your contribution. For this PR, do we need to update docs?

(The PR template contains info about doc, which helps others know more about the changes. Can you provide doc-related info in this and future PR descriptions? Thanks)

jerrypeng · 2021-08-12T05:51:50Z

pulsar-client-api/src/main/java/org/apache/pulsar/client/api/PulsarClientException.java

+    }
+
+    @Override
+    public String toString() {


Why do we need to print out the previous encountered exceptions every time we log an exception? We already log every exception in the client, can't we just search the logs for the history?

@jerrypeng we don't print it when we log. the previous exceptions only get attached when the exception is propagated to the client. It's useful because it gives you more info to correlate on the broker side.

@ivankelly so when will be print out the previous exceptions?

@ivankelly aren't we printing lookup errors here:
https://github.com/apache/pulsar/blob/master/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ConnectionHandler.java#L81

Looking through the code, we do log exceptions returned from the broker.

What I meant is, we don't print the exception with the previous exceptions attached every time we log. We only print that when at the point that we're about to complete the subscribeFuture or producerCreatedFuture with an exception. Which is when the exception gets passed to the client. For me, the logging of that exception is incidental. What I want is for the client code to get an exception that has context about the retries.

Take for example the case of a customer who has a flink pipeline, and they get a TooManyRequestsException. They take a screenshot of the exception in the flink dashboard and send it to us. I want all the information to be in that screenshot, and not have to ask them to dig around in flink logs to get it.

jerrypeng · 2021-08-12T05:54:49Z

pulsar-client/src/main/java/org/apache/pulsar/client/impl/ConsumerImpl.java

+                if (nonRetriableError) {
+                    log.info("[{}] Consumer creation failed for consumer {} with unretriableError {}", topic, consumerId, exception);
+                } else {
+                    log.info("[{}] Consumer creation failed for consumer {} after timeout", topic, consumerId);


I think it will still be useful to log the exception even if we timed out

I didn't change this code.

Can we add it?

it wouldn't add any more information. TimeoutExceptions all come from

pulsar/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ClientCnx.java

Line 1149 in 05827ae

if (requestFuture.completeExceptionally(new TimeoutException(timeoutMessage))) {

So the timeout exception message and stack would be exactly the same every time.

Also, note that with the non-timeout exception it's not printing the stacktrace.

jerrypeng

Generally looks good to me. Left a couple of comments

ivankelly · 2021-08-12T07:58:17Z

Thanks for your contribution. For this PR, do we need to update docs?

(The PR template contains info about doc, which helps others know more about the changes. Can you provide doc-related info in this and future PR descriptions? Thanks)

@Anonymitaet there's javadoc for the new configuration option.

…ManyRequests

jerrypeng · 2021-08-12T21:06:58Z

pulsar-client/src/main/java/org/apache/pulsar/client/impl/ConsumerImpl.java

@@ -243,7 +245,7 @@ protected ConsumerImpl(PulsarClientImpl client, String topic, ConsumerConfigurat
        this.initialStartMessageId = this.startMessageId;
        this.startMessageRollbackDurationInSec = startMessageRollbackDurationInSec;
        AVAILABLE_PERMITS_UPDATER.set(this, 0);
-        this.subscribeTimeout = System.currentTimeMillis() + client.getConfiguration().getOperationTimeoutMs();
+        this.subscribeTimeout = System.currentTimeMillis() + client.getConfiguration().getLookupTimeoutMs();


The time to subscribe will include time to do a lookup + the time to create a connection (if the connection to the broker is not established yet). However, with out current code we are including the connection establishment time within our lookup time. This makes this timeout here confusing and hard to reason about as it may or may not include the time to establish a connection. Also establishing a connection has its own timeout which defaults to 10 seconds. I think we should clearly separate the two timeouts so one is not just overlapping with the other and we can clearly understand if subscribe failed because of a lookup timeout or a connection timeout.

Same for producers.

It also includes the time to do CommandSubscribe and CommandPublish.

Separating the timeout to do lookup from the time to establish the correct connection is a major rework of how timeouts work. The lookup timeout and retry is handled in ConsumerImpl and ProducerImpl, and these only get signals via connectionFailed and connectionOpen callbacks. So separate it out, we'd need to refactor how the Impls get a connection. Currently it goes from Impl->ConnectionHandler->PulsarClientImpl->LookupService. I don't think it's worth it. It's already clear if subscribe failed due to lookup timeout or a connection timeout. The exception returned is different, PulsarClientException.TimeoutException for the former, netty ConnectTimeoutException for the latter. If you want to know which node you failed to connect to, it's there in the exception message.

java.util.concurrent.CompletionException: org.apache.pulsar.client.api.PulsarClientException: java.util.concurrent.CompletionException: io.netty.channel.ConnectTimeoutException: connection timed out: /192.168.1.34:5432

w.r.t. including the CommandSubscribe and CommandProducer, I can change this, but it would create a behavioral change by default as then the operationTimeout for these commands only starts counting down after lookup has succeeded. i.e. the whole operation could take twice as long. I guess this isn't a major issue though.

jerrypeng

left some additional comments. Thanks.

ivankelly · 2021-08-13T09:42:52Z

@BewareMyPower do you have any additional concerns?

eolivelli · 2021-08-16T09:29:18Z

IIUC @rdhabalia left some comments on the ML

@rdhabalia do you mind to official write your position about this PR ?

it also looks like that @jerrypeng initially approved the PR and then added more comments, if you can @jerrypeng please add your review as well

ivankelly · 2021-08-17T09:16:49Z

@eolivelli this has 4 approvals and no changes requested. IMO it's ready to merge.

eolivelli · 2021-08-17T10:11:12Z

@eolivelli this has 4 approvals and no changes requested. IMO it's ready to merge.

agreed. merging now

ivankelly · 2021-08-17T10:24:01Z

@eolivelli thanks

* PIP-91: Separate lookup timeout from operation timeout This patch contains a number of changes. TooManyRequests is retried for partition metadata and lookups Lookup timeout configuration has been added. By default it matches operation timeout. Partition metadata timeout calculation has been fixed to calculate the elapsed time correctly. Small refactor on broker construction to allow a mocked ServerCnx implementation for testing. Unfortunately, the test takes over 50 seconds, but this is unavoidable due to the fact that we're working with timeouts here. PulsarClientExceptions have been reworked to contain more context (remote/local/reqid) and any previous exceptions which may have occurred triggering retries. The previous exceptions must be manually recorded, so this only applies to lookups on the consumer side for now. * Fixup for test failures BrokerClientIntegrationTest#testCloseConnectionOnBrokerRejected was depending on the fact that TooManyRequests was previously fatal for partition metadata request. Now that it retries, that test was failing. It's a bad test anyhow, depending on thread interactions and whatnot. I've rewritten it to use the ServerCnx mock. It now actually tests for the thing it should, that clients close the connection after the max rejects. The schema tests were failing because they expected a certain exception message which has been extended. I changes endsWith to contains. I also added Producer retries similiar to the Consumer ones. I was going to do as a followon PR, but decided to put in this one. Co-authored-by: Ivan Kelly <ikelly@splunk.com>

ivankelly requested review from eolivelli, merlimat and jerrypeng and removed request for eolivelli August 10, 2021 12:09

ivankelly self-assigned this Aug 10, 2021

ivankelly requested a review from BewareMyPower August 10, 2021 12:10

eolivelli approved these changes Aug 10, 2021

View reviewed changes

BewareMyPower reviewed Aug 10, 2021

View reviewed changes

pulsar-client/src/main/java/org/apache/pulsar/client/impl/conf/ClientConfigurationData.java Show resolved Hide resolved

BewareMyPower requested changes Aug 10, 2021

View reviewed changes

pulsar-client-api/src/main/java/org/apache/pulsar/client/api/PulsarClientException.java Outdated Show resolved Hide resolved

lhotari approved these changes Aug 10, 2021

View reviewed changes

jerrypeng reviewed Aug 12, 2021

View reviewed changes

Ivan Kelly added 2 commits August 12, 2021 09:04

Another test fix that was checking for exact text

9e233e2

More test fixes that were depending on PMR failing immediately on Too…

41e68f1

…ManyRequests

jerrypeng approved these changes Aug 12, 2021

View reviewed changes

jerrypeng reviewed Aug 12, 2021

View reviewed changes

jerrypeng self-requested a review August 12, 2021 21:17

ivankelly requested a review from BewareMyPower August 13, 2021 09:48

Move CommandSubscribe and CommandProduce out of lookup timeout

2a7f5ff

Anonymitaet added the doc Your PR contains doc changes, no matter whether the changes are in markdown or code files. label Aug 13, 2021

Was setting deadline wrong.

0c5b06f

BewareMyPower approved these changes Aug 16, 2021

View reviewed changes

eolivelli requested a review from rdhabalia August 16, 2021 09:29

jerrypeng approved these changes Aug 16, 2021

View reviewed changes

sijie added this to the 2.9.0 milestone Aug 17, 2021

eolivelli merged commit b557e24 into apache:master Aug 17, 2021

eolivelli added the area/client label Aug 17, 2021

ivankelly deleted the PLSR-2262 branch August 17, 2021 10:24

rdhabalia mentioned this pull request Sep 3, 2024

[fix][client] Fix broker/Client CPU reaching 100% during retriable connection failure #23251

Merged

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PIP-91: Separate lookup timeout from operation timeout #11627

PIP-91: Separate lookup timeout from operation timeout #11627

ivankelly commented Aug 10, 2021

eolivelli left a comment

ivankelly commented Aug 10, 2021

lhotari left a comment

rdhabalia commented Aug 11, 2021

ivankelly commented Aug 11, 2021

Anonymitaet commented Aug 12, 2021

jerrypeng Aug 12, 2021

ivankelly Aug 12, 2021

jerrypeng Aug 12, 2021

jerrypeng Aug 12, 2021

jerrypeng Aug 12, 2021

ivankelly Aug 13, 2021

jerrypeng Aug 12, 2021

ivankelly Aug 12, 2021

jerrypeng Aug 12, 2021

ivankelly Aug 13, 2021

jerrypeng left a comment

ivankelly commented Aug 12, 2021

jerrypeng Aug 12, 2021 •

edited

Loading

ivankelly Aug 13, 2021

jerrypeng left a comment

ivankelly commented Aug 13, 2021

eolivelli commented Aug 16, 2021

ivankelly commented Aug 17, 2021

eolivelli commented Aug 17, 2021

ivankelly commented Aug 17, 2021

PIP-91: Separate lookup timeout from operation timeout #11627

PIP-91: Separate lookup timeout from operation timeout #11627

Conversation

ivankelly commented Aug 10, 2021

eolivelli left a comment

Choose a reason for hiding this comment

ivankelly commented Aug 10, 2021

lhotari left a comment

Choose a reason for hiding this comment

rdhabalia commented Aug 11, 2021

ivankelly commented Aug 11, 2021

Anonymitaet commented Aug 12, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerrypeng left a comment

Choose a reason for hiding this comment

ivankelly commented Aug 12, 2021

jerrypeng Aug 12, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerrypeng left a comment

Choose a reason for hiding this comment

ivankelly commented Aug 13, 2021

eolivelli commented Aug 16, 2021

ivankelly commented Aug 17, 2021

eolivelli commented Aug 17, 2021

ivankelly commented Aug 17, 2021

jerrypeng Aug 12, 2021 •

edited

Loading