Request time out in JobActivation request (and others) is ignored #9276

pihme · 2022-05-03T05:14:54Z

Describe the bug

Setting a high request time out in job activation (and other) requests has no effect. Later down the line, the request time out is overwritten by default timeout.

To Reproduce

Set request timeout for job worker
Either bring your system into an unresponsive state, or point your debugger to this position:

Note the highlighted sections. The upper section is the 5 minute timeout set on the client. The lower section is the default request timeout set in BrokerRequestManager during construction time.

Expected behavior

request timeout is obeyed

Environment:

Zeebe Version: 8.1.0-SNAPSHOT

The text was updated successfully, but these errors were encountered:

npepinpe · 2022-05-03T15:49:35Z

Looking into it, it seems this is the current behavior (whether it's expected or not is a different thing):

requestTimeout is a property on all client commands
For most commands, it's a purely local property, i.e. it's not part of the request, but is used to add a timeout on the client side. For such commands, the only timeout server is the gateway to broker timeout configurable in the gateway itself (but not on a per request basis).
For the CreateProcessInstanceWithResult, this is used as the timeout on the gateway as well, since the request is waiting for the result before returning. There it overrides the gateway-to-broker timeout configured on the gateway.
For the job activation, it's used for long polling as the total timeout for long polling. When long polling is disabled, it's unused. In all cases, the gateway-to-broker timeout is used by the gateway (and cannot be overridden per request). When long polling is disabled, the timeout on the server side is then gatewayToBrokerTimeout * numberOfPartitions.

I think the property name is just confusing, it's really the longPollingTimeout (which is incidentally what it ends up being called on the server side 😄). So it's my bad, I misunderstood (or rather forgot) how the timeouts were supposed work. In our case we would've wanted to increase the gateway-to-broker timeout (e.g. ZEEBE_GATEWAY_CLUSTER_REQUESTTIMEOUT = 5m) AND the client request timeout.

I would propose to add a comment in the protocol explaining this, but would refrain from changing the name as that would be a breaking change. Alternatively, we could deprecate requestTimeout and add a longPollingTimeout (mapping requestTimeout to longPollingTimeout for backwards compatibility) already, and slate the removal of requestTimeout for a later date.

WDYT?

pihme · 2022-05-04T06:18:35Z

That doesn't really explain it. Long Polling is active by default
public static final boolean DEFAULT_LONG_POLLING_ENABLED = true;

So by the logic described so clearly above, it should have an effect.

pihme · 2022-05-04T06:24:29Z

Also regarding:

For most commands, it's a purely local property, i.e. it's not part of the request, but is used to add a timeout on the client side. For such commands, the only timeout server is the gateway to broker timeout configurable in the gateway itself (but not on a per request basis).

What is the point in this then? Seems I can only make it shorter on the client than the value on the Gateway?

So if I try to e.g. send a new deployment command, or want to cancel the process, I send this command and then it times out between Gateway and Broker. I tested this with a standalone broker. So clearly we are not talking about network latency.

My assumption is that the timeout then happens due to processing latency. I.e. if the Broker cannot process it fast enough it will time out. But this is really unfortunate e.g. in the context of #8991. The Broker is busy and there seems to be no way to submit any commands at all.

Also whenever we talk about backpressure and that we want to let certain commands through to "make progress", but we cannot get the command through due to time out... this would mean we can never get any commands added to the log if e.g. the number of unprocessed records is around 1 Mio records.

Is this really what is happening? Or did I get things mixed up and I missed something?

npepinpe · 2022-05-04T06:46:53Z

I wrote a small test to check if the long polling timeout has an effect. In io.camunda.zeebe.it.client.command.LongPollingActivateJobsTest (where long polling is enabled):

  @Test
  public void shouldTimeoutRequest() {
    // given
    final var activateJobsResponse =
        CLIENT_RULE
            .getClient()
            .newActivateJobsCommand()
            .jobType(jobType)
            .maxJobsToActivate(5)
            .workerName("open")
            .requestTimeout(Duration.ofMinutes(10))
            .send();

    // then
    assertThat((Future<? extends Object>) activateJobsResponse)
        .failsWithin(Duration.ofMinutes(11))
        .withThrowableOfType(ClientStatusException.class)
        .asInstanceOf(InstanceOfAssertFactories.type(ClientStatusException.class))
        .extracting(ClientStatusException::getStatusCode)
        .isEqualTo(Code.DEADLINE_EXCEEDED);
  }

As far as I can tell, it seems to work fine. I'll try with the worker, perhaps there's something weird going on there. Can you share the code you were using to get a DEADLINE_EXCEEDED from the worker?

Re your other points: I think it's a bit more nuanced. The whitelisted commands are added to the command queue, and will be processed - but their response won't be received. Other non whitelisted commands will be rejected immediately. The hope is that the command queue will shrink such that you will be able to accept new commands later. But clearly with a fork bomb this isn't the case. I think this shows our backpressure strategy - having whitelisted commands, and also having the internal/engine produced commands bypass backpressure - is flawed. With a fork bomb, I think there's good chances you will never make any progress.

Re the timeouts: this is also correct. Since there's no form of context propagation, timeouts only apply to certain parts - the one set in the client only to the client (except for long polling and create instance with result), and the gateway one only for gateway to broker. There's no real way to set a timeout in the broker. This has plenty of downsides, e.g. if the client times out, it's very likely the command is still processed - just the client never knows about it. This is a problem for example with job activation, where jobs are activated but never worked on.

I think we need to revisit our backpressure strategy, but last time we talked about it no one had any ideas on how to deal with fork bombs. Maybe being able to blacklist an instance/process definition without having to go through the command queue to do this as a sort of fail safe?

pihme · 2022-05-04T06:54:42Z

This is the code for the worker:

		client.newWorker().jobType("benchmark-task").handler(new JobHandler() {
			
			@Override
			public void handle(JobClient client, ActivatedJob job) throws Exception {
				System.out.println(job.getKey());
				client.newCompleteCommand(job.getKey()).send().join();				
			}
		}).requestTimeout(Duration.ofMinutes(5)).maxJobsActive(1000).open();

No exceptions are logged on the client side. On server side I see these log messages:

08:52:29.464 [Broker-0-ZeebePartition-1] [Broker-0-zb-actors-0] DEBUG io.camunda.zeebe.broker.system - Detected 'UNHEALTHY' components. The current health status of components: [ZeebePartition-1{status=HEALTHY}, raft-partition-partition-1{status=HEALTHY}, Broker-0-Exporter-1{status=HEALTHY}, Broker-0-StreamProcessor-1{status=UNHEALTHY, issue=HealthIssue[message=actor appears blocked, throwable=null, cause=null]}, Broker-0-LogStream-1{status=HEALTHY}, Broker-0-SnapshotDirector-1{status=HEALTHY}]
08:52:29.464 [Broker-0-HealthCheckService] [Broker-0-zb-actors-0] WARN  io.camunda.zeebe.broker.system - Partition-1 failed, marking it as unhealthy: Broker-0{status=HEALTHY}
08:52:29.464 [Broker-0-HealthCheckService] [Broker-0-zb-actors-0] DEBUG io.camunda.zeebe.broker.system - Detected 'UNHEALTHY' components. The current health status of components: [Partition-1{status=UNHEALTHY, issue=HealthIssue[message=null, throwable=null, cause=Broker-0-StreamProcessor-1{status=UNHEALTHY, issue=HealthIssue[message=actor appears blocked, throwable=null, cause=null]}]}]
08:52:29.523 [GatewayTopologyManager] [Broker-0-zb-actors-1] DEBUG io.camunda.zeebe.gateway - Received metadata change from Broker 0, partitions {1=LEADER}, terms {1=19} and health {1=UNHEALTHY}.
Allow new timer 2251799815167539
08:52:34.261 [Broker-0-StreamProcessor-1] [Broker-0-zb-actors-0] DEBUG io.camunda.zeebe.broker.transport - Expected to have a rate limiter listener for request-1-185, but none found. (This can happen during fail over.)
-1 JobIntent.TIME_OUT
08:52:34.263 [ActivateJobsHandler] [Broker-0-zb-actors-0] WARN  io.camunda.zeebe.gateway - Failed to activate jobs for type benchmark-task from partition 1
java.util.concurrent.TimeoutException: Request ProtocolRequest{id=4, subject=command-api-1, sender=0.0.0.0:26502, payload=byte[]{length=131, hash=238853500}} to 0.0.0.0:26501 timed out in PT15S
	at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$sendAndReceive$4(NettyMessagingService.java:230) ~[classes/:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.76.Final.jar:4.1.76.Final]
	at java.lang.Thread.run(Thread.java:833) ~[?:?]

(Additional System outs were added to have some insight into the black box)

pihme · 2022-05-04T06:59:34Z

This has plenty of downsides, e.g. if the client times out, it's very likely the command is still processed - just the client never knows about it. This is a problem for example with job activation, where jobs are activated but never worked on.

This checks out with what I am seeing. The activate job commands time out from the point of view of the client, and yet I see JobIntent.TIME_OUT commands being processed by the engine. So it seems they were accepted by the server.

However, I cannot say the same for the deploy process command. I would hope this one is also whitelisted. Here it doesn't appear that it was written on the server. Or maybe it is too early to tell. I can keep the server running for a couple of hours to see whether it was written to the log and will be executed eventually.

npepinpe · 2022-05-04T07:01:36Z

Ah, then it's behaving as "expected" (whether that's the right behavior of not is questionable, of course). The request timeout is applied as the total timeout for long polling, i.e. how long it will be re-enqueued and retried by the LongPollingActivateJobsHandler. However, each individual request will only use the gateway-to-broker timeout configured in the broker 😞

So if we give a request timeout of 5 minutes, then the long polling activate jobs handler should be retrying this request for at least 5 minutes without closing the StreamObserver on the server side (or until it has activated enough jobs).

The deploy command is not whitelisted, only the job complete or fail commands are.

pihme · 2022-05-04T07:17:57Z

The deploy command is not whitelisted, only the job complete or fail commands are

Maybe we should consider whitelisting it. It would be one way to potentially fix all process instances of a certain process model.

pihme · 2022-05-04T07:33:33Z

I now changed my worker to wait 30 mins for a single job to be activated:
}).requestTimeout(Duration.ofMinutes(30)).maxJobsActive(1).open();

I still cannot activate any jobs at all. Whether that is expected or not, at least it is not useful.

npepinpe · 2022-06-24T13:58:17Z

I'm putting this back in the backlog, as to be honest, I didn't make so much progress on this anyway.

npepinpe · 2023-07-20T07:42:14Z

I think it's safe to close this. Request time outs work pretty well, the issue is we don't propagate cancellation on the server side, but there is another issue for this.

pihme added kind/bug Categorizes an issue or PR as a bug team/distributed labels May 3, 2022

npepinpe self-assigned this May 3, 2022

npepinpe added this to the 8.1 milestone May 3, 2022

pihme changed the title ~~Request time out in JobActivation request is ignored~~ Request time out in JobActivation request (and others) is ignored May 4, 2022

pihme mentioned this issue May 4, 2022

Consider adding deploy workflow / cancel process instance to whitelisted commands (which will be accepted when backpressure is high) #9283

Closed

pihme mentioned this issue May 4, 2022

Partitions gets unhealthy if many timers are scheduled #8991

Open

npepinpe removed their assignment Jun 24, 2022

npepinpe removed this from the 8.1 milestone Jun 24, 2022

menski removed the team/process-automation label Jul 11, 2022

Zelldon added the component/gateway label Dec 30, 2022

npepinpe closed this as completed Jul 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request time out in JobActivation request (and others) is ignored #9276

Request time out in JobActivation request (and others) is ignored #9276

pihme commented May 3, 2022 •

edited

npepinpe commented May 3, 2022

pihme commented May 4, 2022

pihme commented May 4, 2022 •

edited

npepinpe commented May 4, 2022

pihme commented May 4, 2022

pihme commented May 4, 2022

npepinpe commented May 4, 2022

pihme commented May 4, 2022

pihme commented May 4, 2022

npepinpe commented Jun 24, 2022

npepinpe commented Jul 20, 2023

Request time out in JobActivation request (and others) is ignored #9276

Request time out in JobActivation request (and others) is ignored #9276

Comments

pihme commented May 3, 2022 • edited

npepinpe commented May 3, 2022

pihme commented May 4, 2022

pihme commented May 4, 2022 • edited

npepinpe commented May 4, 2022

pihme commented May 4, 2022

pihme commented May 4, 2022

npepinpe commented May 4, 2022

pihme commented May 4, 2022

pihme commented May 4, 2022

npepinpe commented Jun 24, 2022

npepinpe commented Jul 20, 2023

pihme commented May 3, 2022 •

edited

pihme commented May 4, 2022 •

edited