Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request time out in JobActivation request (and others) is ignored #9276

Closed
pihme opened this issue May 3, 2022 · 11 comments
Closed

Request time out in JobActivation request (and others) is ignored #9276

pihme opened this issue May 3, 2022 · 11 comments
Labels
area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) area/ux Marks an issue as related to improving the user experience component/gateway kind/bug Categorizes an issue or PR as a bug scope/gateway Marks an issue or PR to appear in the gateway section of the changelog

Comments

@pihme
Copy link
Contributor

pihme commented May 3, 2022

Describe the bug

Setting a high request time out in job activation (and other) requests has no effect. Later down the line, the request time out is overwritten by default timeout.

To Reproduce

  • Set request timeout for job worker
  • Either bring your system into an unresponsive state, or point your debugger to this position:

image

Note the highlighted sections. The upper section is the 5 minute timeout set on the client. The lower section is the default request timeout set in BrokerRequestManager during construction time.

Expected behavior

  • request timeout is obeyed

Environment:

  • Zeebe Version: 8.1.0-SNAPSHOT
@pihme pihme added kind/bug Categorizes an issue or PR as a bug team/distributed labels May 3, 2022
@npepinpe npepinpe added area/ux Marks an issue as related to improving the user experience area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) scope/gateway Marks an issue or PR to appear in the gateway section of the changelog team/process-automation and removed team/distributed labels May 3, 2022
@npepinpe npepinpe self-assigned this May 3, 2022
@npepinpe npepinpe added this to the 8.1 milestone May 3, 2022
@npepinpe
Copy link
Member

npepinpe commented May 3, 2022

Looking into it, it seems this is the current behavior (whether it's expected or not is a different thing):

  • requestTimeout is a property on all client commands
  • For most commands, it's a purely local property, i.e. it's not part of the request, but is used to add a timeout on the client side. For such commands, the only timeout server is the gateway to broker timeout configurable in the gateway itself (but not on a per request basis).
  • For the CreateProcessInstanceWithResult, this is used as the timeout on the gateway as well, since the request is waiting for the result before returning. There it overrides the gateway-to-broker timeout configured on the gateway.
  • For the job activation, it's used for long polling as the total timeout for long polling. When long polling is disabled, it's unused. In all cases, the gateway-to-broker timeout is used by the gateway (and cannot be overridden per request). When long polling is disabled, the timeout on the server side is then gatewayToBrokerTimeout * numberOfPartitions.

I think the property name is just confusing, it's really the longPollingTimeout (which is incidentally what it ends up being called on the server side 😄). So it's my bad, I misunderstood (or rather forgot) how the timeouts were supposed work. In our case we would've wanted to increase the gateway-to-broker timeout (e.g. ZEEBE_GATEWAY_CLUSTER_REQUESTTIMEOUT = 5m) AND the client request timeout.

I would propose to add a comment in the protocol explaining this, but would refrain from changing the name as that would be a breaking change. Alternatively, we could deprecate requestTimeout and add a longPollingTimeout (mapping requestTimeout to longPollingTimeout for backwards compatibility) already, and slate the removal of requestTimeout for a later date.

WDYT?

@pihme
Copy link
Contributor Author

pihme commented May 4, 2022

That doesn't really explain it. Long Polling is active by default
public static final boolean DEFAULT_LONG_POLLING_ENABLED = true;

So by the logic described so clearly above, it should have an effect.

@pihme
Copy link
Contributor Author

pihme commented May 4, 2022

Also regarding:

For most commands, it's a purely local property, i.e. it's not part of the request, but is used to add a timeout on the client side. For such commands, the only timeout server is the gateway to broker timeout configurable in the gateway itself (but not on a per request basis).

What is the point in this then? Seems I can only make it shorter on the client than the value on the Gateway?

So if I try to e.g. send a new deployment command, or want to cancel the process, I send this command and then it times out between Gateway and Broker. I tested this with a standalone broker. So clearly we are not talking about network latency.

My assumption is that the timeout then happens due to processing latency. I.e. if the Broker cannot process it fast enough it will time out. But this is really unfortunate e.g. in the context of #8991. The Broker is busy and there seems to be no way to submit any commands at all.

Also whenever we talk about backpressure and that we want to let certain commands through to "make progress", but we cannot get the command through due to time out... this would mean we can never get any commands added to the log if e.g. the number of unprocessed records is around 1 Mio records.

Is this really what is happening? Or did I get things mixed up and I missed something?

@pihme pihme changed the title Request time out in JobActivation request is ignored Request time out in JobActivation request (and others) is ignored May 4, 2022
@npepinpe
Copy link
Member

npepinpe commented May 4, 2022

I wrote a small test to check if the long polling timeout has an effect. In io.camunda.zeebe.it.client.command.LongPollingActivateJobsTest (where long polling is enabled):

  @Test
  public void shouldTimeoutRequest() {
    // given
    final var activateJobsResponse =
        CLIENT_RULE
            .getClient()
            .newActivateJobsCommand()
            .jobType(jobType)
            .maxJobsToActivate(5)
            .workerName("open")
            .requestTimeout(Duration.ofMinutes(10))
            .send();

    // then
    assertThat((Future<? extends Object>) activateJobsResponse)
        .failsWithin(Duration.ofMinutes(11))
        .withThrowableOfType(ClientStatusException.class)
        .asInstanceOf(InstanceOfAssertFactories.type(ClientStatusException.class))
        .extracting(ClientStatusException::getStatusCode)
        .isEqualTo(Code.DEADLINE_EXCEEDED);
  }

As far as I can tell, it seems to work fine. I'll try with the worker, perhaps there's something weird going on there. Can you share the code you were using to get a DEADLINE_EXCEEDED from the worker?

Re your other points: I think it's a bit more nuanced. The whitelisted commands are added to the command queue, and will be processed - but their response won't be received. Other non whitelisted commands will be rejected immediately. The hope is that the command queue will shrink such that you will be able to accept new commands later. But clearly with a fork bomb this isn't the case. I think this shows our backpressure strategy - having whitelisted commands, and also having the internal/engine produced commands bypass backpressure - is flawed. With a fork bomb, I think there's good chances you will never make any progress.

Re the timeouts: this is also correct. Since there's no form of context propagation, timeouts only apply to certain parts - the one set in the client only to the client (except for long polling and create instance with result), and the gateway one only for gateway to broker. There's no real way to set a timeout in the broker. This has plenty of downsides, e.g. if the client times out, it's very likely the command is still processed - just the client never knows about it. This is a problem for example with job activation, where jobs are activated but never worked on.

I think we need to revisit our backpressure strategy, but last time we talked about it no one had any ideas on how to deal with fork bombs. Maybe being able to blacklist an instance/process definition without having to go through the command queue to do this as a sort of fail safe?

@pihme
Copy link
Contributor Author

pihme commented May 4, 2022

This is the code for the worker:

		client.newWorker().jobType("benchmark-task").handler(new JobHandler() {
			
			@Override
			public void handle(JobClient client, ActivatedJob job) throws Exception {
				System.out.println(job.getKey());
				client.newCompleteCommand(job.getKey()).send().join();				
			}
		}).requestTimeout(Duration.ofMinutes(5)).maxJobsActive(1000).open();

No exceptions are logged on the client side. On server side I see these log messages:

08:52:29.464 [Broker-0-ZeebePartition-1] [Broker-0-zb-actors-0] DEBUG io.camunda.zeebe.broker.system - Detected 'UNHEALTHY' components. The current health status of components: [ZeebePartition-1{status=HEALTHY}, raft-partition-partition-1{status=HEALTHY}, Broker-0-Exporter-1{status=HEALTHY}, Broker-0-StreamProcessor-1{status=UNHEALTHY, issue=HealthIssue[message=actor appears blocked, throwable=null, cause=null]}, Broker-0-LogStream-1{status=HEALTHY}, Broker-0-SnapshotDirector-1{status=HEALTHY}]
08:52:29.464 [Broker-0-HealthCheckService] [Broker-0-zb-actors-0] WARN  io.camunda.zeebe.broker.system - Partition-1 failed, marking it as unhealthy: Broker-0{status=HEALTHY}
08:52:29.464 [Broker-0-HealthCheckService] [Broker-0-zb-actors-0] DEBUG io.camunda.zeebe.broker.system - Detected 'UNHEALTHY' components. The current health status of components: [Partition-1{status=UNHEALTHY, issue=HealthIssue[message=null, throwable=null, cause=Broker-0-StreamProcessor-1{status=UNHEALTHY, issue=HealthIssue[message=actor appears blocked, throwable=null, cause=null]}]}]
08:52:29.523 [GatewayTopologyManager] [Broker-0-zb-actors-1] DEBUG io.camunda.zeebe.gateway - Received metadata change from Broker 0, partitions {1=LEADER}, terms {1=19} and health {1=UNHEALTHY}.
Allow new timer 2251799815167539
08:52:34.261 [Broker-0-StreamProcessor-1] [Broker-0-zb-actors-0] DEBUG io.camunda.zeebe.broker.transport - Expected to have a rate limiter listener for request-1-185, but none found. (This can happen during fail over.)
-1 JobIntent.TIME_OUT
08:52:34.263 [ActivateJobsHandler] [Broker-0-zb-actors-0] WARN  io.camunda.zeebe.gateway - Failed to activate jobs for type benchmark-task from partition 1
java.util.concurrent.TimeoutException: Request ProtocolRequest{id=4, subject=command-api-1, sender=0.0.0.0:26502, payload=byte[]{length=131, hash=238853500}} to 0.0.0.0:26501 timed out in PT15S
	at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$sendAndReceive$4(NettyMessagingService.java:230) ~[classes/:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.76.Final.jar:4.1.76.Final]
	at java.lang.Thread.run(Thread.java:833) ~[?:?]

(Additional System outs were added to have some insight into the black box)

@pihme
Copy link
Contributor Author

pihme commented May 4, 2022

This has plenty of downsides, e.g. if the client times out, it's very likely the command is still processed - just the client never knows about it. This is a problem for example with job activation, where jobs are activated but never worked on.

This checks out with what I am seeing. The activate job commands time out from the point of view of the client, and yet I see JobIntent.TIME_OUT commands being processed by the engine. So it seems they were accepted by the server.

However, I cannot say the same for the deploy process command. I would hope this one is also whitelisted. Here it doesn't appear that it was written on the server. Or maybe it is too early to tell. I can keep the server running for a couple of hours to see whether it was written to the log and will be executed eventually.

@npepinpe
Copy link
Member

npepinpe commented May 4, 2022

Ah, then it's behaving as "expected" (whether that's the right behavior of not is questionable, of course). The request timeout is applied as the total timeout for long polling, i.e. how long it will be re-enqueued and retried by the LongPollingActivateJobsHandler. However, each individual request will only use the gateway-to-broker timeout configured in the broker 😞

So if we give a request timeout of 5 minutes, then the long polling activate jobs handler should be retrying this request for at least 5 minutes without closing the StreamObserver on the server side (or until it has activated enough jobs).

The deploy command is not whitelisted, only the job complete or fail commands are.

@pihme
Copy link
Contributor Author

pihme commented May 4, 2022

The deploy command is not whitelisted, only the job complete or fail commands are

Maybe we should consider whitelisting it. It would be one way to potentially fix all process instances of a certain process model.

@pihme
Copy link
Contributor Author

pihme commented May 4, 2022

I now changed my worker to wait 30 mins for a single job to be activated:
}).requestTimeout(Duration.ofMinutes(30)).maxJobsActive(1).open();

I still cannot activate any jobs at all. Whether that is expected or not, at least it is not useful.

@npepinpe
Copy link
Member

I'm putting this back in the backlog, as to be honest, I didn't make so much progress on this anyway.

@npepinpe npepinpe removed this from the 8.1 milestone Jun 24, 2022
@npepinpe
Copy link
Member

I think it's safe to close this. Request time outs work pretty well, the issue is we don't propagate cancellation on the server side, but there is another issue for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) area/ux Marks an issue as related to improving the user experience component/gateway kind/bug Categorizes an issue or PR as a bug scope/gateway Marks an issue or PR to appear in the gateway section of the changelog
Projects
None yet
Development

No branches or pull requests

4 participants