-
Notifications
You must be signed in to change notification settings - Fork 556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request time out in JobActivation request (and others) is ignored #9276
Comments
Looking into it, it seems this is the current behavior (whether it's expected or not is a different thing):
I think the property name is just confusing, it's really the I would propose to add a comment in the protocol explaining this, but would refrain from changing the name as that would be a breaking change. Alternatively, we could deprecate WDYT? |
That doesn't really explain it. Long Polling is active by default So by the logic described so clearly above, it should have an effect. |
Also regarding:
What is the point in this then? Seems I can only make it shorter on the client than the value on the Gateway? So if I try to e.g. send a new deployment command, or want to cancel the process, I send this command and then it times out between Gateway and Broker. I tested this with a standalone broker. So clearly we are not talking about network latency. My assumption is that the timeout then happens due to processing latency. I.e. if the Broker cannot process it fast enough it will time out. But this is really unfortunate e.g. in the context of #8991. The Broker is busy and there seems to be no way to submit any commands at all. Also whenever we talk about backpressure and that we want to let certain commands through to "make progress", but we cannot get the command through due to time out... this would mean we can never get any commands added to the log if e.g. the number of unprocessed records is around 1 Mio records. Is this really what is happening? Or did I get things mixed up and I missed something? |
I wrote a small test to check if the long polling timeout has an effect. In @Test
public void shouldTimeoutRequest() {
// given
final var activateJobsResponse =
CLIENT_RULE
.getClient()
.newActivateJobsCommand()
.jobType(jobType)
.maxJobsToActivate(5)
.workerName("open")
.requestTimeout(Duration.ofMinutes(10))
.send();
// then
assertThat((Future<? extends Object>) activateJobsResponse)
.failsWithin(Duration.ofMinutes(11))
.withThrowableOfType(ClientStatusException.class)
.asInstanceOf(InstanceOfAssertFactories.type(ClientStatusException.class))
.extracting(ClientStatusException::getStatusCode)
.isEqualTo(Code.DEADLINE_EXCEEDED);
} As far as I can tell, it seems to work fine. I'll try with the worker, perhaps there's something weird going on there. Can you share the code you were using to get a DEADLINE_EXCEEDED from the worker? Re your other points: I think it's a bit more nuanced. The whitelisted commands are added to the command queue, and will be processed - but their response won't be received. Other non whitelisted commands will be rejected immediately. The hope is that the command queue will shrink such that you will be able to accept new commands later. But clearly with a fork bomb this isn't the case. I think this shows our backpressure strategy - having whitelisted commands, and also having the internal/engine produced commands bypass backpressure - is flawed. With a fork bomb, I think there's good chances you will never make any progress. Re the timeouts: this is also correct. Since there's no form of context propagation, timeouts only apply to certain parts - the one set in the client only to the client (except for long polling and create instance with result), and the gateway one only for gateway to broker. There's no real way to set a timeout in the broker. This has plenty of downsides, e.g. if the client times out, it's very likely the command is still processed - just the client never knows about it. This is a problem for example with job activation, where jobs are activated but never worked on. I think we need to revisit our backpressure strategy, but last time we talked about it no one had any ideas on how to deal with fork bombs. Maybe being able to blacklist an instance/process definition without having to go through the command queue to do this as a sort of fail safe? |
This is the code for the worker:
No exceptions are logged on the client side. On server side I see these log messages:
(Additional System outs were added to have some insight into the black box) |
This checks out with what I am seeing. The activate job commands time out from the point of view of the client, and yet I see However, I cannot say the same for the deploy process command. I would hope this one is also whitelisted. Here it doesn't appear that it was written on the server. Or maybe it is too early to tell. I can keep the server running for a couple of hours to see whether it was written to the log and will be executed eventually. |
Ah, then it's behaving as "expected" (whether that's the right behavior of not is questionable, of course). The request timeout is applied as the total timeout for long polling, i.e. how long it will be re-enqueued and retried by the So if we give a request timeout of 5 minutes, then the long polling activate jobs handler should be retrying this request for at least 5 minutes without closing the The deploy command is not whitelisted, only the job complete or fail commands are. |
Maybe we should consider whitelisting it. It would be one way to potentially fix all process instances of a certain process model. |
I now changed my worker to wait 30 mins for a single job to be activated: I still cannot activate any jobs at all. Whether that is expected or not, at least it is not useful. |
I'm putting this back in the backlog, as to be honest, I didn't make so much progress on this anyway. |
I think it's safe to close this. Request time outs work pretty well, the issue is we don't propagate cancellation on the server side, but there is another issue for this. |
Describe the bug
Setting a high request time out in job activation (and other) requests has no effect. Later down the line, the request time out is overwritten by default timeout.
To Reproduce
Note the highlighted sections. The upper section is the 5 minute timeout set on the client. The lower section is the default request timeout set in
BrokerRequestManager
during construction time.Expected behavior
Environment:
The text was updated successfully, but these errors were encountered: