New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-instance parallel subprocess times out and never completes for large input collections #8687
Comments
When calling the process with 5.000 input collection variables, I do not see the blacklist log. I keep seeing these timeout issues which keep repeating over and over again: Stacktrace
When calling the process with 20.000 input collection variables, I do see the blacklist log: Stacktrace
|
Let's try to reproduce it and timebox root causing it as we would normally in our medic process. |
@lfourky Thanks to https://github.com/lfourky/zeebe-multiinstance-issue I was able to reproduce the issue locally. It looks like |
I've looked into this a bit more: I tried to improve the performance of To me this looks like a limitation of how MultiInstanceBodyProcessor is implemented: For each child instance that is created, it has to evaluate the entire input collection. /cc @npepinpe |
Could this be a related issue? #8622 |
That one is a bit different, it had to do with the jobs being activated but not making it to the worker, then timing out, as far as I remember. /cc @saig0 / @korthout - are there any low hanging fruits to help improve the evaluation of the expression? If not, then I classify this as a current limitation related to multi instances (for now anyway). Is there any workaround we can suggest for users running into this? |
I shortly discussed it with @saig0. We wonder what the actual numbers are on the performance of the evaluateArrayExpression and whether this is really the reason that the process can't make progress. If it would just be slow, then it would just take some additional time for the processing to finish, but this shouldn't block anything. However, the stacktrace shows that the activate job batch request times out after 15 seconds. @oleschoenburg Did you already try increasing the request timeout or decreasing the job timeout? Finally, if slowness is the reason, then we have some ideas on improving the performance of this specific expression evaluation. But again there we should first measure before making changes. An idea is that currently, it evaluates the expression and then transforms each element in the array to message pack. However, on activation of the child, only 1 specific element is needed from this list (the one at the loop counter's index). Perhaps, if that part is taking too much time, we could change how we retrieve this 1 item from the expression result. |
I've tried playing around with the timeouts in the go worker: - client.NewJobWorker().JobType("second_data_processor").Handler(secondParallelMultiInstance).Open()
+ client.NewJobWorker().JobType("second_data_processor").Handler(secondParallelMultiInstance).RequestTimeout(60 * time.Second).Timeout(5 * time.Second).PollInterval(1 * time.Second).Concurrency(1).Open() but that did not improve the situation at all: no jobs of the multi-instance were activated (or at least they didn't reach the worker). I waited for ~10 minutes. So as a sanity check I implemented the worker with the java client: import io.camunda.zeebe.client.ZeebeClient;
import io.camunda.zeebe.client.api.response.ActivatedJob;
import io.camunda.zeebe.client.api.worker.JobClient;
import java.time.Duration;
public class App {
public static void main(String[] args) throws InterruptedException {
try (final var client =
ZeebeClient.newClientBuilder().gatewayAddress("localhost:26500").usePlaintext().build()) {
try (final var worker =
client
.newWorker()
.jobType("second_data_processor")
.handler(
(JobClient jobClient, ActivatedJob job) -> {
System.out.println("Handling job: " + job.getKey());
client.newCompleteCommand(job.getKey()).send().join();
})
.timeout(Duration.ofSeconds(5))
.requestTimeout(Duration.ofSeconds(60))
.maxJobsActive(1)
.pollInterval(Duration.ofSeconds(1))
.open()) {
System.out.println("Job worker started");
Thread.currentThread().join();
}
}
}
} I'm using the same settings for timeout, requestTimeout, concurrency and pollInterval. Now we are finally activating jobs! So this leads me to believe that there is either a bug in the go worker implementation, a bug in the go client, or a subtle difference in the (default) configuration used for the java and go clients. |
@oleschoenburg can you share some numbers? It would be interesting to see how long it takes to complete the process instance. Also, for different configurations. This could guide users in the right direction. We could tweak the job worker a bit by increasing the |
We had some issues in the C# clients with job activation, which we only discovered by using Wireshark - we could then confirm the gateway was sending a response with the jobs, but somehow the job handlers on the C# side never observed them. Could be one lead here to figure out if the bug is on the client or gateway side. |
@saig0 Increasing |
When using Worker logs:
Zeebe logs:
|
Yes, that's a known issue, but unless we know of a quick fix I'd rather try to convince our product team to allocate resources to redesign the job worker end-to-end (i.e. including gateway/broker) for a smoother experience. I guess my point here is, don't dig too deep into that unless it's immediately relevant to the issue at hand, if that makes sense 😄 |
I'll leave it in the backlog for now. It seems to be issues mostly with job activation, and I would like that we work on the job activation pipeline in general, as it has more than one such issue, and no real quick fixes. |
Would this be fixed by #8879 ? What was the end consensus, the issue was with the expression or the job activation? |
I think it was mostly the job activation that was the problem here. Thanks for the reminder, I actually wanted to test this again 👍 |
For an input collection of 1000 elements everything looks as expected but for 2000 I'm still getting pretty much the same behavior. Zeebe logs a bunch of
|
Yes, but now the jobs should be returned to the broker and be available for activation again, so overall it the hope is perceived performance is better now. Is that the case? |
I don't think so. There's just nothing happening. I'd expect that the same worker gets job once they are re-activated, right? |
Yeah, that's the expectation 😞 |
This issue came up in a support case. As a workaround users can try to reduce the |
Can you refresh my memory here - the issue is that we spend too long busy-looping over the same multi-instance activity because the collection is too big? Correct? |
I'm not sure, I just tried the workaround and thought it makes sense to highlight it here. @oleschoenburg do you remember the cause? |
If I remember correctly, the cause was unclear to me. I had initially suspected that evaluating the input collection (which involves FEEL) was too slow but I later discarded that theory, probably based on what I saw when I experimented with smaller input collections. |
Marking priority as Note that we're currently already improving multi-instance for large collections, with: So we could consider increasing the priority to Please comment if you think this should have a higher priority. |
@lfourky I've had another look at this bug. Using your example, I'm able to reproduce this on zeebe version 8.2.11. I had to make some small adjustments:
Running it, I noticed that the problem is not the large input collection but the large number of variables passed along with each job. The case works as follows:
Solution 1:So, then I got to the first idea. Just don't send the client.NewJobWorker().JobType("second_data_processor").Handler(secondParallelMultiInstance).FetchVariables("inputElement").Open() This solves the problem by reducing the amount of data that needs to be transmitted to the worker for each batch of activated jobs. Solution 2:But I wondered, how can it be that the gateway cannot send the jobs to the client when there are large variables sent along with the activated jobs? I was able to complete the process by specifying the client's inbound MaxMsgSize to 8 MB: client, err := zbc.NewClient(&zbc.ClientConfig{
GatewayAddress: gatewayAddress,
UsePlaintextConnection: true,
DialOpts: []grpc.DialOption{grpc.WithDefaultCallOptions(grpc.MaxCallRecvMsgSize(8 * 1024 * 1024))},
}) So, it seems that the problem is that the broker can produce responses larger than the expected maximum of 4MB (default). I'm not yet sure whether this is a bug or expected behavior. @npepinpe What do you think? @lfourky You can find a working version of your example using solution 2 here. |
So, we have two workarounds available:
🤔 However, I'm still unsure whether it is expected that the gateway can send the client an ActivateJobBatch response that is larger than the default 4MB. |
Since the response goes through different encodings, it's possible that it was very close to 4MB on the broker/Raft side, but once transformed to gRPC is larger than 4MB. I can't really think of another reason (other than misconfiguration with different max message sizes, but I assume that's not it here). I would expect that if we fail to forward the activated job, we would yield it back to the engine. |
That happens as expected, yielding this way is still implemented using FailJob. In the situation above, this simply leads to loads of traffic:
@npepinpe Do you think that's expected behavior? Or should we avoid this from happening? For example, we could further limit the max size of jobs collected in a job batch record. That would be at the cost of payload size (users might expect to activate a single job with a single large variable near the configured max message size) EDIT: If we believe this is expected behavior. Then I'll close this issue |
This sounds a lot like what I did recently for a different bug 😄 |
Describe the bug
When running a multi-instance parallel subprocess, I've experienced issues such as stalling - for example, with an input count of 5000 elements, Zeebe seems to process a few, then never process the rest.
This can sometime stall all the future processes for the same process ID, as well, I think.
Hint: when run as a sequential subprocess, it runs fine.
To Reproduce
https://github.com/lfourky/zeebe-multiinstance-issue
I've created a repository to demonstrate this issue. The BPMN is quite a simple one and it's using a Go client.
Assuming you've got zbctl installed, you can run:
to reproduce this locally.
Expected behavior
Expected the process to complete in near future.
Log/Stacktrace
Full Stacktrace
Environment:
The text was updated successfully, but these errors were encountered: