-
Notifications
You must be signed in to change notification settings - Fork 607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential duplicate PI creation across partitions in case of request timeouts #17333
Comments
Input for triage: Likelihood depends on the setup and request timeout config. |
ZPA-Triage:
|
In the particular case a
could happen on any broker crash and in such as cases the Gateway shouldn't retry. I need to validate this provoking the situation again obtaining broker and gateway logs on such an occasion. |
We throw |
## Description This replaces `ConnectionException` throws in cases where a connection was actually established previously with a more appropriate `MessagingException.ConnectionClosed`, as `ConnectionException` is only supposed to be used in cases where a connection couldn't get established at all, see https://docs.oracle.com/javase%2F7%2Fdocs%2Fapi%2F%2F/java/net/ConnectException.html. ## Related issues closes #17333
…nection (#18286) # Description Manual backport of #18264 to `stable/8.2`. There were conflicts in [MessagingException.java](https://github.com/camunda/zeebe/pull/18286/files#diff-cfd7f2bc0198b1461470a353efa3fbebce2cec3c7ca71c96513eefc316b61245) which contains one subtype less than on all newer branches. relates to #17333 original author: @megglos
Describe the bug
The following log is from a sample application to start 50 PIs concurrently, it does not retry failed requests. It assigns a unique variable to each instance and has a stateful worker that keeps track of the variable values to detect duplicate instances.
So from the log the client got a timeout error, at most we would expect one instance to exist afterwards or none, as in case of a timeout we don't know if the broker actually received the creation command.
In this particular case actually three instances were created and present in Operate afterwards:
Looking at the zeebe records there is one instance per partition:
This indicates we seem to have a bug in the gateway, trying the next partition if creating an instance on one partition failed due to a timeout or other connection related errors, while the request may have reached the broker still.
Enabling trace logging in the Gateway, I could also verify that the logic to retry a request on other partitions is indeed triggered.
Kudos to @pierre-yves-monnet for preparing the setup to easily reproduce the issue!
To Reproduce
You can use https://github.com/pierre-yves-monnet/c8-highcreation to provoke this easily.
Expected behavior
In overload/timeout between gateway and broker situations at most one PI should be created. The Gateway should not retry on other partitions if there wasn't a qualified error returned by the broker.
Log/Stacktrace
Full Stacktrace
Environment:
** Support:**
The text was updated successfully, but these errors were encountered: