Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential duplicate PI creation across partitions in case of request timeouts #17333

Closed
megglos opened this issue Apr 5, 2024 · 4 comments · Fixed by #18264
Closed

Potential duplicate PI creation across partitions in case of request timeouts #17333

megglos opened this issue Apr 5, 2024 · 4 comments · Fixed by #18264
Assignees
Labels
component/zeebe Related to the Zeebe component/team kind/bug Categorizes an issue or PR as a bug likelihood/low Observed rarely / rather unlikely edge-case severity/high Marks a bug as having a noticeable impact on the user with no known workaround support Marks an issue as related to a customer support request version:8.2.27 Marks an issue as being completely or in parts released in 8.2.27 version:8.3.11 Marks an issue as being completely or in parts released in 8.3.11 version:8.4.7 Marks an issue as being completely or in parts released in 8.4.7 version:8.5.1 Marks an issue as being completely or in parts released in 8.5.1 version:8.6.0 Label that represents issues released on version 8.6.0

Comments

@megglos
Copy link
Contributor

megglos commented Apr 5, 2024

Describe the bug

The following log is from a sample application to start 50 PIs concurrently, it does not retry failed requests. It assigns a unique variable to each instance and has a stateful worker that keeps track of the variable values to detect duplicate instances.

08:33:56.615 [pool-3-thread-36] INFO com.camunda.highcreation.GenerateProcessInstance -- Created instance with tid 20240405_083348-52
08:33:56.616 [pool-3-thread-36] ERROR com.camunda.highcreation.GenerateProcessInstance -- OVERTIME the timeout 2s : 4087 ms result? [null] exception? [Time out between gateway and broker: Request command-api-3 to zeebe-2.zeebe-broker-service.408112bf-acf5-4ad4-a479-a6dcf91776b3-zeebe.svc.cluster.local:26501 timed out in PT1.999S] tid: [20240405_083348-52]
....
08:34:13.730 [pool-2-thread-1] ERROR com.camunda.highcreation.worker.CheckUniqueTidWorker -- >>>>>>>>> #10 DUPLICATE PID detected 20240405_083348-52 PI:6755399441056627 and 2251799813685893

So from the log the client got a timeout error, at most we would expect one instance to exist afterwards or none, as in case of a timeout we don't know if the broker actually received the creation command.

In this particular case actually three instances were created and present in Operate afterwards:
Pasted Graphic

Looking at the zeebe records there is one instance per partition:
zeebe-record_process-instance-creatton_8 4 5_2024-84-05

This indicates we seem to have a bug in the gateway, trying the next partition if creating an instance on one partition failed due to a timeout or other connection related errors, while the request may have reached the broker still.

Enabling trace logging in the Gateway, I could also verify that the logic to retry a request on other partitions is indeed triggered.

Kudos to @pierre-yves-monnet for preparing the setup to easily reproduce the issue!

To Reproduce

  1. Create a lot of PIs concurrently so the request timeout is hit (best while providing a unique identifiers as variables to the instances)
  2. Observe timeout exception in your client
  3. Observe duplicate PIs (via the unique variable identifiers)

You can use https://github.com/pierre-yves-monnet/c8-highcreation to provoke this easily.

Expected behavior

In overload/timeout between gateway and broker situations at most one PI should be created. The Gateway should not retry on other partitions if there wasn't a qualified error returned by the broker.

Log/Stacktrace

Full Stacktrace

{
  "insertId": "rxqycbmysvpzunhh",
  "jsonPayload": {
    "serviceContext": {
      "version": "8.4.5",
      "service": "zeebe"
    },
    "context": {
      "actor-name": "BrokerRequestManager",
      "threadName": "zb-actors-0",
      "threadId": 16,
      "threadPriority": 5,
      "loggerName": "io.camunda.zeebe.gateway",
      "actor-scheduler": "Gateway-zeebe-gateway-c5dccc6f6-rtlz4"
    },
    "exception": "java.net.ConnectException: Connection RemoteClientConnection{channel=[id: 0x197e8716, L:/10.64.47.13:50458 ! R:zeebe-2.zeebe-broker-service.408112bf-acf5-4ad4-a479-a6dcf91776b3-zeebe.svc.cluster.local/10.64.42.5:26501]} was closed\n\tat io.atomix.cluster.messaging.impl.AbstractClientConnection.close(AbstractClientConnection.java:76) ~[zeebe-atomix-cluster-8.4.5.jar:8.4.5]\n\tat io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$getOrCreateClientConnection$38(NettyMessagingService.java:702) ~[zeebe-atomix-cluster-8.4.5.jar:8.4.5]\n\tat io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590) ~[netty-common-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:583) ~[netty-common-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:559) ~[netty-common-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:492) ~[netty-common-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:636) ~[netty-common-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.util.concurrent.DefaultPromise.setSuccess0(DefaultPromise.java:625) ~[netty-common-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:105) ~[netty-common-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:84) ~[netty-transport-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.channel.AbstractChannel$CloseFuture.setClosed(AbstractChannel.java:1164) ~[netty-transport-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.channel.AbstractChannel$AbstractUnsafe.doClose0(AbstractChannel.java:755) ~[netty-transport-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:731) ~[netty-transport-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:620) ~[netty-transport-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.channel.DefaultChannelPipeline$HeadContext.close(DefaultChannelPipeline.java:1352) ~[netty-transport-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeClose(AbstractChannelHandlerContext.java:755) ~[netty-transport-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.channel.AbstractChannelHandlerContext.access$1200(AbstractChannelHandlerContext.java:61) ~[netty-transport-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.channel.AbstractChannelHandlerContext$11.run(AbstractChannelHandlerContext.java:738) ~[netty-transport-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173) ~[netty-common-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166) ~[netty-common-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470) ~[netty-common-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:413) ~[netty-transport-classes-epoll-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[netty-common-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.107.Final.jar:4.1.107.Final]\n\tat java.base/java.lang.Thread.run(Unknown Source) ~[?:?]\n",
    "message": "Failed to create process on partition 3"
  },
  "resource": {
    "type": "k8s_container",
    "labels": {
      "project_id": "camunda-saas-int-chaos",
      "namespace_name": "408112bf-acf5-4ad4-a479-a6dcf91776b3-zeebe",
      "pod_name": "zeebe-gateway-c5dccc6f6-rtlz4",
      "location": "europe-west1",
      "cluster_name": "worker-chaos-1",
      "container_name": "zeebe-gateway"
    }
  },
  "timestamp": "2024-04-05T07:22:15.833760259Z",
  "severity": "DEBUG",
  "labels": {
    "k8s-pod/pod-template-hash": "c5dccc6f6",
    "compute.googleapis.com/resource_name": "gke-worker-chaos-1-spot-v3-2bef5425-fb7x",
    "k8s-pod/app_kubernetes_io/component": "standalone-gateway",
    "k8s-pod/cloud_camunda_io/orgId": "b48317de_6daf_4ea2_a64a_84ed0b016bd5",
    "k8s-pod/cloud_camunda_io/clusterPlan": "G3_S_HA",
    "k8s-pod/app_kubernetes_io/app": "zeebe-gateway",
    "k8s-pod/cloud_camunda_io/generation": "Camunda_8_4_5",
    "k8s-pod/cloud_camunda_io/salesPlanType": "internal",
    "k8s-pod/cloud_camunda_io/channel": "Stable",
    "k8s-pod/cloud_camunda_io/internalSalesPlan": "true",
    "k8s-pod/cloud_camunda_io/clusterPlanType": "G3_S",
    "k8s-pod/cloud_camunda_io/salesPlan": "Internal_with_Low_Hardware"
  },
  "logName": "projects/camunda-saas-int-chaos/logs/stdout",
  "sourceLocation": {
    "file": "RequestRetryHandler.java",
    "line": "108",
    "function": "lambda$sendRequestWithRetry$1"
  },
  "receiveTimestamp": "2024-04-05T07:22:20.572787794Z"
}

Environment:

  • OS:
  • Zeebe Version: 8.4+ (likely affects older releases as well)
  • Configuration:

** Support:**

@megglos megglos added kind/bug Categorizes an issue or PR as a bug component/zeebe Related to the Zeebe component/team severity/high Marks a bug as having a noticeable impact on the user with no known workaround likelihood/low Observed rarely / rather unlikely edge-case labels Apr 5, 2024
@github-actions github-actions bot added the support Marks an issue as related to a customer support request label Apr 5, 2024
@megglos
Copy link
Contributor Author

megglos commented Apr 5, 2024

Input for triage: Likelihood depends on the setup and request timeout config.
I would argue if a setup is tuned to be able to handle the majority of requests within the used request timeout, it shouldn't occur too often and only on load peaks, thus classifying as low likelihood. If it happens it's rather bad though and a user would need to implement saftey mechanisms like a artificial deduplication task which is not really an acceptable workaround, thus severity high.

@megglos megglos changed the title Potential duplicate PI creation in case a broker is overloaded Potential duplicate PI creation in case of request timeouts Apr 5, 2024
@megglos megglos changed the title Potential duplicate PI creation in case of request timeouts Potential duplicate PI creation across partitions in case of request timeouts Apr 5, 2024
@megglos
Copy link
Contributor Author

megglos commented Apr 8, 2024

ZPA-Triage:

  • appears ZDP domain (gateway to broker communication) - might need clarification/alignment with ZPA what can actually be retried

@megglos
Copy link
Contributor Author

megglos commented Apr 23, 2024

In the particular case a java.net.ConnectException was thrown from netty as the broker closed the connection

java.net.ConnectException: Connection RemoteClientConnection{...} was closed

could happen on any broker crash and in such as cases the Gateway shouldn't retry.
This means a retry on any ConnectException is not appropriate.

I need to validate this provoking the situation again obtaining broker and gateway logs on such an occasion.

@megglos
Copy link
Contributor Author

megglos commented May 6, 2024

We throw ConnectException in cases where it shouldn't be used, we need to replace this with a more appropriate exception,.
We may only cover this with a unit test on the NettyMessagingService.

github-merge-queue bot pushed a commit that referenced this issue May 6, 2024
## Description

This replaces `ConnectionException` throws in cases where a connection
was actually established previously with a more appropriate
`MessagingException.ConnectionClosed`, as `ConnectionException` is only
supposed to be used in cases where a connection couldn't get established
at all, see
https://docs.oracle.com/javase%2F7%2Fdocs%2Fapi%2F%2F/java/net/ConnectException.html.

## Related issues

closes #17333
github-merge-queue bot pushed a commit that referenced this issue May 6, 2024
…nection (#18282)

# Description
Backport of #18264 to `stable/8.3`.

relates to #17333
original author: @megglos
github-merge-queue bot pushed a commit that referenced this issue May 6, 2024
…nection (#18284)

# Description
Backport of #18264 to `stable/8.5`.

relates to #17333
original author: @megglos
github-merge-queue bot pushed a commit that referenced this issue May 6, 2024
…nection (#18286)

# Description
Manual backport of #18264 to `stable/8.2`.

There were conflicts in
[MessagingException.java](https://github.com/camunda/zeebe/pull/18286/files#diff-cfd7f2bc0198b1461470a353efa3fbebce2cec3c7ca71c96513eefc316b61245)
which contains one subtype less than on all newer branches.

relates to #17333
original author: @megglos
github-merge-queue bot pushed a commit that referenced this issue May 7, 2024
…nection (#18283)

# Description
Backport of #18264 to `stable/8.4`.

relates to #17333
original author: @megglos
@Zelldon Zelldon added version:8.2.27 Marks an issue as being completely or in parts released in 8.2.27 version:8.4.7 Marks an issue as being completely or in parts released in 8.4.7 version:8.5.1 Marks an issue as being completely or in parts released in 8.5.1 version:8.3.11 Marks an issue as being completely or in parts released in 8.3.11 labels May 7, 2024
@tmetzke tmetzke added the version:8.6.0 Label that represents issues released on version 8.6.0 label Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/zeebe Related to the Zeebe component/team kind/bug Categorizes an issue or PR as a bug likelihood/low Observed rarely / rather unlikely edge-case severity/high Marks a bug as having a noticeable impact on the user with no known workaround support Marks an issue as related to a customer support request version:8.2.27 Marks an issue as being completely or in parts released in 8.2.27 version:8.3.11 Marks an issue as being completely or in parts released in 8.3.11 version:8.4.7 Marks an issue as being completely or in parts released in 8.4.7 version:8.5.1 Marks an issue as being completely or in parts released in 8.5.1 version:8.6.0 Label that represents issues released on version 8.6.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants