Potential duplicate PI creation across partitions in case of request timeouts #17333

megglos · 2024-04-05T07:28:46Z

Describe the bug

The following log is from a sample application to start 50 PIs concurrently, it does not retry failed requests. It assigns a unique variable to each instance and has a stateful worker that keeps track of the variable values to detect duplicate instances.

08:33:56.615 [pool-3-thread-36] INFO com.camunda.highcreation.GenerateProcessInstance -- Created instance with tid 20240405_083348-52
08:33:56.616 [pool-3-thread-36] ERROR com.camunda.highcreation.GenerateProcessInstance -- OVERTIME the timeout 2s : 4087 ms result? [null] exception? [Time out between gateway and broker: Request command-api-3 to zeebe-2.zeebe-broker-service.408112bf-acf5-4ad4-a479-a6dcf91776b3-zeebe.svc.cluster.local:26501 timed out in PT1.999S] tid: [20240405_083348-52]
....
08:34:13.730 [pool-2-thread-1] ERROR com.camunda.highcreation.worker.CheckUniqueTidWorker -- >>>>>>>>> #10 DUPLICATE PID detected 20240405_083348-52 PI:6755399441056627 and 2251799813685893

So from the log the client got a timeout error, at most we would expect one instance to exist afterwards or none, as in case of a timeout we don't know if the broker actually received the creation command.

In this particular case actually three instances were created and present in Operate afterwards:

Looking at the zeebe records there is one instance per partition:

This indicates we seem to have a bug in the gateway, trying the next partition if creating an instance on one partition failed due to a timeout or other connection related errors, while the request may have reached the broker still.

Enabling trace logging in the Gateway, I could also verify that the logic to retry a request on other partitions is indeed triggered.

Kudos to @pierre-yves-monnet for preparing the setup to easily reproduce the issue!

To Reproduce

Create a lot of PIs concurrently so the request timeout is hit (best while providing a unique identifiers as variables to the instances)
Observe timeout exception in your client
Observe duplicate PIs (via the unique variable identifiers)

You can use https://github.com/pierre-yves-monnet/c8-highcreation to provoke this easily.

Expected behavior

In overload/timeout between gateway and broker situations at most one PI should be created. The Gateway should not retry on other partitions if there wasn't a qualified error returned by the broker.

Log/Stacktrace

Full Stacktrace

{
  "insertId": "rxqycbmysvpzunhh",
  "jsonPayload": {
    "serviceContext": {
      "version": "8.4.5",
      "service": "zeebe"
    },
    "context": {
      "actor-name": "BrokerRequestManager",
      "threadName": "zb-actors-0",
      "threadId": 16,
      "threadPriority": 5,
      "loggerName": "io.camunda.zeebe.gateway",
      "actor-scheduler": "Gateway-zeebe-gateway-c5dccc6f6-rtlz4"
    },
    "exception": "java.net.ConnectException: Connection RemoteClientConnection{channel=[id: 0x197e8716, L:/10.64.47.13:50458 ! R:zeebe-2.zeebe-broker-service.408112bf-acf5-4ad4-a479-a6dcf91776b3-zeebe.svc.cluster.local/10.64.42.5:26501]} was closed\n\tat io.atomix.cluster.messaging.impl.AbstractClientConnection.close(AbstractClientConnection.java:76) ~[zeebe-atomix-cluster-8.4.5.jar:8.4.5]\n\tat io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$getOrCreateClientConnection$38(NettyMessagingService.java:702) ~[zeebe-atomix-cluster-8.4.5.jar:8.4.5]\n\tat io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590) ~[netty-common-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:583) ~[netty-common-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:559) ~[netty-common-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:492) ~[netty-common-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:636) ~[netty-common-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.util.concurrent.DefaultPromise.setSuccess0(DefaultPromise.java:625) ~[netty-common-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:105) ~[netty-common-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:84) ~[netty-transport-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.channel.AbstractChannel$CloseFuture.setClosed(AbstractChannel.java:1164) ~[netty-transport-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.channel.AbstractChannel$AbstractUnsafe.doClose0(AbstractChannel.java:755) ~[netty-transport-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:731) ~[netty-transport-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:620) ~[netty-transport-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.channel.DefaultChannelPipeline$HeadContext.close(DefaultChannelPipeline.java:1352) ~[netty-transport-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeClose(AbstractChannelHandlerContext.java:755) ~[netty-transport-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.channel.AbstractChannelHandlerContext.access$1200(AbstractChannelHandlerContext.java:61) ~[netty-transport-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.channel.AbstractChannelHandlerContext$11.run(AbstractChannelHandlerContext.java:738) ~[netty-transport-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173) ~[netty-common-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166) ~[netty-common-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470) ~[netty-common-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:413) ~[netty-transport-classes-epoll-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[netty-common-4.1.107.Final.jar:4.1.107.Final]\n\tat io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.107.Final.jar:4.1.107.Final]\n\tat java.base/java.lang.Thread.run(Unknown Source) ~[?:?]\n",
    "message": "Failed to create process on partition 3"
  },
  "resource": {
    "type": "k8s_container",
    "labels": {
      "project_id": "camunda-saas-int-chaos",
      "namespace_name": "408112bf-acf5-4ad4-a479-a6dcf91776b3-zeebe",
      "pod_name": "zeebe-gateway-c5dccc6f6-rtlz4",
      "location": "europe-west1",
      "cluster_name": "worker-chaos-1",
      "container_name": "zeebe-gateway"
    }
  },
  "timestamp": "2024-04-05T07:22:15.833760259Z",
  "severity": "DEBUG",
  "labels": {
    "k8s-pod/pod-template-hash": "c5dccc6f6",
    "compute.googleapis.com/resource_name": "gke-worker-chaos-1-spot-v3-2bef5425-fb7x",
    "k8s-pod/app_kubernetes_io/component": "standalone-gateway",
    "k8s-pod/cloud_camunda_io/orgId": "b48317de_6daf_4ea2_a64a_84ed0b016bd5",
    "k8s-pod/cloud_camunda_io/clusterPlan": "G3_S_HA",
    "k8s-pod/app_kubernetes_io/app": "zeebe-gateway",
    "k8s-pod/cloud_camunda_io/generation": "Camunda_8_4_5",
    "k8s-pod/cloud_camunda_io/salesPlanType": "internal",
    "k8s-pod/cloud_camunda_io/channel": "Stable",
    "k8s-pod/cloud_camunda_io/internalSalesPlan": "true",
    "k8s-pod/cloud_camunda_io/clusterPlanType": "G3_S",
    "k8s-pod/cloud_camunda_io/salesPlan": "Internal_with_Low_Hardware"
  },
  "logName": "projects/camunda-saas-int-chaos/logs/stdout",
  "sourceLocation": {
    "file": "RequestRetryHandler.java",
    "line": "108",
    "function": "lambda$sendRequestWithRetry$1"
  },
  "receiveTimestamp": "2024-04-05T07:22:20.572787794Z"
}

Environment:

OS:
Zeebe Version: 8.4+ (likely affects older releases as well)
Configuration:

** Support:**

The text was updated successfully, but these errors were encountered:

megglos · 2024-04-05T09:47:16Z

Input for triage: Likelihood depends on the setup and request timeout config.
I would argue if a setup is tuned to be able to handle the majority of requests within the used request timeout, it shouldn't occur too often and only on load peaks, thus classifying as low likelihood. If it happens it's rather bad though and a user would need to implement saftey mechanisms like a artificial deduplication task which is not really an acceptable workaround, thus severity high.

megglos · 2024-04-08T12:45:04Z

ZPA-Triage:

appears ZDP domain (gateway to broker communication) - might need clarification/alignment with ZPA what can actually be retried

megglos · 2024-04-23T08:28:13Z

In the particular case a java.net.ConnectException was thrown from netty as the broker closed the connection

java.net.ConnectException: Connection RemoteClientConnection{...} was closed

could happen on any broker crash and in such as cases the Gateway shouldn't retry.
This means a retry on any ConnectException is not appropriate.

I need to validate this provoking the situation again obtaining broker and gateway logs on such an occasion.

megglos · 2024-05-06T09:30:38Z

We throw ConnectException in cases where it shouldn't be used, we need to replace this with a more appropriate exception,.
We may only cover this with a unit test on the NettyMessagingService.

## Description This replaces `ConnectionException` throws in cases where a connection was actually established previously with a more appropriate `MessagingException.ConnectionClosed`, as `ConnectionException` is only supposed to be used in cases where a connection couldn't get established at all, see https://docs.oracle.com/javase%2F7%2Fdocs%2Fapi%2F%2F/java/net/ConnectException.html. ## Related issues closes #17333

@megglos

…nection (#18282) # Description Backport of #18264 to `stable/8.3`. relates to #17333 original author: @megglos

@megglos

…nection (#18284) # Description Backport of #18264 to `stable/8.5`. relates to #17333 original author: @megglos

@megglos

…nection (#18286) # Description Manual backport of #18264 to `stable/8.2`. There were conflicts in [MessagingException.java](https://github.com/camunda/zeebe/pull/18286/files#diff-cfd7f2bc0198b1461470a353efa3fbebce2cec3c7ca71c96513eefc316b61245) which contains one subtype less than on all newer branches. relates to #17333 original author: @megglos

@megglos

…nection (#18283) # Description Backport of #18264 to `stable/8.4`. relates to #17333 original author: @megglos

megglos added kind/bug Categorizes an issue or PR as a bug component/zeebe Related to the Zeebe component/team severity/high Marks a bug as having a noticeable impact on the user with no known workaround likelihood/low Observed rarely / rather unlikely edge-case labels Apr 5, 2024

github-actions bot added the support Marks an issue as related to a customer support request label Apr 5, 2024

megglos changed the title ~~Potential duplicate PI creation in case a broker is overloaded~~ Potential duplicate PI creation in case of request timeouts Apr 5, 2024

megglos changed the title ~~Potential duplicate PI creation in case of request timeouts~~ Potential duplicate PI creation across partitions in case of request timeouts Apr 5, 2024

Zelldon assigned megglos Apr 22, 2024

megglos mentioned this issue May 6, 2024

fix: do not retry create instance on closed connection #18264

Merged

megglos closed this as completed in #18264 May 6, 2024

github-merge-queue bot pushed a commit that referenced this issue May 6, 2024

[Backport stable/8.3] fix: do not retry create instance on closed con…

3694fd4

…nection (#18282) # Description Backport of #18264 to `stable/8.3`. relates to #17333 original author: @megglos

megglos mentioned this issue May 6, 2024

[Backport stable/8.2] fix: do not retry create instance on closed connection #18286

Merged

github-merge-queue bot pushed a commit that referenced this issue May 6, 2024

[Backport stable/8.5] fix: do not retry create instance on closed con…

d162958

…nection (#18284) # Description Backport of #18264 to `stable/8.5`. relates to #17333 original author: @megglos

github-merge-queue bot pushed a commit that referenced this issue May 7, 2024

[Backport stable/8.4] fix: do not retry create instance on closed con…

3e2cb47

…nection (#18283) # Description Backport of #18264 to `stable/8.4`. relates to #17333 original author: @megglos

tmetzke added the version:8.6.0 Label that represents issues released on version 8.6.0 label Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential duplicate PI creation across partitions in case of request timeouts #17333

Potential duplicate PI creation across partitions in case of request timeouts #17333

megglos commented Apr 5, 2024 •

edited

Loading

megglos commented Apr 5, 2024 •

edited

Loading

megglos commented Apr 8, 2024

megglos commented Apr 23, 2024 •

edited

Loading

megglos commented May 6, 2024

Potential duplicate PI creation across partitions in case of request timeouts #17333

Potential duplicate PI creation across partitions in case of request timeouts #17333

Comments

megglos commented Apr 5, 2024 • edited Loading

megglos commented Apr 5, 2024 • edited Loading

megglos commented Apr 8, 2024

megglos commented Apr 23, 2024 • edited Loading

megglos commented May 6, 2024

megglos commented Apr 5, 2024 •

edited

Loading

megglos commented Apr 5, 2024 •

edited

Loading

megglos commented Apr 23, 2024 •

edited

Loading