Disconnects from cloud #99

jwulf · 2019-10-28T02:31:37Z

Since cloud went to Zeebe 0.21.1, this happens every day:

{"context":"/server/endpoint/node_modules/@magikcraft/nestjs-zeebe/node_modules/zeebe-node/dist/zb/ZBWorker.js:122","id":"retrieve-spells_103","level":50,"message":"Stalled on gRPC error","pollMode":"Long Poll","taskType":"retrieve-spells","time":"2019 Oct-28 01:48:50AM","timestamp":"2019-10-28T01:48:50.672Z"}
{"context":"/server/endpoint/node_modules/@magikcraft/nestjs-zeebe/node_modules/zeebe-node/dist/lib/GRPCClient.js:71","id":"gRPC Channel","level":50,"message":"GRPC ERROR: 1 CANCELLED: Received http2 header with status: 502","pollMode":"Long Poll","taskType":"gRPC Channel","time":"2019 Oct-28 01:48:50AM","timestamp":"2019-10-28T01:48:50.672Z"}
{"context":"/server/endpoint/node_modules/@magikcraft/nestjs-zeebe/node_modules/zeebe-node/dist/lib/GRPCClient.js:242","id":"gRPC Channel","level":50,"message":"GRPC Channel State: READY","pollMode":"Long Poll","taskType":"gRPC Channel","time":"2019 Oct-28 01:48:50AM","timestamp":"2019-10-28T01:48:50.673Z"}
{"context":"/server/endpoint/node_modules/@magikcraft/nestjs-zeebe/node_modules/zeebe-node/dist/lib/GRPCClient.js:77","id":"gRPC Channel","level":30,"message":"gRPC Channel reconnected","pollMode":"Long Poll","taskType":"gRPC Channel","time":"2019 Oct-28 01:48:50AM","timestamp":"2019-10-28T01:48:50.673Z"}
{"context":"/server/endpoint/node_modules/@magikcraft/nestjs-zeebe/node_modules/zeebe-node/dist/lib/GRPCClient.js:258","id":"gRPC Channel","level":30,"message":"gRPC Channel State: READY","pollMode":"Long Poll","taskType":"gRPC Channel","time":"2019 Oct-28 01:48:50AM","timestamp":"2019-10-28T01:48:50.855Z"}
{"context":"/server/endpoint/node_modules/@magikcraft/nestjs-zeebe/node_modules/zeebe-node/dist/lib/GRPCClient.js:259","id":"gRPC Channel","level":30,"message":"gRPC Retry count: 2","pollMode":"Long Poll","taskType":"gRPC Channel","time":"2019 Oct-28 01:48:50AM","timestamp":"2019-10-28T01:48:50.855Z"}
{"context":"/server/endpoint/node_modules/@magikcraft/nestjs-zeebe/node_modules/zeebe-node/dist/lib/GRPCClient.js:262","id":"gRPC Channel","level":30,"message":"gRPC reconnected","pollMode":"Long Poll","taskType":"gRPC Channel","time":"2019 Oct-28 01:48:50AM","timestamp":"2019-10-28T01:48:50.855Z"}
{"context":"/server/endpoint/node_modules/@magikcraft/nestjs-zeebe/node_modules/zeebe-node/dist/lib/GRPCClient.js:77","id":"gRPC Channel","level":30,"message":"gRPC Channel reconnected","pollMode":"Long Poll","taskType":"gRPC Channel","time":"2019 Oct-28 01:48:50AM","timestamp":"2019-10-28T01:48:50.856Z"}

The client reports it is connected, but does not retrieve any jobs.

Could this be due to pod rescheduling?

The text was updated successfully, but these errors were encountered:

jwulf · 2019-10-29T10:42:01Z

Running the 0.21.3 branch with the heartbeat code.

Connected fine. Ran at DEBUG log level for several hours, then switched to INFO to surface errors.

Load on the broker is next to nothing. Workers polling at 30 second intervals. ~12 workers connected.

09:49 UTC: The cluster rescheduled while I was watching the log stream. The client failed for a minute or so with a 503 response from the nginx proxy:

[2019 Oct-29 09:49:22AM] ERROR:
    context: "/server/node_modules/@magikcraft/nestjs-zeebe/node_modules/zeebe-node/dist/lib/GRPCClient.js:71"
    id: "gRPC Channel"
    message: "GRPC ERROR: 1 CANCELLED: Received http2 header with status: 503"
    pollMode: "Long Poll"
    taskType: "gRPC Channel"
[2019 Oct-29 09:49:22AM] ERROR:
    context: "/server/node_modules/@magikcraft/nestjs-zeebe/node_modules/zeebe-node/dist/lib/GRPCClient.js:242"
    id: "gRPC Channel"
    message: "GRPC Channel State: READY"
    pollMode: "Long Poll"
    taskType: "gRPC Channel"

When the cluster reschedule finished, the client came back up, and serviced tasks.

10:06:34: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received three times
10:10:36: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received once.
10:29:15: Worker still working.
10:43:43: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received once.
10:43:48: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received once.
10:52:53: Worker still working.
11:06:32: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received once.
11:10:19: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received once.
11:10:33: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received twice.
11:29:01: Worker still working.
11:38:33: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received twice.
11:47:57: Worker still working.
12:06:37: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received twice.
12:10:12: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received twice.

The 12:06 and 12:10 events were predicted, so there is a pattern.

12:17:00 Restarted workers to determine if the pattern is absolute timed (server-dependent) or relative timed (client-elapsed-time-dependent)
12:43:41: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received five times.

jwulf · 2019-11-01T16:07:36Z

Running the client against a local broker running in Docker, against a remote broker in the same data centre (AWS US East), or against a K8s cluster via port-forwarding (AWS US East to GKE AU South East), there are none of these errors.

Therefore it is either the proxy or the broker config on Camunda Cloud.

jwulf · 2019-11-11T03:54:35Z

It's still disconnecting and requiring a reboot. Try this: if the channel is down for a set amount of time, destroy and recreate the channel.

jwulf · 2019-11-18T09:19:08Z

It looks like the worker channels are durable, but the client channel used to send commands becomes stale.

At the moment you can't reliably inspect the state of the client channel because the worker channel state bubbles up through it. Will change that behaviour in #109.

MainAero · 2019-11-27T18:21:06Z

Any news on this?

jwulf · 2019-11-27T23:53:44Z

Are you seeing this issue in production? I would be surprised if you see it with Lambdas. It seems to affect long-lived connections.

jwulf · 2020-01-22T09:18:39Z

The Camunda Cloud team are now using this in production for their own systems, and are looking into the source of these issues.

afirth · 2020-01-23T14:41:13Z

Looked into this today with @ColRad. We believe nginx receives the keepalive but doesn't pass it to the backend. Because the nginx <-> backend connection has no data on it, it's killed after 60s by the grpc_read_timeout. We've increased that timeout to 601s (1s > than the default long-poll of 10 mins) in https://github.com/camunda-cloud/zeebe-controller-k8s/pull/185 - this should go live this week

afirth · 2020-03-25T12:35:55Z

https://trac.nginx.org/nginx/ticket/1555 sound familiar?
we are playing around with tuning http2_max_requests but this comes with some caveats (basically it's a cap on a memory leak).

afirth · 2020-03-25T12:36:26Z

any chance you could put a debug counter on the requests?

…amunda-community-hub#99

jwulf · 2020-05-06T15:43:52Z

Fixed in 0.23.0.

afirth · 2020-05-06T15:51:23Z

@jwulf thanks for the fix! what was the root cause?

jwulf modified the milestone: 0.22.0-alpha.2 Nov 18, 2019

jwulf added a commit to jwulf/zeebe-client-node-js that referenced this issue May 6, 2020

Fixes camunda-community-hub#158 Fixes camunda-community-hub#152 Fixes c…

c2ee6d5

…amunda-community-hub#99

jwulf closed this as completed May 6, 2020

jwulf mentioned this issue May 6, 2020

0.23.0 #162

Merged

jwulf mentioned this issue Oct 23, 2020

V0.25.0 #188

Merged

dependabot-preview bot mentioned this issue Nov 1, 2020

Bump zeebe-node from 0.23.3 to 0.25.0 camunda-community-hub/node-red-contrib-zeebe#140

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disconnects from cloud #99

Disconnects from cloud #99

jwulf commented Oct 28, 2019

jwulf commented Oct 29, 2019 •

edited

jwulf commented Nov 1, 2019

jwulf commented Nov 11, 2019

jwulf commented Nov 18, 2019 •

edited

MainAero commented Nov 27, 2019

jwulf commented Nov 27, 2019 •

edited

jwulf commented Jan 22, 2020

afirth commented Jan 23, 2020

afirth commented Mar 25, 2020

afirth commented Mar 25, 2020

jwulf commented May 6, 2020

afirth commented May 6, 2020

Disconnects from cloud #99

Disconnects from cloud #99

Comments

jwulf commented Oct 28, 2019

jwulf commented Oct 29, 2019 • edited

jwulf commented Nov 1, 2019

jwulf commented Nov 11, 2019

jwulf commented Nov 18, 2019 • edited

MainAero commented Nov 27, 2019

jwulf commented Nov 27, 2019 • edited

jwulf commented Jan 22, 2020

afirth commented Jan 23, 2020

afirth commented Mar 25, 2020

afirth commented Mar 25, 2020

jwulf commented May 6, 2020

afirth commented May 6, 2020

jwulf commented Oct 29, 2019 •

edited

jwulf commented Nov 18, 2019 •

edited

jwulf commented Nov 27, 2019 •

edited