Skip to content
This repository has been archived by the owner on Apr 8, 2024. It is now read-only.

Disconnects from cloud #99

Closed
jwulf opened this issue Oct 28, 2019 · 12 comments · Fixed by #188
Closed

Disconnects from cloud #99

jwulf opened this issue Oct 28, 2019 · 12 comments · Fixed by #188

Comments

@jwulf
Copy link
Member

jwulf commented Oct 28, 2019

Since cloud went to Zeebe 0.21.1, this happens every day:

{"context":"/server/endpoint/node_modules/@magikcraft/nestjs-zeebe/node_modules/zeebe-node/dist/zb/ZBWorker.js:122","id":"retrieve-spells_103","level":50,"message":"Stalled on gRPC error","pollMode":"Long Poll","taskType":"retrieve-spells","time":"2019 Oct-28 01:48:50AM","timestamp":"2019-10-28T01:48:50.672Z"}
{"context":"/server/endpoint/node_modules/@magikcraft/nestjs-zeebe/node_modules/zeebe-node/dist/lib/GRPCClient.js:71","id":"gRPC Channel","level":50,"message":"GRPC ERROR: 1 CANCELLED: Received http2 header with status: 502","pollMode":"Long Poll","taskType":"gRPC Channel","time":"2019 Oct-28 01:48:50AM","timestamp":"2019-10-28T01:48:50.672Z"}
{"context":"/server/endpoint/node_modules/@magikcraft/nestjs-zeebe/node_modules/zeebe-node/dist/lib/GRPCClient.js:242","id":"gRPC Channel","level":50,"message":"GRPC Channel State: READY","pollMode":"Long Poll","taskType":"gRPC Channel","time":"2019 Oct-28 01:48:50AM","timestamp":"2019-10-28T01:48:50.673Z"}
{"context":"/server/endpoint/node_modules/@magikcraft/nestjs-zeebe/node_modules/zeebe-node/dist/lib/GRPCClient.js:77","id":"gRPC Channel","level":30,"message":"gRPC Channel reconnected","pollMode":"Long Poll","taskType":"gRPC Channel","time":"2019 Oct-28 01:48:50AM","timestamp":"2019-10-28T01:48:50.673Z"}
{"context":"/server/endpoint/node_modules/@magikcraft/nestjs-zeebe/node_modules/zeebe-node/dist/lib/GRPCClient.js:258","id":"gRPC Channel","level":30,"message":"gRPC Channel State: READY","pollMode":"Long Poll","taskType":"gRPC Channel","time":"2019 Oct-28 01:48:50AM","timestamp":"2019-10-28T01:48:50.855Z"}
{"context":"/server/endpoint/node_modules/@magikcraft/nestjs-zeebe/node_modules/zeebe-node/dist/lib/GRPCClient.js:259","id":"gRPC Channel","level":30,"message":"gRPC Retry count: 2","pollMode":"Long Poll","taskType":"gRPC Channel","time":"2019 Oct-28 01:48:50AM","timestamp":"2019-10-28T01:48:50.855Z"}
{"context":"/server/endpoint/node_modules/@magikcraft/nestjs-zeebe/node_modules/zeebe-node/dist/lib/GRPCClient.js:262","id":"gRPC Channel","level":30,"message":"gRPC reconnected","pollMode":"Long Poll","taskType":"gRPC Channel","time":"2019 Oct-28 01:48:50AM","timestamp":"2019-10-28T01:48:50.855Z"}
{"context":"/server/endpoint/node_modules/@magikcraft/nestjs-zeebe/node_modules/zeebe-node/dist/lib/GRPCClient.js:77","id":"gRPC Channel","level":30,"message":"gRPC Channel reconnected","pollMode":"Long Poll","taskType":"gRPC Channel","time":"2019 Oct-28 01:48:50AM","timestamp":"2019-10-28T01:48:50.856Z"}

The client reports it is connected, but does not retrieve any jobs.

Could this be due to pod rescheduling?

@jwulf
Copy link
Member Author

jwulf commented Oct 29, 2019

Running the 0.21.3 branch with the heartbeat code.

Connected fine. Ran at DEBUG log level for several hours, then switched to INFO to surface errors.

Load on the broker is next to nothing. Workers polling at 30 second intervals. ~12 workers connected.

  • 09:49 UTC: The cluster rescheduled while I was watching the log stream. The client failed for a minute or so with a 503 response from the nginx proxy:
[2019 Oct-29 09:49:22AM] ERROR:
    context: "/server/node_modules/@magikcraft/nestjs-zeebe/node_modules/zeebe-node/dist/lib/GRPCClient.js:71"
    id: "gRPC Channel"
    message: "GRPC ERROR: 1 CANCELLED: Received http2 header with status: 503"
    pollMode: "Long Poll"
    taskType: "gRPC Channel"
[2019 Oct-29 09:49:22AM] ERROR:
    context: "/server/node_modules/@magikcraft/nestjs-zeebe/node_modules/zeebe-node/dist/lib/GRPCClient.js:242"
    id: "gRPC Channel"
    message: "GRPC Channel State: READY"
    pollMode: "Long Poll"
    taskType: "gRPC Channel"

When the cluster reschedule finished, the client came back up, and serviced tasks.

  • 10:06:34: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received three times

  • 10:10:36: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received once.

  • 10:29:15: Worker still working.

  • 10:43:43: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received once.

  • 10:43:48: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received once.

  • 10:52:53: Worker still working.

  • 11:06:32: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received once.

  • 11:10:19: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received once.

  • 11:10:33: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received twice.

  • 11:29:01: Worker still working.

  • 11:38:33: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received twice.

  • 11:47:57: Worker still working.

  • 12:06:37: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received twice.

  • 12:10:12: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received twice.

The 12:06 and 12:10 events were predicted, so there is a pattern.

  • 12:17:00 Restarted workers to determine if the pattern is absolute timed (server-dependent) or relative timed (client-elapsed-time-dependent)

  • 12:43:41: GRPC ERROR: 14 UNAVAILABLE: GOAWAY received five times.

@jwulf
Copy link
Member Author

jwulf commented Nov 1, 2019

Running the client against a local broker running in Docker, against a remote broker in the same data centre (AWS US East), or against a K8s cluster via port-forwarding (AWS US East to GKE AU South East), there are none of these errors.

Therefore it is either the proxy or the broker config on Camunda Cloud.

@jwulf
Copy link
Member Author

jwulf commented Nov 11, 2019

It's still disconnecting and requiring a reboot. Try this: if the channel is down for a set amount of time, destroy and recreate the channel.

@jwulf
Copy link
Member Author

jwulf commented Nov 18, 2019

It looks like the worker channels are durable, but the client channel used to send commands becomes stale.

At the moment you can't reliably inspect the state of the client channel because the worker channel state bubbles up through it. Will change that behaviour in #109.

@jwulf jwulf modified the milestone: 0.22.0-alpha.2 Nov 18, 2019
@MainAero
Copy link
Contributor

Any news on this?

@jwulf
Copy link
Member Author

jwulf commented Nov 27, 2019

Are you seeing this issue in production? I would be surprised if you see it with Lambdas. It seems to affect long-lived connections.

@jwulf
Copy link
Member Author

jwulf commented Jan 22, 2020

The Camunda Cloud team are now using this in production for their own systems, and are looking into the source of these issues.

@afirth
Copy link

afirth commented Jan 23, 2020

Looked into this today with @ColRad. We believe nginx receives the keepalive but doesn't pass it to the backend. Because the nginx <-> backend connection has no data on it, it's killed after 60s by the grpc_read_timeout. We've increased that timeout to 601s (1s > than the default long-poll of 10 mins) in https://github.com/camunda-cloud/zeebe-controller-k8s/pull/185 - this should go live this week

@afirth
Copy link

afirth commented Mar 25, 2020

https://trac.nginx.org/nginx/ticket/1555 sound familiar?
we are playing around with tuning http2_max_requests but this comes with some caveats (basically it's a cap on a memory leak).

@afirth
Copy link

afirth commented Mar 25, 2020

any chance you could put a debug counter on the requests?

jwulf added a commit to jwulf/zeebe-client-node-js that referenced this issue May 6, 2020
@jwulf
Copy link
Member Author

jwulf commented May 6, 2020

Fixed in 0.23.0.

@jwulf jwulf closed this as completed May 6, 2020
@afirth
Copy link

afirth commented May 6, 2020

@jwulf thanks for the fix! what was the root cause?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants