Better recovery from unhealthy controller connection? #152

ruimarinho · 2024-01-24T12:43:22Z

Hi,

Recently, our workers started becoming stuck after the keepalive ping fails due to ACK timeout. This coincides with a migration on our side from an AWS Classic Load Balancer to a Network Load Balancer. There's something about that change that orchard is more sensitive too.

I'm wondering if this behavior can be improved to automatically restart the gRPC stream in case it fails to actually communicate? I'm assuming there is an issue preventing the retry logic from actually working since a simple orchard restart immediately resolves the problem and there is no apparent loss of network connection at any stafe.

{"level":"info","ts":1706057743.510983,"msg":"syncing 0 local VMs against 0 remote VMs..."}
{"level":"debug","ts":1706057748.3017972,"msg":"got worker from the API"}
{"level":"debug","ts":1706057748.3993149,"msg":"updated worker in the API"}
{"level":"info","ts":1706057748.5070028,"msg":"syncing 0 local VMs against 0 remote VMs..."}
{"level":"warn","ts":1706057773.640198,"msg":"failed to watch RPC: rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout"}
{"level":"info","ts":1706057773.9021702,"msg":"connecting to orchard-controller:443 over gRPC"}
{"level":"info","ts":1706057773.903039,"msg":"gRPC connection established, starting gRPC stream with the controller"}
{"level":"info","ts":1706057774.88195,"msg":"running gRPC stream with the controller"}

That last message remains there at the log until a restart is performed, but the orchard controller does not receive any ping from the worker since that point in time (inclusively).

A typical successful log message group would show:

{"level":"info","ts":1706057774.88195,"msg":"running gRPC stream with the controller"}
{"level":"info","ts":1706099569.125219,"msg":"registered worker ci-1"}
{"level":"info","ts":1706099569.1264691,"msg":"connecting to orchard-controller:443 over gRPC"}
{"level":"info","ts":1706099569.127172,"msg":"gRPC connection established, starting gRPC stream with the controller"}
{"level":"info","ts":1706099569.243507,"msg":"syncing on-disk VMs..."}
{"level":"debug","ts":1706099569.2437172,"msg":"running 'tart list --format json'"}

Thank you!

The text was updated successfully, but these errors were encountered:

ruimarinho · 2024-01-31T17:01:48Z

Initial observation looking at metrics is that orchard is not recovering from TCP RST packets sent by the NLB.

fkorotkov · 2024-01-31T18:39:59Z

We are also investigating switching from grpc-go with its own networking to https://connectrpc.com/ which is based on the standard Go's HTTP client and provides GPRC compatibility.

edigaryev · 2024-02-05T14:08:32Z

Hello Rui 👋

I've tried reproducing your issue by running an Orchard Controller on an EC2 instance behind an NLB and I think that the issue is not with the gRPC stream because it works just fine in the presence of hard-coded 350 second timeout imposed by the NLB.

There's a great article gRPC Keepalives and Transport Closing Errors on AWS Network Load Balancers which discusses communication problems with NLB a gRPC server/client pair that have improper keep-alive settings and how it can be fixed, however, this doesn't seem to be our case as we're actually (1) using streams and (2) have keep-alive configured on the server too (which should not result in GO_AWAY messages.

Looking at the last log lines before that you've posted before the worker had stuck:

{"level":"info","ts":1706057773.903039,"msg":"gRPC connection established, starting gRPC stream with the controller"}
{"level":"info","ts":1706057774.88195,"msg":"running gRPC stream with the controller"}

The next lines should be got worker from the API, updated worker in the API and so on.

The only reason for these lines not to be emitted I see is that Orchard's HTTP client is not making any progress, which should be addressed by #153.

edigaryev · 2024-02-05T17:59:48Z

Please check out the new 0.15.1 release, it should help with your issue.

Closing for now, please let us know in case you'll encounter this ever again.

fkorotkov assigned edigaryev Jan 24, 2024

edigaryev mentioned this issue Feb 5, 2024

Set HTTP client timeout to 30 seconds #153

Merged

edigaryev closed this as completed Feb 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better recovery from unhealthy controller connection? #152

Better recovery from unhealthy controller connection? #152

ruimarinho commented Jan 24, 2024

ruimarinho commented Jan 31, 2024

fkorotkov commented Jan 31, 2024

edigaryev commented Feb 5, 2024

edigaryev commented Feb 5, 2024

Better recovery from unhealthy controller connection? #152

Better recovery from unhealthy controller connection? #152

Comments

ruimarinho commented Jan 24, 2024

ruimarinho commented Jan 31, 2024

fkorotkov commented Jan 31, 2024

edigaryev commented Feb 5, 2024

edigaryev commented Feb 5, 2024