Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better recovery from unhealthy controller connection? #152

Closed
ruimarinho opened this issue Jan 24, 2024 · 4 comments
Closed

Better recovery from unhealthy controller connection? #152

ruimarinho opened this issue Jan 24, 2024 · 4 comments
Assignees

Comments

@ruimarinho
Copy link

Hi,

Recently, our workers started becoming stuck after the keepalive ping fails due to ACK timeout. This coincides with a migration on our side from an AWS Classic Load Balancer to a Network Load Balancer. There's something about that change that orchard is more sensitive too.

I'm wondering if this behavior can be improved to automatically restart the gRPC stream in case it fails to actually communicate? I'm assuming there is an issue preventing the retry logic from actually working since a simple orchard restart immediately resolves the problem and there is no apparent loss of network connection at any stafe.

{"level":"info","ts":1706057743.510983,"msg":"syncing 0 local VMs against 0 remote VMs..."}
{"level":"debug","ts":1706057748.3017972,"msg":"got worker from the API"}
{"level":"debug","ts":1706057748.3993149,"msg":"updated worker in the API"}
{"level":"info","ts":1706057748.5070028,"msg":"syncing 0 local VMs against 0 remote VMs..."}
{"level":"warn","ts":1706057773.640198,"msg":"failed to watch RPC: rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout"}
{"level":"info","ts":1706057773.9021702,"msg":"connecting to orchard-controller:443 over gRPC"}
{"level":"info","ts":1706057773.903039,"msg":"gRPC connection established, starting gRPC stream with the controller"}
{"level":"info","ts":1706057774.88195,"msg":"running gRPC stream with the controller"}

That last message remains there at the log until a restart is performed, but the orchard controller does not receive any ping from the worker since that point in time (inclusively).

A typical successful log message group would show:

{"level":"info","ts":1706057774.88195,"msg":"running gRPC stream with the controller"}
{"level":"info","ts":1706099569.125219,"msg":"registered worker ci-1"}
{"level":"info","ts":1706099569.1264691,"msg":"connecting to orchard-controller:443 over gRPC"}
{"level":"info","ts":1706099569.127172,"msg":"gRPC connection established, starting gRPC stream with the controller"}
{"level":"info","ts":1706099569.243507,"msg":"syncing on-disk VMs..."}
{"level":"debug","ts":1706099569.2437172,"msg":"running 'tart list --format json'"}

Thank you!

@ruimarinho
Copy link
Author

Initial observation looking at metrics is that orchard is not recovering from TCP RST packets sent by the NLB.

@fkorotkov
Copy link
Contributor

We are also investigating switching from grpc-go with its own networking to https://connectrpc.com/ which is based on the standard Go's HTTP client and provides GPRC compatibility.

@edigaryev
Copy link
Collaborator

Hello Rui 👋

I've tried reproducing your issue by running an Orchard Controller on an EC2 instance behind an NLB and I think that the issue is not with the gRPC stream because it works just fine in the presence of hard-coded 350 second timeout imposed by the NLB.

There's a great article gRPC Keepalives and Transport Closing Errors on AWS Network Load Balancers which discusses communication problems with NLB a gRPC server/client pair that have improper keep-alive settings and how it can be fixed, however, this doesn't seem to be our case as we're actually (1) using streams and (2) have keep-alive configured on the server too (which should not result in GO_AWAY messages.

Looking at the last log lines before that you've posted before the worker had stuck:

{"level":"info","ts":1706057773.903039,"msg":"gRPC connection established, starting gRPC stream with the controller"}
{"level":"info","ts":1706057774.88195,"msg":"running gRPC stream with the controller"}

The next lines should be got worker from the API, updated worker in the API and so on.

The only reason for these lines not to be emitted I see is that Orchard's HTTP client is not making any progress, which should be addressed by #153.

@edigaryev
Copy link
Collaborator

Please check out the new 0.15.1 release, it should help with your issue.

Closing for now, please let us know in case you'll encounter this ever again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants