-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better recovery from unhealthy controller connection? #152
Comments
Initial observation looking at metrics is that orchard is not recovering from |
We are also investigating switching from |
Hello Rui 👋 I've tried reproducing your issue by running an Orchard Controller on an EC2 instance behind an NLB and I think that the issue is not with the gRPC stream because it works just fine in the presence of hard-coded 350 second timeout imposed by the NLB. There's a great article gRPC Keepalives and Transport Closing Errors on AWS Network Load Balancers which discusses communication problems with NLB a gRPC server/client pair that have improper keep-alive settings and how it can be fixed, however, this doesn't seem to be our case as we're actually (1) using streams and (2) have keep-alive configured on the server too (which should not result in Looking at the last log lines before that you've posted before the worker had stuck:
The next lines should be The only reason for these lines not to be emitted I see is that Orchard's HTTP client is not making any progress, which should be addressed by #153. |
Please check out the new Closing for now, please let us know in case you'll encounter this ever again. |
Hi,
Recently, our workers started becoming stuck after the keepalive ping fails due to ACK timeout. This coincides with a migration on our side from an AWS Classic Load Balancer to a Network Load Balancer. There's something about that change that orchard is more sensitive too.
I'm wondering if this behavior can be improved to automatically restart the gRPC stream in case it fails to actually communicate? I'm assuming there is an issue preventing the retry logic from actually working since a simple orchard restart immediately resolves the problem and there is no apparent loss of network connection at any stafe.
That last message remains there at the log until a restart is performed, but the orchard controller does not receive any ping from the worker since that point in time (inclusively).
A typical successful log message group would show:
Thank you!
The text was updated successfully, but these errors were encountered: