New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
System.IO.IOException: The request was aborted in the KubeOps.Operator.Kubernetes.ResourceWatcher #482
Comments
It is happening for me as well. I've double checked and this issue is only happening on my AKS clusters, please take a look at this issue at official k8s dotnet client repo Watch Method stops watching. A workaround is suggested at the mentioned thread, Ill try it myself and come back with the results. @alexander-klang Is by any chance your environment running on AKS as well? |
@Crespalvaro I'm running on GKE (also added version info to the original post). Checked the workaround - please let us know if it works! |
Hi @alexander-klang. I've been doing multiple tests regarding the subject and unfortunatily I cannot come up with a definitive answer. However, I will still post here the results of my research so maybe we can arrange a solution together. First of all, the hack suggested on the official c# client gh is only a mitigation strategy, It wont completely fix this issue, and so was for me. I've seen as well that this hack was integrated at version 7.0.13 and removed in version 7.1.x apparently in favour of a better strategy using k8s.Autorest as replacement for Microsoft.Rest. Thus, I dont think we should keep looking further in that direction. I've also played a bit with the timeout configuration and reconcile requeue interval, seems that it helped although I'm still facing the issue. Finally, I digged onto this repository code and find out that its already dealing with not only server closing the connection but also any exception thrown ResourceWatcher.OnException. This disposes the K8s Watcher and recreates it. I'm confused because OnException reconnecting with backoff strategy and OnClose does purely the same. The only differences I see on the code is that _reconnectHandler awaits WatchResource whereas OnClose calls RestartWatcher without await, thus it wont wait even though inside RestartWatcher it is as well calling await WatchResource Find below some traces of the 2 cases (keep in mind traces are captured on grafana for in deep tracking thus they are upside down): Server Close Connection
Conn Reset Exception Catched
As you can see, OnClose successfully recovers whereas OnException successfully restarts the Watcher and catchs a new "Received watch event "Added"" without propagating it to EventQueue Event Mapping It is frustrating because it seems like we are properly dealing with the error but somehow there is something else that remains broken and I could not identify it. @buehler Does this investigation gives you an idea of what could be wrong? Hope that his is helpful and that I could explain my self in an understandable way. |
We observe the same problem on our operator with KubeOps 5.4.1 as well as the latest version KubeOps 6.6.1 on Kubernetes 1.22. It is really difficult to debug as it is of such a sporadic nature, but it is a real show stopper for us as the users of our operator rely on its stability without such manual interaction as restarts. Is there any progress on the issue? Thank you very much to all who contributed to the investigation already! |
This is virtually impossible to retrace. When Kubernetes (or the API of it) closes the connection, you can do nothing. However, maybe we should actively let the server crash after some retries and trigger a redeployment of the pod... but that seems to be a rather harsh treatment... |
I wonder how this is done e.g. in the Golang Operator SDK? I agree it sounds like a harsh treatment, but it would still be better than having an operator that does not reconcile anymore... especially, in cases like ours where we provide the operator to be used by others on their Kubernetes cluster. |
@anekdoti The best solution for us was to implement readiness and liveness probes. Basically when server closes the connection, the watcher remains broken and reconciles stop ocurring. Everything in my prior investigation helps mitigating the ocurrences, still this health checks will restart automatically the operator whenever it happens, fixing the issue. An implementation example could be as follows (moreover it is the one I implemented), set time/timestamp on the status of your custom resources every reconcile, health checks will basically get the resources and check if it has been over x time (in my case 10 minutes) idle (no updates). At this point you can play with failureThreshold (max retries), periodSeconds (execution frecuency), etc... check docs if needed Liveness, Readiness Probes I hope this helps you and anyone facing this issue now or in the future. |
Hi, Occasionally (or is it rather exactly after 40 consecutively failed reconnection attempts?) the _reconnectAttempts reaches the threshold of 40 retries and then as depicted bellow an unhandled After this the const int MaxRetriesAttempts = 39;
++_reconnectAttempts;
_logger.LogError(e, @"There was an error while watching the resource ""{resource}"".", typeof(TEntity));
var backoff = _settings.ErrorBackoffStrategy(_reconnectAttempts > MaxRetriesAttempts ? MaxRetriesAttempts : _reconnectAttempts);
if (backoff.TotalSeconds > _settings.WatcherMaxRetrySeconds)
{
backoff = TimeSpan.FromSeconds(_settings.WatcherMaxRetrySeconds);
} Of course adjust it to fit the library the best. Would it make sense to also surround some parts of the try
{
//...
}
catch (Exception exception)
{
_logger.LogError(exception, "Failed...");
}
finally
{
_reconnectHandler.OnNext(backoff);
} The end result of the above described issue is that the operator stops watching a resource of type Thank you in advance! |
Hello @Crespalvaro and @tomasfabian ! First, thank you very much, @Crespalvaro for sharing your workaround. We were actually about to implement our interpretation of it, until @tomasfabian shared his analysis with me (thank you!). We actually had a call about it, and I think @tomasfabian is on to something - and the problem described is so subtil that I could easily imagine how it did not got caught earlier on. What do you think (also @buehler)? |
Wow..! I think the issue with the backoff strategy seems legit. |
Hi @buehler @anekdoti @alexander-klang @Crespalvaro and everyone else! I prepared a PR which should fix the issue regarding the stopped reconciliations. Thank you @anekdoti for leading me toward the provided fix during our call! |
Describe the bug
The app stops watching designated resources after encountering the
System.IO.IOException
in theKubeOps.Operator.Kubernetes.ResourceWatcher
.To Reproduce
It happens after running a pod for a while. Conditions are not clear to me.
Expected behavior
Pod handles transient exceptions and continues to watch designated resources.
Additional context
The callstack:
Edit:
KubeOps 6.6.0
k8s 1.21.14-gke.700
The text was updated successfully, but these errors were encountered: