[bug]: No retry is performed when the ResourceWatcher fail to watch a resource #739

gabrieledemaglie · 2024-03-20T13:54:02Z

Describe the bug

Hello,

sometimes I see that when the operator starts, it doesn't succeed to watch a resource type at the first try.
In the image an example of the error that sometimes I get and that trigger the bug.

With version 8 everything was ok because there were retries until a watch was successfully established.
With version 9 no retries are performed in case of error: it means that when this situation occurs, the particular resource type is not watched at all (a restart of the operator is necessary).

To reproduce

Configure an operator that watch for particular resource type.
Restart it multiple times, checking the logs for an error during the resource watch process.
When it happens, try to create/modify a resource of the watched type: the operator will not start the reconcile function.

Expected behavior

The operator should retry to watch the resource in case of failure during the watch establishment.

Screenshots

No response

Additional Context

No response

buehler · 2024-04-19T07:37:03Z

Hey @gabrieledemaglie

Thank you for bringing up this matter.
This may be related to @nachtjasmin's implementation of the async methods.

WDYT @nachtjasmin ?

nachtjasmin · 2024-04-22T09:35:46Z

@buehler Oh yeah, that's a bug. There's no reassignment to _eventWatcher being done. Once it throws, it terminates, therefore the watcher does not restart automatically. I think that wrapping whole logic of WatchClientEventsAsync inside of while (!stoppingToken.IsCancellationRequested) and therefore moving the try/catch into the loop should solve the issue.

duke-bartholomew · 2024-05-21T11:00:08Z

@nachtjasmin Hi, I'm running in exactly the same issue: watcher dies after some time.
Apparently it's normal for the KubeAPI to drop the websocket connection after some idle-time, but will simply recreating the watch be enough in this case?

I would assume that, inside ResourceWatcher, you then also need to keep track of the latest ResourceVersion of the resource that was watched so that the watch can be re-enabled using that specific resource version, in order to prevent retriggering resources we already processed in the operator?
And this would then also mean: properly handling a "resource too old" exception?

nachtjasmin · 2024-05-22T09:46:08Z

Although I appreciate that y'all mention me in this issue, I'm not the primary maintainer. I'd love to fix the mentioned issue, but right now, I don't have the capacity to do so.

From my point of view, the patch below should fix the problem (and also #763). But again, I don't have the capacities to fix it now. So, without any integration tests and all, I can only assume that it's going to work.

diff --git a/src/KubeOps.Operator/Watcher/ResourceWatcher{TEntity}.cs b/src/KubeOps.Operator/Watcher/ResourceWatcher{TEntity}.cs
index 58ba7b1..447b298 100644
--- a/src/KubeOps.Operator/Watcher/ResourceWatcher{TEntity}.cs
+++ b/src/KubeOps.Operator/Watcher/ResourceWatcher{TEntity}.cs
@@ -128,9 +128,9 @@ internal class ResourceWatcher<TEntity>(
 
     private async Task WatchClientEventsAsync(CancellationToken stoppingToken)
     {
-        try
+        while (!stoppingToken.IsCancellationRequested)
         {
-            while (!stoppingToken.IsCancellationRequested)
+            try
             {
                 await foreach ((WatchEventType type, TEntity? entity) in client.WatchAsync<TEntity>(
                                    settings.Namespace,
@@ -184,14 +184,14 @@ internal class ResourceWatcher<TEntity>(
                     }
                 }
             }
-        }
-        catch (OperationCanceledException) when (stoppingToken.IsCancellationRequested)
-        {
-            // Don't throw if the cancellation was indeed requested.
-        }
-        catch (Exception e)
-        {
-            await OnWatchErrorAsync(e);
+            catch (OperationCanceledException) when (stoppingToken.IsCancellationRequested)
+            {
+                // Don't throw if the cancellation was indeed requested.
+            }
+            catch (Exception e)
+            {
+                await OnWatchErrorAsync(e);
+            }
         }
     }

buehler · 2024-05-22T09:49:46Z

Thank you ;-)
no worries, I know how it feels, I'll have a try at your suggestion as soon as I've the time to do so. Thanks for giving thought!

HappyCodeSloth · 2024-05-22T20:59:43Z

Here's a PR to add the fix: #765

I tried to add an Integration Test for the case but I couldn't figure out how to simulate the failure without introducing a stub for the client to mock the failure or interfere with other tests by breaking the network of the host.

Any suggestions welcome.

duke-bartholomew · 2024-05-23T17:40:57Z

@HappyCodeSloth @buehler
I'm not sure if the suggested fix will be enough ...
For one, it might go into a infinite failure loop as catching anything catch (Exception) could be any kind of failure.
Secondly, re-creating the watch will result in all Entities to be delivered again because the ResourceVersion is not taken into account.
As I noticed the API-Server consistently dropping the idle connections after 30 minutes (in my case), this approach would then lead to a full reprocessing of all the watched entities I already processed every 30 minutes.

So, in my opinion, a better solution would look more in line of this:

private async Task DoWatch(CancellationToken stoppingToken, string? resourceVersion = null) {
  var currentVersion = resourceVersion;
  while (!stoppingToken.IsCancellationRequested) {
    try  {
      await foreach ((WatchEventType type, TEntity? entity) in client.WatchAsync<TEntity>( resourceVersion: currentVersion, ... ) {
        currentVersion = entity.ResourceVersion();    // or rely on WatchEventType.Bookmark to update the resourceVersion ?
        await OnEventAsync(type, entity, stoppingToken);
      }
    }
    catch(KubernetesException cause) when (cause.Status.Code == 504 || cause.Status.Code == 410) {
      currentVersion = . . .   // regex extract the last resourceVersion from the exception message
                               // bail out when no resourceVersion could be extracted
    }
    catch(OperationCanceledException) when (stoppingToken.IsCancellationRequested) { }
    catch (HttpOperationException cause) when (cause.Response.StatusCode == HttpStatusCode.Conflict) {
      break;    // bail out of the watch
    }
    catch (HttpRequestException cause) { // when 'connection reset by peer'
      await Task.Delay(TimeSpan.FromSeconds(5), stoppingToken);  // wait a bit and retry the watch
    }
    catch ( ... ) // maybe other exception we can/need to catch to retry the current watch ?
    catch (Exception cause) {
      throw;  // I guess, because something really wrong happened ....
    }
  }
}

private async Task<(string? ResourceVersion, IList<TEntity> Entities)> DoList(CancellationToken stoppingToken) {
  ...
  var resultList = await client.ApiClient.CustomObjects.ListClusterCustomObjectAsync<EntityList<TEntity>>(...)
  return (resultList.Metadata.ResourceVersion, resultList.Items );
}

private async Task WatchClientEventsAsync(CancellationToken stoppingToken) {
        while (!stoppingToken.IsCancellationRequested) {
            _entityCache.Clear();

            var (resourceVersion, entities) = await DoList(stoppingToken);
            foreach (var entity in entities) {
                await OnEventAsync(WatchEventType.Added, entity, stoppingToken);
            }

            // start watching for changes starting from the known resource-version
            await DoWatch(resourceVersion, stoppingToken);
        }
}

This approach will keep track of the resource-version provided by the watch, and resume from the last known resource-version in case the API server drops the connection.
Any (known) exceptions leading to the watch not being usable anymore would trigger a full re-initialization (i.e. List() + Watch-From_ResourceVersion, as described in the kubernetes documentation)

Any thoughts?

duke-bartholomew · 2024-05-27T11:13:12Z

I've given it a go here: https://github.com/duke-bartholomew/dotnet-operator-sdk
This version will automatically recreate the watch when the API server drops the connection. The watch will be recreated taking the correct ResourceVersion into account so that previously processed events are not processed again.
(keep in mind: the KubernetesClient API got changed in this version)

I'm currently still testing this against a kubernetes platform (on AKS), but it looks like it does the trick ...
This should also fix #763

…il to watch a resource

buehler · 2024-05-29T08:05:57Z

Hey @duke-bartholomew

Thanks for the investigation. Did you create a PR?

slacki123 · 2024-06-13T11:00:44Z

@duke-bartholomew @buehler thank you for your work on this. I also see this issue come up quite frequently, which makes our application non production ready :(

How is the progress on this issue looking?

rajaniraog · 2024-06-13T14:10:02Z

We are facing the same issue and hoping an update will be released soon.

buehler · 2024-06-13T14:30:57Z

I'm currently at a music festival, I will have a look at the stuff next week

gabrieledemaglie added the bug Something isn't working label Mar 20, 2024

HappyCodeSloth mentioned this issue May 22, 2024

fix: resource watcher restart on failure #765

Closed

duke-bartholomew added a commit to duke-bartholomew/dotnet-operator-sdk that referenced this issue May 27, 2024

fixes(buehler#739): No retry is performed when the ResourceWatcher fa…

9f26947

…il to watch a resource

duke-bartholomew added a commit to duke-bartholomew/dotnet-operator-sdk that referenced this issue May 27, 2024

fixes(buehler#739): No retry is performed when the ResourceWatcher fa…

0e8e793

…il to watch a resource

duke-bartholomew added a commit to duke-bartholomew/dotnet-operator-sdk that referenced this issue May 27, 2024

fixes(buehler#739): No retry is performed when the ResourceWatcher fa…

56ae572

…il to watch a resource

duke-bartholomew added a commit to duke-bartholomew/dotnet-operator-sdk that referenced this issue May 27, 2024

fixes(buehler#739): No retry is performed when the ResourceWatcher fa…

f82d655

…il to watch a resource

buehler mentioned this issue May 29, 2024

[bug]: Watcher stops watching custom resource after some time #763

Closed

buehler mentioned this issue Jun 20, 2024

fix(watcher): reconnect after server or client timeout #780

Merged

buehler closed this as completed in #780 Jun 20, 2024

buehler closed this as completed in aa073d1 Jun 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug]: No retry is performed when the ResourceWatcher fail to watch a resource #739

[bug]: No retry is performed when the ResourceWatcher fail to watch a resource #739

gabrieledemaglie commented Mar 20, 2024

buehler commented Apr 19, 2024

nachtjasmin commented Apr 22, 2024 •

edited

Loading

duke-bartholomew commented May 21, 2024

nachtjasmin commented May 22, 2024

buehler commented May 22, 2024

HappyCodeSloth commented May 22, 2024

duke-bartholomew commented May 23, 2024 •

edited

Loading

duke-bartholomew commented May 27, 2024 •

edited

Loading

buehler commented May 29, 2024

slacki123 commented Jun 13, 2024

rajaniraog commented Jun 13, 2024

buehler commented Jun 13, 2024

[bug]: No retry is performed when the ResourceWatcher fail to watch a resource #739

[bug]: No retry is performed when the ResourceWatcher fail to watch a resource #739

Comments

gabrieledemaglie commented Mar 20, 2024

Describe the bug

To reproduce

Expected behavior

Screenshots

Additional Context

buehler commented Apr 19, 2024

nachtjasmin commented Apr 22, 2024 • edited Loading

duke-bartholomew commented May 21, 2024

nachtjasmin commented May 22, 2024

buehler commented May 22, 2024

HappyCodeSloth commented May 22, 2024

duke-bartholomew commented May 23, 2024 • edited Loading

duke-bartholomew commented May 27, 2024 • edited Loading

buehler commented May 29, 2024

slacki123 commented Jun 13, 2024

rajaniraog commented Jun 13, 2024

buehler commented Jun 13, 2024

nachtjasmin commented Apr 22, 2024 •

edited

Loading

duke-bartholomew commented May 23, 2024 •

edited

Loading

duke-bartholomew commented May 27, 2024 •

edited

Loading