Rollout failing with msg "the object has been modified; please apply your changes to the latest version" #3080

pdeva · 2023-10-06T03:04:11Z

Checklist:

[ x] I've included steps to reproduce the bug.
[ x] I've included the version of argo rollouts.

Describe the bug
updates to services in argo rollouts are failing suddenly with this msg for no reason. the only change we made was change the image tag of the Rollout

To Reproduce
it fails and gets in this state when mutliple rollout image tags are updated at once. if we then do a rollout retry one service at a time, each service succeeds.

Expected behavior
Rollout should succeed. has no reason to fail since the only thing chaged is updated image tag

Screenshots

Version
1.6.0

Logs

roCtx.reconcile err Operation cannot be fulfilled on replicasets.apps "pg-query-65bc4849f5": the object has been modified; please apply your changes to the latest version and try again

# Paste the logs from the rollout controller

# Logs for the entire controller:
kubectl logs -n argo-rollouts deployment/argo-rollouts

# Logs for a specific rollout:
kubectl logs -n argo-rollouts deployment/argo-rollouts | grep rollout=<ROLLOUTNAME

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

The text was updated successfully, but these errors were encountered:

bpoland · 2023-11-08T17:28:57Z

Hi @zachaller I am still seeing this the object has been modified; please apply your changes to the latest version error with replicasets using 1.6.2 which includes the fix from #3091 :(

zachaller · 2023-11-08T17:40:35Z

@bpoland It's a relatively normal log, Rollouts will retry the update, is there an issue you think this might be causing?

bpoland · 2023-11-08T19:49:31Z

@bpoland It's a relatively normal log, Rollouts will retry the update, is there an issue you think this might be causing?

Yeah our rollout got stuck and I saw this message over and over. The behaviour we saw was:

The rollout was in a healthy state (or so it said)
We updated the image on the Rollout (we are using workload referencing, in case that matters)
Rollouts created a new ReplicaSet with the updated image, but the new RS had 0 replicas (even though our first step is setWeight: 1) and nothing seemed to be happening
I manually promoted to the next step which is a pause and then to the next step setWeight: 2 -- still the new RS was stuck at 0 replicas (Rollout was marked as "Progressing")
I tried manually scaling up the new RS to 1 replica. The new pod started but the steps did not progress
I checked rollout controller logs and saw it complaining about an RS for an old revision. I manually deleted that RS and then the rollouts controller immediately picked up with the next step and the issue was resolved

mclarke47 · 2023-11-09T18:25:45Z

I'm also seeing a LOT of these in the controller logs with a very similar setup to @bpoland and we're also seeing rollouts get tingstuck unless we manually retry the failures

zachaller · 2023-11-10T18:04:03Z

Do you guys have any type of policy agent that modifies the replicasets or possible the pod spec in the replicaset? I have not experianced this and have never been able to reproduce it, most of the cases that I have seen people had some other controller fighting with rollouts controller on the replicaset that's not to say there isn't some issue within the rollouts controller just need to be able to reproduce it.

zachaller · 2023-11-11T21:19:53Z

@pdeva Are you able to reliably reproduce this. Also your image has a bunch of issues with the VirtualService but then you also show a log line on the ReplicaSet so it could be something else also modifying the VirtualService

mclarke47 · 2023-11-13T13:47:34Z

@zachaller We have policy agents but it seemed to work fine on v1.3.0 which we just upgraded from.

I managed to find a rollout stuck in progress because it seemed like it wasn't updating the replica count in the new replica set.

controller logs

As a follow up we rolled back to v1.3.0 and everything started working again

bpoland · 2023-11-13T14:52:53Z

Do you guys have any type of policy agent that modifies the replicasets or possible the pod spec in the replicaset? I have not experianced this and have never been able to reproduce it, most of the cases that I have seen people had some other controller fighting with rollouts controller on the replicaset that's not to say there isn't some issue within the rollouts controller just need to be able to reproduce it.

We have a linkerd injector which adds a container, maybe that is related? Similar to @mclarke47 though, we have not experienced this previously (currently trying to upgrade from 1.4.1)

DanTulovsky · 2023-11-15T01:50:31Z

We are also seeing this happen a lot more. Yesterday HPA increased the number of replicas, but the Rollout did not bring up more pods. The Rollout object itself had the correct number set, it's just the new pods weren't coming up. Killing Argo Rollouts controller always fixes these stuck cases.

It's definitely happening a lot more with the 1.6 version than before.

DanTulovsky · 2023-11-15T02:03:30Z

Question, would something like HPA modifying the number of replicas count as something that modifies the replicaset and might cause this issue?

DanTulovsky · 2023-11-15T02:21:17Z

Here's an example. We started seeing these message at: 2023-11-14T22:10:00Z

time="2023-11-14T22:10:00Z" level=error msg="roCtx.reconcile err Operation cannot be fulfilled on replicasets.apps \"public-collector-saas-56577cf9c\": the object has been modified; please apply your changes to the latest version and try again" generation=670 namespace=public resourceVersion=432345210 rollout=public-collector-saas

time="2023-11-15T00:44:36Z" level=error msg="rollout syncHandler error: Operation cannot be fulfilled on replicasets.apps \"public-collector-saas-56577cf9c\": the object has been modified; please apply your changes to the latest version and try again" namespace=public rollout=public-collector-saas

And they continued, this is last one at 2023-11-15T00:44:37Z

time="2023-11-15T00:44:37Z" level=error msg="Operation cannot be fulfilled on replicasets.apps \"public-collector-saas-56577cf9c\": the object has been modified; please apply your changes to the latest version and try again\n" error="<nil>"

That's two hours. And it only started working again when we killed the argo controller pod. Would it be possible to include in the message what changed? Perhaps that will lead to some clue as to why this is happening?

This is the first message (referencing the same replicaset) after the controller restarted:

time="2023-11-15T00:44:38Z" level=info msg="Enqueueing parent of public/public-collector-saas-56577cf9c-15: Rollout public/public-collector-saas"

Can you please explain a bit more about these conflicts that cause the "the object has been modified" errors? What is a common cause? How is the controller meant to deal with them? Presumably nothing was modifying this replicaset for 2 hours straight... is the idea that the controller modifies it, and then something else and also modifies it (maybe reverting something) and that's what the controller notices?

This is now happening to us daily so anything we can do to help figure this out, please let us know. We are on 1.6.2.

Thank you
Dan

DanTulovsky · 2023-11-15T02:24:52Z

btw, we also run gatekeeper, but it only has one mutating webhook which has to do with HPA, this is what it looks like:

apiVersion: mutations.gatekeeper.sh/v1beta1
kind: Assign
metadata:
  name: hpa-assign-default-scale-down
spec:
  applyTo:
    - groups: ["autoscaling"]
      kinds: ["HorizontalPodAutoscaler"]
      versions: ["v2beta2"]
  match:
    scope: Namespaced
    kinds:
      - apiGroups: ["*"]
        kinds: ["HorizontalPodAutoscaler"]
  location: "spec.behavior.scaleDown"
  parameters:
    pathTests:
      - subPath: "spec.behavior.scaleDown"
        condition: MustNotExist
    assign:
      value:
        # wait this long the largest recommendation and then scale down to that
        stabilizationWindowSeconds: 600  # default = 300
        policies:
          # Only take down ${value}% per ${periodSeconds}
          - periodSeconds: 300  # default = 60
            type: Percent
            value: 10

So in theory this shouldn't be touching the replicaset at all. The other webhooks are constraints that have to do with labels and annotations, nothing that would mess with a running pod.

DanTulovsky · 2023-11-15T02:40:01Z

fwiw, this is what the APi is showing related to that stuck replicaset:

DanTulovsky · 2023-11-15T02:45:38Z

Finally, this is when the issue started, the view from the api controller:

DanTulovsky · 2023-11-15T03:10:45Z

Maybe this is helpful, this is the last succesful Update by the argo controller, followed by changes from other components, and then the failed updates from argo starting:

DanTulovsky · 2023-11-15T03:34:21Z

So that last event that happened before the errors started was HPA taking down one replica. Which maybe was the trigger and what change in the replicaset since argo saw it last, but somehow it didn't manage to reconcile that properly.

This is the HPA view:

It looks like it was trying to increase the number of replicas. I wonder if this is Argo and HPA fighting it out then?

DanTulovsky · 2023-11-15T03:41:13Z

Note that while HPA shows current and desired replicas = 124, the actual number of replicas was 112. So this is similar to what I saw a couple of days ago where HPA said "bring up more replicas" and argo did not.

I assume the "current replicas" comes from the controller (argo in this case). And I can confirm that I did see the Rollout object have the correct desired number of pods, while the number or running pods was smaller.

zachaller · 2023-11-16T19:36:40Z

I want to just comment I think we are also seeing some issues as well with one of our clusters in regards to this so spending some time looking into it.

zachaller · 2023-11-28T16:56:22Z

Do any of you use notifications within your rollouts specs? Trying to see if there is a correlation between notifications updating the replicate spec.

DanTulovsky · 2023-11-28T16:58:09Z

yes.

…

On Tue, Nov 28, 2023 at 11:56 AM Zach Aller ***@***.***> wrote: Do you guys use notifications within your rollouts specs? — Reply to this email directly, view it on GitHub <#3080 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAD4PPZ6R2VINHAEXQQLJYDYGYJUDAVCNFSM6AAAAAA5VFK4BWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZQGI4TINZVG4> . You are receiving this because you commented.Message ID: ***@***.***>

bpoland · 2023-11-28T18:53:50Z

We don't use notifications currently.

mclarke47 · 2023-11-28T22:02:49Z

No notification use here

int128 · 2023-12-01T10:03:07Z

I have faced the similar problems in our cluster.

1. Rollout is stuck while canary update

When a Rollout is updated, both old and new ReplicaSet are running and then Rollout is stuck.
I could see the following message in the status of Rollout.

old replicas are pending termination

Here is a snippet of kubectl get replicaset. Hash 676f9f555d are new, and 7b7cdd9847 is old.

worker-676f9f555d-25xd9                      1/1     Running     0          3h22m
worker-676f9f555d-dkmlt                      1/1     Running     0          3h22m
worker-7b7cdd9847-d4xl6                      1/1     Running     0          5h8m

I deleted the old ReplicaSet and then Rollout status became Healthy.

2. Rollout status becomes Degraded even if pods are running

When a Rollout is updated, it becomes the degraded status even if all new pods are running.
I could see the following message in the status of Rollout:

ProgressDeadlineExceeded: ReplicaSet "poller-64d95bc44b" has timed out progressing.

I could refresh the status of Rollout by restating the argo-rollouts-controller.

DanTulovsky · 2023-12-04T15:42:47Z

Thank you for fixing this (hopefully for good)! What is the eta for when this might make it into a release?

zachaller · 2023-12-05T14:38:31Z

Just released it, can you try it out?

It will still log the conflict but rollouts should not get stuck anymore.

zachaller · 2023-12-06T15:10:29Z

I also probably found the root cause of the conflicts just not sure how to deal with it yet, but they also should not cause any issues because they do get retried and we have had this code for a while now #3218

int128 · 2023-12-08T02:18:16Z

I have updated argo-rollouts to v1.6.3 and this problem seems resolved.
Thank you very much for the quick fix.

bpoland · 2023-12-08T13:45:11Z

We've also upgraded and haven't seen the issue again since. Thanks!

DanTulovsky · 2023-12-13T18:49:13Z

Hi. We are still seeing this in 1.6.4. In this case, a new rollout was triggered, and show up in the UI, but did not start rolling out. I clicked the promote button and then it went ahead.

NaurisSadovskis · 2023-12-20T18:10:20Z

hey folks, we're also seeing this on the latest Argo Rollouts version (the unreleased 1.7.x).

In our case, we have a process which annotates (and labels) Rollout objects on each application deploy and we suspect the issue happens when:

Underlying Deployment is modified (triggering a new rollout) and a new ReplicaSet is created with the same labels/annotations as the original Rollout (foo: bar)
Rollout progresses
The original Rollout object is modified with a new label (foo: foobar) and Deployment is modified (again, triggering a new rollout)
This triggers the errors we've seen above.

In some cases, Argo Rollout Controller seems to lock up and stop reporting any data for that particular Rollout (usually one out of 15 we're running at a time) and the solution is to restart the controller to force it to reconcile the Rollout to the latest version it should be.

zachaller · 2023-12-20T18:30:09Z

@NaurisSadovskis Did you also see the issue on 1.5, the logs would be different and not log the error but could you also see if rollouts got stuck?

int128 · 2023-12-25T04:40:14Z

I updated the controller to v1.6.4 and this problem occurs again. For a workaround, we run a CronJob to restart the controller every day.

zachaller · 2023-12-27T14:43:36Z

@NaurisSadovskis would you be able to test a version with this patch: #3272

NaurisSadovskis · 2024-01-03T13:57:29Z

@zachaller updated and the problem persists - more specifically controller is active, but it gets stuck on rolling out the new ReplicaSet. the @int128's solution of restarting the controller solves this again.

ajhodgson · 2024-03-04T19:46:39Z

Just experienced this. 1.6.6. Definitely seems related to HPA; the replicaset was scaled up at the time.

edit: argocd 2.10.2, on EKS 1.28. Restarting the controller fixed it.

mclarke47 · 2024-03-06T14:13:01Z

Does it make sense to reopen this?

dodwmd · 2024-03-14T23:39:19Z

We're also seeing this issue, using latest release.

Danny5487401 · 2024-03-18T03:06:32Z

rollouts: v1.6.6 still have the same issue

later it will timed out progressing

marcusio888 · 2024-03-18T12:50:08Z

Good morning, the case must be reopened since with the new version of argo rollout v1.6.6 it continues to happen, in my case from time to time we have to restart argo rollout for it to be fixed. Annex logs.

This error happens with hpa and without hpa.

bpoland · 2024-03-26T13:32:30Z

@zachaller can we reopen this issue? We are also continuing to hit it

omer2500 · 2024-05-19T12:20:43Z

We are facing the same issue on v1.6.6

pdeva added the bug Something isn't working label Oct 6, 2023

zachaller mentioned this issue Oct 10, 2023

fix: keep rs informer updated #3091

Merged

bpoland mentioned this issue Nov 8, 2023

Cant Retry Aborted Canary #3120

Closed

2 tasks

zachaller mentioned this issue Nov 22, 2023

ReplicaSet Update Error: Operation cannot be fulfilled on replicasets.apps "xxxx" #3176

Closed

zachaller added this to the v1.7 milestone Nov 22, 2023

zachaller mentioned this issue Dec 1, 2023

fix: rollouts getting stuck due to bad rs informer updates #3200

Merged

zachaller closed this as completed in #3200 Dec 4, 2023

samuelbaena mentioned this issue Dec 12, 2023

BlueGreen doesn't scale down previously active replica after a rollback #3242

Closed

int128 mentioned this issue Feb 2, 2024

Rollout stuck issue #3316

Open

2 tasks

Rollout failing with msg "the object has been modified; please apply your changes to the latest version" #3080

Rollout failing with msg "the object has been modified; please apply your changes to the latest version" #3080

Comments

pdeva commented Oct 6, 2023 • edited

bpoland commented Nov 8, 2023

zachaller commented Nov 8, 2023

bpoland commented Nov 8, 2023

mclarke47 commented Nov 9, 2023 • edited

zachaller commented Nov 10, 2023 • edited

zachaller commented Nov 11, 2023

mclarke47 commented Nov 13, 2023 • edited

bpoland commented Nov 13, 2023

DanTulovsky commented Nov 15, 2023

DanTulovsky commented Nov 15, 2023

DanTulovsky commented Nov 15, 2023

DanTulovsky commented Nov 15, 2023

DanTulovsky commented Nov 15, 2023

DanTulovsky commented Nov 15, 2023

DanTulovsky commented Nov 15, 2023

DanTulovsky commented Nov 15, 2023 • edited

DanTulovsky commented Nov 15, 2023

zachaller commented Nov 16, 2023

zachaller commented Nov 28, 2023 • edited

DanTulovsky commented Nov 28, 2023 via email

bpoland commented Nov 28, 2023

mclarke47 commented Nov 28, 2023

int128 commented Dec 1, 2023

1. Rollout is stuck while canary update

2. Rollout status becomes Degraded even if pods are running

DanTulovsky commented Dec 4, 2023

zachaller commented Dec 5, 2023 • edited

zachaller commented Dec 6, 2023

int128 commented Dec 8, 2023

bpoland commented Dec 8, 2023

DanTulovsky commented Dec 13, 2023

NaurisSadovskis commented Dec 20, 2023

zachaller commented Dec 20, 2023

int128 commented Dec 25, 2023 • edited

zachaller commented Dec 27, 2023

NaurisSadovskis commented Jan 3, 2024

ajhodgson commented Mar 4, 2024 • edited

mclarke47 commented Mar 6, 2024

dodwmd commented Mar 14, 2024

Danny5487401 commented Mar 18, 2024 • edited

marcusio888 commented Mar 18, 2024 • edited

bpoland commented Mar 26, 2024

omer2500 commented May 19, 2024

pdeva commented Oct 6, 2023 •

edited

mclarke47 commented Nov 9, 2023 •

edited

zachaller commented Nov 10, 2023 •

edited

mclarke47 commented Nov 13, 2023 •

edited

DanTulovsky commented Nov 15, 2023 •

edited

zachaller commented Nov 28, 2023 •

edited

zachaller commented Dec 5, 2023 •

edited

int128 commented Dec 25, 2023 •

edited

ajhodgson commented Mar 4, 2024 •

edited

Danny5487401 commented Mar 18, 2024 •

edited

marcusio888 commented Mar 18, 2024 •

edited