Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rollout failing with msg "the object has been modified; please apply your changes to the latest version" #3080

Closed
pdeva opened this issue Oct 6, 2023 · 41 comments · Fixed by #3200
Labels
bug Something isn't working
Milestone

Comments

@pdeva
Copy link

pdeva commented Oct 6, 2023

Checklist:

  • [ x] I've included steps to reproduce the bug.
  • [ x] I've included the version of argo rollouts.

Describe the bug
updates to services in argo rollouts are failing suddenly with this msg for no reason. the only change we made was change the image tag of the Rollout

To Reproduce
it fails and gets in this state when mutliple rollout image tags are updated at once. if we then do a rollout retry one service at a time, each service succeeds.

Expected behavior
Rollout should succeed. has no reason to fail since the only thing chaged is updated image tag

Screenshots
Screenshot 2023-10-05 at 8 01 19 PM

Screenshot 2023-10-05 at 8 05 02 PM

Version
1.6.0

Logs

roCtx.reconcile err Operation cannot be fulfilled on replicasets.apps "pg-query-65bc4849f5": the object has been modified; please apply your changes to the latest version and try again
# Paste the logs from the rollout controller

# Logs for the entire controller:
kubectl logs -n argo-rollouts deployment/argo-rollouts

# Logs for a specific rollout:
kubectl logs -n argo-rollouts deployment/argo-rollouts | grep rollout=<ROLLOUTNAME

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

@pdeva pdeva added the bug Something isn't working label Oct 6, 2023
@bpoland
Copy link
Contributor

bpoland commented Nov 8, 2023

Hi @zachaller I am still seeing this the object has been modified; please apply your changes to the latest version error with replicasets using 1.6.2 which includes the fix from #3091 :(

@bpoland bpoland mentioned this issue Nov 8, 2023
2 tasks
@zachaller
Copy link
Collaborator

@bpoland It's a relatively normal log, Rollouts will retry the update, is there an issue you think this might be causing?

@bpoland
Copy link
Contributor

bpoland commented Nov 8, 2023

@bpoland It's a relatively normal log, Rollouts will retry the update, is there an issue you think this might be causing?

Yeah our rollout got stuck and I saw this message over and over. The behaviour we saw was:

  1. The rollout was in a healthy state (or so it said)
  2. We updated the image on the Rollout (we are using workload referencing, in case that matters)
  3. Rollouts created a new ReplicaSet with the updated image, but the new RS had 0 replicas (even though our first step is setWeight: 1) and nothing seemed to be happening
  4. I manually promoted to the next step which is a pause and then to the next step setWeight: 2 -- still the new RS was stuck at 0 replicas (Rollout was marked as "Progressing")
  5. I tried manually scaling up the new RS to 1 replica. The new pod started but the steps did not progress
  6. I checked rollout controller logs and saw it complaining about an RS for an old revision. I manually deleted that RS and then the rollouts controller immediately picked up with the next step and the issue was resolved

@mclarke47
Copy link
Contributor

mclarke47 commented Nov 9, 2023

I'm also seeing a LOT of these in the controller logs with a very similar setup to @bpoland and we're also seeing rollouts get tingstuck unless we manually retry the failures

@zachaller
Copy link
Collaborator

zachaller commented Nov 10, 2023

Do you guys have any type of policy agent that modifies the replicasets or possible the pod spec in the replicaset? I have not experianced this and have never been able to reproduce it, most of the cases that I have seen people had some other controller fighting with rollouts controller on the replicaset that's not to say there isn't some issue within the rollouts controller just need to be able to reproduce it.

@zachaller
Copy link
Collaborator

@pdeva Are you able to reliably reproduce this. Also your image has a bunch of issues with the VirtualService but then you also show a log line on the ReplicaSet so it could be something else also modifying the VirtualService

@mclarke47
Copy link
Contributor

mclarke47 commented Nov 13, 2023

@zachaller We have policy agents but it seemed to work fine on v1.3.0 which we just upgraded from.

I managed to find a rollout stuck in progress because it seemed like it wasn't updating the replica count in the new replica set.

Screenshot 2023-11-13 at 8 31 32 AM Screenshot 2023-11-13 at 8 31 51 AM Screenshot 2023-11-13 at 8 31 25 AM

controller logs

As a follow up we rolled back to v1.3.0 and everything started working again

@bpoland
Copy link
Contributor

bpoland commented Nov 13, 2023

Do you guys have any type of policy agent that modifies the replicasets or possible the pod spec in the replicaset? I have not experianced this and have never been able to reproduce it, most of the cases that I have seen people had some other controller fighting with rollouts controller on the replicaset that's not to say there isn't some issue within the rollouts controller just need to be able to reproduce it.

We have a linkerd injector which adds a container, maybe that is related? Similar to @mclarke47 though, we have not experienced this previously (currently trying to upgrade from 1.4.1)

@DanTulovsky
Copy link

We are also seeing this happen a lot more. Yesterday HPA increased the number of replicas, but the Rollout did not bring up more pods. The Rollout object itself had the correct number set, it's just the new pods weren't coming up. Killing Argo Rollouts controller always fixes these stuck cases.

It's definitely happening a lot more with the 1.6 version than before.

@DanTulovsky
Copy link

Question, would something like HPA modifying the number of replicas count as something that modifies the replicaset and might cause this issue?

@DanTulovsky
Copy link

Here's an example. We started seeing these message at: 2023-11-14T22:10:00Z

time="2023-11-14T22:10:00Z" level=error msg="roCtx.reconcile err Operation cannot be fulfilled on replicasets.apps \"public-collector-saas-56577cf9c\": the object has been modified; please apply your changes to the latest version and try again" generation=670 namespace=public resourceVersion=432345210 rollout=public-collector-saas

time="2023-11-15T00:44:36Z" level=error msg="rollout syncHandler error: Operation cannot be fulfilled on replicasets.apps \"public-collector-saas-56577cf9c\": the object has been modified; please apply your changes to the latest version and try again" namespace=public rollout=public-collector-saas

And they continued, this is last one at 2023-11-15T00:44:37Z

time="2023-11-15T00:44:37Z" level=error msg="Operation cannot be fulfilled on replicasets.apps \"public-collector-saas-56577cf9c\": the object has been modified; please apply your changes to the latest version and try again\n" error="<nil>"

That's two hours. And it only started working again when we killed the argo controller pod. Would it be possible to include in the message what changed? Perhaps that will lead to some clue as to why this is happening?

This is the first message (referencing the same replicaset) after the controller restarted:

time="2023-11-15T00:44:38Z" level=info msg="Enqueueing parent of public/public-collector-saas-56577cf9c-15: Rollout public/public-collector-saas"

Can you please explain a bit more about these conflicts that cause the "the object has been modified" errors? What is a common cause? How is the controller meant to deal with them? Presumably nothing was modifying this replicaset for 2 hours straight... is the idea that the controller modifies it, and then something else and also modifies it (maybe reverting something) and that's what the controller notices?

This is now happening to us daily so anything we can do to help figure this out, please let us know. We are on 1.6.2.

Thank you
Dan

@DanTulovsky
Copy link

btw, we also run gatekeeper, but it only has one mutating webhook which has to do with HPA, this is what it looks like:

apiVersion: mutations.gatekeeper.sh/v1beta1
kind: Assign
metadata:
  name: hpa-assign-default-scale-down
spec:
  applyTo:
    - groups: ["autoscaling"]
      kinds: ["HorizontalPodAutoscaler"]
      versions: ["v2beta2"]
  match:
    scope: Namespaced
    kinds:
      - apiGroups: ["*"]
        kinds: ["HorizontalPodAutoscaler"]
  location: "spec.behavior.scaleDown"
  parameters:
    pathTests:
      - subPath: "spec.behavior.scaleDown"
        condition: MustNotExist
    assign:
      value:
        # wait this long the largest recommendation and then scale down to that
        stabilizationWindowSeconds: 600  # default = 300
        policies:
          # Only take down ${value}% per ${periodSeconds}
          - periodSeconds: 300  # default = 60
            type: Percent
            value: 10

So in theory this shouldn't be touching the replicaset at all. The other webhooks are constraints that have to do with labels and annotations, nothing that would mess with a running pod.

@DanTulovsky
Copy link

fwiw, this is what the APi is showing related to that stuck replicaset:

image

@DanTulovsky
Copy link

Finally, this is when the issue started, the view from the api controller:

image

@DanTulovsky
Copy link

Maybe this is helpful, this is the last succesful Update by the argo controller, followed by changes from other components, and then the failed updates from argo starting:

image

@DanTulovsky
Copy link

DanTulovsky commented Nov 15, 2023

So that last event that happened before the errors started was HPA taking down one replica. Which maybe was the trigger and what change in the replicaset since argo saw it last, but somehow it didn't manage to reconcile that properly.

This is the HPA view:

image

It looks like it was trying to increase the number of replicas. I wonder if this is Argo and HPA fighting it out then?

@DanTulovsky
Copy link

Note that while HPA shows current and desired replicas = 124, the actual number of replicas was 112. So this is similar to what I saw a couple of days ago where HPA said "bring up more replicas" and argo did not.

I assume the "current replicas" comes from the controller (argo in this case). And I can confirm that I did see the Rollout object have the correct desired number of pods, while the number or running pods was smaller.

@zachaller
Copy link
Collaborator

I want to just comment I think we are also seeing some issues as well with one of our clusters in regards to this so spending some time looking into it.

@zachaller
Copy link
Collaborator

zachaller commented Nov 28, 2023

Do any of you use notifications within your rollouts specs? Trying to see if there is a correlation between notifications updating the replicate spec.

@DanTulovsky
Copy link

DanTulovsky commented Nov 28, 2023 via email

@bpoland
Copy link
Contributor

bpoland commented Nov 28, 2023

We don't use notifications currently.

@mclarke47
Copy link
Contributor

No notification use here

@int128
Copy link

int128 commented Dec 1, 2023

I have faced the similar problems in our cluster.

1. Rollout is stuck while canary update

When a Rollout is updated, both old and new ReplicaSet are running and then Rollout is stuck.
I could see the following message in the status of Rollout.

old replicas are pending termination

Here is a snippet of kubectl get replicaset. Hash 676f9f555d are new, and 7b7cdd9847 is old.

worker-676f9f555d-25xd9                      1/1     Running     0          3h22m
worker-676f9f555d-dkmlt                      1/1     Running     0          3h22m
worker-7b7cdd9847-d4xl6                      1/1     Running     0          5h8m

I deleted the old ReplicaSet and then Rollout status became Healthy.

2. Rollout status becomes Degraded even if pods are running

When a Rollout is updated, it becomes the degraded status even if all new pods are running.
I could see the following message in the status of Rollout:

ProgressDeadlineExceeded: ReplicaSet "poller-64d95bc44b" has timed out progressing.

I could refresh the status of Rollout by restating the argo-rollouts-controller.

@DanTulovsky
Copy link

Thank you for fixing this (hopefully for good)! What is the eta for when this might make it into a release?

@zachaller
Copy link
Collaborator

zachaller commented Dec 5, 2023

Just released it, can you try it out?

It will still log the conflict but rollouts should not get stuck anymore.

@zachaller
Copy link
Collaborator

I also probably found the root cause of the conflicts just not sure how to deal with it yet, but they also should not cause any issues because they do get retried and we have had this code for a while now #3218

@int128
Copy link

int128 commented Dec 8, 2023

I have updated argo-rollouts to v1.6.3 and this problem seems resolved.
Thank you very much for the quick fix.

@bpoland
Copy link
Contributor

bpoland commented Dec 8, 2023

We've also upgraded and haven't seen the issue again since. Thanks!

@DanTulovsky
Copy link

Hi. We are still seeing this in 1.6.4. In this case, a new rollout was triggered, and show up in the UI, but did not start rolling out. I clicked the promote button and then it went ahead.

image

image

@NaurisSadovskis
Copy link
Contributor

hey folks, we're also seeing this on the latest Argo Rollouts version (the unreleased 1.7.x).

In our case, we have a process which annotates (and labels) Rollout objects on each application deploy and we suspect the issue happens when:

  1. Underlying Deployment is modified (triggering a new rollout) and a new ReplicaSet is created with the same labels/annotations as the original Rollout (foo: bar)
  2. Rollout progresses
  3. The original Rollout object is modified with a new label (foo: foobar) and Deployment is modified (again, triggering a new rollout)
  4. This triggers the errors we've seen above.

In some cases, Argo Rollout Controller seems to lock up and stop reporting any data for that particular Rollout (usually one out of 15 we're running at a time) and the solution is to restart the controller to force it to reconcile the Rollout to the latest version it should be.

@zachaller
Copy link
Collaborator

@NaurisSadovskis Did you also see the issue on 1.5, the logs would be different and not log the error but could you also see if rollouts got stuck?

@int128
Copy link

int128 commented Dec 25, 2023

I updated the controller to v1.6.4 and this problem occurs again. For a workaround, we run a CronJob to restart the controller every day.

@zachaller
Copy link
Collaborator

@NaurisSadovskis would you be able to test a version with this patch: #3272

@NaurisSadovskis
Copy link
Contributor

@zachaller updated and the problem persists - more specifically controller is active, but it gets stuck on rolling out the new ReplicaSet. the @int128's solution of restarting the controller solves this again.

@int128 int128 mentioned this issue Feb 2, 2024
2 tasks
@ajhodgson
Copy link

ajhodgson commented Mar 4, 2024

Just experienced this. 1.6.6. Definitely seems related to HPA; the replicaset was scaled up at the time.

edit: argocd 2.10.2, on EKS 1.28. Restarting the controller fixed it.

@mclarke47
Copy link
Contributor

Does it make sense to reopen this?

@dodwmd
Copy link

dodwmd commented Mar 14, 2024

We're also seeing this issue, using latest release.

@Danny5487401
Copy link

Danny5487401 commented Mar 18, 2024

rollouts: v1.6.6 still have the same issue
image

later it will timed out progressing

@marcusio888
Copy link

marcusio888 commented Mar 18, 2024

Good morning, the case must be reopened since with the new version of argo rollout v1.6.6 it continues to happen, in my case from time to time we have to restart argo rollout for it to be fixed. Annex logs.
Screenshot 2024-03-18 at 09 52 32
This error happens with hpa and without hpa.

@bpoland
Copy link
Contributor

bpoland commented Mar 26, 2024

@zachaller can we reopen this issue? We are also continuing to hit it

@omer2500
Copy link

We are facing the same issue on v1.6.6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.