Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lack of context for failing reconciliation #498

Open
tun0 opened this issue Mar 20, 2023 · 28 comments
Open

Lack of context for failing reconciliation #498

tun0 opened this issue Mar 20, 2023 · 28 comments

Comments

@tun0
Copy link

tun0 commented Mar 20, 2023

{
  "level": "error",
  "ts": "2023-03-20T14:08:31.180Z",
  "msg": "Reconciler error",
  "controller": "imageupdateautomation",
  "controllerGroup": "image.toolkit.fluxcd.io",
  "controllerKind": "ImageUpdateAutomation",
  "ImageUpdateAutomation": {
    "name": "apps",
    "namespace": "flux-system"
  },
  "namespace": "flux-system",
  "name": "apps",
  "reconcileID": "081a105f-7672-4fc7-b532-26be91972eeb",
  "error": "object not found"
}

This doesn't provide enough context to determine what actually is going wrong here.

@makkes
Copy link
Member

makkes commented Mar 20, 2023

Take a look at the events of that object using flux events or kubectl describe.

@tun0
Copy link
Author

tun0 commented Mar 20, 2023

I have a pretty good idea what's the underlying issue here. But still would be nice to see more info in the error itself.

Current idea: we have 1 git repo for 2 clusters. Both with webhooks for image updates. It's quite likely a race condition between the 2 Flux instances (one in each cluster). By the time one cluster tries to push, its HEAD is no longer up-to-date as it was already altered by the other cluster. Saw the same with Flux v1, but we didn't care for its logs as much as we do with v2 😉

Assuming the above to be correct, it'd be nice for it to retry (basically rebase?) instead of failing. On the other hand, we should probably invest some time in proper monitoring & alerting, instead of just dumping everything to slack directly. As Kubernetes is kinda all about "it's okay to fail, sometimes", being (primarily) stateless and all that.

@makkes
Copy link
Member

makkes commented Mar 20, 2023

Oh I see. Agree that the error should contain more info on the root cause.

@mantasaudickas
Copy link

mantasaudickas commented Apr 1, 2023

I have the same error.. also no context and have no idea what's going on here :)
Seems like everything is deployed, updated etc..
flux events does not show anything failing, and have no idea what I could describe with kubectl..
describe image-automation-controller does not show any new events at that time when error is displayed.
Issue started with upgrade to 0.41.1 version.

@kingdonb
Copy link
Member

kingdonb commented Apr 5, 2023

There are a couple of reports of this type of failure (or potentially unrelated failures eg. git error code 128) that are showing up in the Slack channel, I haven't seen them filter down to reports for IAC as of yet, but something to be aware of.

I will load up some Image Update Automation controls today or tomorrow and try to reproduce this issue, one or the other issue, there is not much context to go on for what is causing the failure. I understand this report is not about one specific failure, but the general case of failure not being reported very clearly with a good obvious link to a really specific root cause.

{"level":"error","ts":"2023-04-03T19:27:38.561Z","msg":"Reconciler error","controller":"imageupdateautomation","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImageUpdateAutomation","ImageUpdateAutomation":{"name":"flux-system","namespace":"flux-system"},"namespace":"flux-system","name":"flux-system","reconcileID":"a43e903f-de19-4eaa-a7cd-e64a804d77fa","error":"malformed unpack status: \u00010069\u0001001dunpack index-pack failed\n0043ng refs/heads/main error processing packfiles: exit status 128\n0000"}

This is another example of that. This is the error returned from Git, and I'm not sure how much helpful parsing we can do, but to refocus, the subject of this report is about making it clearer what has gone wrong when IAC fails. Maybe we can come up with some common failure scenarios and start classifying errors to raise those as conditions, based on a pattern matching.

@mantasaudickas
Copy link

mantasaudickas commented Sep 15, 2023

We have migrated our repository to another provider (migrated to gitlab from bitbucket.org). And seems like these errors are gone.
Nothing else changed: cluster and fluxcd versions remained the same. What I did just flux uninstall and flux bootstrap with new git repository url.
So seems like bitbucket.org have some specialties which produces this error?
For the completeness: I have tried to uninstall and install also with bitbucket.org.. but it did not helped.

@PaulSxxxs
Copy link

I'm having this same issue and have gone down the route of changing gitImplementation to use libgit2, but we're using source.toolkit.fluxcd.io/v1beta2 so this has been deprecated, (https://github.com/fluxcd/source-controller/blob/main/docs/spec/v1beta1/gitrepositories.md#git-implementation)

Furthermore v1beta2 recommends setting --feature-gates=OptimizedGitClones=false which I don't know how to achieve .. any tips on how to enable this? (https://github.com/fluxcd/source-controller/blob/main/docs/spec/v1beta2/gitrepositories.md#optimized-git-clones)

@mantasaudickas, We're also using bitbucket and have the scenario of multiple clusters using the same repo. Are you having good results so far?

@mantasaudickas
Copy link

@mantasaudickas, We're also using bitbucket and have the scenario of multiple clusters using the same repo. Are you having good results so far?

I have switched one project to gitlab and another to github (2 independent clients). So far error message "object not found" is gone in both of them. I did not tried your mentioned options.

The reason for switch actually was bitbucket issue - that once FluxCD makes a push - its not possible to get that last push anymore (while it is visible using UI, but not fetchable to local copies and not visible in git command line history)... I don't know if its a flux or bitbucket issue, but it was solved by migrating to other providers.

@dewe
Copy link

dewe commented Sep 20, 2023

... once FluxCD makes a push - its not possible to get that last push anymore (while it is visible using UI, but not fetchable to local copies and not visible in git command line history)...

We've seen the same behaviour and have raised a support ticket with Bitbucket... still no solution though.

@mantasaudickas
Copy link

have raised a support ticket with Bitbucket... still no solution though

Same here, since it was blocking us - we switched manifest repository location to another provider.. and now thinking to switch everything :)

@tobiasjochheimenglund
Copy link

Also getting the "Object not found" error using flux with Bitbucket. Imageautomation gets stuck starting with image-automation-controller command error on refs/heads/master: failed to update ref . Manually pushing to the same repo seems to be a temporary fix

@PaulSxxxs
Copy link

@tobiasjochheimenglund We're seeing the same behavior's; image updater is returning on-fast-forward update: refs/heads/master while --feature-gates=OptimizedGitClones=false, but image updated so seem to be processing; I'm about to test this more thoroughly and will report back. For us too, the error disappears when someone makes a commit which is thankfully quite frequent.

@dewe @mantasaudickas Do you have any more technical details you sent to bitbucket to push the problem onto them if it does seem to be bitbucket specific? Bitbucket didnt resolve for us either, though were helpful and pointed me towards git shallow clone potentially causing the issue. My support ticket was less technical and more a query about shallow clone, repo health and fluxcd.

I'll report back with any findings.

@mantasaudickas
Copy link

They did not asked for any technical details.. all their communication sounded more like: please check that, or that and maybe we can do GC for your repo.. and I did not heard from them since last Friday :)
Not sure how shallow clone should cause such an issue - but sounds like just another: "we don't know what is happening and Git is not supposed to keep your history at all" :D

@youest
Copy link

youest commented Sep 21, 2023

Hello there, we have the same issue. our configuration is multicluster with different branches on the same Bitbucket repo.
We are encountering this error only on one branch/cluster but not in others, at least until now.
is there any configuration we can add or change to have more details to investigate the problem? for example, does it make sense to increase the log level?

@dewe
Copy link

dewe commented Sep 21, 2023

@PaulSxxxs At the same time we get object not found, bitbucket pipeline doesn't trigger automatically as expected. When trying to start the pipeline manually it fails with "we couldn't clone the repository". There's definitely a correlation here. We have reported about the pipeline triggering problem, but got no actual response other than "Our engineering team is currently investigating this further".

@PaulSxxxs
Copy link

We had a similar issue committing from any git client for a time when "object not found" was occurring.
It definitely felt like some sort of lock but we couldn't figure it out, perhaps because it's a bitbucket issue.

@PaulSxxxs
Copy link

PaulSxxxs commented Sep 22, 2023

I received this message from Bitbucket:

Syahrul commented:

G'day, Paul

A quick update on this issue.

Our development team noticed a pattern with the FluxCD issue with Bitbucket cloud. After thorough analysis, it has been determined that the issue is most likely caused by the go-git library being used by FluxCD. This library prematurely closes the connection before a push operation is completed.

To address this matter, we will release a fix tomorrow to mitigate the problem. Once the mitigation process is complete, we will provide you with an update.

We appreciate your patience and encourage you to reach out if you have any additional questions.


Mohammad Syahrul
Support Engineer APAC, Bitbucket cloud

@mantasaudickas
Copy link

Wondering if "object not found" issue will be fixed, or its related to something else :)

@PaulSxxxs
Copy link

I specifically spoke to them about "object not found" and gave some technical details ... i'm fairly sure it will fix this.

@dewe
Copy link

dewe commented Sep 22, 2023

Can't find any apparently related issue over at go-git... 🤔

@hiddeco
Copy link
Member

hiddeco commented Sep 22, 2023

As I happen to be a go-git maintainer as well, we would be really happy to see an issue being created in go-git with steps to reproduce (or any details they can share about how they determined the connection to be closed prematurely).

@gregawoods
Copy link

We use both flux and bitbucket and have been absolutely pulling our hair out over this issue. For what it's worth we found that moving from https:// to ssh:// git URLs seemed to make the behavior go away. That isn't always practical to do however so here's hoping that bitbucket's fix works out.

@PaulSxxxs
Copy link

Bitbucket rolled their fix, and for us everything has been working perfectly again.

@mantasaudickas
Copy link

Yeah... I reverted my manifests as well to bitbucket, so it works - but I am again getting "object not found" messages :)

@tun0
Copy link
Author

tun0 commented Nov 16, 2023

Given that these object not found errors tend to get followed up by a successful reconciliation, it seems there's already some retry mechanism in place. Depending on the details of that retry logic, it might make sense to just "ignore" object not found error, unless it persists after several retries?

@pjbgf
Copy link
Member

pjbgf commented Nov 17, 2023

@tun0 Image Automation Controller would automatically retry on the next reconciliation, so yes it should be safe to disregard the "one-off" object not found error.

However, if you do find a pattern where you can reliably reproduce the issue, please report it upstream so it can be investigated and fixed.

@mantasaudickas
Copy link

mantasaudickas commented Nov 17, 2023

However, if you do find a pattern where you can reliably reproduce the issue

It is still happening with Bitbucket Cloud :)

@hiddeco
Copy link
Member

hiddeco commented Nov 17, 2023

Then please report it upstream with more details around any patterns you observe while the error occurs (or e.g. information about the contents of your repository, size, etc.)?

There is little we can do from within the context of this repository, and it really has to be addressed there. Thanks for your cooperation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests