-
Notifications
You must be signed in to change notification settings - Fork 910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
crossplane core fails to start up : "cannot apply crd" #4400
Comments
Thanks for reporting this @dee0sap - at first I imagine this is a by product of the large jump from 1.7 to 1.12 where the api versions of this Can you share here the current state of the
|
I'm getting this too when upgrading from 1.12 to 1.13:
|
@danports do you happen to have any existing instances of one way to check this would be:
|
How can I check that? |
you know who might know exactly what to look at here off the top of their head? @sttts 🤓 |
FWIW, just for sanity I installed v1.12.2, created some compositions (and therefore compositionRevisions), then upgraded to v1.13. That fairly vanilla upgrade scenario worked OK at least, just as a datapoint for the scope of this investigation 🕵️ |
Yeah, the cluster I tried upgrading today has been running Crossplane forever so it gets to be the canary for issues like these... 😅 |
Thanks guys. Another datapoint from our side..... Background :
In the case of our augmented crossplane image, the tests that are running making sure services come up and that we can provision a variety of things. This didn't detect a problem. So... I am really thinking this either has to do with the 'old' k8s versions being upgraded. |
Just attaching a survey of our 8 'long lived' dev deployments. Info was collected with a command like We also have canary and prod deployments, but we haven't rolled out to the for obvious reasons. Also as mentioned, our CI pipelines are always doing upgraded but they didn't have a problem. The two deployments showing a problem are the oldest. I am going to try and collect some history so we can better underand how they are different from the others. |
Another survey. Command was like Interesting that it is the 1st and 3rd oldest crossplane deployments that are showing a problem. Will dig further to see what is different about the 2nd oldest. |
I tried the upgrade path 1.7.0 -> 1.12.2 with a composition revision existing with both on k8s 1.25.2 and 1.27.3, but couldn't reproduce. Trying to focus the error message here, it says, XP init container tried to patch CRD for compositionrevisions.apiextensions.crossplane.io and version
Wondering if something changed during this re-packaging? It is odd that it is not happening for all clusters but for some though. Another question I have is whether any downgrades happened on those problematic ones? |
Don't know why this is happending. But the message says that the storage version of the XRD/CRD is v1, but for some reason the actor tries to remove v1 from |
Could it be the case that the above error message is being emitted from a pod with old XP version (v1.7.0) after new CRDs with v1.12.2 is applied? It would make more sense in that case. |
Correction to what I wrote: the message means that |
i can reproduce the issue with my test: https://github.com/haarchri/crossplane-issue-4400
full log output: |
dumped also crd here: |
I believe that everyone is affected if they installed Crossplane before version 1.11.0. In version 1.11.0, we introduced composition revisions v1beta1 (https://github.com/crossplane/crossplane/releases/tag/v1.11.0) and changed |
has been validated with version v1.13.0; however, it introduces a compatibility issue with https://github.com/haarchri/crossplane-issue-4400/blob/main/working-start-v1.11.0/setup.sh |
Very doubtful that the repackaging is relevant. If that were the case then I would expect wide spread problems. Still I am attaching the Dockerfile in the spirit of full disclosure. The problematic systems didn't undergo any downgrades. Fwiw our CD infrastructure has checks in place to prevent downgrades so it is not the case that some could have just casually performed a downgrade without folks noticing. |
Hey @sttts , @haarchri and everyone who has been contributing... I've read the comments but I am not sure what the way forward for my team is. Should we just wait for the fix from @haarchri ? Or is that not going to work for us because these deploys began with pre 1.11.0 ? Also, was it a timing issue? If yes can we just delete the old deployment and then redeploy and the problem will be resolved? |
Thanks! |
From @haarchri:
From @turkenh:
Do we understand why @turkenh wasn't able to reproduce if we expect everyone who installed Crossplane before v1.11.0 would be affected? 🤔
Do we fully understand the root cause now? Is this race between old/new pod the prevailing theory? If not, can someone help me understand this well by succinctly explaining root cause? Thank you very much! 🙇♂️ |
@jbw976 as i wrote: i think @turkenh tested v1.7.0 -> v1.12.2 which is working - the error happens if you upgrade this to v1.13.0... In version v1.11.0, we introduced compositionrevisions v1beta1 (https://github.com/crossplane/crossplane/releases/tag/v1.11.0) and changed// +kubebuilder:storageversion 48513d3#diff-702c83c3825e1d51189eca1a358b71f01ec8798cf91588f41cc029bdf09b80ccL616 , while before that, we had v1alpha1, in v1.13.0 we decided to drop compositionrevision v1alpha1 94a5030 but we missed the following when removing v1alpha1
when crossplane installation initially was done with v1.11.0 upgrade to v1.13.0 is possible because the compositionrevision storageversion is v1beta1 - all crossplane versions before uses storageversion v1alpha1 |
FYI: just cut |
So we're still in the process of rolling out 1.13.1 to the affected clusters. Some unrelated issues were introduced last week and they slowed us down. That said, an upper manager asked me to
That request has caused me to read more carefully through this ticket and I am now questioning how valid the diagnosis was for our clusters. In particular, we hit a problem upgrading from 1.7.0 to 1.12.2 but this comment speaks of a problem upgrading to 1.13.0 which we didn't do. Also, the oldest and third oldest of our 8 dev clusters had a problem. However the second oldest did not. This makes me think that the race condition suggested by @turkenh might more sense. Thoughts? |
@dee0sap, you are right, we mainly fixed the issue reported by @danports above, which we managed to reproduce. Your issue was about v1 missing and it's probably coming from an older pod trying to restore the old crd as @turkenh suggested, it would be really useful if you could check the version of crossplane that emitted that by checking the source of the log you shared. |
Thanks for the reply @phisco |
If you could check the source of the log we would be able to understand whether it was an actual issue in the first place, because if the log came from an older version of xp which was trying to install the old crd, the fact that it was blocked would actually be a good thing. |
Yeah I am trying to collect the log data from grafana now |
Oh, a question... There is an init container and a normal container. Would just one or the other or both be a factor here? |
can you post kubectl get pods -n crossplane-system and kubectl describe deployment crossplane -n crossplane-system ? |
Ok... good news I think. Not 100% sure why both init containers would be running at the same time but I can guess.. What do you guys think about this hypothesis? Could this actually explain what we have observed? ( Btw @haarchri I won't be able run those kubectl commands.. However I believe the above probably gives you what you're looking for. ) |
@dee0sap only init containers should be applying CRDs, so it's could be due to an old pod being restarted. |
Too many unknowns to be sure that's the actual root cause or not, but it feels plausible. |
Wow @dee0sap that process sounds like the perfect stress test for your control plane and all of your deployments... 😅 I can confirm that 1.13.1 fixed the issue I had upgrading from 1.12. 🎉 |
Hey @sttts @jbw976 @phisco @turkenh @haarchri Spent most of the day poking around in Grafana looking at metrics and logs for one of the problematic cluster. I think this is what happened
The problem with this theory is that @turkenh said he upgraded successfully from 1.7.0 to 1.12.2. @turkenh Could it be the case that when you tested there you didn't have any problematic objects in your k8s server? That first 'v1alpha1' message is actually Maybe you didn't have a lock? Also, @haarchri will 1.13.1 fix the upgrade of the lock CRD as well? Or do we need another 1.13.2? |
No it won't fix the locks issue. Can you open a dedicated issue for that? |
just a note we dropped v1alpha1 for lock in |
Just created #4442 |
What happened?
After upgrading from 1.7.0 to 1.12.2 the crossplane core doesn't startup. We see error messages like
How can we reproduce it?
TBD. We have at least 1 deploy where the problem didn't happen and 1 where it did.
They both went through the same upgrade of Crossplane. However the one that is showing the problem is actually at least a couple years old and has gone through upgrades of k8s while the other was just created within the last month.
NOTE: We saw a similar situation with service-catalog CRD in May of last year. What I am saying is, while the symptoms were very different back then the determining factor for which deployments did or did not show the symptoms was whether k8s itself was an 'old' deploy that had been upgraded or if it was a 'young' deploy which had not been upgraded.
What environment did it happen in?
Crossplane version: 1.12.2
k8s: 1.25.10 <!-- will try to get history of k8s upgrades
k8s distro : aws <!-- Teammate is checking how widespread this problem is, could be on others as well
os: Unavailable at this time
kernel: Unavailable at this time
The text was updated successfully, but these errors were encountered: