Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crossplane core fails to start up : "cannot apply crd" #4400

Closed
Tracked by #4372
dee0sap opened this issue Jul 28, 2023 · 40 comments · Fixed by #4402
Closed
Tracked by #4372

crossplane core fails to start up : "cannot apply crd" #4400

dee0sap opened this issue Jul 28, 2023 · 40 comments · Fixed by #4402
Assignees
Labels
bug Something isn't working
Milestone

Comments

@dee0sap
Copy link
Contributor

dee0sap commented Jul 28, 2023

What happened?

After upgrading from 1.7.0 to 1.12.2 the crossplane core doesn't startup. We see error messages like

crossplane: error: cannot initialize core: cannot apply crd: 
cannot patch object: CustomResourceDefinition.apiextensions.k8s.io 
"compositionrevisions.apiextensions.crossplane.io" is invalid: 
status.storedVersions[1]: Invalid value: "v1": must appear in spec.versions

How can we reproduce it?

TBD. We have at least 1 deploy where the problem didn't happen and 1 where it did.
They both went through the same upgrade of Crossplane. However the one that is showing the problem is actually at least a couple years old and has gone through upgrades of k8s while the other was just created within the last month.

NOTE: We saw a similar situation with service-catalog CRD in May of last year. What I am saying is, while the symptoms were very different back then the determining factor for which deployments did or did not show the symptoms was whether k8s itself was an 'old' deploy that had been upgraded or if it was a 'young' deploy which had not been upgraded.

What environment did it happen in?

Crossplane version: 1.12.2
k8s: 1.25.10 <!-- will try to get history of k8s upgrades
k8s distro : aws <!-- Teammate is checking how widespread this problem is, could be on others as well
os: Unavailable at this time
kernel: Unavailable at this time

@dee0sap dee0sap added the bug Something isn't working label Jul 28, 2023
@jbw976
Copy link
Member

jbw976 commented Jul 28, 2023

Thanks for reporting this @dee0sap - at first I imagine this is a by product of the large jump from 1.7 to 1.12 where the api versions of this compositionrevisions resource changed over time. But you're saying that you don't see this 100% across all control planes that underwent this exact same upgrade from 1.7 to 1.12? 🤔

Can you share here the current state of the storedVersions and versions for this CRD? this is what i see from a fresh (no upgrade) install of v1.13.0 in a kind cluster:

❯ kubectl get crd compositionrevisions.apiextensions.crossplane.io -o json | jq '.status.storedVersions'
[
  "v1"
]
 
❯ kubectl get crd compositionrevisions.apiextensions.crossplane.io -o json | jq '.spec.versions[].name'
"v1"
"v1beta1"

@danports
Copy link

I'm getting this too when upgrading from 1.12 to 1.13: crossplane: error: core.initCommand.Run(): cannot initialize core: cannot apply crd: cannot patch object: [CustomResourceDefinition.apiextensions.k8s.io](http://customresourcedefinition.apiextensions.k8s.io/) "[compositionrevisions.apiextensions.crossplane.io](http://compositionrevisions.apiextensions.crossplane.io/)" is invalid: status.storedVersions[0]: Invalid value: "v1alpha1": must appear in spec.versions

> kubectl get crd compositionrevisions.apiextensions.crossplane.io -o json | jq '.status.storedVersions'
[
   "v1alpha1",
   "v1beta1",
   "v1"
]

@jbw976
Copy link
Member

jbw976 commented Jul 28, 2023

@danports do you happen to have any existing instances of compositionRevisions living in the cluster that are still of v1alpha1 versions? e.g. this thread on reddit. that would make sense to me, but @dee0sap's scenario of v1 missing does not make sense to me 🤪

one way to check this would be:

kubectl get compositionrevision  -o json | jq '.items[].apiVersion'

@danports
Copy link

How can I check that? kubectl get compositionrevisions.apiextensions.crossplane.io -o yaml says everything is v1 but I think that's because the k8s API server automatically migrates stored versions to the latest before returning them to the client.

@jbw976
Copy link
Member

jbw976 commented Jul 28, 2023

you know who might know exactly what to look at here off the top of their head? @sttts 🤓

@jbw976
Copy link
Member

jbw976 commented Jul 28, 2023

FWIW, just for sanity I installed v1.12.2, created some compositions (and therefore compositionRevisions), then upgraded to v1.13. That fairly vanilla upgrade scenario worked OK at least, just as a datapoint for the scope of this investigation 🕵️

@danports
Copy link

Yeah, the cluster I tried upgrading today has been running Crossplane forever so it gets to be the canary for issues like these... 😅

@dee0sap
Copy link
Contributor Author

dee0sap commented Jul 28, 2023

Thanks guys.

Another datapoint from our side.....

Background :

  • We have a CI pipeline that it is making a new image based on the crossplane image with some definitely non-conflicting things added. ( It's a corporate compliance thing ).
  • We have a central repo that dictates the k8s version deployments should use and all of the versions of the services that should be in the deployment
  • The home grown testing framework we are using is always spinning up a cluster using the current revision of that central repo, upgrading the cluster with freshly built version of the service and then running integration tests.

In the case of our augmented crossplane image, the tests that are running making sure services come up and that we can provision a variety of things. This didn't detect a problem.

So... I am really thinking this either has to do with the 'old' k8s versions being upgraded.

@dee0sap
Copy link
Contributor Author

dee0sap commented Jul 28, 2023

survey-for-crossplane.txt

Just attaching a survey of our 8 'long lived' dev deployments. Info was collected with a command like
for i in kubeconfig-devoidc.yaml ; do echo "====== $i ====="; kubectl --kubeconfig=$i get crd compositionrevisions.apiextensions.crossplane.io -o yaml | yq '.status.storedVersions' -; kubectl --kubeconfig=$i get customresourcedefinition compositionrevisions.apiextensions.crossplane.io -o=yaml | yq '.spec.versions[].name' -; kubectl --kubeconfig=$i get -n mc-provisioner pod; kubectl --kubeconfig=$i version ;done |& tee survey-0.txt

We also have canary and prod deployments, but we haven't rolled out to the for obvious reasons.

Also as mentioned, our CI pipelines are always doing upgraded but they didn't have a problem.

The two deployments showing a problem are the oldest. I am going to try and collect some history so we can better underand how they are different from the others.

@dee0sap
Copy link
Contributor Author

dee0sap commented Jul 28, 2023

Another survey. Command was like
kubectl get crd compositionrevisions.apiextensions.crossplane.io -o yaml | yq '.metadata.creationTimestamp' -

Interesting that it is the 1st and 3rd oldest crossplane deployments that are showing a problem. Will dig further to see what is different about the 2nd oldest.

survey-for-crossplane2.txt

@turkenh
Copy link
Member

turkenh commented Jul 28, 2023

I tried the upgrade path 1.7.0 -> 1.12.2 with a composition revision existing with both on k8s 1.25.2 and 1.27.3, but couldn't reproduce.

Trying to focus the error message here, it says, XP init container tried to patch CRD for compositionrevisions.apiextensions.crossplane.io and version v1 wasn't available in spec.versions which is obviously not true 😅

cannot patch object: CustomResourceDefinition.apiextensions.k8s.io 
"compositionrevisions.apiextensions.crossplane.io" is invalid: 
status.storedVersions[1]: Invalid value: "v1": must appear in spec.versions

We have a CI pipeline that it is making a new image based on the crossplane image with some definitely non-conflicting things added. ( It's a corporate compliance thing ).

Wondering if something changed during this re-packaging? It is odd that it is not happening for all clusters but for some though.

Another question I have is whether any downgrades happened on those problematic ones?

@sttts
Copy link
Contributor

sttts commented Jul 28, 2023

Don't know why this is happending. But the message says that the storage version of the XRD/CRD is v1, but for some reason the actor tries to remove v1 from status.storageVersions. That field is used to remember which versions of an object might be in etcd. It is possible remove non-storage versions from it, but it is only safe if some object migration has been run, e.g. through https://github.com/kubernetes-sigs/kube-storage-version-migrator. If that is ignored, there is a risk of data loss.

@turkenh
Copy link
Member

turkenh commented Jul 28, 2023

Could it be the case that the above error message is being emitted from a pod with old XP version (v1.7.0) after new CRDs with v1.12.2 is applied? It would make more sense in that case.

@sttts
Copy link
Contributor

sttts commented Jul 28, 2023

Correction to what I wrote: the message means that v1 is neither the storage version, nor is it showing up as version at all in that CRD. So this might be some race between old and new as @turkenh writes. Old pod trying to reapply old CRD.

@haarchri
Copy link
Contributor

i can reproduce the issue with my test: https://github.com/haarchri/crossplane-issue-4400

kubectl logs -n crossplane-system crossplane-798547c55c-vz94z -c crossplane-init
{"level":"info","ts":"2023-07-28T07:47:16Z","logger":"crossplane","msg":"Given tls secret is already filled, skipping tls certificate generation","Step":"WebhookCertificateGenerator"}
{"level":"info","ts":"2023-07-28T07:47:16Z","logger":"crossplane","msg":"Step has been completed","Name":"WebhookCertificateGenerator"}
crossplane: error: core.initCommand.Run(): cannot initialize core: cannot apply crd: cannot patch object: CustomResourceDefinition.apiextensions.k8s.io "compositionrevisions.apiextensions.crossplane.io" is invalid: status.storedVersions[0]: Invalid value: "v1alpha1": must appear in spec.versions

full log output:
https://github.com/haarchri/crossplane-issue-4400/blob/main/log.txt#L183

@haarchri
Copy link
Contributor

@haarchri
Copy link
Contributor

haarchri commented Jul 28, 2023

I believe that everyone is affected if they installed Crossplane before version 1.11.0. In version 1.11.0, we introduced composition revisions v1beta1 (https://github.com/crossplane/crossplane/releases/tag/v1.11.0) and changed// +kubebuilder:storageversion 48513d3#diff-702c83c3825e1d51189eca1a358b71f01ec8798cf91588f41cc029bdf09b80ccL616 , while before that, we had v1alpha1, which we decided to drop 94a5030

@haarchri
Copy link
Contributor

has been validated with version v1.13.0; however, it introduces a compatibility issue with compositionrevision crd for users who had previously started crossplane with versions before v1.11.0

https://github.com/haarchri/crossplane-issue-4400/blob/main/working-start-v1.11.0/setup.sh

@dee0sap
Copy link
Contributor Author

dee0sap commented Jul 28, 2023

I tried the upgrade path 1.7.0 -> 1.12.2 with a composition revision existing with both on k8s 1.25.2 and 1.27.3, but couldn't reproduce.

Trying to focus the error message here, it says, XP init container tried to patch CRD for compositionrevisions.apiextensions.crossplane.io and version v1 wasn't available in spec.versions which is obviously not true 😅

cannot patch object: CustomResourceDefinition.apiextensions.k8s.io 
"compositionrevisions.apiextensions.crossplane.io" is invalid: 
status.storedVersions[1]: Invalid value: "v1": must appear in spec.versions

We have a CI pipeline that it is making a new image based on the crossplane image with some definitely non-conflicting things added. ( It's a corporate compliance thing ).

Wondering if something changed during this re-packaging? It is odd that it is not happening for all clusters but for some though.

Another question I have is whether any downgrades happened on those problematic ones?

Very doubtful that the repackaging is relevant. If that were the case then I would expect wide spread problems. Still I am attaching the Dockerfile in the spirit of full disclosure.

The problematic systems didn't undergo any downgrades. Fwiw our CD infrastructure has checks in place to prevent downgrades so it is not the case that some could have just casually performed a downgrade without folks noticing.

Dockerfile.zip

@dee0sap
Copy link
Contributor Author

dee0sap commented Jul 28, 2023

Hey @sttts , @haarchri and everyone who has been contributing...
First, thanks so much for jumping on this.

I've read the comments but I am not sure what the way forward for my team is.

Should we just wait for the fix from @haarchri ? Or is that not going to work for us because these deploys began with pre 1.11.0 ?

Also, was it a timing issue? If yes can we just delete the old deployment and then redeploy and the problem will be resolved?

@turkenh
Copy link
Member

turkenh commented Jul 28, 2023

@dee0sap we are planning the v1.13.1 patch with the fix.

I believe, all you would need to is to upgrade to that version so that the newly introduced migratory could take care of the problem. @sttts or @phisco can correct me if I am wrong.

@dee0sap
Copy link
Contributor Author

dee0sap commented Jul 28, 2023

Thanks!

@jbw976 jbw976 added this to the v1.14 milestone Jul 29, 2023
@jbw976
Copy link
Member

jbw976 commented Jul 29, 2023

From @haarchri:

I believe that everyone is affected if they installed Crossplane before version 1.11.0.

From @turkenh:

I tried the upgrade path 1.7.0 -> 1.12.2 with a composition revision existing with both on k8s 1.25.2 and 1.27.3, but couldn't reproduce.

Do we understand why @turkenh wasn't able to reproduce if we expect everyone who installed Crossplane before v1.11.0 would be affected? 🤔

So this might be some race between old and new as @turkenh writes. Old pod trying to reapply old CRD.

Do we fully understand the root cause now? Is this race between old/new pod the prevailing theory? If not, can someone help me understand this well by succinctly explaining root cause? Thank you very much! 🙇‍♂️

@haarchri
Copy link
Contributor

haarchri commented Jul 29, 2023

@jbw976 as i wrote:

i think @turkenh tested  v1.7.0 -> v1.12.2 which is working - the error happens if you upgrade this to v1.13.0...

In version v1.11.0, we introduced compositionrevisions v1beta1 (https://github.com/crossplane/crossplane/releases/tag/v1.11.0) and changed// +kubebuilder:storageversion 48513d3#diff-702c83c3825e1d51189eca1a358b71f01ec8798cf91588f41cc029bdf09b80ccL616 , while before that, we had v1alpha1, in v1.13.0 we decided to drop compositionrevision v1alpha1 94a5030

but we missed the following when removing v1alpha1

  • list all compositionrevisions
  • apply empty (!) patch against each
  • remove v1alpha1 from status.storedVersions in the CRD on cluster
  • update to the new CRD without v1alpha1 in spec.versions

when crossplane installation initially was done with v1.11.0 upgrade to v1.13.0 is possible because the compositionrevision storageversion is v1beta1 - all crossplane versions before uses storageversion v1alpha1

@phisco
Copy link
Contributor

phisco commented Jul 31, 2023

FYI: just cut v1.13.1, we are still working on the release notes, but you should already be able to install it. 🙏

@dee0sap
Copy link
Contributor Author

dee0sap commented Aug 1, 2023

@jbw976 @haarchri

So we're still in the process of rolling out 1.13.1 to the affected clusters. Some unrelated issues were introduced last week and they slowed us down.

That said, an upper manager asked me to

  • Survey our canary and prod and clusters in an effort to identify ones we should monitor closely with the next upgrade
  • Figure out how we can put testing in place to make sure we catch this kind of problem by design

That request has caused me to read more carefully through this ticket and I am now questioning how valid the diagnosis was for our clusters. In particular, we hit a problem upgrading from 1.7.0 to 1.12.2 but this comment speaks of a problem upgrading to 1.13.0 which we didn't do.

Also, the oldest and third oldest of our 8 dev clusters had a problem. However the second oldest did not.
And when we started looking into this all 3 showed
.status.storedVersions
v1alpha1
v1
.spec.versions[].name
v1
v1alpha1
v1beta1

This makes me think that the race condition suggested by @turkenh might more sense.

Thoughts?

@phisco
Copy link
Contributor

phisco commented Aug 2, 2023

@dee0sap, you are right, we mainly fixed the issue reported by @danports above, which we managed to reproduce. Your issue was about v1 missing and it's probably coming from an older pod trying to restore the old crd as @turkenh suggested, it would be really useful if you could check the version of crossplane that emitted that by checking the source of the log you shared.

@dee0sap
Copy link
Contributor Author

dee0sap commented Aug 2, 2023

Thanks for the reply @phisco
So.....
Then should this issue still be open? :) I mean, the symptom in the description wasn't actually addressed.

@phisco
Copy link
Contributor

phisco commented Aug 2, 2023

If you could check the source of the log we would be able to understand whether it was an actual issue in the first place, because if the log came from an older version of xp which was trying to install the old crd, the fact that it was blocked would actually be a good thing.

@dee0sap
Copy link
Contributor Author

dee0sap commented Aug 2, 2023

Yeah I am trying to collect the log data from grafana now

@dee0sap
Copy link
Contributor Author

dee0sap commented Aug 2, 2023

Oh, a question...

There is an init container and a normal container. Would just one or the other or both be a factor here?
I would expect the 1.12.2 initcontainer to be a factor of course, but what about the 1.7 non-init container?

@haarchri
Copy link
Contributor

haarchri commented Aug 2, 2023

can you post kubectl get pods -n crossplane-system and kubectl describe deployment crossplane -n crossplane-system ?

@dee0sap
Copy link
Contributor Author

dee0sap commented Aug 2, 2023

Ok... good news I think.
I manage to get logs and kube_pod_container_info metrics from grafana . Just looking at the timeframe when the v1 error was occuring and just looking at init container logs I see that actually a 1.7.0 init container was logging the "v1" must appear error while a 1.12.2 init container was logging the "v1alpha1" must appear error.

Not 100% sure why both init containers would be running at the same time but I can guess..
The cluster autoscaler is set up for these clusters. And our rollouts our goofy. Basically we have a process that makes 1 big yaml with absolutely everything for the cluster in it and that is passed to kubectl apply. Which means absolutely everything is changing at once. And there is 1 service in particular that has 10s ( hundreds? ) of workloads that can be updated in the process. This means that the cluster autoscaler goes into overdrive adding nodes and replacing pods as it tries to redistribute the workload across the nodes. ( And everything is put into the same node work group btw, just to maximize churn :) )
I could imagine that the autoscaler tried to replace the 1.7.0 pod on node X with a new pod on node Y while at the same time trying to spin up a 1.12.2 pod on node Z. And then hilarity ensued.

What do you guys think about this hypothesis? Could this actually explain what we have observed?

( Btw @haarchri I won't be able run those kubectl commands.. However I believe the above probably gives you what you're looking for. )

@phisco
Copy link
Contributor

phisco commented Aug 2, 2023

@dee0sap only init containers should be applying CRDs, so it's could be due to an old pod being restarted.

@phisco
Copy link
Contributor

phisco commented Aug 2, 2023

Too many unknowns to be sure that's the actual root cause or not, but it feels plausible.

@danports
Copy link

danports commented Aug 2, 2023

Wow @dee0sap that process sounds like the perfect stress test for your control plane and all of your deployments... 😅

I can confirm that 1.13.1 fixed the issue I had upgrading from 1.12. 🎉

@dee0sap
Copy link
Contributor Author

dee0sap commented Aug 3, 2023

Hey @sttts @jbw976 @phisco @turkenh @haarchri

Spent most of the day poking around in Grafana looking at metrics and logs for one of the problematic cluster. I think this is what happened

  • Something about v1alpha1 was removed between 1.7.0 and 1.12.2
  • We rolled out 1.12.2 at approximately 2023-07-26 02:29:00 UTC
  • The init container went into a crash loop complaining about v1alpha1
  • Because of the crashloop, k8s left the 1.7.0 pod running
  • At approximately 2023-07-27 00:47:37 UTC the cluster autoscaler eliminated the node hosting 1.7.0 as part a rebalance.
  • As part of the above the 1.7.0 pod on the node being dropped was replaced with a 1.7.0 pod an another node.
  • The initcontainer ran for the new 1.7.0 pod and entered a crash loop complaining about v1

The problem with this theory is that @turkenh said he upgraded successfully from 1.7.0 to 1.12.2.

@turkenh Could it be the case that when you tested there you didn't have any problematic objects in your k8s server? That first 'v1alpha1' message is actually
crossplane: error: core.initCommand.Run(): cannot initialize core: cannot apply crd: cannot patch object: CustomResourceDefinition.apiextensions.k8s.io "locks.pkg.crossplane.io" is invalid: status.storedVersions[0]: Invalid value: "v1alpha1": must appear in spec.versions

Maybe you didn't have a lock?

Also, @haarchri will 1.13.1 fix the upgrade of the lock CRD as well? Or do we need another 1.13.2?

@phisco
Copy link
Contributor

phisco commented Aug 3, 2023

No it won't fix the locks issue. Can you open a dedicated issue for that?

@haarchri
Copy link
Contributor

haarchri commented Aug 3, 2023

just a note we dropped v1alpha1 for lock in v1.11.0 #3479

@dee0sap
Copy link
Contributor Author

dee0sap commented Aug 3, 2023

Just created #4442

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

7 participants