Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crossplane Core Fails to Start up after upgrading from V1.7.0 to V1.13.2 #4557

Closed
Technovation22 opened this issue Sep 1, 2023 · 25 comments
Closed
Labels
bug Something isn't working stale

Comments

@Technovation22
Copy link

Technovation22 commented Sep 1, 2023

What happened?

After upgrading from 1.7.0 to 1.13.2 the crossplane core doesn't startup. We see error messages like

crossplane: error: core.initCommand.Run(): cannot initialize core: cannot apply crd: cannot patch object: CustomResourceDefinition.apiextensions.k8s.io "compositionrevisions.apiextensions.crossplane.io" is invalid: status.storedVersions[0]: Invalid value: "v1alpha1": must appear in spec.versions.

This is a canary upgrade from 1.7.0 to 1.13.2. In the dev environment, we upgraded from 1.7.0 to 1.12.2 and then got the fix release of 1.13.2 and everything works fine. What is happening in the canary is this upgrade is from 1.7.0 to 1.13.2 and again we see a similar error before the 1.13.2 fix.

Something to note here is we did not have compositionrevisions enabled before the upgrade.

Logs from init-container before upgrade and new init-container after upgrade are attached

How can we reproduce it?

It happened in the transient test clusters that are CI/CD pipelines. But should be reproducible by upgrading from 1.7.0 to 1.13.2.

What environment did it happen in?

  • Crossplane version: 1.13.2
  • Cloud provider: AWS, GCP
  • Kubernetes version v1.25.10
  • Os: Unavailable at this time
  • Kernel: Unavailable at this time

oldinit-vs-newinit.txt

@Technovation22 Technovation22 added the bug Something isn't working label Sep 1, 2023
@Technovation22
Copy link
Author

Hi @sttts, @jbw976, @turkenh any chance this #4402 somehow did not cover upgrade from very old version?

@haarchri
Copy link
Contributor

haarchri commented Sep 2, 2023

for reference - we tested upgrade v1.3.1 - v1.7.0 - v1.12.2 and then tested with Code from v1.13.2

#4447

@haarchri
Copy link
Contributor

haarchri commented Sep 2, 2023

can you show output from command:

kubectl get crd compositionrevisions.apiextensions.crossplane.io -o json | jq '.status.storedVersions'

@dee0sap
Copy link
Contributor

dee0sap commented Sep 2, 2023

Hey @haarchri

Our CI pipeline collects an obj dump from the cluster before disposing of it. Looking in the dump from Parisa's last pipeline run I found the following
$ yq -C '.[] | select( .metadata.name == "compositionrevisions.apiextensions.crossplane.io" ) | .status.storedVersions' kind_CustomResourceDefinition_apiextensions.k8s.io_v1.yml

  • v1alpha1

GLOBAL+I540621@W-R90YFPQE /cygdrive/c/dev/fpa131/2023-08-29-crossplane-canary-upgrade-problem/parisa-can-pr/b4/a5110b-dmi-resources/all_resources
$ yq -C '.[] | select( .metadata.name == "compositionrevisions.apiextensions.crossplane.io" ) | .spec.versions[].name' kind_CustomResourceDefinition_apiextensions.k8s.io_v1.yml
v1alpha1

@dee0sap
Copy link
Contributor

dee0sap commented Sep 2, 2023

Hey @haarchri

As I mentioned in the slack thread we're very keen to get 1.13.2 out to canary quickly so any work around that would let us move forward would be much appreciated.

Look at timestamps on some of the objects in the obj dump, I see we have this sequence
"2023-08-31T21:50:09Z" Initial deploy
"2023-08-31T21:50:26Z" Creation of CompositionRevision CRD
"2023-08-31T21:57:20Z" Upgrade deploy

Now I also mentioned in slack that one of things different about our deployment of Crossplane is we have an additional init container that runs before the init containers provided by Crossplane. Since compositionevisions is not enabled for our 1.7 install could we work around this situation by having our init container in the 1.13.2 deploy delete the compositionrevision CRD ? Or maybe conditionally delete the CRD if it finds that the only version supported version is v1alpha1. That would probably be safer.

What do you think?

@turkenh
Copy link
Member

turkenh commented Sep 4, 2023

I can reproduce this issue simply by trying an upgrade from v1.7.0 to v1.13.2:

  1. helm upgrade --install crossplane --namespace crossplane-system crossplane-stable/crossplane --version v1.7.0 --create-namespace
  2. helm upgrade --install crossplane --namespace crossplane-system crossplane-stable/crossplane --version v1.13.2 --create-namespace
  3. XP Core crashes with the below init container logs:
{"level":"info","ts":"2023-09-04T11:51:59Z","logger":"crossplane","msg":"Given tls secret is already filled, skipping tls certificate generation","Step":"WebhookCertificateGenerator"}
{"level":"info","ts":"2023-09-04T11:51:59Z","logger":"crossplane","msg":"Step has been completed","Name":"WebhookCertificateGenerator"}
{"level":"info","ts":"2023-09-04T11:52:00Z","logger":"crossplane","msg":"Step has been completed","Name":"CoreCRDsMigrator"}
{"level":"info","ts":"2023-09-04T11:52:00Z","logger":"crossplane","msg":"Step has been completed","Name":"CoreCRDsMigrator"}
crossplane: error: core.initCommand.Run(): cannot initialize core: cannot apply crd: cannot patch object: CustomResourceDefinition.apiextensions.k8s.io "compositionrevisions.apiextensions.crossplane.io" is invalid: status.storedVersions[0]: Invalid value: "v1alpha1": must appear in spec.versions

I'll investigate what's going on further...

@turkenh
Copy link
Member

turkenh commented Sep 4, 2023

And it works fine if I try it from v1.7.0 to v1.12.2 and then to v.13.2 as @haarchri also pointed out.

@haarchri
Copy link
Contributor

haarchri commented Sep 4, 2023

@turkenh I suppose the problem is that we should first apply the CRDs and then proceed with the migration part if needed ?

@phisco
Copy link
Contributor

phisco commented Sep 4, 2023

In 1.13.2 we upgrade the CRDs to the latest version available before installing the new CRDs. If I'm not wrong the point is that jumping from 1.7.x to 1.13.2 there is no v1beta1 to upgrade it to first, hence the init job does nothing and then when we upgrade the crd removing v1alpha1 it fails. So yes, upgrading to 1.12 first should do the trick.

@turkenh
Copy link
Member

turkenh commented Sep 4, 2023

@turkenh I suppose the problem is that we should first apply the CRDs and then proceed with the migration part if needed ?

This was my first impression while reading the migrator code, but applying CRD fails with that error already, so not possible like a chicken-egg problem.

In general, if we attempt upgrading from a version that has an old version as storageVersion (e.g. v1alpha1) to a recent version that has already dropped that version, the migrator wouldn't help (and indeed migrator migrates CRs not CRDs) since the storageVersion it finds on the cluster is the old version v1alpha1 and it is not possible to first apply the latest version with new storageVersion since it fails with the very same error, e.g. Invalid value: "v1alpha1": must appear in spec.versions.

So, I believe it is fair to expect upgrading over an intermediate version since it is not feasible to support upgrading from any version to any. When we do it over an intermediate version:

v1.7.0 -> v1.12.2:

CRD apply works since v1alpha1 (e.g. storageVersion) is still in the new list.

v1.12.2 -> v1.13.2:

CRD apply works since new storageVersion, which is v1 is still in the new list.


For reference, composition revision CRDs have the following versions:

In v1.7.0:

versions:
- name: v1alpha1
  storage: true

In v1.12.2:

versions:
- name: v1
  storage: true
- name: v1beta1
  storage: false
- name: v1alpha1
  storage: false

In v1.13.2:

versions:
- name: v1
  storage: true
- name: v1beta1
  storage: false

@jeanduplessis
Copy link
Contributor

jeanduplessis commented Sep 4, 2023

I also believe it's fair to expect the end user to take an intermediary upgrade path.
I do think we should add an upgrade section to the docs and make this expectation known to users: https://docs.crossplane.io/software/

cc @plumbis

@dee0sap
Copy link
Contributor

dee0sap commented Sep 4, 2023

Hey guys

Thanks for the analysis so far. At this point I don't think we will easily be able to rollout < 1.13.2 to our canary and prod landscapes. The folks who own the CI/CD infrastructure enforce a rollout of dev then canary then production. So we would first have to downgrade dev clusters to whatever intermediate version was selected, then roll that out to canary and production, and then begin with the rollout of 1.13.2.

That said, would my work around that I suggested here
#4557 (comment)
work? We are adding our own init container to the deployment, one that runs before the crossplane core init container. Could that init container delete the v1alpha1 CRD and thereby unblock things?

@phisco
Copy link
Contributor

phisco commented Sep 4, 2023

Deleting the v1alpha1 version could result in data loss. An option would be to let your init container add the v1 and v1beta1 versions to the existing CRD, so that the latter can move all resources to v1 and then update the crd to drop the v1alpha1

@dee0sap
Copy link
Contributor

dee0sap commented Sep 4, 2023

Thanks for the response @phisco.

How would the deletion of the CRD result in data loss given that

  • we are starting from 1.7.0 and do not have the compositionrevisions feature enabled
  • our init container will run before the 'native' init container of 1.13.2
    That is, there will be no instances of compositionrevisions at the start of the upgrade of 1.13.2. ( Fwiw, we have surveyed all clusters and confirmed this to be the case. )

@phisco
Copy link
Contributor

phisco commented Sep 4, 2023

If that's the case then there should be no problem

@dee0sap
Copy link
Contributor

dee0sap commented Sep 4, 2023

Thanks!

@dee0sap
Copy link
Contributor

dee0sap commented Sep 4, 2023

@phisco

Just realized that you called out that there could be data loss but didn't indicate whether or not you thought my idea would actually get my team unstuck. What do you think?

@plumbis
Copy link
Contributor

plumbis commented Sep 5, 2023

I also believe it's fair to expect the end user to take an intermediary upgrade path. I do think we should add an upgrade section to the docs and make this expectation known to users: docs.crossplane.io/software

Opened a discussion to make sure the maintainers come to an agreement on upgrade policies and we'll get that included in the docs.
#4569

@dee0sap
Copy link
Contributor

dee0sap commented Sep 5, 2023

Hey everyone,

Just asking again if folks think the workaround mentioned here
#4557 (comment)
might work?
As mentioned in some subsequent comment, we haven't enabled compositionrevision so there is no chance of data loss.

@phisco
Copy link
Contributor

phisco commented Sep 5, 2023

Yes, @dee0sap, it should

@darkmuggle
Copy link
Contributor

I've hit the same bug trying to go from 1.9 to 1.12.1:

I0919 16:18:19.925612       1 request.go:690] Waited for 1.04789089s due to client-side throttling, not priority and fairness, request: GET:https://10.24.8.1:443/apis/monitoring.coreos.com/v1?timeout=32s
{"level":"info","ts":"2023-09-19T16:18:25Z","logger":"crossplane","msg":"Given tls secret is already filled, skipping tls certificate generation","Step":"WebhookCertificateGenerator"}
{"level":"info","ts":"2023-09-19T16:18:25Z","logger":"crossplane","msg":"Step has been completed","Name":"WebhookCertificateGenerator"}
crossplane: error: core.initCommand.Run(): cannot initialize core: cannot apply crd: cannot patch object: CustomResourceDefinition.apiextensions.k8s.io "locks.pkg.crossplane.io" is invalid: status.storedVersions[0]: Invalid value: "v1alpha1": must appear in spec.versions

@haarchri
Copy link
Contributor

@darkmuggle for locks.pkg.crossplane.io we fixed this in https://github.com/crossplane/crossplane/releases/tag/v1.13.2

Copy link

Crossplane does not currently have enough maintainers to address every issue and pull request. This issue has been automatically marked as stale because it has had no activity in the last 90 days. It will be closed in 14 days if no further activity occurs. Leaving a comment starting with /fresh will mark this issue as not stale.

@github-actions github-actions bot added the stale label Dec 19, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 2, 2024
@holgerson97
Copy link

I ran into the same issue. I resolved by uninstalling the old CRDS of crossplane.

@SleepyBrett
Copy link

I'm running into the same issue trying to go from 1.9.1 to 1.15.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

10 participants