Releases implementation #96

nprokopic · 2024-05-27T02:39:50Z

Below are excerpts from a couple of sections.

TODOs:

Ping stakeholders
Update RFC metadata

1. Introduction

This RFC defines how we implement releases for workload clusters for the new KaaS product that is based on the Cluster API project. It covers multiple aspects of releases, such as creation, testing and delivery of releases. It does not cover how clusters are upgraded to the new releases.

4. Proposal

This section defines how releases are implemented, where they are, how they are developed, tested and delivered.

Then it compares the proposed solution to the current one that we have for our new KaaS product that is based on Cluster API.

4.1. Implementation of releases

This document proposes to continue using giantswarm/releases repository for creating and delivering releases, albeit in a simplified way when compared to the vintage releases. Release resources are used in a minimal way, and cluster-$provider apps continue to be used for deploying workload clusters.

Briefly put, cluster-$provider apps are still deployed in almost exactly same way, with the following few differences:

In cluster-$provider app manifest, instead of specifying .spec.version, we MUST specify release.giantswarm.io/version label.
Information about app version, catalog and dependencies MUST BE obtained from the Release resource (during Helm rendering phase) when deploying a production cluster.
Information about app version, catalog and dependencies CAN be overridden with Helm values when deploying a test cluster in e2e tests.

The core of this proposal is the idea to decouple the app and component versions from the cluster-$provider apps, which will then enable us to have a very scalable process for working with releases, where we can easily create and manage many releases across many providers and develop mechanisms to enforce business logic across all of those.

AverageMarcus · 2024-05-27T07:19:13Z

releases-implementation/README.md

+
+Can the same way of doing releases really work equally well for both a fast-growing startup and a large corporation?
+
+Can the same way of doing releases really work equally well for, one one hand, a retail company that has fully embraced cloud-native approach, and on the other hand a manufacturer that is deploying Kubernetes clusters to multiple on-premise slow-changing and almost air-gapped environments like factories?


Suggested change

Can the same way of doing releases really work equally well for, one one hand, a retail company that has fully embraced cloud-native approach, and on the other hand a manufacturer that is deploying Kubernetes clusters to multiple on-premise slow-changing and almost air-gapped environments like factories?

Can the same way of doing releases really work equally well for, on one hand, a retail company that has fully embraced cloud-native approach, and on the other hand a manufacturer that is deploying Kubernetes clusters to multiple on-premise slow-changing and almost air-gapped environments like factories?

AverageMarcus · 2024-05-27T07:32:55Z

releases-implementation/README.md

+This document proposes to continue using `giantswarm/releases` repository for creating and delivering releases, albeit in a simplified way when compared to the vintage releases. Release resources are used in a minimal way, and cluster-$provider apps continue to be used for deploying workload clusters.
+
+Briefly put, cluster-$provider apps are still deployed in almost exactly same way, with the following few differences:
+- In cluster-$provider app manifest, instead of specifying `.spec.version`, we MUST specify `release.giantswarm.io/version` label.


What's the benefit of using a label rather part of the spec? Labels can't have OpenAPI validation performed by the api-server.
Using a label for this seems like a hack and would require changes to our app platform to allow not specifying a version but only for cluster apps.

+1, however, from batman times I remember that there's sometimes efficiency gains for label usage in terms of operator implementiations, but I'd say that should not result in omitting the spec field, but rather automatically adding/updating the label.

Cluster apps are still just Apps (Giant Swarm app platform App resource) and we apply those to deploy all required resources for the workload cluster. That does not change here.

We will still have to apply real cluster-aws version. Release version is different than cluster-aws version. Also multiple releases can use the same cluster-aws version.

Today in App resources we don't have anything related to releases, so not sure we would put Release version in the App CR.

Using a label for this seems like a hack and would require changes to our app platform to allow not specifying a version but only for cluster apps.

The workflow is the following and one change is needed in the app platform:

We deploy App CR with release.giantswarm.io/version label, e.g. "release.giantswarm.io/version: 25.0.0". It can be an annotation, or a new field in the spec, it doesn't matter, I was going for the simplest initial solution that works (it just cannot be .spec.version because cluster-aws version must be set there).

app-admission-controller mutates the App CR to set the correct .spec.version

it does this only for cluster apps (checking if an App is a cluster app is straightforward)

it reads release version from the label and it fetches the Release CR

it reads cluster-$provider app version from the Release CR

it sets App's .spec.version

cluster-$provider app got .spec.version set and it is deployed regularly like until now.

Semi-related but it's just made me think - how would a change to the release version label trigger all the needed changes if the cluster app version remains the same? What would be responsible for updating the kubernetes version or flatcar version values for example?

Semi-related but it's just made me think - how would a change to the release version label trigger all the needed changes if the cluster app version remains the same?

Good question, gotta check how it behaves on the update (so far I successfully tested creation only).

Label update will definitely trigger App CR reconciliation, just not sure if that will re-render the templates (re-rendering of the templates would update e.g. OS version, k8s version, app versions).

In case that release label changes do not trigger re-rendering of the cluster-$provider app, an alternative solution would be to put the release version into cluster-$provider app's config (as changes in the config do trigger cluster-$provider app re-rendering). I thought of this also, but went with the simpler option with release label on the App CR (as having release label on the CRs is something that was common in vintage as well).

After looking into this again I have moved release version from cluster-$provider App label to a Helm value global.release.version.

Thanks for the feedback here @AverageMarcus and @puja108!

I believe it is better like this, so now:

Change in cluster chart is a bit simpler. OTOH change in mutating webhook has one more more step as it gets release version from the ConfigMap, but this is IMO fine.

We could also do some basic release version validation with JSON schema.

Cleaner Helm code to use the release version in the templates, e.g. to set it as a label.

I have tested the cluster upgrade, it works as expected :)

Will update the RFC.

I have tested the cluster upgrade, it works as expected :)

with the release.version as a helm value, I see a problem when the upgrade includes a change in version in cluster-<provider> app.

I believe that because the release.version is now in a configmap, the app-admission-controller won't update the cluster-<provider> app version until the actual app CR is modified. That won't happen automatically by just modifying the associated configmap

App CRs have ConfigMap version as the annotation. IIUC this gets updated when you change the app config (e.g. when you change the release version), so this App CR update should trigger the mutating webhook which then updates the App version.

apiVersion: application.giantswarm.io/v1alpha1 kind: App metadata: annotations: app-operator.giantswarm.io/latest-configmap-version: "271755519"

I missed that annotation. So all clear from my end. thanks for digging that up!

AverageMarcus · 2024-05-27T07:33:53Z

releases-implementation/README.md

+
+Briefly put, cluster-$provider apps are still deployed in almost exactly same way, with the following few differences:
+- In cluster-$provider app manifest, instead of specifying `.spec.version`, we MUST specify `release.giantswarm.io/version` label.
+- Information about app version, catalog and dependencies MUST BE obtained from the Release resource (during Helm rendering phase) when deploying a production cluster.


I assume this refers to needing to make the change in app-operator (or similar) to lookup the correct values.

Have you considered how this might be handled in the future when we migrate to using Flux and HelmReleases rather than app-operator?

We should 100% avoid building anything custom into app operator for this, but AFAIK as long as these values are supplied through a configmap we should be fine no matter which helm rendering tool is used as they all support merging configs.

I thought app operator will be there to manage Apps (transforming them into right HelmRelease objects offering the 3level config) meanwhile chart operator is going to dissapear

I assume this refers to needing to make the change in app-operator (or similar) to lookup the correct values.

No changes in app-operator are needed. Release CR is read in the Helm template with Helm lookup function, and then app and component versions are read and set in the templates.

Have you considered how this might be handled in the future when we migrate to using Flux and HelmReleases rather than app-operator?

AFAIK Team Honeybadger is working on replacing chart-operator with Flux Helm controller, but app-operator is not going anywhere any time soon. Even if it did, that does not affect the work here in any way, as the change is just about Helm reading a Release CR, getting some value from it and putting that value to some field of some CR (e.g. App version in .spec.version, HelmRelease version in .spec.chart.spec.version, etc).

AverageMarcus · 2024-05-27T07:38:09Z

releases-implementation/README.md

+- Kubernetes version MUST BE specified as a `.spec.components` entry,
+- Flatcar version and image variant MUST BE specified as `.spec.components` entries,


Is this purely informational or is it used to decide what VM image to use? As we also bake Teleport agent into the node image does that also need to be included here?

It decides which image to use.

The idea is to get all versions of all apps / components from the Release, so:

Kubernetes version

OS version

info needed for the whole image name, including variant number that changes when we get e.g. new Teleport version baked in

alternatively we can also put the entire image name so it's more flexible (e.g. cluster-aws currently allows to override the image name with a free-form string, or it looks up image according to defined format, which is what we usually use and what I also did here)

app versions for apps that are deployed as App CRs

app versions for apps that are deployed as HelmRelease CRs

AverageMarcus · 2024-05-27T07:40:12Z

releases-implementation/README.md

+- Kubernetes version MUST BE specified as a `.spec.components` entry,
+- Flatcar version and image variant MUST BE specified as `.spec.components` entries,
+- cluster-$provider app version MUST BE specified as a `.spec.components` entry.
+


What about versions of CAPx controllers? Does that have any bearing on a Release? E.g. does a specific release need to use a specific version of the CAPi and CAPA controllers as changes between versions could effect behaviour of how the WCs are created.

The CAPI & CAPA controllers are not part of the release because we can only run one set on a management cluster.

I thought the watch label allowed for running multiple on the same MC and each would deal with different resources?

What about versions of CAPx controllers? Does that have any bearing on a Release?

No.

E.g. does a specific release need to use a specific version of the CAPi and CAPA controllers as changes between versions could effect behaviour of how the WCs are created.

While changes in both CAPI and CAPA controllers should be always backward compatible, and you should be always able to use same manifests even with newer CAPI/CAPA controllers, it is true that with some quite newer CAPI/CAPA controller, even when you create a cluster with the same release, it might get created slightly differently compared to e.g. 3 months ago with older CAPI/CAPA controllers. This is true also today with cluster-$provider apps (for all providers), and it does not change with this RFC. IIRC this is very much intended, because we want to be able to update MC controllers continuously (mostly for the sake of fixes).

The CAPI & CAPA controllers are not part of the release because we can only run one set on a management cluster.

Yes, exactly this.

I thought the watch label allowed for running multiple on the same MC and each would deal with different resources?

It did, but IIRC when we implemented watch label, it was back then when we wanted to run CAPI/CAPx controllers on the existing vintage MC, so that vintage MCs can run CAPI/CAPx controllers and deploy CAPI WCs. AFAIK we have abandoned that idea, and migration assumes that WCs are migrated to new CAPI-only MCs.

AverageMarcus · 2024-05-27T07:41:18Z

releases-implementation/README.md

+kind: App
+metadata:
+  labels:
+    app-operator.giantswarm.io/version: 0.0.0


Is this needed?

No idea, we render this in App CR for the cluster app (but I think it is needed for some reason). I just added what we already have.

AverageMarcus · 2024-05-27T07:43:56Z

releases-implementation/README.md

+    configMap:
+      name: mycluster-userconfig
+      namespace: org-mycompany
+  version: ""


Will this play nicely with gitops? Does both Flux and Argo allow this to be empty in git and not try and re-apply it to empty after the mutation has happened?

AFAIK an empty field should not be a problem for Flux. We already leave some fields empty for some resources in our gitops setup.

Will check this though, thanks!

AverageMarcus · 2024-05-27T07:45:45Z

releases-implementation/README.md

+- after cluster-$provider app is applied successfully, in order to render all templates and set the app version, catalog and depends on properties, Helm will do the following:
+	- it will lookup the App resource (via Helm `lookup` function) and read release version,
+	- then it will lookup the Release resource and read app version, catalog and depends on properties from there, and
+	- finally it will render all templates and apply them.


(Thinking out loud...) I wonder if it might be better / easier to instead have a ConfigMap deployed with each Release that acts like a values file that can be used as an overlay for each cluster-$provider App instead of needing to do multiple helm lookups.

+1 not sure if this is upstream helm functionality and if that is supported by all helm implementations (e.g. within Flux, Argo,...)

I wonder if it might be better / easier to instead have a ConfigMap deployed with each Release that acts like a values file that can be used as an overlay for each cluster-$provider App instead of needing to do multiple helm lookups.

Better, not sure, maybe yes, maybe not.

Easier? I don't think so, at least not now, because we already have process, workflows, tooling for the releases repo and Release CRs, and one of the ideas here is to make use of at least some of that, because otherwise we are taking more than few steps back as we're dropping the tooling and processes that have been working for many years.

Another downside of the ConfigMap idea and using it as an overlay is that it means that k8s/OS/app versions are then Helm values again, and we don't want that, because it makes k8s/OS/app versions part of the workload cluster creation API like it's just a config.

not sure if this is upstream helm functionality

Helm lookup function (which is basically a k8s client) is a regular upstream Helm function https://helm.sh/docs/chart_template_guide/functions_and_pipelines/#using-the-lookup-function, we already use it in some places in cluster-$provider apps.

AverageMarcus · 2024-05-27T07:52:34Z

releases-implementation/README.md

+
+In both cases there is a single release identifier, just defined in different places.
+
+Versioning logic, i.e. what and how can change in a major, minor  or a patch version, would be the same in both cases.


Suggested change

Versioning logic, i.e. what and how can change in a major, minor or a patch version, would be the same in both cases.

Versioning logic, i.e. what and how can change in a major, minor or a patch version, would be the same in both cases.

AverageMarcus · 2024-05-27T08:00:00Z

releases-implementation/README.md

+##### 4.2.2.2. Scenario 1: new provider-independent app patch
+
+Now in the case above, let’s say we want to release a new patch version for some provider-independent app, and we have to do that in 5 providers, and for 2 major releases for every provider. The process would look like the following:
+- Renovate updates the version of the app in 2 git branches in the cluster chart. Here we have 2 pull requests, both created by Renovate.


For reference as I wasn't aware - you can specify multiple base branches in Renovate, that's very nice! 😁 https://docs.renovatebot.com/configuration-options/#basebranches

Yeah, if we keep the current way, then we would really have to do that. You still have PRs though.

And Renovate can't help if you have to cherry-pick entire changes/fixes/features to older release branches, and when you have to deal with git conflicts.

AverageMarcus · 2024-05-27T08:01:40Z

releases-implementation/README.md

+- For all 5 providers, Renovate opens PRs for the last 2 major versions. Here we have 10 PRs opened by Renovate.
+- We release new patch versions for the last two major releases of all cluster-$provider apps. Here we have 10 release pull requests.
+
+In total, in the above scenario, we have 12 Renovate PRs and 12 release PRs, so 24 PRs in total. And all these PRs are across multiple components owned by multiple teams, meaning the some actions should be taken by people from multiple teams.


Suggested change

In total, in the above scenario, we have 12 Renovate PRs and 12 release PRs, so 24 PRs in total. And all these PRs are across multiple components owned by multiple teams, meaning the some actions should be taken by people from multiple teams.

In total, in the above scenario, we have 12 Renovate PRs and 12 release PRs, so 24 PRs in total. And all these PRs are across multiple components owned by multiple teams, meaning that some actions should be taken by people from multiple teams.

AverageMarcus · 2024-05-27T08:23:34Z

releases-implementation/README.md

+Changes to the current e2e tests would be minimal initially.
+- Testing of cluster-$provider app would require one addition, which would be:
+	- Looking up the latest release and setting the release label on the App CR.
+	- Like today, cluster-test-suites would use custom version of cluster-$provider app from the branch it is testing (in which case the mutating webhook would not set the cluster-$provider app version based on the Release CR).


How would the webhook know not to trigger?

A mutating webhook sets .spec.version only if it is not set already, which is useful and even needed for testing. Here is the code.

In addition to the mutating webhook, we would probably need a validating webhook, to check cluster App .spec.version and the version of the cluster-$provider app in the Release (they must be either the same, or .spec.version is a dev build on top of the version that we have in the Release, to allow for overriding during testing).

A mutating webhook sets .spec.version only if it is not set already, which is useful and even needed for testing.

Gotta fix this so it also updates .spec.version when release label is updated, except when we want to override cluster-aws version for testing (which we can signal with a Helm value, which should be fine, as we already have some testing-related ephemeral values).

AverageMarcus · 2024-05-27T08:24:48Z

releases-implementation/README.md

+	- Looking up the latest release and setting the release label on the App CR.
+	- Like today, cluster-test-suites would use custom version of cluster-$provider app from the branch it is testing (in which case the mutating webhook would not set the cluster-$provider app version based on the Release CR).
+- The existing app testing where app versions are overridden via ephemeral Helm values would continue working as is.
+- As a future improvement, cluster-test-suites could run all tests not only for the latest release, but for all releases which use the same major cluster-$provider app version.


Tests would also now need to be triggered from the releases repo if teams are able to bump app versions there. Each of those draft PRs for new releases need to be tested. This would require a completely new pipeline in Tekton.

Yes, I had that on my mind, just forgot to write it, thanks for the reminder, will add that as well.

AverageMarcus · 2024-05-27T08:28:23Z

releases-implementation/README.md

+
+Since app and component versions are embedded in the cluster-$provider app, we cannot combine different versions of different apps in different ways (unless we make app and component versions configurable, which then opens the door to a whole new set of issues). We cannot have another release model where e.g. versions of Kubernetes, OS, CNI and CPI are a part of the release, and all other apps are always at their latest versions. Or a release model where we use LTS release of OS, CPI, CNI and other apps that provide a LTS release (or something similar).
+
+OTOH with the releases repository, we can easily create different directories for different release models, where we can combine the versions of apps and components in whatever way and where we can update different release models with different frequency and with different rules. Therefore it would be relatively easy to have one release model where app versions are not even part of the release, or are decoupled and continuously updated. Or another “slower” release model with long-term support where apps and components are on their LTS versions and are updated more slowly.


I don't agree that this is "easy" in any way. It adds quite a bit of complexity to release management (what PRs need to be created when in what order), testing (each variation adds another suite of tests that need to be run each time) and around upgrades (can a cluster move to a different release channel? Can they then move back?)

I would guess that handling such things in folder structures is easier to implement and gives more flexibility than branches do, although with branches you can rely on existing git tooling that does sometimes already support such features (as many projects out there do work with branches).
Maybe at least for the simple cases that we are seeing already (releasing a new WC version based on a new app release or k8s/OS release, we could already make the process clear and think of how we would automate this.

That can however be separate from this RFC in general

We shouldn't add more complexity/feature than what we offered with Vintage releases. It will be very hard to implement and understand. Better to bring CAPI releases on par with vintage releases, and then see how it goes.

AverageMarcus · 2024-05-27T08:35:03Z

I've left a bunch of comments on the PR but I also have a few general questions / observations:

Supporting ~10 different versions means our amount of testing expense is going to leap up. I hope Joe is prepared for that 😉 Having to support 10 different releases is going to mean we're more reliant on the E2E tests as there's no way anyone is going to be manually testing their changes against all these variations.
How do we handle the case where an upgrade to the Kubernetes version (as performed in the releases repo) also requires changes to the Cluster app (e.g. changing of api-server flags or something like that)
Upgrades have been specifically left out of this RFC - why was that? It seems like a pretty major feature of releases that needs to be sorted before proceeding.
Do you have any idea of the timeline of changes needed to achieve this? It seems like it would involve quite a few changes to different apps and processes (app-operator, our release process, testing pipelines, etc.) that involve multiple teams. Do you have an idea of what might need to be completed before other work can be done?

puja108 · 2024-05-27T09:20:42Z

releases-implementation/README.md

+
+We have noticed that we miss few aspects of old releases, like a single release identifier, being able to easily and clearly see which versions of which apps are part of some release and which versions of which apps should be deployed to a workload cluster.
+
+We also missed a versioning scheme where it’s clear what we promise and what you can expect in a patch, minor or major release upgrade, which, although not strictly defined, it was mostly clear for releases of the vintage product. And equally important, we lack a mechanism to enforce this behaviour.


We have https://docs.giantswarm.io/vintage/platform-overview/cluster-management/releases/#versioning-conventions define our convention in our docs for vintage. And I'd say we should not diverge too much from that definition going forward. Could make the breaking vs non-breaking clearer in there maybe

AverageMarcus · 2024-05-27T09:25:51Z

releases-implementation/README.md

+
+## 2. Requirements language
+
+The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119] [RFC8174].


They are (https://github.com/giantswarm/rfc/pull/96/files#diff-203a244ac080f7a2c2046fe64c789ea1a5b963578abcaba96c551f4db232b676R414-R415)

View the "rich diff" and you'll see the links working.

puja108 · 2024-05-27T09:50:37Z

releases-implementation/README.md

+
+Our product is being developed for multiple providers, where we currently support 5 of them - AWS (CAPA), EKS (CAPA), Azure (CAPZ), vSphere (CAPV), VMware Cloud Director (CAPVCD). Hopefully we will work with more providers soon (e.g. GCP/GKE).
+
+We also need to support multiple major versions across those providers. Most of managed Kubernetes products support all community-supported Kubernetes versions, meaning the latest 3 minor versions. In our case that would mean 3 major versions, and that is per provider.


officially we support only 2, anything beyond is exceptions currently

puja108 · 2024-05-27T09:52:55Z

releases-implementation/README.md

+
+Assuming we would like to support at least 2 major versions per provider, and that we work with 5 providers, we would regularly support 10 major versions across all providers. With 6 providers and 3 latest major releases, this number grows to 18 major releases across all providers.
+
+Additionally, it might be possible that for every major release, we would support more than 1 minor in some cases (e.g. last 2). Therefore, when there is a new patch version of some app (e.g. a security fix), we can easily be in a situation where we have to create a double-digit number of new patch releases in order to patch all affected minor releases.


again this is not officially supported in our guidelines, which state each major is supported kept on latest minor and patch only, and would be an exception (especially as minors and patches should not be breaking and thus not have blockers towards an upgrade to the latest minor and patch.

puja108 · 2024-05-27T10:11:16Z

releases-implementation/README.md

+- provider-independent and provider-specific default configuration of apps,
+- configuration of the operating system and different node components, such as systemd, containerd, etc.
+
+Multiple teams and multiple people are continuously working on all the above and it is indispensable to ensure that all of them have smooth and frictionless development, testing and release experience, so we can increase deployment frequency, reduce lead time for changes and reduce change failure rate. For this to work, we need to be able to develop, test and release almost every change independently of almost all other changes. The team  that have worked on a change should be able to release the change fully independently, without any intervention from the provider-integration or provider-independent KaaS teams.


puja108 · 2024-05-27T10:43:54Z

releases-implementation/README.md

+
+That’s it - 1 PR. 1 PR for any number of releases across any number of providers.
+
+We can also maintain a draft PR in the releases repository where we automatically create next draft releases for all providers, and then Renovate can bump version numbers in those draft releases, and we just decide when to cut the new release (and name it appropriately).


Ideally it should be so cheap to cut a new release that we just do it on any change and not batch changes within versions, batching will happen naturally by consecutively releasing patches, but at least we'd have atomic updates, which could be tested independently.

puja108 · 2024-05-27T10:50:48Z

releases-implementation/README.md

+  components:
+  - name: cluster-aws
+    version: 1.0.0
+  - name: flatcar
+    version: 3815.2.2
+  - name: flatcar-variant
+    version: 3.0.0
+  - name: kubernetes
+    version: 1.25.16


so this would mean for an OS or K8s upgrade we would not need to release any new app/chart, but just bump the version here (and have the image available), right?

I'm not sure that'd work technically. Something would need to trigger the chart to lookup the new versions and change the image reference in the rendered templates.

I just meant that with this proposal we would not need to release a new chart in many cases, only changes that need new values being wired up would need a new chart.

so this would mean for an OS or K8s upgrade we would not need to release any new app/chart

Yes, that is exactly the idea, when you need new OS/k8s/app version, you don't have to touch cluster-$provider app repo, no new versions are needed.

Then only scenario where cluster-$provider app changes are needed would be if some new k8s/OS/app version requires some changes in some config, but this is usually rare (and probably means it is some breaking change). We can see how to improve this scenario later, e.g. if needed we could decouple even app config from the cluster-$provider app and take it out, but I don't think this is needed (at least not now).

but just bump the version here (and have the image available), right?

Since releases are immutable, we would create a new Release with the new versions. Then we would bump the release version for the cluster.

sure, I did mean bump and release as new version, not bump inside existing version

puja108 · 2024-05-27T10:57:31Z

releases-implementation/README.md

+
+They current cluster-$provider releases are attractive thanks to their simplicity and straightforwardness. However they have not been put to a test yet, as our customers are still not using the new product for their production workloads.
+
+As soon as our customers move from our vintage product to Cluster API, we will have a need for multiple major releases. Once the customers are using the new product more, they will put it to test and we’ll have even more work on it, compared to today, so there will be even more need for more patch and minor releases across all providers.


This leaves me thinking if we have a plan and timeline how quick we can move to this model, as we already have a timeline when customers will be moving and we also need a plan how to move customers that already are using the existing system. cc @alex-dabija

What about right away? 🙈 😬

Required changes are very small, I have a working POC already, with changes in cluster chart, cluster-aws and app-admission-controller.

I have added an agenda item for this week's SIG Architecture to show it there.

We will definitely need to also look at how upgrades are handled, but generally good if we can move fast.

Upgrade of existing clusters would be just two changes (slightly different compared to a regular version bump):

Remove spec.version in cluster-$provider app manifest.

Set release version.

Then for clusters on the new release system (with Release CRs) it's just a regular version bumps like for cluster-aws version.

nprokopic · 2024-05-27T21:27:38Z

Huge thanks to @puja108, @AverageMarcus and @pipo02mix for the review! Will reply to all of your comments.

pipo02mix · 2024-05-28T09:55:55Z

releases-implementation/README.md

+
+With the releases repository, release identifiers would be defined in the repository itself, in the same way like for vintage releases. In this process the cluster-$provider app is one of the release components.
+
+In both cases there is a single release identifier, just defined in different places.


What do you mean by different places? 🤔

Currently the release identifier (release version) is cluster-$provider app release, so you can look up some release e.g. in cluster-aws repo releases.

With the releases repo you would look up some release in the release repo (as cluster-$provider release becomes just one part of the release).

AndiDog · 2024-05-29T11:15:16Z

releases-implementation/README.md

+- https://github.com/orgs/giantswarm/teams/sig-architecture
+- https://github.com/orgs/giantswarm/teams/team-turtles
+state: approved
+summary: Where and how we implement releases for workload clusters.


This must summarize what the RFC proposes. Right now, it's very vague and not helpful in the RFC overview table.

AndiDog · 2024-05-29T11:16:33Z

releases-implementation/README.md

+summary: Where and how we implement releases for workload clusters.
+---
+
+# Releases implementation


Suggested change

# Releases implementation

# Cluster releases implementation

AndiDog · 2024-05-29T14:19:00Z

releases-implementation/README.md

+
+#### 4.2.1. Release identifier and versioning
+
+With the current cluster-$provider apps releases process, the cluster-$providers app version is the release identifier.


Please replace all occurrences of "current", "old", "new", "today" etc. with absolute terms (e.g. "CAPI release for AWS", "Vintage release", "as of 2024-04", "before introduction of this release model", ...). This document must still make sense to readers next month and next year.

AndiDog · 2024-05-29T14:28:40Z

releases-implementation/README.md

+
+Since app and component versions are embedded in the cluster-$provider app, we cannot combine different versions of different apps in different ways (unless we make app and component versions configurable, which then opens the door to a whole new set of issues). We cannot have another release model where e.g. versions of Kubernetes, OS, CNI and CPI are a part of the release, and all other apps are always at their latest versions. Or a release model where we use LTS release of OS, CPI, CNI and other apps that provide a LTS release (or something similar).
+
+OTOH with the releases repository, we can easily create different directories for different release models, where we can combine the versions of apps and components in whatever way and where we can update different release models with different frequency and with different rules. Therefore it would be relatively easy to have one release model where app versions are not even part of the release, or are decoupled and continuously updated. Or another “slower” release model with long-term support where apps and components are on their LTS versions and are updated more slowly.


We shouldn't add more complexity/feature than what we offered with Vintage releases. It will be very hard to implement and understand. Better to bring CAPI releases on par with vintage releases, and then see how it goes.

Gacko

Nothing more to add, other comments already cover my questions.

AverageMarcus · 2024-06-07T09:29:38Z

Some questions that have come to mind while working on the tests:

What happens if the Release isn't found by Helm? How is it surfaced to the user so they know what to do?
What happens if an in-use Release CR is deleted? I assume this will then prevent any further changes to those Clusters as the values will fail to render. Is there a plan to have deletion protection in place?
Are we going to prevent Release CRs from being updated / patched?
If we're set on using a generic Release CRD can we at least ensure that all CRs are labeled with the provider they are related to so we can do simple kubectl get -l ... for a specific providers release instead of needing to use the name hack and piping through grep?
Why are we using a generic Release CRD? Was there any thought on using provider-specific types? E.g. CAPARelease?

nprokopic · 2024-06-10T09:08:54Z

What happens if the Release isn't found by Helm? How is it surfaced to the user so they know what to do?

Helm rendering fails, so you cannot apply your cluster manifests.

This is error ATM (but it could be nicer, will improve it):

Release resource '%s' not found

What happens if an in-use Release CR is deleted? I assume this will then prevent any further changes to those Clusters as the values will fail to render. Is there a plan to have deletion protection in place?

The idea is to add a finalizer for every cluster to Release CR in cluster chart post-create, post-upgrade hook, and to delete it in post-delete hook. This way you cannot accidentally delete Release CR that is in use.

Are we going to prevent Release CRs from being updated / patched?

We should, but we're not doing it yet.

If we're set on using a generic Release CRD can we at least ensure that all CRs are labeled with the provider they are related to so we can do simple kubectl get -l ... for a specific providers release instead of needing to use the name hack and piping through grep?

Good idea, thanks!

Why are we using a generic Release CRD? Was there any thought on using provider-specific types? E.g. CAPARelease?

IMO releases should behave the same way across all providers, with only apps/components being different. Do you have some use case in mind for a different Release CRD per provider?

AverageMarcus · 2024-06-10T09:20:22Z

The idea is to add a finalizer for every cluster to Release CR in cluster chart post-create, post-upgrade hook, and to delete it in post-delete hook. This way you cannot accidentally delete Release CR that is in use.

👍 Can we include this in the RFC then? :)

Do you have some use case in mind for a different Release CRD per provider?

I was just thinking about how to easily get the latest available release for a provider. If we label the CRs then that solves the problem.
Also could possibly be useful for Happa to easily filter the available releases for a specific cluster provider.
As multi-provider MCs are new for us I suspect that there might be some "gotchas" we haven't thought about yet.

fiunchinho · 2024-06-19T15:06:42Z

I don't think this is a good idea. The process for creating a workload cluster is complex enough as it currently is, and we are just adding more indirection and making it more complex and harder to reason about.
Designing an App where we need to know that we need to leave the spec.version empty so that some mutating webhook can fetch a CR from the k8s api to look up the right version to use sounds really complex and counter intuitive, UX wise.

I'm not sure if the number of PRs that we need deal with when updating an app is a good metric to measure the approach. Some of the PR's are automatically created by Renovate, and I'd invest time in automatically running tests and merging these PRs if tests were successful.
If the number of PRs matters, we could move all cluster charts (cluster, cluster-aws, cluster-azure, etc etc) to a single repository. Then bumping the version of an app would effectively be one single PR. We could create different releases for the different providers from this single repository.

Also, when upgrading something like the kubernetes version or flatcar version, may be just changing the version number in the Release CR. But upgrading these may require changes in the CAPI manifests some times. So we need the PRs anyway.

Having to maintain several branches will be challenging. But then again I'd invest time in improving our tooling to deal with branches (keeping up to date, cherry pick changes automatically, etc). Many open source projects already have automation around this, so it shouldn't be a huge effort. And we would get benefits not only for managing clusters, but for managing all our repositories that may need several branches.

JosephSalisbury · 2024-07-25T12:58:25Z

@nprokopic what's the status here? i feel like we're doing this now :D

nprokopic · 2024-08-20T10:07:10Z

@nprokopic what's the status here? i feel like we're doing this now :D

Yeah, we're doing this definitely, and it's still heavily in progress in Turtles (we moved CAPA and CAPZ to new releases, soon CAPV), which is why I didn't get around to pick up the RFC again and finish this 🙈

nprokopic added 8 commits May 26, 2024 18:40

Add Releases RFC

ed7c19b

Update Releases RFC

d512f1a

Update Releases RFC

629fabc

Update Releases RFC

24d678f

Update Releases RFC

ae22611

Update Releases RFC

0cd1de5

Update Releases RFC

0488015

Update Releases RFC

6677661

nprokopic requested review from a team May 27, 2024 02:40

AverageMarcus reviewed May 27, 2024

View reviewed changes

puja108 reviewed May 27, 2024

View reviewed changes

AverageMarcus reviewed May 27, 2024

View reviewed changes

puja108 reviewed May 27, 2024

View reviewed changes

pipo02mix reviewed May 28, 2024

View reviewed changes

Add issue link

22943fc

AndiDog reviewed May 29, 2024

View reviewed changes

Gacko reviewed Jun 4, 2024

View reviewed changes

This was referenced Jun 4, 2024

Enable testing of releases framework giantswarm/roadmap#3473

Closed

Investigate how mc-bootstrap would deal with the new releases flow giantswarm/roadmap#3475

Closed

This was referenced Jun 4, 2024

Add support for Release CRs giantswarm/cluster#211

Merged

Add support for Release CRs giantswarm/cluster-aws#637

Merged

This was referenced Jun 4, 2024

Adapt cluster-test-suite to the new way of providing the release version for cluster-aws giantswarm/roadmap#3476

Closed

Wire up tests on releases repo giantswarm/roadmap#3477

Closed

marians mentioned this pull request Jun 5, 2024

Add RFC: CLI command for creating a new cluster #98

Open

nprokopic mentioned this pull request Jun 20, 2024

Use new releases for CAPA cluster templates giantswarm/kubectl-gs#1352

Merged

1 task

njuettner mentioned this pull request Jun 28, 2024

Use new releases for CAPZ cluster templates giantswarm/kubectl-gs#1363

Merged

1 task


		Can the same way of doing releases really work equally well for both a fast-growing startup and a large corporation?

		Can the same way of doing releases really work equally well for, one one hand, a retail company that has fully embraced cloud-native approach, and on the other hand a manufacturer that is deploying Kubernetes clusters to multiple on-premise slow-changing and almost air-gapped environments like factories?

		- Kubernetes version MUST BE specified as a `.spec.components` entry,
		- Flatcar version and image variant MUST BE specified as `.spec.components` entries,


		In both cases there is a single release identifier, just defined in different places.

		Versioning logic, i.e. what and how can change in a major, minor or a patch version, would be the same in both cases.

	In total, in the above scenario, we have 12 Renovate PRs and 12 release PRs, so 24 PRs in total. And all these PRs are across multiple components owned by multiple teams, meaning the some actions should be taken by people from multiple teams.
	In total, in the above scenario, we have 12 Renovate PRs and 12 release PRs, so 24 PRs in total. And all these PRs are across multiple components owned by multiple teams, meaning that some actions should be taken by people from multiple teams.


		Since app and component versions are embedded in the cluster-$provider app, we cannot combine different versions of different apps in different ways (unless we make app and component versions configurable, which then opens the door to a whole new set of issues). We cannot have another release model where e.g. versions of Kubernetes, OS, CNI and CPI are a part of the release, and all other apps are always at their latest versions. Or a release model where we use LTS release of OS, CPI, CNI and other apps that provide a LTS release (or something similar).

		OTOH with the releases repository, we can easily create different directories for different release models, where we can combine the versions of apps and components in whatever way and where we can update different release models with different frequency and with different rules. Therefore it would be relatively easy to have one release model where app versions are not even part of the release, or are decoupled and continuously updated. Or another “slower” release model with long-term support where apps and components are on their LTS versions and are updated more slowly.


		We have noticed that we miss few aspects of old releases, like a single release identifier, being able to easily and clearly see which versions of which apps are part of some release and which versions of which apps should be deployed to a workload cluster.

		We also missed a versioning scheme where it’s clear what we promise and what you can expect in a patch, minor or major release upgrade, which, although not strictly defined, it was mostly clear for releases of the vintage product. And equally important, we lack a mechanism to enforce this behaviour.


		## 2. Requirements language

		The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119] [RFC8174].


		Our product is being developed for multiple providers, where we currently support 5 of them - AWS (CAPA), EKS (CAPA), Azure (CAPZ), vSphere (CAPV), VMware Cloud Director (CAPVCD). Hopefully we will work with more providers soon (e.g. GCP/GKE).

		We also need to support multiple major versions across those providers. Most of managed Kubernetes products support all community-supported Kubernetes versions, meaning the latest 3 minor versions. In our case that would mean 3 major versions, and that is per provider.


		Assuming we would like to support at least 2 major versions per provider, and that we work with 5 providers, we would regularly support 10 major versions across all providers. With 6 providers and 3 latest major releases, this number grows to 18 major releases across all providers.

		Additionally, it might be possible that for every major release, we would support more than 1 minor in some cases (e.g. last 2). Therefore, when there is a new patch version of some app (e.g. a security fix), we can easily be in a situation where we have to create a double-digit number of new patch releases in order to patch all affected minor releases.


		That’s it - 1 PR. 1 PR for any number of releases across any number of providers.

		We can also maintain a draft PR in the releases repository where we automatically create next draft releases for all providers, and then Renovate can bump version numbers in those draft releases, and we just decide when to cut the new release (and name it appropriately).


		They current cluster-$provider releases are attractive thanks to their simplicity and straightforwardness. However they have not been put to a test yet, as our customers are still not using the new product for their production workloads.

		As soon as our customers move from our vintage product to Cluster API, we will have a need for multiple major releases. Once the customers are using the new product more, they will put it to test and we’ll have even more work on it, compared to today, so there will be even more need for more patch and minor releases across all providers.


		With the releases repository, release identifiers would be defined in the repository itself, in the same way like for vintage releases. In this process the cluster-$provider app is one of the release components.

		In both cases there is a single release identifier, just defined in different places.


		#### 4.2.1. Release identifier and versioning

		With the current cluster-$provider apps releases process, the cluster-$providers app version is the release identifier.

Releases implementation #96

Are you sure you want to change the base?

Releases implementation #96

Conversation

nprokopic commented May 27, 2024 • edited Loading

1. Introduction

4. Proposal

4.1. Implementation of releases

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mcharriere Jun 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AverageMarcus commented May 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nprokopic commented May 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Gacko left a comment

Choose a reason for hiding this comment

AverageMarcus commented Jun 7, 2024

nprokopic commented Jun 10, 2024

AverageMarcus commented Jun 10, 2024

fiunchinho commented Jun 19, 2024

JosephSalisbury commented Jul 25, 2024

nprokopic commented May 27, 2024 •

edited

Loading

mcharriere Jun 7, 2024 •

edited

Loading