-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Releases implementation #96
base: main
Are you sure you want to change the base?
Conversation
|
||
Can the same way of doing releases really work equally well for both a fast-growing startup and a large corporation? | ||
|
||
Can the same way of doing releases really work equally well for, one one hand, a retail company that has fully embraced cloud-native approach, and on the other hand a manufacturer that is deploying Kubernetes clusters to multiple on-premise slow-changing and almost air-gapped environments like factories? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the same way of doing releases really work equally well for, one one hand, a retail company that has fully embraced cloud-native approach, and on the other hand a manufacturer that is deploying Kubernetes clusters to multiple on-premise slow-changing and almost air-gapped environments like factories? | |
Can the same way of doing releases really work equally well for, on one hand, a retail company that has fully embraced cloud-native approach, and on the other hand a manufacturer that is deploying Kubernetes clusters to multiple on-premise slow-changing and almost air-gapped environments like factories? |
This document proposes to continue using `giantswarm/releases` repository for creating and delivering releases, albeit in a simplified way when compared to the vintage releases. Release resources are used in a minimal way, and cluster-$provider apps continue to be used for deploying workload clusters. | ||
|
||
Briefly put, cluster-$provider apps are still deployed in almost exactly same way, with the following few differences: | ||
- In cluster-$provider app manifest, instead of specifying `.spec.version`, we MUST specify `release.giantswarm.io/version` label. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the benefit of using a label rather part of the spec? Labels can't have OpenAPI validation performed by the api-server.
Using a label for this seems like a hack and would require changes to our app platform to allow not specifying a version but only for cluster apps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, however, from batman times I remember that there's sometimes efficiency gains for label usage in terms of operator implementiations, but I'd say that should not result in omitting the spec field, but rather automatically adding/updating the label.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cluster apps are still just Apps (Giant Swarm app platform App resource) and we apply those to deploy all required resources for the workload cluster. That does not change here.
We will still have to apply real cluster-aws version. Release version is different than cluster-aws version. Also multiple releases can use the same cluster-aws version.
Today in App resources we don't have anything related to releases, so not sure we would put Release version in the App CR.
Using a label for this seems like a hack and would require changes to our app platform to allow not specifying a version but only for cluster apps.
The workflow is the following and one change is needed in the app platform:
- We deploy App CR with
release.giantswarm.io/version
label, e.g. "release.giantswarm.io/version: 25.0.0". It can be an annotation, or a new field in the spec, it doesn't matter, I was going for the simplest initial solution that works (it just cannot be.spec.version
because cluster-aws version must be set there). - app-admission-controller mutates the App CR to set the correct
.spec.version
- it does this only for cluster apps (checking if an App is a cluster app is straightforward)
- it reads release version from the label and it fetches the Release CR
- it reads cluster-$provider app version from the Release CR
- it sets App's
.spec.version
- cluster-$provider app got
.spec.version
set and it is deployed regularly like until now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Semi-related but it's just made me think - how would a change to the release version label trigger all the needed changes if the cluster app version remains the same? What would be responsible for updating the kubernetes version or flatcar version values for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Semi-related but it's just made me think - how would a change to the release version label trigger all the needed changes if the cluster app version remains the same?
Good question, gotta check how it behaves on the update (so far I successfully tested creation only).
Label update will definitely trigger App CR reconciliation, just not sure if that will re-render the templates (re-rendering of the templates would update e.g. OS version, k8s version, app versions).
In case that release label changes do not trigger re-rendering of the cluster-$provider app, an alternative solution would be to put the release version into cluster-$provider app's config (as changes in the config do trigger cluster-$provider app re-rendering). I thought of this also, but went with the simpler option with release label on the App CR (as having release label on the CRs is something that was common in vintage as well).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After looking into this again I have moved release version from cluster-$provider App label to a Helm value global.release.version
.
Thanks for the feedback here @AverageMarcus and @puja108!
I believe it is better like this, so now:
- Change in cluster chart is a bit simpler. OTOH change in mutating webhook has one more more step as it gets release version from the ConfigMap, but this is IMO fine.
- We could also do some basic release version validation with JSON schema.
- Cleaner Helm code to use the release version in the templates, e.g. to set it as a label.
- I have tested the cluster upgrade, it works as expected :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update the RFC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tested the cluster upgrade, it works as expected :)
with the release.version
as a helm value, I see a problem when the upgrade includes a change in version in cluster-<provider>
app.
I believe that because the release.version
is now in a configmap, the app-admission-controller won't update the cluster-<provider>
app version until the actual app CR is modified. That won't happen automatically by just modifying the associated configmap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
App CRs have ConfigMap version as the annotation. IIUC this gets updated when you change the app config (e.g. when you change the release version), so this App CR update should trigger the mutating webhook which then updates the App version.
apiVersion: application.giantswarm.io/v1alpha1
kind: App
metadata:
annotations:
app-operator.giantswarm.io/latest-configmap-version: "271755519"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed that annotation. So all clear from my end. thanks for digging that up!
|
||
Briefly put, cluster-$provider apps are still deployed in almost exactly same way, with the following few differences: | ||
- In cluster-$provider app manifest, instead of specifying `.spec.version`, we MUST specify `release.giantswarm.io/version` label. | ||
- Information about app version, catalog and dependencies MUST BE obtained from the Release resource (during Helm rendering phase) when deploying a production cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this refers to needing to make the change in app-operator (or similar) to lookup the correct values.
Have you considered how this might be handled in the future when we migrate to using Flux and HelmReleases rather than app-operator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should 100% avoid building anything custom into app operator for this, but AFAIK as long as these values are supplied through a configmap we should be fine no matter which helm rendering tool is used as they all support merging configs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought app operator will be there to manage Apps
(transforming them into right HelmRelease objects offering the 3level config) meanwhile chart operator is going to dissapear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this refers to needing to make the change in app-operator (or similar) to lookup the correct values.
No changes in app-operator are needed. Release CR is read in the Helm template with Helm lookup
function, and then app and component versions are read and set in the templates.
Have you considered how this might be handled in the future when we migrate to using Flux and HelmReleases rather than app-operator?
AFAIK Team Honeybadger is working on replacing chart-operator with Flux Helm controller, but app-operator is not going anywhere any time soon. Even if it did, that does not affect the work here in any way, as the change is just about Helm reading a Release CR, getting some value from it and putting that value to some field of some CR (e.g. App version in .spec.version
, HelmRelease version in .spec.chart.spec.version
, etc).
- Kubernetes version MUST BE specified as a `.spec.components` entry, | ||
- Flatcar version and image variant MUST BE specified as `.spec.components` entries, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this purely informational or is it used to decide what VM image to use? As we also bake Teleport agent into the node image does that also need to be included here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It decides which image to use.
The idea is to get all versions of all apps / components from the Release, so:
- Kubernetes version
- OS version
- info needed for the whole image name, including variant number that changes when we get e.g. new Teleport version baked in
- alternatively we can also put the entire image name so it's more flexible (e.g. cluster-aws currently allows to override the image name with a free-form string, or it looks up image according to defined format, which is what we usually use and what I also did here)
- app versions for apps that are deployed as App CRs
- app versions for apps that are deployed as HelmRelease CRs
- Kubernetes version MUST BE specified as a `.spec.components` entry, | ||
- Flatcar version and image variant MUST BE specified as `.spec.components` entries, | ||
- cluster-$provider app version MUST BE specified as a `.spec.components` entry. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about versions of CAPx controllers? Does that have any bearing on a Release? E.g. does a specific release need to use a specific version of the CAPi and CAPA controllers as changes between versions could effect behaviour of how the WCs are created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CAPI & CAPA controllers are not part of the release because we can only run one set on a management cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the watch label allowed for running multiple on the same MC and each would deal with different resources?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about versions of CAPx controllers? Does that have any bearing on a Release?
No.
E.g. does a specific release need to use a specific version of the CAPi and CAPA controllers as changes between versions could effect behaviour of how the WCs are created.
While changes in both CAPI and CAPA controllers should be always backward compatible, and you should be always able to use same manifests even with newer CAPI/CAPA controllers, it is true that with some quite newer CAPI/CAPA controller, even when you create a cluster with the same release, it might get created slightly differently compared to e.g. 3 months ago with older CAPI/CAPA controllers. This is true also today with cluster-$provider apps (for all providers), and it does not change with this RFC. IIRC this is very much intended, because we want to be able to update MC controllers continuously (mostly for the sake of fixes).
The CAPI & CAPA controllers are not part of the release because we can only run one set on a management cluster.
Yes, exactly this.
I thought the watch label allowed for running multiple on the same MC and each would deal with different resources?
It did, but IIRC when we implemented watch label, it was back then when we wanted to run CAPI/CAPx controllers on the existing vintage MC, so that vintage MCs can run CAPI/CAPx controllers and deploy CAPI WCs. AFAIK we have abandoned that idea, and migration assumes that WCs are migrated to new CAPI-only MCs.
kind: App | ||
metadata: | ||
labels: | ||
app-operator.giantswarm.io/version: 0.0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No idea, we render this in App CR for the cluster app (but I think it is needed for some reason). I just added what we already have.
configMap: | ||
name: mycluster-userconfig | ||
namespace: org-mycompany | ||
version: "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this play nicely with gitops? Does both Flux and Argo allow this to be empty in git and not try and re-apply it to empty after the mutation has happened?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK an empty field should not be a problem for Flux. We already leave some fields empty for some resources in our gitops setup.
Will check this though, thanks!
- after cluster-$provider app is applied successfully, in order to render all templates and set the app version, catalog and depends on properties, Helm will do the following: | ||
- it will lookup the App resource (via Helm `lookup` function) and read release version, | ||
- then it will lookup the Release resource and read app version, catalog and depends on properties from there, and | ||
- finally it will render all templates and apply them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Thinking out loud...) I wonder if it might be better / easier to instead have a ConfigMap deployed with each Release that acts like a values file that can be used as an overlay for each cluster-$provider App instead of needing to do multiple helm lookups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 not sure if this is upstream helm functionality and if that is supported by all helm implementations (e.g. within Flux, Argo,...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it might be better / easier to instead have a ConfigMap deployed with each Release that acts like a values file that can be used as an overlay for each cluster-$provider App instead of needing to do multiple helm lookups.
Better, not sure, maybe yes, maybe not.
Easier? I don't think so, at least not now, because we already have process, workflows, tooling for the releases repo and Release CRs, and one of the ideas here is to make use of at least some of that, because otherwise we are taking more than few steps back as we're dropping the tooling and processes that have been working for many years.
Another downside of the ConfigMap idea and using it as an overlay is that it means that k8s/OS/app versions are then Helm values again, and we don't want that, because it makes k8s/OS/app versions part of the workload cluster creation API like it's just a config.
not sure if this is upstream helm functionality
Helm lookup
function (which is basically a k8s client) is a regular upstream Helm function https://helm.sh/docs/chart_template_guide/functions_and_pipelines/#using-the-lookup-function, we already use it in some places in cluster-$provider apps.
|
||
In both cases there is a single release identifier, just defined in different places. | ||
|
||
Versioning logic, i.e. what and how can change in a major, minor or a patch version, would be the same in both cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Versioning logic, i.e. what and how can change in a major, minor or a patch version, would be the same in both cases. | |
Versioning logic, i.e. what and how can change in a major, minor or a patch version, would be the same in both cases. |
##### 4.2.2.2. Scenario 1: new provider-independent app patch | ||
|
||
Now in the case above, let’s say we want to release a new patch version for some provider-independent app, and we have to do that in 5 providers, and for 2 major releases for every provider. The process would look like the following: | ||
- Renovate updates the version of the app in 2 git branches in the cluster chart. Here we have 2 pull requests, both created by Renovate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference as I wasn't aware - you can specify multiple base branches in Renovate, that's very nice! 😁 https://docs.renovatebot.com/configuration-options/#basebranches
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, if we keep the current way, then we would really have to do that. You still have PRs though.
And Renovate can't help if you have to cherry-pick entire changes/fixes/features to older release branches, and when you have to deal with git conflicts.
- For all 5 providers, Renovate opens PRs for the last 2 major versions. Here we have 10 PRs opened by Renovate. | ||
- We release new patch versions for the last two major releases of all cluster-$provider apps. Here we have 10 release pull requests. | ||
|
||
In total, in the above scenario, we have 12 Renovate PRs and 12 release PRs, so 24 PRs in total. And all these PRs are across multiple components owned by multiple teams, meaning the some actions should be taken by people from multiple teams. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In total, in the above scenario, we have 12 Renovate PRs and 12 release PRs, so 24 PRs in total. And all these PRs are across multiple components owned by multiple teams, meaning the some actions should be taken by people from multiple teams. | |
In total, in the above scenario, we have 12 Renovate PRs and 12 release PRs, so 24 PRs in total. And all these PRs are across multiple components owned by multiple teams, meaning that some actions should be taken by people from multiple teams. |
Changes to the current e2e tests would be minimal initially. | ||
- Testing of cluster-$provider app would require one addition, which would be: | ||
- Looking up the latest release and setting the release label on the App CR. | ||
- Like today, cluster-test-suites would use custom version of cluster-$provider app from the branch it is testing (in which case the mutating webhook would not set the cluster-$provider app version based on the Release CR). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would the webhook know not to trigger?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A mutating webhook sets .spec.version
only if it is not set already, which is useful and even needed for testing. Here is the code.
In addition to the mutating webhook, we would probably need a validating webhook, to check cluster App .spec.version
and the version of the cluster-$provider
app in the Release (they must be either the same, or .spec.version
is a dev build on top of the version that we have in the Release, to allow for overriding during testing).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A mutating webhook sets .spec.version only if it is not set already, which is useful and even needed for testing.
Gotta fix this so it also updates .spec.version
when release label is updated, except when we want to override cluster-aws version for testing (which we can signal with a Helm value, which should be fine, as we already have some testing-related ephemeral values).
- Looking up the latest release and setting the release label on the App CR. | ||
- Like today, cluster-test-suites would use custom version of cluster-$provider app from the branch it is testing (in which case the mutating webhook would not set the cluster-$provider app version based on the Release CR). | ||
- The existing app testing where app versions are overridden via ephemeral Helm values would continue working as is. | ||
- As a future improvement, cluster-test-suites could run all tests not only for the latest release, but for all releases which use the same major cluster-$provider app version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests would also now need to be triggered from the releases
repo if teams are able to bump app versions there. Each of those draft PRs for new releases need to be tested. This would require a completely new pipeline in Tekton.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I had that on my mind, just forgot to write it, thanks for the reminder, will add that as well.
|
||
Since app and component versions are embedded in the cluster-$provider app, we cannot combine different versions of different apps in different ways (unless we make app and component versions configurable, which then opens the door to a whole new set of issues). We cannot have another release model where e.g. versions of Kubernetes, OS, CNI and CPI are a part of the release, and all other apps are always at their latest versions. Or a release model where we use LTS release of OS, CPI, CNI and other apps that provide a LTS release (or something similar). | ||
|
||
OTOH with the releases repository, we can easily create different directories for different release models, where we can combine the versions of apps and components in whatever way and where we can update different release models with different frequency and with different rules. Therefore it would be relatively easy to have one release model where app versions are not even part of the release, or are decoupled and continuously updated. Or another “slower” release model with long-term support where apps and components are on their LTS versions and are updated more slowly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't agree that this is "easy" in any way. It adds quite a bit of complexity to release management (what PRs need to be created when in what order), testing (each variation adds another suite of tests that need to be run each time) and around upgrades (can a cluster move to a different release channel? Can they then move back?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would guess that handling such things in folder structures is easier to implement and gives more flexibility than branches do, although with branches you can rely on existing git tooling that does sometimes already support such features (as many projects out there do work with branches).
Maybe at least for the simple cases that we are seeing already (releasing a new WC version based on a new app release or k8s/OS release, we could already make the process clear and think of how we would automate this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That can however be separate from this RFC in general
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't add more complexity/feature than what we offered with Vintage releases. It will be very hard to implement and understand. Better to bring CAPI releases on par with vintage releases, and then see how it goes.
I've left a bunch of comments on the PR but I also have a few general questions / observations:
|
|
||
We have noticed that we miss few aspects of old releases, like a single release identifier, being able to easily and clearly see which versions of which apps are part of some release and which versions of which apps should be deployed to a workload cluster. | ||
|
||
We also missed a versioning scheme where it’s clear what we promise and what you can expect in a patch, minor or major release upgrade, which, although not strictly defined, it was mostly clear for releases of the vintage product. And equally important, we lack a mechanism to enforce this behaviour. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have https://docs.giantswarm.io/vintage/platform-overview/cluster-management/releases/#versioning-conventions define our convention in our docs for vintage. And I'd say we should not diverge too much from that definition going forward. Could make the breaking vs non-breaking clearer in there maybe
|
||
## 2. Requirements language | ||
|
||
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119] [RFC8174]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
View the "rich diff" and you'll see the links working.
|
||
Our product is being developed for multiple providers, where we currently support 5 of them - AWS (CAPA), EKS (CAPA), Azure (CAPZ), vSphere (CAPV), VMware Cloud Director (CAPVCD). Hopefully we will work with more providers soon (e.g. GCP/GKE). | ||
|
||
We also need to support multiple major versions across those providers. Most of managed Kubernetes products support all community-supported Kubernetes versions, meaning the latest 3 minor versions. In our case that would mean 3 major versions, and that is per provider. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
officially we support only 2, anything beyond is exceptions currently
|
||
Assuming we would like to support at least 2 major versions per provider, and that we work with 5 providers, we would regularly support 10 major versions across all providers. With 6 providers and 3 latest major releases, this number grows to 18 major releases across all providers. | ||
|
||
Additionally, it might be possible that for every major release, we would support more than 1 minor in some cases (e.g. last 2). Therefore, when there is a new patch version of some app (e.g. a security fix), we can easily be in a situation where we have to create a double-digit number of new patch releases in order to patch all affected minor releases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again this is not officially supported in our guidelines, which state each major is supported kept on latest minor and patch only, and would be an exception (especially as minors and patches should not be breaking and thus not have blockers towards an upgrade to the latest minor and patch.
- provider-independent and provider-specific default configuration of apps, | ||
- configuration of the operating system and different node components, such as systemd, containerd, etc. | ||
|
||
Multiple teams and multiple people are continuously working on all the above and it is indispensable to ensure that all of them have smooth and frictionless development, testing and release experience, so we can increase deployment frequency, reduce lead time for changes and reduce change failure rate. For this to work, we need to be able to develop, test and release almost every change independently of almost all other changes. The team that have worked on a change should be able to release the change fully independently, without any intervention from the provider-integration or provider-independent KaaS teams. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
|
||
That’s it - 1 PR. 1 PR for any number of releases across any number of providers. | ||
|
||
We can also maintain a draft PR in the releases repository where we automatically create next draft releases for all providers, and then Renovate can bump version numbers in those draft releases, and we just decide when to cut the new release (and name it appropriately). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally it should be so cheap to cut a new release that we just do it on any change and not batch changes within versions, batching will happen naturally by consecutively releasing patches, but at least we'd have atomic updates, which could be tested independently.
components: | ||
- name: cluster-aws | ||
version: 1.0.0 | ||
- name: flatcar | ||
version: 3815.2.2 | ||
- name: flatcar-variant | ||
version: 3.0.0 | ||
- name: kubernetes | ||
version: 1.25.16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this would mean for an OS or K8s upgrade we would not need to release any new app/chart, but just bump the version here (and have the image available), right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that'd work technically. Something would need to trigger the chart to lookup the new versions and change the image reference in the rendered templates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just meant that with this proposal we would not need to release a new chart in many cases, only changes that need new values being wired up would need a new chart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this would mean for an OS or K8s upgrade we would not need to release any new app/chart
Yes, that is exactly the idea, when you need new OS/k8s/app version, you don't have to touch cluster-$provider app repo, no new versions are needed.
Then only scenario where cluster-$provider app changes are needed would be if some new k8s/OS/app version requires some changes in some config, but this is usually rare (and probably means it is some breaking change). We can see how to improve this scenario later, e.g. if needed we could decouple even app config from the cluster-$provider app and take it out, but I don't think this is needed (at least not now).
but just bump the version here (and have the image available), right?
Since releases are immutable, we would create a new Release with the new versions. Then we would bump the release version for the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, I did mean bump and release as new version, not bump inside existing version
|
||
They current cluster-$provider releases are attractive thanks to their simplicity and straightforwardness. However they have not been put to a test yet, as our customers are still not using the new product for their production workloads. | ||
|
||
As soon as our customers move from our vintage product to Cluster API, we will have a need for multiple major releases. Once the customers are using the new product more, they will put it to test and we’ll have even more work on it, compared to today, so there will be even more need for more patch and minor releases across all providers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This leaves me thinking if we have a plan and timeline how quick we can move to this model, as we already have a timeline when customers will be moving and we also need a plan how to move customers that already are using the existing system. cc @alex-dabija
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about right away? 🙈 😬
Required changes are very small, I have a working POC already, with changes in cluster chart, cluster-aws and app-admission-controller.
I have added an agenda item for this week's SIG Architecture to show it there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will definitely need to also look at how upgrades are handled, but generally good if we can move fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upgrade of existing clusters would be just two changes (slightly different compared to a regular version bump):
- Remove spec.version in cluster-$provider app manifest.
- Set release version.
Then for clusters on the new release system (with Release CRs) it's just a regular version bumps like for cluster-aws version.
Huge thanks to @puja108, @AverageMarcus and @pipo02mix for the review! Will reply to all of your comments. |
|
||
With the releases repository, release identifiers would be defined in the repository itself, in the same way like for vintage releases. In this process the cluster-$provider app is one of the release components. | ||
|
||
In both cases there is a single release identifier, just defined in different places. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by different places? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the release identifier (release version) is cluster-$provider app release, so you can look up some release e.g. in cluster-aws repo releases.
With the releases repo you would look up some release in the release repo (as cluster-$provider release becomes just one part of the release).
- https://github.com/orgs/giantswarm/teams/sig-architecture | ||
- https://github.com/orgs/giantswarm/teams/team-turtles | ||
state: approved | ||
summary: Where and how we implement releases for workload clusters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This must summarize what the RFC proposes. Right now, it's very vague and not helpful in the RFC overview table.
summary: Where and how we implement releases for workload clusters. | ||
--- | ||
|
||
# Releases implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Releases implementation | |
# Cluster releases implementation |
|
||
#### 4.2.1. Release identifier and versioning | ||
|
||
With the current cluster-$provider apps releases process, the cluster-$providers app version is the release identifier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please replace all occurrences of "current", "old", "new", "today" etc. with absolute terms (e.g. "CAPI release for AWS", "Vintage release", "as of 2024-04", "before introduction of this release model", ...). This document must still make sense to readers next month and next year.
|
||
Since app and component versions are embedded in the cluster-$provider app, we cannot combine different versions of different apps in different ways (unless we make app and component versions configurable, which then opens the door to a whole new set of issues). We cannot have another release model where e.g. versions of Kubernetes, OS, CNI and CPI are a part of the release, and all other apps are always at their latest versions. Or a release model where we use LTS release of OS, CPI, CNI and other apps that provide a LTS release (or something similar). | ||
|
||
OTOH with the releases repository, we can easily create different directories for different release models, where we can combine the versions of apps and components in whatever way and where we can update different release models with different frequency and with different rules. Therefore it would be relatively easy to have one release model where app versions are not even part of the release, or are decoupled and continuously updated. Or another “slower” release model with long-term support where apps and components are on their LTS versions and are updated more slowly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't add more complexity/feature than what we offered with Vintage releases. It will be very hard to implement and understand. Better to bring CAPI releases on par with vintage releases, and then see how it goes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing more to add, other comments already cover my questions.
Some questions that have come to mind while working on the tests:
|
Helm rendering fails, so you cannot apply your cluster manifests. This is error ATM (but it could be nicer, will improve it):
The idea is to add a finalizer for every cluster to Release CR in cluster chart post-create, post-upgrade hook, and to delete it in post-delete hook. This way you cannot accidentally delete Release CR that is in use.
We should, but we're not doing it yet.
Good idea, thanks!
IMO releases should behave the same way across all providers, with only apps/components being different. Do you have some use case in mind for a different Release CRD per provider? |
👍 Can we include this in the RFC then? :)
I was just thinking about how to easily get the latest available release for a provider. If we label the CRs then that solves the problem. |
I don't think this is a good idea. The process for creating a workload cluster is complex enough as it currently is, and we are just adding more indirection and making it more complex and harder to reason about. I'm not sure if the number of PRs that we need deal with when updating an app is a good metric to measure the approach. Some of the PR's are automatically created by Renovate, and I'd invest time in automatically running tests and merging these PRs if tests were successful. Also, when upgrading something like the kubernetes version or flatcar version, may be just changing the version number in the Having to maintain several branches will be challenging. But then again I'd invest time in improving our tooling to deal with branches (keeping up to date, cherry pick changes automatically, etc). Many open source projects already have automation around this, so it shouldn't be a huge effort. And we would get benefits not only for managing clusters, but for managing all our repositories that may need several branches. |
@nprokopic what's the status here? i feel like we're doing this now :D |
Yeah, we're doing this definitely, and it's still heavily in progress in Turtles (we moved CAPA and CAPZ to new releases, soon CAPV), which is why I didn't get around to pick up the RFC again and finish this 🙈 |
Below are excerpts from a couple of sections.
TODOs:
1. Introduction
This RFC defines how we implement releases for workload clusters for the new KaaS product that is based on the Cluster API project. It covers multiple aspects of releases, such as creation, testing and delivery of releases. It does not cover how clusters are upgraded to the new releases.
4. Proposal
This section defines how releases are implemented, where they are, how they are developed, tested and delivered.
Then it compares the proposed solution to the current one that we have for our new KaaS product that is based on Cluster API.
4.1. Implementation of releases
This document proposes to continue using
giantswarm/releases
repository for creating and delivering releases, albeit in a simplified way when compared to the vintage releases. Release resources are used in a minimal way, and cluster-$provider apps continue to be used for deploying workload clusters.Briefly put, cluster-$provider apps are still deployed in almost exactly same way, with the following few differences:
.spec.version
, we MUST specifyrelease.giantswarm.io/version
label.The core of this proposal is the idea to decouple the app and component versions from the cluster-$provider apps, which will then enable us to have a very scalable process for working with releases, where we can easily create and manage many releases across many providers and develop mechanisms to enforce business logic across all of those.