-
Notifications
You must be signed in to change notification settings - Fork 41
Kubernetes docs #188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Kubernetes docs #188
Changes from all commits
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
cb39104
Add Kubernetes Updates docs
eliaoggian 23e471e
Add Kubernetes cluster docs
eliaoggian 4acd1aa
Add Kubernetes Updates docs
eliaoggian 2b66f84
Update docs/kubernetes/clusters.md
eliaoggian 4248481
Update docs/kubernetes/clusters.md
eliaoggian ff673fe
Update docs/kubernetes/clusters.md
eliaoggian 1c4fbd7
Update docs/kubernetes/clusters.md
eliaoggian 9d7ed89
Update docs/kubernetes/clusters.md
eliaoggian f34b986
Update docs/kubernetes/clusters.md
eliaoggian 5a29ac5
Update docs/kubernetes/kubernetes-upgrades.md
eliaoggian 115c9b4
Fix docs based on review
eliaoggian ddeffbc
Merge changes
eliaoggian 7ec5d89
Improve docs. Add spelling config. Add CODEOWNERS
eliaoggian c5f5d65
sort allowed words
eliaoggian 1cb1877
Merge branch 'main' into kubernetes-docs
eliaoggian 7ddcc1e
Add Kubernetes to the list of services
eliaoggian 482a6b6
Update docs/services/kubernetes/clusters.md
eliaoggian 84c6942
Update docs/services/kubernetes/index.md
eliaoggian 80a8a3f
Update .github/actions/spelling/allow.txt
eliaoggian 6c0c893
Apply requested changes. Remove Emojis from headers.
eliaoggian 6f2dd34
Merge branch 'main' into kubernetes-docs
bcumming File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,221 @@ | ||
| [](){#ref-kubernetes-clusters} | ||
| # CSCS Kubernetes clusters | ||
|
|
||
| This document provides an overview of the Kubernetes clusters maintained by CSCS and offers step-by-step instructions for accessing and interacting with them. | ||
|
|
||
| ## Architecture | ||
|
|
||
| All Kubernetes clusters at CSCS are: | ||
|
|
||
| - Managed using **[Rancher](https://www.rancher.com)** | ||
| - Running **[RKE2 (Rancher Kubernetes Engine 2)](https://github.com/rancher/rke2)** | ||
|
|
||
| CSCS offers two types of Kubernetes clusters for partners: | ||
|
|
||
| - **Harvester-only clusters**: These clusters run exclusively on virtual machines provisioned by Harvester (SUSE Virtualization), providing a flexible and isolated environment suitable for most workloads. | ||
| - **Alpernetes clusters**: These clusters combine Harvester VMs with compute nodes from the Alps supercomputer. This hybrid setup, called *Alpernetes*, enables workloads to leverage both virtualized infrastructure and high-performance computing resources within the same Kubernetes environment. | ||
|
|
||
| ## Cluster Environments | ||
|
|
||
| Clusters are grouped into two main environments: | ||
|
|
||
| - **TDS** – Test and Development Systems | ||
| - **PROD** – Production | ||
|
|
||
| See [Kubernetes upgrades][ref-kubernetes-clusters-upgrades] for detailed upgrade policy. | ||
|
|
||
| ## Kubernetes API Access | ||
|
|
||
| You can access the Kubernetes API in two main ways: | ||
|
|
||
| ### Direct Internet Access | ||
|
|
||
| - A Virtual IP is exposed for the API server. | ||
| - Access is restricted by source IP addresses of the partner. | ||
|
|
||
| ### Access via CSCS Jump Host | ||
|
|
||
| - Connect through a jump host (e.g., `ela.cscs.ch`). | ||
| - API calls are securely proxied through Rancher. | ||
|
|
||
| To check which method you are using, examine the `current-context` in your `kubeconfig` file. | ||
|
|
||
| ## Cluster Access | ||
|
|
||
| To interact with the cluster, you need the `kubectl` CLI: | ||
| 🔗 [Install kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl) | ||
| ??? Note "`kubectl` is pre-installed on the CSCS jump host." | ||
|
|
||
|
|
||
| ### Retrieve your kubeconfig file | ||
|
|
||
| #### Internal CSCS Users | ||
| Access [Rancher](https://rancher.cscs.ch) and download the kubeconfig for your cluster. | ||
|
|
||
| #### External Users | ||
| A specific Rancher user and password should have been provided to the partner. | ||
|
|
||
| Use the `kcscs` tool installed on `ela.cscs.ch` to obtain the kubeconfig by following the next steps. | ||
|
|
||
| Download your SSH keys from [SSH Service](https://sshservice.cscs.ch) (and add them to the SSH agent). | ||
|
|
||
| SSH to the jump host using the downloaded SSH keys | ||
| ```bash | ||
| ssh ela.cscs.ch | ||
| ``` | ||
|
|
||
| Login with `kcscs` with the provided Rancher credentials | ||
| ```bash | ||
| kcscs login | ||
| ``` | ||
|
|
||
| List the accessible clusters | ||
| ```bash | ||
| kcscs list | ||
| ``` | ||
|
|
||
| Retrieve the kubeconfig file for a specific cluster | ||
| ```bash | ||
| kcscs get | ||
| ``` | ||
|
|
||
|
|
||
| ### Store the kubeconfig file | ||
|
|
||
| ```bash | ||
| mv mykubeconfig.yaml ~/.kube/config | ||
| ``` | ||
| or | ||
| ```bash | ||
| export KUBECONFIG=/home/user/kubeconfig.yaml | ||
| ``` | ||
|
|
||
| ### Test connectivity | ||
| ```bash | ||
| kubectl get nodes | ||
| ``` | ||
|
|
||
| !!! warning | ||
| The kubeconfig file contains credentials. Keep it secure. | ||
|
|
||
| ## Pre-installed Applications | ||
|
|
||
| All CSCS-provided clusters include a set of pre-installed tools and components, described below: | ||
|
|
||
| ### `ceph-csi` | ||
|
|
||
| Provides dynamic persistent volume provisioning via the Ceph Container Storage Interface (CEPH CSI). | ||
|
|
||
| #### Storage Classes | ||
|
|
||
| - `cephfs` – ReadWriteMany (RWX), backed by HDD (large data volumes) | ||
| - `rbd-hdd` – ReadWriteOnce (RWO), backed by HDD | ||
| - `rbd-nvme` – RWO, backed by NVMe (high-performance workloads like databases) | ||
| - `*-retain` – Same classes, but retain the volume after PVC deletion | ||
|
|
||
| ### `external-dns` | ||
|
|
||
| Automatically manages DNS entries for: | ||
|
|
||
| - Ingress resources | ||
| - Services of type `LoadBalancer` (when annotated) | ||
|
|
||
| #### Example | ||
| ```bash | ||
| kubectl annotate service nginx "external-dns.alpha.kubernetes.io/hostname=nginx.mycluster.tds.cscs.ch." | ||
| ``` | ||
|
|
||
| !!! Note "Use a valid name under the configured subdomain" | ||
|
|
||
| 🔗 [external-dns documentation](https://github.com/kubernetes-sigs/external-dns) | ||
|
|
||
| ### `cert-manager` | ||
|
|
||
| Handles automatic issuance of TLS certificates from Let's Encrypt. | ||
|
|
||
| #### Example | ||
| ```yaml | ||
| apiVersion: cert-manager.io/v1 | ||
| kind: Certificate | ||
| metadata: | ||
| name: echo | ||
| spec: | ||
| secretName: echo | ||
| commonName: echo.mycluster.tds.cscs.ch | ||
| dnsNames: | ||
| - echo.mycluster.tds.cscs.ch | ||
| issuerRef: | ||
| kind: ClusterIssuer | ||
| name: letsencrypt | ||
| ``` | ||
|
|
||
| You can also issue certificates automatically via Ingress annotations (see `ingress-nginx` section). | ||
|
|
||
| 🔗 [cert-manager documentation](https://cert-manager.io) | ||
|
|
||
| ### `metallb` | ||
|
|
||
| Enables `LoadBalancer` service types by assigning public IPs. | ||
|
|
||
| !!! Warning "The public IP pool is limited. Prefer using `Ingress` unless you specifically need a `LoadBalancer` Service for TCP traffic." | ||
|
|
||
| 🔗 [MetalLB documentation](https://metallb.universe.tf) | ||
|
|
||
| ### `ingress-nginx` | ||
|
|
||
| Default Ingress controller with class `nginx`. | ||
| Supports automatic TLS via cert-manager annotations. | ||
|
|
||
| Example: | ||
|
|
||
| ```yaml | ||
| apiVersion: networking.k8s.io/v1 | ||
| kind: Ingress | ||
| metadata: | ||
| name: myIngress | ||
| namespace: myIngress | ||
| annotations: | ||
| cert-manager.io/cluster-issuer: letsencrypt | ||
| spec: | ||
| rules: | ||
| - host: example.tds.cscs.ch | ||
| http: | ||
| paths: | ||
| - pathType: Prefix | ||
| path: / | ||
| backend: | ||
| service: | ||
| name: myservice | ||
| port: | ||
| number: 80 | ||
| tls: | ||
| - hosts: | ||
| - example.tds.cscs.ch | ||
| secretName: myingress-cert | ||
| ``` | ||
|
|
||
| 🔗 [NGINX Ingress Docs](https://docs.nginx.com/nginx-ingress-controller) | ||
| 🔗 [cert-manager Ingress Usage](https://cert-manager.io/docs/usage/ingress/) | ||
|
|
||
| ### `external-secrets` | ||
|
|
||
| Integrates with secret management tools like **HashiCorp Vault**. | ||
|
|
||
| Enables the usage of `ExternalSecret` resources to fetch secrets from `SecreStore` or `ClusterSecretStore` resources to fetch secrets and store them into `Secrets` inside the cluster. | ||
|
|
||
| It helps to avoid storing secrets in the deployment manifests, especially in GitOps environments. | ||
|
|
||
| 🔗 [external-secrets documentation](https://external-secrets.io/) | ||
|
|
||
| ### `kured` | ||
|
|
||
| Responsible for automatic node reboots (e.g., after kernel updates). | ||
|
|
||
| 🔗 [kured documentation](https://kured.dev/) | ||
|
|
||
| ### Observability | ||
|
|
||
| Includes: | ||
|
|
||
| - **Beats agents** – Export logs and metrics to CSCS’s central log system | ||
| - **Prometheus** – Scrapes metrics and exports them to CSCS's central monitoring cluster |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,35 @@ | ||
| [](){#ref-kubernetes} | ||
| # Kubernetes | ||
|
|
||
| Kubernetes is only available for specific partners. | ||
|
|
||
| !!! Note | ||
| Kubernetes is not available for normal users on Alps. | ||
|
|
||
| This documentation is designed to help partners who have been granted access to a Kubernetes cluster. | ||
|
|
||
| It explains how clusters are provisioned, maintained, and the policies in place for upgrades and updates. | ||
|
|
||
|
|
||
|
|
||
| <div class="grid cards" markdown> | ||
| - :fontawesome-solid-layer-group: __Cluster Architecture__ | ||
|
|
||
| CSCS Kubernetes cluster overview. What are the main components and how to interact with it. | ||
|
|
||
| [:octicons-arrow-right-24: Clusters][ref-kubernetes-clusters] | ||
|
|
||
| - :fontawesome-solid-arrow-up-from-bracket: __Kubernetes Upgrades__ | ||
|
|
||
| Kubernetes Cluster upgrade policy (Kubernetes version upgrades) | ||
|
|
||
| [:octicons-arrow-right-24: Kubernetes Upgrades][ref-kubernetes-clusters-upgrades] | ||
|
|
||
| - :fontawesome-solid-shield-halved: __Node Updates__ | ||
|
|
||
| Cluster Nodes OS update policy (Regular Node Security Updates) | ||
|
|
||
| [:octicons-arrow-right-24: Node OS Updates][ref-kubernetes-node-updates] | ||
|
|
||
| </div> | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,40 @@ | ||
| [](){#ref-kubernetes-clusters-upgrades} | ||
| # Kubernetes Cluster Upgrade Policy | ||
|
|
||
| To maintain a secure, stable, and supported platform, we regularly upgrade our Kubernetes clusters. We use **[RKE2](https://docs.rke2.io/)** as our Kubernetes distribution. | ||
|
|
||
| ## Upgrade Flow | ||
|
|
||
| **Phased Rollout** | ||
|
|
||
| - Upgrades are first applied to **TDS clusters** (Test and Development Systems). | ||
| - After a **minimum of 2 weeks**, if no critical issues are observed, the same upgrade will be applied to **PROD clusters**. | ||
|
|
||
| **No Fixed Schedule** | ||
|
|
||
| - Upgrades are not done on a strict calendar basis. | ||
| - Timing may depend on compatibility with **other infrastructure components** (e.g., storage, CNI plugins, monitoring tools). | ||
| - However, all clusters will be upgraded **before the current Kubernetes version reaches End of Life (EOL)**. | ||
|
|
||
| ## Upgrade Impact | ||
|
|
||
| The **impact of a Kubernetes upgrade can vary**, depending on the nature of the changes involved: | ||
|
|
||
| **Minimal Impact** | ||
|
|
||
| - For example, upgrades that affect only the `kubelet` may be **transparent to workloads**. | ||
| - Rolling restarts may occur, but no downtime is expected for well-configured applications. | ||
|
|
||
| **Potentially Disruptive** | ||
|
|
||
| - Upgrades involving components such as the **CNI (Container Network Interface)** may cause **temporary network interruptions**. | ||
| - Other control plane or critical component updates might cause short-lived disruption to scheduling or connectivity. | ||
|
|
||
| ??? Note "Applications that follow cloud-native best practices (e.g., readiness probes, multiple replicas, graceful shutdown handling) are **less likely to be impacted** by upgrades." | ||
|
|
||
| ## What You Can Expect | ||
|
|
||
| - Upgrades are performed using safe, tested procedures with minimal risk to production workloads. | ||
| - TDS clusters serve as a **canary environment**, allowing us to identify issues early. | ||
| - All clusters are kept **aligned with supported Kubernetes versions**. | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,46 @@ | ||
| [](){#ref-kubernetes-node-updates} | ||
| # Kubernetes Nodes OS Update Policy | ||
|
|
||
| To ensure the **security** and **stability** of our infrastructure, CSCS will perform **monthly OS updates** on all nodes of our Kubernetes clusters. | ||
|
|
||
| ## Maintenance Schedule | ||
|
|
||
| - **Frequency**: Every **first week of the month** | ||
| - **Reboot Window**: **Monday to Friday**, between **09:00 and 15:00** | ||
| - **Time Zone**: Europe/Zurich | ||
|
|
||
| These updates include important security patches and system updates for the operating systems of cluster nodes. | ||
|
|
||
| ??? Note "Nodes will be rebooted only if required by the updates." | ||
|
|
||
| ## Urgent Security Patches | ||
|
|
||
| In the event of a **critical zero-day vulnerability**, we will apply patches and perform reboots (if required) **as soon as possible**, outside of the regular update schedule if needed. | ||
|
|
||
| - Affected nodes will be updated **immediately** to protect the platform. | ||
| - Users will be notified ahead of time **when possible**. | ||
| - Standard safety and rolling reboot practices will still be followed. | ||
|
|
||
| ## Reboot Management with Kured | ||
|
|
||
| We use [**Kured** (KUbernetes REboot Daemon)](https://github.com/kubereboot/kured) to safely automate the reboot process. Kured ensures that: | ||
|
|
||
| - Reboots are triggered **only when necessary** (e.g., after kernel updates). | ||
| - Nodes are rebooted **one at a time** to avoid service disruption. | ||
| - Reboots occur **only during the defined window** | ||
| - Nodes are **cordoned**, **drained**, and **gracefully reintegrated** after reboot. | ||
|
|
||
| ## Application Requirements | ||
|
|
||
| To avoid service disruption during node maintenance, applications **must be designed for high availability**. Specifically: | ||
|
|
||
| - Use **multiple replicas** spread across nodes. | ||
| - Follow **cloud-native best practices**, including: | ||
| - Proper **readiness** and **liveness probes** | ||
| - **Graceful shutdown** support | ||
| - **Stateless design** or resilient handling of state | ||
| - Appropriate **resource requests and limits** | ||
|
|
||
| !!! Warning | ||
| Applications that do not meet these requirements **may experience temporary disruption** during node reboots. | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.