From cb391040729fd96368adfedc9c861760f77be3d8 Mon Sep 17 00:00:00 2001 From: Elia Oggian Date: Mon, 7 Jul 2025 15:30:39 +0200 Subject: [PATCH 01/18] Add Kubernetes Updates docs --- docs/kubernetes/kubernetes-upgrades.md | 48 +++++++++++++++++++++ docs/kubernetes/node-upgrades.md | 59 ++++++++++++++++++++++++++ mkdocs.yml | 3 ++ 3 files changed, 110 insertions(+) create mode 100644 docs/kubernetes/kubernetes-upgrades.md create mode 100644 docs/kubernetes/node-upgrades.md diff --git a/docs/kubernetes/kubernetes-upgrades.md b/docs/kubernetes/kubernetes-upgrades.md new file mode 100644 index 00000000..117db4f2 --- /dev/null +++ b/docs/kubernetes/kubernetes-upgrades.md @@ -0,0 +1,48 @@ +# Kubernetes Cluster Upgrade Policy + +To maintain a secure, stable, and supported platform, we regularly upgrade our Kubernetes clusters. We use **[RKE2](https://docs.rke2.io/)** as our Kubernetes distribution. + +--- + +## πŸ”„ Upgrade Flow + +- **Phased Rollout**: + - Upgrades are first applied to **TDS clusters** (Test and Development Systems). + - After a **minimum of 2 weeks**, if no critical issues are observed, the same upgrade will be applied to **PROD clusters**. + +- **No Fixed Schedule**: + - Upgrades are not done on a strict calendar basis. + - Timing may depend on compatibility with **other infrastructure components** (e.g., storage, CNI plugins, monitoring tools). + - However, all clusters will be upgraded **before the current Kubernetes version reaches End of Life (EOL)**. + +--- + +## ⚠️ Upgrade Impact + +The **impact of a Kubernetes upgrade can vary**, depending on the nature of the changes involved: + +- **Minimal Impact**: + - For example, upgrades that affect only the `kubelet` may be **transparent to workloads**. + - Rolling restarts may occur, but no downtime is expected for well-configured applications. + +- **Potentially Disruptive**: + - Upgrades involving components such as the **CNI (Container Network Interface)** may cause **temporary network interruptions**. + - Other control plane or critical component updates might cause short-lived disruption to scheduling or connectivity. + +> πŸ’‘ Applications that follow cloud-native best practices (e.g., readiness probes, multiple replicas, graceful shutdown handling) are **less likely to be impacted** by upgrades. + +--- + +## βœ… What You Can Expect + +- Upgrades are performed using safe, tested procedures with minimal risk to production workloads. +- TDS clusters serve as a **canary environment**, allowing us to identify issues early. +- All clusters are kept **aligned with supported Kubernetes versions**. + +--- + +## πŸ’¬ Questions? + +If you have any questions about upcoming Kubernetes upgrades or want help verifying your application’s readiness, please contact the Network and Cloud team via Service Desk ticket. + +Thank you for your support and collaboration in keeping our platform secure and reliable. diff --git a/docs/kubernetes/node-upgrades.md b/docs/kubernetes/node-upgrades.md new file mode 100644 index 00000000..fa66631d --- /dev/null +++ b/docs/kubernetes/node-upgrades.md @@ -0,0 +1,59 @@ +# Kubernetes Nodes OS Update Policy + +To ensure the **security** and **stability** of our infrastructure, CSCS will perform **monthly OS updates** on all nodes of our Kubernetes clusters. + +--- + +## πŸ”„ Maintenance Schedule + +- **Frequency**: Every **first week of the month** +- **Reboot Window**: **Monday to Friday**, between **09:00 and 15:00** +- **Time Zone**: Europe/Zurich + +These updates include important security patches and system updates for the operating systems of cluster nodes. + +> ⚠️ **Note:** Nodes will be **rebooted only if required** by the updates. If no reboot is necessary, nodes will remain in service without disruption. + +--- + +## 🚨 Urgent Security Patches + +In the event of a **critical zero-day vulnerability**, we will apply patches and perform reboots (if required) **as soon as possible**, outside of the regular update schedule if needed. + +- Affected nodes will be updated **immediately** to protect the platform. +- Users will be notified ahead of time **when possible**. +- Standard safety and rolling reboot practices will still be followed. + +--- + +## πŸ› οΈ Reboot Management with Kured + +We use [**Kured** (KUbernetes REboot Daemon)](https://github.com/kubereboot/kured) to safely automate the reboot process. Kured ensures that: + +- Reboots are triggered **only when necessary** (e.g., after kernel updates). +- Nodes are rebooted **one at a time** to avoid service disruption. +- Reboots occur **only during the defined window** +- Nodes are **cordoned**, **drained**, and **gracefully reintegrated** after reboot. + +--- + +## βœ… Application Requirements + +To avoid service disruption during node maintenance, applications **must be designed for high availability**. Specifically: + +- Use **multiple replicas** spread across nodes. +- Follow **cloud-native best practices**, including: + - Proper **readiness** and **liveness probes** + - **Graceful shutdown** support + - **Stateless design** or resilient handling of state + - Appropriate **resource requests and limits** + +> ❗ Applications that do not meet these requirements **may experience temporary disruption** during node reboots. + +--- + +## πŸ‘©β€πŸ’» Need Help? + +If you have questions or need help preparing your applications for rolling node maintenance, please contact the Network and Cloud team via Service Desk ticket. + +Thank you for your cooperation and commitment to building robust, cloud-native services. diff --git a/mkdocs.yml b/mkdocs.yml index 9ef866eb..42d1873c 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -118,6 +118,9 @@ nav: - 'LLM Inference': guides/mlp_tutorials/llm-inference.md - 'LLM Finetuning': guides/mlp_tutorials/llm-finetuning.md - 'LLM Training': guides/mlp_tutorials/llm-nanotron-training.md + - 'Kubernetes': + - 'Kubernetes Upgrades': kubernetes/kubernetes-upgrades.md + - 'Node OS Upgrades': kubernetes/node-upgrades.md - 'Policies': - policies/index.md - 'User Regulations': policies/regulations.md From 23e471eeb8a2a5b718faadb3e4dc2caec9f1dbbe Mon Sep 17 00:00:00 2001 From: Elia Oggian Date: Mon, 7 Jul 2025 16:16:38 +0200 Subject: [PATCH 02/18] Add Kubernetes cluster docs --- docs/kubernetes/clusters.md | 213 ++++++++++++++++++++++++++++++++++++ mkdocs.yml | 1 + 2 files changed, 214 insertions(+) create mode 100644 docs/kubernetes/clusters.md diff --git a/docs/kubernetes/clusters.md b/docs/kubernetes/clusters.md new file mode 100644 index 00000000..fcdb7b69 --- /dev/null +++ b/docs/kubernetes/clusters.md @@ -0,0 +1,213 @@ + +# CSCS Kubernetes Clusters + +This document provides an overview of the Kubernetes clusters maintained by CSCS and offers step-by-step instructions for accessing and interacting with them. + +--- + +## Architecture + +All Kubernetes clusters at CSCS are: + +- Managed using **[Rancher](https://www.rancher.com)** +- Running **[RKE2 (Rancher Kubernetes Engine 2)](https://github.com/rancher/rke2)** + +--- + +## Cluster Environments + +Clusters are grouped into two main environments: + +- **TDS** – Test and Development Systems +- **PROD** – Production + +TDS clusters receive updates first. If no issues arise, the same updates are then applied to PROD clusters. + +--- + +## Kubernetes API Access + +You can access the Kubernetes API in two main ways: + +### Direct Internet Access + +- A Virtual IP is exposed for the API server. +- Access can be restricted by source IP addresses. + +### Access via CSCS Jump Host + +- Connect through a bastion host (e.g., `ela.cscs.ch`). +- API calls are securely proxied through Rancher. + +To check which method you are using, examine the `current-context` in your `kubeconfig` file. + +--- + +## Cluster Access + +To interact with the cluster, you need the `kubectl` CLI: +πŸ”— [Install kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl) +> `kubectl` is pre-installed on the CSCS jump host. + +### Step-by-Step Access Guide + +#### Retrieve your kubeconfig file + - If you have a CSCS account and can access [Rancher](https://rancher.cscs.ch), download the kubeconfig for your cluster. + + - If you have a CSCS account but can't access [Rancher](https://rancher.cscs.ch), request a local Rancher user and use the **kcscs** tool installed on **ela.cscs.ch** to obtain the kubeconfig: + - Download your SSH keys from [SSH Service](https://sshservice.cscs.ch) + - SSH to `ela.cscs.ch` using the downloaded SSH keys + - Run `kcscs login` and insert your Rancher local user credentials (Supplied by CSCS) + - Run `kcscs list` to list the clusters you have access to + - Run `kcscs get` to get the kubeconfig file for a specific cluster + + - If you don't have a CSCS account, open a Service Desk ticket to ask support. + +#### Store the kubeconfig file + ```bash + mv mykubeconfig.yaml ~/.kube/config + # or + export KUBECONFIG=/home/user/kubeconfig.yaml + ``` + +#### Test connectivity + ```bash + kubectl get nodes + ``` + +> ⚠️ The kubeconfig file contains credentials. Keep it secure. + +--- + +## Pre-installed Applications + +All CSCS-provided clusters include a set of pre-installed tools and components, described below: + +--- + +### πŸ“¦ `ceph-csi` + +Provides **dynamic persistent volume provisioning** via the Ceph Container Storage Interface. + +#### Storage Classes + +- `cephfs` – ReadWriteMany (RWX), backed by HDD (large data volumes) +- `rbd-hdd` – ReadWriteOnce (RWO), backed by HDD +- `rbd-nvme` – RWO, backed by NVMe (high-performance workloads like databases) +- `*-retain` – Same classes, but retain the volume after PVC deletion + +--- + +### 🌐 `external-dns` + +Automatically manages DNS entries for: + +- Ingress resources +- Services of type `LoadBalancer` (when annotated) + +#### Example +```bash +kubectl annotate service nginx "external-dns.alpha.kubernetes.io/hostname=nginx.mycluster.tds.cscs.ch." +``` + +> βœ… Use a valid name under the configured subdomain. +πŸ“„ [external-dns documentation](https://github.com/kubernetes-sigs/external-dns) + +--- + +### πŸ” `cert-manager` + +Handles automatic issuance of TLS certificates from Let's Encrypt. + +#### Example +```yaml +apiVersion: cert-manager.io/v1 +kind: Certificate +metadata: + name: echo +spec: + secretName: echo + commonName: echo.mycluster.tds.cscs.ch + dnsNames: + - echo.mycluster.tds.cscs.ch + issuerRef: + kind: ClusterIssuer + name: letsencrypt +``` + +You can also issue certs automatically via Ingress annotations (see `ingress-nginx` section). + +πŸ“„ [cert-manager documentation](https://cert-manager.io) + +--- + +### πŸ“‘ `metallb` + +Enables `LoadBalancer` service types by assigning public IPs. + +> ⚠️ The public IP pool is limited. +Prefer using `Ingress` unless you specifically need a `LoadBalancer`. +πŸ“„ [metallb documentation](https://metallb.universe.tf) + +--- + +### 🌍 `ingress-nginx` + +Default Ingress controller with class `nginx`. +Supports automatic TLS via cert-manager annotations. + +#### Example\ +```yaml +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + name: myIngress + namespace: myIngress + annotations: + cert-manager.io/cluster-issuer: letsencrypt +spec: + rules: + - host: example.tds.cscs.ch + http: + paths: + - pathType: Prefix + path: / + backend: + service: + name: myservice + port: + number: 80 + tls: + - hosts: + - example.tds.cscs.ch + secretName: myingress-cert +``` + +πŸ“„ [NGINX Ingress Docs](https://docs.nginx.com/nginx-ingress-controller) +πŸ“„ [cert-manager Ingress Usage](https://cert-manager.io/docs/usage/ingress/) + +--- + +### πŸ”‘ `external-secrets` + +Integrates with secret management tools like **HashiCorp Vault**. + +πŸ“„ [external-secrets documentation](https://external-secrets.io/) + +--- + +### πŸ” `kured` + +Responsible for automatic node reboots (e.g., after kernel updates). + +πŸ“„ [kured documentation](https://kured.dev/) + +--- + +### πŸ“Š Observability + +Includes: + +- **ECK Operator** +- **Beats agents** – Export logs and metrics to CSCS’s central log system +- **Prometheus** – Scrapes metrics and exports them to CSCS's central monitoring cluster diff --git a/mkdocs.yml b/mkdocs.yml index 42d1873c..b9b7b648 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -119,6 +119,7 @@ nav: - 'LLM Finetuning': guides/mlp_tutorials/llm-finetuning.md - 'LLM Training': guides/mlp_tutorials/llm-nanotron-training.md - 'Kubernetes': + - 'Clusters': kubernetes/clusters.md - 'Kubernetes Upgrades': kubernetes/kubernetes-upgrades.md - 'Node OS Upgrades': kubernetes/node-upgrades.md - 'Policies': From 4acd1aa32815c0fed8101fb4d4a71786b15260f5 Mon Sep 17 00:00:00 2001 From: Elia Oggian Date: Mon, 7 Jul 2025 15:30:39 +0200 Subject: [PATCH 03/18] Add Kubernetes Updates docs From 2b66f84fb0d91c28137e1d6760c44d169327fb9a Mon Sep 17 00:00:00 2001 From: eliaoggian Date: Tue, 15 Jul 2025 16:55:27 +0200 Subject: [PATCH 04/18] Update docs/kubernetes/clusters.md Co-authored-by: Mikael Simberg --- docs/kubernetes/clusters.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/kubernetes/clusters.md b/docs/kubernetes/clusters.md index fcdb7b69..506a4ed5 100644 --- a/docs/kubernetes/clusters.md +++ b/docs/kubernetes/clusters.md @@ -1,4 +1,3 @@ - # CSCS Kubernetes Clusters This document provides an overview of the Kubernetes clusters maintained by CSCS and offers step-by-step instructions for accessing and interacting with them. From 42484811fde39d67eb31391d0fd1ceb8bb7f4875 Mon Sep 17 00:00:00 2001 From: eliaoggian Date: Tue, 15 Jul 2025 16:55:37 +0200 Subject: [PATCH 05/18] Update docs/kubernetes/clusters.md Co-authored-by: Mikael Simberg --- docs/kubernetes/clusters.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/kubernetes/clusters.md b/docs/kubernetes/clusters.md index 506a4ed5..1dcd69fc 100644 --- a/docs/kubernetes/clusters.md +++ b/docs/kubernetes/clusters.md @@ -1,4 +1,4 @@ -# CSCS Kubernetes Clusters +# CSCS Kubernetes clusters This document provides an overview of the Kubernetes clusters maintained by CSCS and offers step-by-step instructions for accessing and interacting with them. From ff673fe625ceecfe098aa74bd8444c80147b1b3c Mon Sep 17 00:00:00 2001 From: eliaoggian Date: Tue, 15 Jul 2025 16:56:10 +0200 Subject: [PATCH 06/18] Update docs/kubernetes/clusters.md Co-authored-by: Mikael Simberg --- docs/kubernetes/clusters.md | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/docs/kubernetes/clusters.md b/docs/kubernetes/clusters.md index 1dcd69fc..5d98a2ad 100644 --- a/docs/kubernetes/clusters.md +++ b/docs/kubernetes/clusters.md @@ -63,11 +63,14 @@ To interact with the cluster, you need the `kubectl` CLI: - If you don't have a CSCS account, open a Service Desk ticket to ask support. #### Store the kubeconfig file - ```bash - mv mykubeconfig.yaml ~/.kube/config - # or - export KUBECONFIG=/home/user/kubeconfig.yaml - ``` + +```bash +mv mykubeconfig.yaml ~/.kube/config +``` +or +```bash +export KUBECONFIG=/home/user/kubeconfig.yaml +``` #### Test connectivity ```bash From 1c4fbd769f187eae1e56745a3aa3ef9279221cbc Mon Sep 17 00:00:00 2001 From: eliaoggian Date: Tue, 15 Jul 2025 16:56:40 +0200 Subject: [PATCH 07/18] Update docs/kubernetes/clusters.md Co-authored-by: Mikael Simberg --- docs/kubernetes/clusters.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/kubernetes/clusters.md b/docs/kubernetes/clusters.md index 5d98a2ad..6a2d8ee3 100644 --- a/docs/kubernetes/clusters.md +++ b/docs/kubernetes/clusters.md @@ -112,8 +112,8 @@ Automatically manages DNS entries for: kubectl annotate service nginx "external-dns.alpha.kubernetes.io/hostname=nginx.mycluster.tds.cscs.ch." ``` -> βœ… Use a valid name under the configured subdomain. -πŸ“„ [external-dns documentation](https://github.com/kubernetes-sigs/external-dns) +!!! info "Use a valid name under the configured subdomain" + [external-dns documentation](https://github.com/kubernetes-sigs/external-dns) --- From 9d7ed89a6628f240d18b34b78021de09b1ab80b9 Mon Sep 17 00:00:00 2001 From: eliaoggian Date: Tue, 15 Jul 2025 17:01:11 +0200 Subject: [PATCH 08/18] Update docs/kubernetes/clusters.md Co-authored-by: Mikael Simberg --- docs/kubernetes/clusters.md | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/docs/kubernetes/clusters.md b/docs/kubernetes/clusters.md index 6a2d8ee3..72ae4bdb 100644 --- a/docs/kubernetes/clusters.md +++ b/docs/kubernetes/clusters.md @@ -60,7 +60,17 @@ To interact with the cluster, you need the `kubectl` CLI: - Run `kcscs list` to list the clusters you have access to - Run `kcscs get` to get the kubeconfig file for a specific cluster - - If you don't have a CSCS account, open a Service Desk ticket to ask support. + +- If you have a CSCS account and can access [Rancher](https://rancher.cscs.ch), download the kubeconfig for your cluster. + +- If you have a CSCS account but can't access [Rancher](https://rancher.cscs.ch), request a local Rancher user and use the **kcscs** tool installed on **ela.cscs.ch** to obtain the kubeconfig: + - Download your SSH keys from [SSH Service](https://sshservice.cscs.ch) + - SSH to `ela.cscs.ch` using the downloaded SSH keys + - Run `kcscs login` and insert your Rancher local user credentials (Supplied by CSCS) + - Run `kcscs list` to list the clusters you have access to + - Run `kcscs get` to get the kubeconfig file for a specific cluster + +- If you don't have a CSCS account, open a Service Desk ticket to ask support. #### Store the kubeconfig file From f34b9863107b54d9b879d9a473263d34073f8f08 Mon Sep 17 00:00:00 2001 From: eliaoggian Date: Tue, 15 Jul 2025 17:01:47 +0200 Subject: [PATCH 09/18] Update docs/kubernetes/clusters.md Co-authored-by: Mikael Simberg --- docs/kubernetes/clusters.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/kubernetes/clusters.md b/docs/kubernetes/clusters.md index 72ae4bdb..6d56ae44 100644 --- a/docs/kubernetes/clusters.md +++ b/docs/kubernetes/clusters.md @@ -87,7 +87,8 @@ export KUBECONFIG=/home/user/kubeconfig.yaml kubectl get nodes ``` -> ⚠️ The kubeconfig file contains credentials. Keep it secure. +!!! warning + The kubeconfig file contains credentials. Keep it secure. --- From 5a29ac5cdba8e5c2f31144089e69d4a1589c6b78 Mon Sep 17 00:00:00 2001 From: eliaoggian Date: Tue, 15 Jul 2025 17:04:37 +0200 Subject: [PATCH 10/18] Update docs/kubernetes/kubernetes-upgrades.md Co-authored-by: Mikael Simberg --- docs/kubernetes/kubernetes-upgrades.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/kubernetes/kubernetes-upgrades.md b/docs/kubernetes/kubernetes-upgrades.md index 117db4f2..3db4c723 100644 --- a/docs/kubernetes/kubernetes-upgrades.md +++ b/docs/kubernetes/kubernetes-upgrades.md @@ -43,6 +43,6 @@ The **impact of a Kubernetes upgrade can vary**, depending on the nature of the ## πŸ’¬ Questions? -If you have any questions about upcoming Kubernetes upgrades or want help verifying your application’s readiness, please contact the Network and Cloud team via Service Desk ticket. +If you have any questions about upcoming Kubernetes upgrades or want help verifying your application’s readiness, please [contact the Network and Cloud team via Service Desk ticket][ref-get-in-touch]. Thank you for your support and collaboration in keeping our platform secure and reliable. From 115c9b4872ae189943e0c2e6c7450b8ff58dde41 Mon Sep 17 00:00:00 2001 From: Elia Oggian Date: Tue, 15 Jul 2025 17:05:40 +0200 Subject: [PATCH 11/18] Fix docs based on review --- docs/kubernetes/clusters.md | 42 +++++--------------------- docs/kubernetes/kubernetes-upgrades.md | 14 --------- docs/kubernetes/node-upgrades.md | 11 ------- 3 files changed, 8 insertions(+), 59 deletions(-) diff --git a/docs/kubernetes/clusters.md b/docs/kubernetes/clusters.md index 6a2d8ee3..4d114c2c 100644 --- a/docs/kubernetes/clusters.md +++ b/docs/kubernetes/clusters.md @@ -2,8 +2,6 @@ This document provides an overview of the Kubernetes clusters maintained by CSCS and offers step-by-step instructions for accessing and interacting with them. ---- - ## Architecture All Kubernetes clusters at CSCS are: @@ -11,8 +9,6 @@ All Kubernetes clusters at CSCS are: - Managed using **[Rancher](https://www.rancher.com)** - Running **[RKE2 (Rancher Kubernetes Engine 2)](https://github.com/rancher/rke2)** ---- - ## Cluster Environments Clusters are grouped into two main environments: @@ -22,8 +18,6 @@ Clusters are grouped into two main environments: TDS clusters receive updates first. If no issues arise, the same updates are then applied to PROD clusters. ---- - ## Kubernetes API Access You can access the Kubernetes API in two main ways: @@ -40,8 +34,6 @@ You can access the Kubernetes API in two main ways: To check which method you are using, examine the `current-context` in your `kubeconfig` file. ---- - ## Cluster Access To interact with the cluster, you need the `kubectl` CLI: @@ -79,15 +71,11 @@ export KUBECONFIG=/home/user/kubeconfig.yaml > ⚠️ The kubeconfig file contains credentials. Keep it secure. ---- - ## Pre-installed Applications All CSCS-provided clusters include a set of pre-installed tools and components, described below: ---- - -### πŸ“¦ `ceph-csi` +### `ceph-csi` Provides **dynamic persistent volume provisioning** via the Ceph Container Storage Interface. @@ -98,9 +86,7 @@ Provides **dynamic persistent volume provisioning** via the Ceph Container Stora - `rbd-nvme` – RWO, backed by NVMe (high-performance workloads like databases) - `*-retain` – Same classes, but retain the volume after PVC deletion ---- - -### 🌐 `external-dns` +### `external-dns` Automatically manages DNS entries for: @@ -115,9 +101,7 @@ kubectl annotate service nginx "external-dns.alpha.kubernetes.io/hostname=nginx. !!! info "Use a valid name under the configured subdomain" [external-dns documentation](https://github.com/kubernetes-sigs/external-dns) ---- - -### πŸ” `cert-manager` +### `cert-manager` Handles automatic issuance of TLS certificates from Let's Encrypt. @@ -141,9 +125,7 @@ You can also issue certs automatically via Ingress annotations (see `ingress-ngi πŸ“„ [cert-manager documentation](https://cert-manager.io) ---- - -### πŸ“‘ `metallb` +### `metallb` Enables `LoadBalancer` service types by assigning public IPs. @@ -151,9 +133,7 @@ Enables `LoadBalancer` service types by assigning public IPs. Prefer using `Ingress` unless you specifically need a `LoadBalancer`. πŸ“„ [metallb documentation](https://metallb.universe.tf) ---- - -### 🌍 `ingress-nginx` +### `ingress-nginx` Default Ingress controller with class `nginx`. Supports automatic TLS via cert-manager annotations. @@ -188,25 +168,19 @@ spec: πŸ“„ [NGINX Ingress Docs](https://docs.nginx.com/nginx-ingress-controller) πŸ“„ [cert-manager Ingress Usage](https://cert-manager.io/docs/usage/ingress/) ---- - -### πŸ”‘ `external-secrets` +### `external-secrets` Integrates with secret management tools like **HashiCorp Vault**. πŸ“„ [external-secrets documentation](https://external-secrets.io/) ---- - -### πŸ” `kured` +### `kured` Responsible for automatic node reboots (e.g., after kernel updates). πŸ“„ [kured documentation](https://kured.dev/) ---- - -### πŸ“Š Observability +### Observability Includes: diff --git a/docs/kubernetes/kubernetes-upgrades.md b/docs/kubernetes/kubernetes-upgrades.md index 117db4f2..93a9fa40 100644 --- a/docs/kubernetes/kubernetes-upgrades.md +++ b/docs/kubernetes/kubernetes-upgrades.md @@ -2,8 +2,6 @@ To maintain a secure, stable, and supported platform, we regularly upgrade our Kubernetes clusters. We use **[RKE2](https://docs.rke2.io/)** as our Kubernetes distribution. ---- - ## πŸ”„ Upgrade Flow - **Phased Rollout**: @@ -15,8 +13,6 @@ To maintain a secure, stable, and supported platform, we regularly upgrade our K - Timing may depend on compatibility with **other infrastructure components** (e.g., storage, CNI plugins, monitoring tools). - However, all clusters will be upgraded **before the current Kubernetes version reaches End of Life (EOL)**. ---- - ## ⚠️ Upgrade Impact The **impact of a Kubernetes upgrade can vary**, depending on the nature of the changes involved: @@ -31,18 +27,8 @@ The **impact of a Kubernetes upgrade can vary**, depending on the nature of the > πŸ’‘ Applications that follow cloud-native best practices (e.g., readiness probes, multiple replicas, graceful shutdown handling) are **less likely to be impacted** by upgrades. ---- - ## βœ… What You Can Expect - Upgrades are performed using safe, tested procedures with minimal risk to production workloads. - TDS clusters serve as a **canary environment**, allowing us to identify issues early. - All clusters are kept **aligned with supported Kubernetes versions**. - ---- - -## πŸ’¬ Questions? - -If you have any questions about upcoming Kubernetes upgrades or want help verifying your application’s readiness, please contact the Network and Cloud team via Service Desk ticket. - -Thank you for your support and collaboration in keeping our platform secure and reliable. diff --git a/docs/kubernetes/node-upgrades.md b/docs/kubernetes/node-upgrades.md index fa66631d..f062cfc6 100644 --- a/docs/kubernetes/node-upgrades.md +++ b/docs/kubernetes/node-upgrades.md @@ -2,8 +2,6 @@ To ensure the **security** and **stability** of our infrastructure, CSCS will perform **monthly OS updates** on all nodes of our Kubernetes clusters. ---- - ## πŸ”„ Maintenance Schedule - **Frequency**: Every **first week of the month** @@ -14,8 +12,6 @@ These updates include important security patches and system updates for the oper > ⚠️ **Note:** Nodes will be **rebooted only if required** by the updates. If no reboot is necessary, nodes will remain in service without disruption. ---- - ## 🚨 Urgent Security Patches In the event of a **critical zero-day vulnerability**, we will apply patches and perform reboots (if required) **as soon as possible**, outside of the regular update schedule if needed. @@ -24,8 +20,6 @@ In the event of a **critical zero-day vulnerability**, we will apply patches and - Users will be notified ahead of time **when possible**. - Standard safety and rolling reboot practices will still be followed. ---- - ## πŸ› οΈ Reboot Management with Kured We use [**Kured** (KUbernetes REboot Daemon)](https://github.com/kubereboot/kured) to safely automate the reboot process. Kured ensures that: @@ -35,8 +29,6 @@ We use [**Kured** (KUbernetes REboot Daemon)](https://github.com/kubereboot/kure - Reboots occur **only during the defined window** - Nodes are **cordoned**, **drained**, and **gracefully reintegrated** after reboot. ---- - ## βœ… Application Requirements To avoid service disruption during node maintenance, applications **must be designed for high availability**. Specifically: @@ -50,10 +42,7 @@ To avoid service disruption during node maintenance, applications **must be desi > ❗ Applications that do not meet these requirements **may experience temporary disruption** during node reboots. ---- - ## πŸ‘©β€πŸ’» Need Help? If you have questions or need help preparing your applications for rolling node maintenance, please contact the Network and Cloud team via Service Desk ticket. -Thank you for your cooperation and commitment to building robust, cloud-native services. From 7ec5d89f2a1bb9ded41c8bd55f52b2b16634b64e Mon Sep 17 00:00:00 2001 From: Elia Oggian Date: Wed, 16 Jul 2025 11:54:10 +0200 Subject: [PATCH 12/18] Improve docs. Add spelling config. Add CODEOWNERS --- .github/CODEOWNERS | 1 + .github/actions/spelling/allow.txt | 115 ++++++++++-------- docs/{ => services}/kubernetes/clusters.md | 97 +++++++++------ docs/services/kubernetes/index.md | 34 ++++++ .../kubernetes/kubernetes-upgrades.md | 20 +-- .../kubernetes/node-updates.md} | 10 +- mkdocs.yml | 9 +- 7 files changed, 176 insertions(+), 110 deletions(-) rename docs/{ => services}/kubernetes/clusters.md (55%) create mode 100644 docs/services/kubernetes/index.md rename docs/{ => services}/kubernetes/kubernetes-upgrades.md (76%) rename docs/{kubernetes/node-upgrades.md => services/kubernetes/node-updates.md} (79%) diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index 8e25a88d..0005da62 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -1,6 +1,7 @@ * @bcumming @msimberg @RMeli docs/access/jupyterlab.md @rsarm docs/services/firecrest @jpdorsch @ekouts +docs/services/kubernetes @eliaoggian docs/software/communication @Madeeks @msimberg docs/software/devtools/linaro @jgphpc docs/software/prgenv/linalg.md @finkandreas @msimberg diff --git a/.github/actions/spelling/allow.txt b/.github/actions/spelling/allow.txt index 34737137..1f95de0b 100644 --- a/.github/actions/spelling/allow.txt +++ b/.github/actions/spelling/allow.txt @@ -1,134 +1,137 @@ +aarch +aarch64 +acl ACLs ACR +Alpstein AMD AWS -Alpstein Balfrin +biomolecular +bristen Broyden +bytecode +capstor +Ceph CFLAGS CHARMM CHF +clariden +concretise +concretizer +Containerfile +containerised COSMA -CPE cpe +CPE CPMD CSCS +customised CWP CXI -capstor -Ceph -Containerfile +diagonalisation DNS EDF EDFs EDFs -EMPA -ETHZ Ehrenfest +eiger +EMPA Errigal +ETHZ FFT +filesystems Fock GAPW +Gaussian GCC GGA +Google GPFS GPG GPU GPUs GPW GROMACS +groundstate GTL -Gaussian -Google +Hartree HDD HPC HPCP HPE HSN -Hartree +inodes iopsstor Jax Jira +kcscs Keycloak +kubeconfig +KUbernetes +kured +Kured LAMMPS LDA -LOCALID -LUMI +lexer Libc +libfabric Linaro Linux +LOCALID +LUMI +metallb +MeteoSwiss MFA MLP MNDO MPICH MPS -MeteoSwiss +multitenancy NAMD NICs NVIDIA NVMe OTP OTPs +Parrinello PASC PBE PDUs PID -PMPI -POSIX -Parrinello Piz Plesset +PMPI +podman +POSIX +prgenv +prioritised +proactively Pulay +quickstart RCCL RDMA -ROCm -RPA +RKE Roboto +ROCm Roothaan -SSHService -STMV -Scopi -TOTP -UANs -UserLab -VASP -Waldur -Wannier -XDG -aarch -aarch64 -acl -biomolecular -bristen -bytecode -clariden -concretise -concretizer -containerised -customised -diagonalisation -eiger -filesystems -groundstate -inodes -lexer -libfabric -multitenancy -podman -prioritised -prgenv -proactively -quickstart +RPA +RWO +RWX santis sbatch +Scopi screenshot slurm smartphone squashfs srun ssh +SSHService stackinator stakeholders +STMV +subdomain subfolders subtable subtables @@ -140,23 +143,30 @@ tcsh testuser timeframe timelimit +TLS tmpfs todi toolbar toolset torchaudio torchvision +TOTP treesitter trilinos +UANs uarch uenv uenvs uids +UserLab +VASP vCluster vClusters venv versioned versioning +Waldur +Wannier webhooks webinar webpage @@ -166,5 +176,6 @@ workaround workflows xattr xattrs +XDG youtube zstd diff --git a/docs/kubernetes/clusters.md b/docs/services/kubernetes/clusters.md similarity index 55% rename from docs/kubernetes/clusters.md rename to docs/services/kubernetes/clusters.md index 94b503ad..4e18a796 100644 --- a/docs/kubernetes/clusters.md +++ b/docs/services/kubernetes/clusters.md @@ -1,3 +1,4 @@ +[](){#ref-kubernetes-clusters} # CSCS Kubernetes clusters This document provides an overview of the Kubernetes clusters maintained by CSCS and offers step-by-step instructions for accessing and interacting with them. @@ -9,6 +10,11 @@ All Kubernetes clusters at CSCS are: - Managed using **[Rancher](https://www.rancher.com)** - Running **[RKE2 (Rancher Kubernetes Engine 2)](https://github.com/rancher/rke2)** +CSCS offers two types of Kubernetes clusters for partners: + +- **Harvester-only clusters**: These clusters run exclusively on virtual machines provisioned by Harvester (SUSE Virtualization), providing a flexible and isolated environment suitable for most workloads. +- **Alpernetes clusters**: These clusters combine Harvester VMs with compute nodes from the Alps supercomputer. This hybrid setup, called *Alpernetes*, enables workloads to leverage both virtualized infrastructure and high-performance computing resources within the same Kubernetes environment. + ## Cluster Environments Clusters are grouped into two main environments: @@ -16,7 +22,7 @@ Clusters are grouped into two main environments: - **TDS** – Test and Development Systems - **PROD** – Production -TDS clusters receive updates first. If no issues arise, the same updates are then applied to PROD clusters. +See [Kubernetes upgrades][ref-kubernetes-clusters-upgrades] for detailed upgrade policy. ## Kubernetes API Access @@ -25,11 +31,11 @@ You can access the Kubernetes API in two main ways: ### Direct Internet Access - A Virtual IP is exposed for the API server. -- Access can be restricted by source IP addresses. +- Access is restricted by source IP addresses of the partner. ### Access via CSCS Jump Host -- Connect through a bastion host (e.g., `ela.cscs.ch`). +- Connect through a jump host (e.g., `ela.cscs.ch`). - API calls are securely proxied through Rancher. To check which method you are using, examine the `current-context` in your `kubeconfig` file. @@ -38,33 +44,43 @@ To check which method you are using, examine the `current-context` in your `kube To interact with the cluster, you need the `kubectl` CLI: πŸ”— [Install kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl) -> `kubectl` is pre-installed on the CSCS jump host. +??? Note "`kubectl` is pre-installed on the CSCS jump host." + -### Step-by-Step Access Guide +### Retrieve your kubeconfig file -#### Retrieve your kubeconfig file - - If you have a CSCS account and can access [Rancher](https://rancher.cscs.ch), download the kubeconfig for your cluster. +#### Internal CSCS Users +Access [Rancher](https://rancher.cscs.ch) and download the kubeconfig for your cluster. - - If you have a CSCS account but can't access [Rancher](https://rancher.cscs.ch), request a local Rancher user and use the **kcscs** tool installed on **ela.cscs.ch** to obtain the kubeconfig: - - Download your SSH keys from [SSH Service](https://sshservice.cscs.ch) - - SSH to `ela.cscs.ch` using the downloaded SSH keys - - Run `kcscs login` and insert your Rancher local user credentials (Supplied by CSCS) - - Run `kcscs list` to list the clusters you have access to - - Run `kcscs get` to get the kubeconfig file for a specific cluster +#### External Users +A specific Rancher user and password should have been provided to the partner. +Use the `kcscs` tool installed on `ela.cscs.ch` to obtain the kubeconfig by following the next steps. -- If you have a CSCS account and can access [Rancher](https://rancher.cscs.ch), download the kubeconfig for your cluster. - -- If you have a CSCS account but can't access [Rancher](https://rancher.cscs.ch), request a local Rancher user and use the **kcscs** tool installed on **ela.cscs.ch** to obtain the kubeconfig: - - Download your SSH keys from [SSH Service](https://sshservice.cscs.ch) - - SSH to `ela.cscs.ch` using the downloaded SSH keys - - Run `kcscs login` and insert your Rancher local user credentials (Supplied by CSCS) - - Run `kcscs list` to list the clusters you have access to - - Run `kcscs get` to get the kubeconfig file for a specific cluster +Download your SSH keys from [SSH Service](https://sshservice.cscs.ch) (and add them to the SSH agent). + +SSH to the jump host using the downloaded SSH keys +```bash +ssh ela.cscs.ch +``` -- If you don't have a CSCS account, open a Service Desk ticket to ask support. +Login with `kcscs` with the provided Rancher credentials +```bash +kcscs login +``` -#### Store the kubeconfig file +List the accessible clusters +```bash +kcscs list +``` + +Retrieve the kubeconfig file for a specific cluster +```bash +kcscs get +``` + + +### Store the kubeconfig file ```bash mv mykubeconfig.yaml ~/.kube/config @@ -74,7 +90,7 @@ or export KUBECONFIG=/home/user/kubeconfig.yaml ``` -#### Test connectivity +### Test connectivity ```bash kubectl get nodes ``` @@ -88,7 +104,7 @@ All CSCS-provided clusters include a set of pre-installed tools and components, ### `ceph-csi` -Provides **dynamic persistent volume provisioning** via the Ceph Container Storage Interface. +Provides dynamic persistent volume provisioning via the Ceph Container Storage Interface (CEPH CSI). #### Storage Classes @@ -109,8 +125,9 @@ Automatically manages DNS entries for: kubectl annotate service nginx "external-dns.alpha.kubernetes.io/hostname=nginx.mycluster.tds.cscs.ch." ``` -!!! info "Use a valid name under the configured subdomain" - [external-dns documentation](https://github.com/kubernetes-sigs/external-dns) +!!! Note "Use a valid name under the configured subdomain" + +πŸ”— [external-dns documentation](https://github.com/kubernetes-sigs/external-dns) ### `cert-manager` @@ -132,24 +149,25 @@ spec: name: letsencrypt ``` -You can also issue certs automatically via Ingress annotations (see `ingress-nginx` section). +You can also issue certificates automatically via Ingress annotations (see `ingress-nginx` section). -πŸ“„ [cert-manager documentation](https://cert-manager.io) +πŸ”— [cert-manager documentation](https://cert-manager.io) ### `metallb` Enables `LoadBalancer` service types by assigning public IPs. -> ⚠️ The public IP pool is limited. -Prefer using `Ingress` unless you specifically need a `LoadBalancer`. -πŸ“„ [metallb documentation](https://metallb.universe.tf) +!!! Warning "The public IP pool is limited. Prefer using `Ingress` unless you specifically need a `LoadBalancer` Service for TCP traffic." + +πŸ”— [metallb documentation](https://metallb.universe.tf) ### `ingress-nginx` Default Ingress controller with class `nginx`. Supports automatic TLS via cert-manager annotations. -#### Example\ +Example: + ```yaml apiVersion: networking.k8s.io/v1 kind: Ingress @@ -176,25 +194,28 @@ spec: secretName: myingress-cert ``` -πŸ“„ [NGINX Ingress Docs](https://docs.nginx.com/nginx-ingress-controller) -πŸ“„ [cert-manager Ingress Usage](https://cert-manager.io/docs/usage/ingress/) +πŸ”— [NGINX Ingress Docs](https://docs.nginx.com/nginx-ingress-controller) +πŸ”— [cert-manager Ingress Usage](https://cert-manager.io/docs/usage/ingress/) ### `external-secrets` Integrates with secret management tools like **HashiCorp Vault**. -πŸ“„ [external-secrets documentation](https://external-secrets.io/) +Enables the usage of `ExternalSecret` resources to fetch secrets from `SecreStore` or `ClusterSecretStore` resources to fetch secrets and store them into `Secrets` inside the cluster. + +It helps to avoid storing secrets in the deployment manifests, especially in GitOps environments. + +πŸ”— [external-secrets documentation](https://external-secrets.io/) ### `kured` Responsible for automatic node reboots (e.g., after kernel updates). -πŸ“„ [kured documentation](https://kured.dev/) +πŸ”— [kured documentation](https://kured.dev/) ### Observability Includes: -- **ECK Operator** - **Beats agents** – Export logs and metrics to CSCS’s central log system - **Prometheus** – Scrapes metrics and exports them to CSCS's central monitoring cluster diff --git a/docs/services/kubernetes/index.md b/docs/services/kubernetes/index.md new file mode 100644 index 00000000..78b36b7d --- /dev/null +++ b/docs/services/kubernetes/index.md @@ -0,0 +1,34 @@ +# Kubernetes + +Kubernetes is only available for specific partners. + +!!! Note + Kubernetes is not available for normal users on Alps. + +This documentation is designed to help partners who have been granted access to a Kubernetes cluster. + +It explains how clusters are provisioned, maintained, and the policies in place for upgrades and updates. + + + +
+- :fontawesome-solid-layer-group: __Cluster Architecture__ + + CSCS Kubernetes cluster overview. What are the main components and how to interact with it. + + [:octicons-arrow-right-24: Clusters][ref-kubernetes-clusters] + +- :fontawesome-solid-arrow-up-from-bracket: __Kubernetes Upgrades__ + + Kuberenetes Cluster upgrade policy (Kubernetes version upgrades) + + [:octicons-arrow-right-24: Kubernetes Upgrades][ref-kubernetes-clusters-upgrades] + +- :fontawesome-solid-shield-halved: __Node Updates__ + + Cluster Nodes OS update policy (Regular Node Security Updates) + + [:octicons-arrow-right-24: Node OS Updates][ref-kubernetes-node-updates] + +
+ diff --git a/docs/kubernetes/kubernetes-upgrades.md b/docs/services/kubernetes/kubernetes-upgrades.md similarity index 76% rename from docs/kubernetes/kubernetes-upgrades.md rename to docs/services/kubernetes/kubernetes-upgrades.md index 498829be..ab077123 100644 --- a/docs/kubernetes/kubernetes-upgrades.md +++ b/docs/services/kubernetes/kubernetes-upgrades.md @@ -1,14 +1,17 @@ +[](){#ref-kubernetes-clusters-upgrades} # Kubernetes Cluster Upgrade Policy To maintain a secure, stable, and supported platform, we regularly upgrade our Kubernetes clusters. We use **[RKE2](https://docs.rke2.io/)** as our Kubernetes distribution. ## πŸ”„ Upgrade Flow -- **Phased Rollout**: +**Phased Rollout** + - Upgrades are first applied to **TDS clusters** (Test and Development Systems). - After a **minimum of 2 weeks**, if no critical issues are observed, the same upgrade will be applied to **PROD clusters**. -- **No Fixed Schedule**: +**No Fixed Schedule** + - Upgrades are not done on a strict calendar basis. - Timing may depend on compatibility with **other infrastructure components** (e.g., storage, CNI plugins, monitoring tools). - However, all clusters will be upgraded **before the current Kubernetes version reaches End of Life (EOL)**. @@ -17,15 +20,17 @@ To maintain a secure, stable, and supported platform, we regularly upgrade our K The **impact of a Kubernetes upgrade can vary**, depending on the nature of the changes involved: -- **Minimal Impact**: +**Minimal Impact** + - For example, upgrades that affect only the `kubelet` may be **transparent to workloads**. - Rolling restarts may occur, but no downtime is expected for well-configured applications. -- **Potentially Disruptive**: +**Potentially Disruptive** + - Upgrades involving components such as the **CNI (Container Network Interface)** may cause **temporary network interruptions**. - Other control plane or critical component updates might cause short-lived disruption to scheduling or connectivity. -> πŸ’‘ Applications that follow cloud-native best practices (e.g., readiness probes, multiple replicas, graceful shutdown handling) are **less likely to be impacted** by upgrades. +??? Note "Applications that follow cloud-native best practices (e.g., readiness probes, multiple replicas, graceful shutdown handling) are **less likely to be impacted** by upgrades." ## βœ… What You Can Expect @@ -33,8 +38,3 @@ The **impact of a Kubernetes upgrade can vary**, depending on the nature of the - TDS clusters serve as a **canary environment**, allowing us to identify issues early. - All clusters are kept **aligned with supported Kubernetes versions**. -## πŸ’¬ Questions? - -If you have any questions about upcoming Kubernetes upgrades or want help verifying your application’s readiness, please [contact the Network and Cloud team via Service Desk ticket][ref-get-in-touch]. - - diff --git a/docs/kubernetes/node-upgrades.md b/docs/services/kubernetes/node-updates.md similarity index 79% rename from docs/kubernetes/node-upgrades.md rename to docs/services/kubernetes/node-updates.md index f062cfc6..4a3cf339 100644 --- a/docs/kubernetes/node-upgrades.md +++ b/docs/services/kubernetes/node-updates.md @@ -1,3 +1,4 @@ +[](){#ref-kubernetes-node-updates} # Kubernetes Nodes OS Update Policy To ensure the **security** and **stability** of our infrastructure, CSCS will perform **monthly OS updates** on all nodes of our Kubernetes clusters. @@ -10,7 +11,7 @@ To ensure the **security** and **stability** of our infrastructure, CSCS will pe These updates include important security patches and system updates for the operating systems of cluster nodes. -> ⚠️ **Note:** Nodes will be **rebooted only if required** by the updates. If no reboot is necessary, nodes will remain in service without disruption. +??? Note "Nodes will be rebooted only if required by the updates." ## 🚨 Urgent Security Patches @@ -40,9 +41,6 @@ To avoid service disruption during node maintenance, applications **must be desi - **Stateless design** or resilient handling of state - Appropriate **resource requests and limits** -> ❗ Applications that do not meet these requirements **may experience temporary disruption** during node reboots. - -## πŸ‘©β€πŸ’» Need Help? - -If you have questions or need help preparing your applications for rolling node maintenance, please contact the Network and Cloud team via Service Desk ticket. +!!! Warning + Applications that do not meet these requirements **may experience temporary disruption** during node reboots. diff --git a/mkdocs.yml b/mkdocs.yml index b9b7b648..52a5b9a6 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -98,6 +98,11 @@ nav: - services/index.md - 'FirecREST': services/firecrest.md - 'CI/CD': services/cicd.md + - 'Kubernetes': + - services/kubernetes/index.md + - 'Clusters': services/kubernetes/clusters.md + - 'Kubernetes Upgrades': services/kubernetes/kubernetes-upgrades.md + - 'Node OS Updates': services/kubernetes/node-updates.md - 'Running Jobs': - running/index.md - 'Slurm': running/slurm.md @@ -118,10 +123,6 @@ nav: - 'LLM Inference': guides/mlp_tutorials/llm-inference.md - 'LLM Finetuning': guides/mlp_tutorials/llm-finetuning.md - 'LLM Training': guides/mlp_tutorials/llm-nanotron-training.md - - 'Kubernetes': - - 'Clusters': kubernetes/clusters.md - - 'Kubernetes Upgrades': kubernetes/kubernetes-upgrades.md - - 'Node OS Upgrades': kubernetes/node-upgrades.md - 'Policies': - policies/index.md - 'User Regulations': policies/regulations.md From c5f5d65c9967fefb7a155d6c0b6b9de35e5753a1 Mon Sep 17 00:00:00 2001 From: Elia Oggian Date: Wed, 16 Jul 2025 11:58:33 +0200 Subject: [PATCH 13/18] sort allowed words --- .github/actions/spelling/allow.txt | 117 +++++++++++++---------------- 1 file changed, 53 insertions(+), 64 deletions(-) diff --git a/.github/actions/spelling/allow.txt b/.github/actions/spelling/allow.txt index 1f95de0b..1efff1c6 100644 --- a/.github/actions/spelling/allow.txt +++ b/.github/actions/spelling/allow.txt @@ -1,137 +1,134 @@ -aarch -aarch64 -acl ACLs ACR -Alpstein AMD AWS +Alpstein Balfrin -biomolecular -bristen Broyden -bytecode -capstor -Ceph CFLAGS CHARMM CHF -clariden -concretise -concretizer -Containerfile -containerised COSMA -cpe CPE CPMD CSCS -customised CWP CXI -diagonalisation +Ceph +Containerfile DNS EDF EDFs EDFs -Ehrenfest -eiger EMPA -Errigal ETHZ +Ehrenfest +Errigal FFT -filesystems Fock GAPW -Gaussian GCC GGA -Google GPFS GPG GPU GPUs GPW GROMACS -groundstate GTL -Hartree +Gaussian +Google HDD HPC HPCP HPE HSN -inodes -iopsstor +Hartree Jax Jira -kcscs Keycloak -kubeconfig -KUbernetes -kured -Kured LAMMPS LDA -lexer +LOCALID +LUMI Libc -libfabric Linaro Linux -LOCALID -LUMI -metallb -MeteoSwiss MFA MLP MNDO MPICH MPS -multitenancy +MeteoSwiss NAMD NICs NVIDIA NVMe OTP OTPs -Parrinello PASC PBE PDUs PID -Piz -Plesset PMPI -podman POSIX -prgenv -prioritised -proactively +Parrinello +Piz +Plesset Pulay -quickstart RCCL RDMA -RKE -Roboto ROCm -Roothaan RPA -RWO -RWX +Roboto +Roothaan +SSHService +STMV +Scopi +TOTP +UANs +UserLab +VASP +Waldur +Wannier +XDG +aarch +aarch64 +acl +biomolecular +bristen +bytecode +capstor +clariden +concretise +concretizer +containerised +cpe +customised +diagonalisation +eiger +filesystems +groundstate +inodes +iopsstor +lexer +libfabric +multitenancy +podman +prgenv +prioritised +proactively +quickstart santis sbatch -Scopi screenshot slurm smartphone squashfs srun ssh -SSHService stackinator stakeholders -STMV -subdomain subfolders subtable subtables @@ -143,30 +140,23 @@ tcsh testuser timeframe timelimit -TLS tmpfs todi toolbar toolset torchaudio torchvision -TOTP treesitter trilinos -UANs uarch uenv uenvs uids -UserLab -VASP vCluster vClusters venv versioned versioning -Waldur -Wannier webhooks webinar webpage @@ -176,6 +166,5 @@ workaround workflows xattr xattrs -XDG youtube zstd From 7ddcc1e24ac2dabdc2215d77e57ac4d674caa25c Mon Sep 17 00:00:00 2001 From: Elia Oggian Date: Wed, 16 Jul 2025 12:11:39 +0200 Subject: [PATCH 14/18] Add Kubernetes to the list of services --- docs/services/index.md | 6 ++++++ docs/services/kubernetes/index.md | 1 + 2 files changed, 7 insertions(+) diff --git a/docs/services/index.md b/docs/services/index.md index e236f98b..c94b4708 100644 --- a/docs/services/index.md +++ b/docs/services/index.md @@ -12,5 +12,11 @@ FirecREST is a RESTful API for programmatically accessing High-Performance Computing resources. [:octicons-arrow-right-24: FirecREST][ref-firecrest] + +- :fontawesome-solid-dharmachakra: __Kubernetes__ + + Kubernetes platform for automating deployment, scaling, and management of containerized applications. + + [:octicons-arrow-right-24: Kubernetes][ref-kubernetes] diff --git a/docs/services/kubernetes/index.md b/docs/services/kubernetes/index.md index 78b36b7d..1c5bce89 100644 --- a/docs/services/kubernetes/index.md +++ b/docs/services/kubernetes/index.md @@ -1,3 +1,4 @@ +[](){#ref-kubernetes} # Kubernetes Kubernetes is only available for specific partners. From 482a6b6a6326a1fc247872d870513db92f107d78 Mon Sep 17 00:00:00 2001 From: eliaoggian Date: Mon, 28 Jul 2025 16:15:14 +0200 Subject: [PATCH 15/18] Update docs/services/kubernetes/clusters.md Co-authored-by: Mikael Simberg --- docs/services/kubernetes/clusters.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/services/kubernetes/clusters.md b/docs/services/kubernetes/clusters.md index 4e18a796..9880875a 100644 --- a/docs/services/kubernetes/clusters.md +++ b/docs/services/kubernetes/clusters.md @@ -159,7 +159,7 @@ Enables `LoadBalancer` service types by assigning public IPs. !!! Warning "The public IP pool is limited. Prefer using `Ingress` unless you specifically need a `LoadBalancer` Service for TCP traffic." -πŸ”— [metallb documentation](https://metallb.universe.tf) +πŸ”— [MetalLB documentation](https://metallb.universe.tf) ### `ingress-nginx` From 84c6942f05732076daf8dfae2effa9538349329e Mon Sep 17 00:00:00 2001 From: eliaoggian Date: Mon, 28 Jul 2025 16:15:22 +0200 Subject: [PATCH 16/18] Update docs/services/kubernetes/index.md Co-authored-by: Mikael Simberg --- docs/services/kubernetes/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/services/kubernetes/index.md b/docs/services/kubernetes/index.md index 1c5bce89..9fb2c620 100644 --- a/docs/services/kubernetes/index.md +++ b/docs/services/kubernetes/index.md @@ -21,7 +21,7 @@ It explains how clusters are provisioned, maintained, and the policies in place - :fontawesome-solid-arrow-up-from-bracket: __Kubernetes Upgrades__ - Kuberenetes Cluster upgrade policy (Kubernetes version upgrades) + Kubernetes Cluster upgrade policy (Kubernetes version upgrades) [:octicons-arrow-right-24: Kubernetes Upgrades][ref-kubernetes-clusters-upgrades] From 80a8a3ff5473557a4a9872ea25098e3c1e818448 Mon Sep 17 00:00:00 2001 From: eliaoggian Date: Mon, 28 Jul 2025 16:15:32 +0200 Subject: [PATCH 17/18] Update .github/actions/spelling/allow.txt Co-authored-by: Mikael Simberg --- .github/actions/spelling/allow.txt | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/.github/actions/spelling/allow.txt b/.github/actions/spelling/allow.txt index ef901a8b..595f788f 100644 --- a/.github/actions/spelling/allow.txt +++ b/.github/actions/spelling/allow.txt @@ -232,6 +232,17 @@ pmix podman prgenv preinstalled +rke +vms +alpernetes +kubeconfig +ceph +rwx +rwo +subdomain +tls +kured +KUbernetes prerelease prereleases prgenv From 6c0c89388d0939b214846a20e0bee9bf3bb8bae2 Mon Sep 17 00:00:00 2001 From: Elia Oggian Date: Mon, 28 Jul 2025 16:27:53 +0200 Subject: [PATCH 18/18] Apply requested changes. Remove Emojis from headers. --- docs/services/kubernetes/kubernetes-upgrades.md | 6 +++--- docs/services/kubernetes/node-updates.md | 8 ++++---- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/services/kubernetes/kubernetes-upgrades.md b/docs/services/kubernetes/kubernetes-upgrades.md index ab077123..33903fd5 100644 --- a/docs/services/kubernetes/kubernetes-upgrades.md +++ b/docs/services/kubernetes/kubernetes-upgrades.md @@ -3,7 +3,7 @@ To maintain a secure, stable, and supported platform, we regularly upgrade our Kubernetes clusters. We use **[RKE2](https://docs.rke2.io/)** as our Kubernetes distribution. -## πŸ”„ Upgrade Flow +## Upgrade Flow **Phased Rollout** @@ -16,7 +16,7 @@ To maintain a secure, stable, and supported platform, we regularly upgrade our K - Timing may depend on compatibility with **other infrastructure components** (e.g., storage, CNI plugins, monitoring tools). - However, all clusters will be upgraded **before the current Kubernetes version reaches End of Life (EOL)**. -## ⚠️ Upgrade Impact +## Upgrade Impact The **impact of a Kubernetes upgrade can vary**, depending on the nature of the changes involved: @@ -32,7 +32,7 @@ The **impact of a Kubernetes upgrade can vary**, depending on the nature of the ??? Note "Applications that follow cloud-native best practices (e.g., readiness probes, multiple replicas, graceful shutdown handling) are **less likely to be impacted** by upgrades." -## βœ… What You Can Expect +## What You Can Expect - Upgrades are performed using safe, tested procedures with minimal risk to production workloads. - TDS clusters serve as a **canary environment**, allowing us to identify issues early. diff --git a/docs/services/kubernetes/node-updates.md b/docs/services/kubernetes/node-updates.md index 4a3cf339..ddc7672c 100644 --- a/docs/services/kubernetes/node-updates.md +++ b/docs/services/kubernetes/node-updates.md @@ -3,7 +3,7 @@ To ensure the **security** and **stability** of our infrastructure, CSCS will perform **monthly OS updates** on all nodes of our Kubernetes clusters. -## πŸ”„ Maintenance Schedule +## Maintenance Schedule - **Frequency**: Every **first week of the month** - **Reboot Window**: **Monday to Friday**, between **09:00 and 15:00** @@ -13,7 +13,7 @@ These updates include important security patches and system updates for the oper ??? Note "Nodes will be rebooted only if required by the updates." -## 🚨 Urgent Security Patches +## Urgent Security Patches In the event of a **critical zero-day vulnerability**, we will apply patches and perform reboots (if required) **as soon as possible**, outside of the regular update schedule if needed. @@ -21,7 +21,7 @@ In the event of a **critical zero-day vulnerability**, we will apply patches and - Users will be notified ahead of time **when possible**. - Standard safety and rolling reboot practices will still be followed. -## πŸ› οΈ Reboot Management with Kured +## Reboot Management with Kured We use [**Kured** (KUbernetes REboot Daemon)](https://github.com/kubereboot/kured) to safely automate the reboot process. Kured ensures that: @@ -30,7 +30,7 @@ We use [**Kured** (KUbernetes REboot Daemon)](https://github.com/kubereboot/kure - Reboots occur **only during the defined window** - Nodes are **cordoned**, **drained**, and **gracefully reintegrated** after reboot. -## βœ… Application Requirements +## Application Requirements To avoid service disruption during node maintenance, applications **must be designed for high availability**. Specifically: