From cb391040729fd96368adfedc9c861760f77be3d8 Mon Sep 17 00:00:00 2001
From: Elia Oggian <elia.oggian@cscs.ch>
Date: Mon, 7 Jul 2025 15:30:39 +0200
Subject: [PATCH 01/18] Add Kubernetes Updates docs

---
 docs/kubernetes/kubernetes-upgrades.md | 48 +++++++++++++++++++++
 docs/kubernetes/node-upgrades.md       | 59 ++++++++++++++++++++++++++
 mkdocs.yml                             |  3 ++
 3 files changed, 110 insertions(+)
 create mode 100644 docs/kubernetes/kubernetes-upgrades.md
 create mode 100644 docs/kubernetes/node-upgrades.md

diff --git a/docs/kubernetes/kubernetes-upgrades.md b/docs/kubernetes/kubernetes-upgrades.md
new file mode 100644
index 00000000..117db4f2
--- /dev/null
+++ b/docs/kubernetes/kubernetes-upgrades.md
@@ -0,0 +1,48 @@
+# Kubernetes Cluster Upgrade Policy
+
+To maintain a secure, stable, and supported platform, we regularly upgrade our Kubernetes clusters. We use **[RKE2](https://docs.rke2.io/)** as our Kubernetes distribution.
+
+---
+
+## 🔄 Upgrade Flow
+
+- **Phased Rollout**:
+  - Upgrades are first applied to **TDS clusters** (Test and Development Systems).
+  - After a **minimum of 2 weeks**, if no critical issues are observed, the same upgrade will be applied to **PROD clusters**.
+
+- **No Fixed Schedule**:
+  - Upgrades are not done on a strict calendar basis.
+  - Timing may depend on compatibility with **other infrastructure components** (e.g., storage, CNI plugins, monitoring tools).
+  - However, all clusters will be upgraded **before the current Kubernetes version reaches End of Life (EOL)**.
+
+---
+
+## ⚠️ Upgrade Impact
+
+The **impact of a Kubernetes upgrade can vary**, depending on the nature of the changes involved:
+
+- **Minimal Impact**:
+  - For example, upgrades that affect only the `kubelet` may be **transparent to workloads**.
+  - Rolling restarts may occur, but no downtime is expected for well-configured applications.
+
+- **Potentially Disruptive**:
+  - Upgrades involving components such as the **CNI (Container Network Interface)** may cause **temporary network interruptions**.
+  - Other control plane or critical component updates might cause short-lived disruption to scheduling or connectivity.
+
+> 💡 Applications that follow cloud-native best practices (e.g., readiness probes, multiple replicas, graceful shutdown handling) are **less likely to be impacted** by upgrades.
+
+---
+
+## ✅ What You Can Expect
+
+- Upgrades are performed using safe, tested procedures with minimal risk to production workloads.
+- TDS clusters serve as a **canary environment**, allowing us to identify issues early.
+- All clusters are kept **aligned with supported Kubernetes versions**.
+
+---
+
+## 💬 Questions?
+
+If you have any questions about upcoming Kubernetes upgrades or want help verifying your application’s readiness, please contact the Network and Cloud team via Service Desk ticket.
+
+Thank you for your support and collaboration in keeping our platform secure and reliable.
diff --git a/docs/kubernetes/node-upgrades.md b/docs/kubernetes/node-upgrades.md
new file mode 100644
index 00000000..fa66631d
--- /dev/null
+++ b/docs/kubernetes/node-upgrades.md
@@ -0,0 +1,59 @@
+# Kubernetes Nodes OS Update Policy
+
+To ensure the **security** and **stability** of our infrastructure, CSCS will perform **monthly OS updates** on all nodes of our Kubernetes clusters.
+
+---
+
+## 🔄 Maintenance Schedule
+
+- **Frequency**: Every **first week of the month**  
+- **Reboot Window**: **Monday to Friday**, between **09:00 and 15:00**  
+- **Time Zone**: Europe/Zurich
+
+These updates include important security patches and system updates for the operating systems of cluster nodes.
+
+> ⚠️ **Note:** Nodes will be **rebooted only if required** by the updates. If no reboot is necessary, nodes will remain in service without disruption.
+
+---
+
+## 🚨 Urgent Security Patches
+
+In the event of a **critical zero-day vulnerability**, we will apply patches and perform reboots (if required) **as soon as possible**, outside of the regular update schedule if needed.  
+
+- Affected nodes will be updated **immediately** to protect the platform.
+- Users will be notified ahead of time **when possible**.
+- Standard safety and rolling reboot practices will still be followed.
+
+---
+
+## 🛠️ Reboot Management with Kured
+
+We use [**Kured** (KUbernetes REboot Daemon)](https://github.com/kubereboot/kured) to safely automate the reboot process. Kured ensures that:
+
+- Reboots are triggered **only when necessary** (e.g., after kernel updates).
+- Nodes are rebooted **one at a time** to avoid service disruption.
+- Reboots occur **only during the defined window** 
+- Nodes are **cordoned**, **drained**, and **gracefully reintegrated** after reboot.
+
+---
+
+## ✅ Application Requirements
+
+To avoid service disruption during node maintenance, applications **must be designed for high availability**. Specifically:
+
+- Use **multiple replicas** spread across nodes.
+- Follow **cloud-native best practices**, including:
+  - Proper **readiness** and **liveness probes**
+  - **Graceful shutdown** support
+  - **Stateless design** or resilient handling of state
+  - Appropriate **resource requests and limits**
+
+> ❗ Applications that do not meet these requirements **may experience temporary disruption** during node reboots.
+
+---
+
+## 👩‍💻 Need Help?
+
+If you have questions or need help preparing your applications for rolling node maintenance, please contact the Network and Cloud team via Service Desk ticket.
+
+Thank you for your cooperation and commitment to building robust, cloud-native services.
diff --git a/mkdocs.yml b/mkdocs.yml
index 9ef866eb..42d1873c 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -118,6 +118,9 @@ nav:
       - 'LLM Inference': guides/mlp_tutorials/llm-inference.md
       - 'LLM Finetuning': guides/mlp_tutorials/llm-finetuning.md
       - 'LLM Training': guides/mlp_tutorials/llm-nanotron-training.md
+  - 'Kubernetes':
+    - 'Kubernetes Upgrades': kubernetes/kubernetes-upgrades.md
+    - 'Node OS Upgrades': kubernetes/node-upgrades.md
   - 'Policies':
     - policies/index.md
     - 'User Regulations': policies/regulations.md

From 23e471eeb8a2a5b718faadb3e4dc2caec9f1dbbe Mon Sep 17 00:00:00 2001
From: Elia Oggian <elia.oggian@cscs.ch>
Date: Mon, 7 Jul 2025 16:16:38 +0200
Subject: [PATCH 02/18] Add Kubernetes cluster docs

---
 docs/kubernetes/clusters.md | 213 ++++++++++++++++++++++++++++++++++++
 mkdocs.yml                  |   1 +
 2 files changed, 214 insertions(+)
 create mode 100644 docs/kubernetes/clusters.md

diff --git a/docs/kubernetes/clusters.md b/docs/kubernetes/clusters.md
new file mode 100644
index 00000000..fcdb7b69
--- /dev/null
+++ b/docs/kubernetes/clusters.md
@@ -0,0 +1,213 @@
+
+# CSCS Kubernetes Clusters
+
+This document provides an overview of the Kubernetes clusters maintained by CSCS and offers step-by-step instructions for accessing and interacting with them.
+
+---
+
+## Architecture
+
+All Kubernetes clusters at CSCS are:
+
+- Managed using **[Rancher](https://www.rancher.com)**
+- Running **[RKE2 (Rancher Kubernetes Engine 2)](https://github.com/rancher/rke2)**
+
+---
+
+## Cluster Environments
+
+Clusters are grouped into two main environments:
+
+- **TDS** – Test and Development Systems  
+- **PROD** – Production
+
+TDS clusters receive updates first. If no issues arise, the same updates are then applied to PROD clusters.
+
+---
+
+## Kubernetes API Access
+
+You can access the Kubernetes API in two main ways:
+
+### Direct Internet Access
+
+- A Virtual IP is exposed for the API server.  
+- Access can be restricted by source IP addresses.
+
+### Access via CSCS Jump Host
+
+- Connect through a bastion host (e.g., `ela.cscs.ch`).
+- API calls are securely proxied through Rancher.
+
+To check which method you are using, examine the `current-context` in your `kubeconfig` file.
+
+---
+
+## Cluster Access
+
+To interact with the cluster, you need the `kubectl` CLI:  
+🔗 [Install kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl)  
+> `kubectl` is pre-installed on the CSCS jump host.
+
+### Step-by-Step Access Guide
+
+#### Retrieve your kubeconfig file
+   - If you have a CSCS account and can access [Rancher](https://rancher.cscs.ch), download the kubeconfig for your cluster.
+   
+   - If you have a CSCS account but can't access [Rancher](https://rancher.cscs.ch), request a local Rancher user and use the **kcscs** tool installed on **ela.cscs.ch** to obtain the kubeconfig:
+    - Download your SSH keys from [SSH Service](https://sshservice.cscs.ch)
+    - SSH to `ela.cscs.ch` using the downloaded SSH keys
+    - Run `kcscs login` and insert your Rancher local user credentials (Supplied by CSCS)
+    - Run `kcscs list` to list the clusters you have access to
+    - Run `kcscs get` to get the kubeconfig file for a specific cluster
+
+   - If you don't have a CSCS account, open a Service Desk ticket to ask support.
+
+#### Store the kubeconfig file
+   ```bash
+   mv mykubeconfig.yaml ~/.kube/config
+   # or
+   export KUBECONFIG=/home/user/kubeconfig.yaml
+   ```
+
+#### Test connectivity
+   ```bash
+   kubectl get nodes
+   ```
+
+> ⚠️ The kubeconfig file contains credentials. Keep it secure.
+
+---
+
+## Pre-installed Applications
+
+All CSCS-provided clusters include a set of pre-installed tools and components, described below:
+
+---
+
+### 📦 `ceph-csi`
+
+Provides **dynamic persistent volume provisioning** via the Ceph Container Storage Interface.
+
+#### Storage Classes
+
+- `cephfs` – ReadWriteMany (RWX), backed by HDD (large data volumes)
+- `rbd-hdd` – ReadWriteOnce (RWO), backed by HDD
+- `rbd-nvme` – RWO, backed by NVMe (high-performance workloads like databases)
+- `*-retain` – Same classes, but retain the volume after PVC deletion
+
+---
+
+### 🌐 `external-dns`
+
+Automatically manages DNS entries for:
+
+- Ingress resources
+- Services of type `LoadBalancer` (when annotated)
+
+#### Example
+```bash
+kubectl annotate service nginx "external-dns.alpha.kubernetes.io/hostname=nginx.mycluster.tds.cscs.ch."
+```
+
+> ✅ Use a valid name under the configured subdomain.  
+📄 [external-dns documentation](https://github.com/kubernetes-sigs/external-dns)
+
+---
+
+### 🔐 `cert-manager`
+
+Handles automatic issuance of TLS certificates from Let's Encrypt.
+
+#### Example
+```yaml
+apiVersion: cert-manager.io/v1
+kind: Certificate
+metadata:
+  name: echo
+spec:
+  secretName: echo
+  commonName: echo.mycluster.tds.cscs.ch
+  dnsNames:
+    - echo.mycluster.tds.cscs.ch
+  issuerRef:
+    kind: ClusterIssuer
+    name: letsencrypt
+```
+
+You can also issue certs automatically via Ingress annotations (see `ingress-nginx` section).
+
+📄 [cert-manager documentation](https://cert-manager.io)
+
+---
+
+### 📡 `metallb`
+
+Enables `LoadBalancer` service types by assigning public IPs.
+
+> ⚠️ The public IP pool is limited.  
+Prefer using `Ingress` unless you specifically need a `LoadBalancer`.  
+📄 [metallb documentation](https://metallb.universe.tf)
+
+---
+
+### 🌍 `ingress-nginx`
+
+Default Ingress controller with class `nginx`.  
+Supports automatic TLS via cert-manager annotations.
+
+#### Example\
+```yaml
+apiVersion: networking.k8s.io/v1
+kind: Ingress
+metadata:
+  name: myIngress
+  namespace: myIngress
+  annotations:
+    cert-manager.io/cluster-issuer: letsencrypt
+spec:
+  rules:
+    - host: example.tds.cscs.ch
+      http:
+        paths:
+          - pathType: Prefix
+            path: /
+            backend:
+              service:
+                name: myservice
+                port:
+                  number: 80
+  tls:
+    - hosts:
+        - example.tds.cscs.ch
+      secretName: myingress-cert
+```
+
+📄 [NGINX Ingress Docs](https://docs.nginx.com/nginx-ingress-controller)  
+📄 [cert-manager Ingress Usage](https://cert-manager.io/docs/usage/ingress/)
+
+---
+
+### 🔑 `external-secrets`
+
+Integrates with secret management tools like **HashiCorp Vault**.
+
+📄 [external-secrets documentation](https://external-secrets.io/)
+
+---
+
+### 🔁 `kured`
+
+Responsible for automatic node reboots (e.g., after kernel updates).
+
+📄 [kured documentation](https://kured.dev/)
+
+---
+
+### 📊 Observability
+
+Includes:
+
+- **ECK Operator**  
+- **Beats agents** – Export logs and metrics to CSCS’s central log system
+- **Prometheus** – Scrapes metrics and exports them to CSCS's central monitoring cluster
diff --git a/mkdocs.yml b/mkdocs.yml
index 42d1873c..b9b7b648 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -119,6 +119,7 @@ nav:
       - 'LLM Finetuning': guides/mlp_tutorials/llm-finetuning.md
       - 'LLM Training': guides/mlp_tutorials/llm-nanotron-training.md
   - 'Kubernetes':
+    - 'Clusters': kubernetes/clusters.md
     - 'Kubernetes Upgrades': kubernetes/kubernetes-upgrades.md
     - 'Node OS Upgrades': kubernetes/node-upgrades.md
   - 'Policies':

From 4acd1aa32815c0fed8101fb4d4a71786b15260f5 Mon Sep 17 00:00:00 2001
From: Elia Oggian <elia.oggian@cscs.ch>
Date: Mon, 7 Jul 2025 15:30:39 +0200
Subject: [PATCH 03/18] Add Kubernetes Updates docs


From 2b66f84fb0d91c28137e1d6760c44d169327fb9a Mon Sep 17 00:00:00 2001
From: eliaoggian <etuz93@gmail.com>
Date: Tue, 15 Jul 2025 16:55:27 +0200
Subject: [PATCH 04/18] Update docs/kubernetes/clusters.md

Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>
---
 docs/kubernetes/clusters.md | 1 -
 1 file changed, 1 deletion(-)

diff --git a/docs/kubernetes/clusters.md b/docs/kubernetes/clusters.md
index fcdb7b69..506a4ed5 100644
--- a/docs/kubernetes/clusters.md
+++ b/docs/kubernetes/clusters.md
@@ -1,4 +1,3 @@
-
 # CSCS Kubernetes Clusters
 
 This document provides an overview of the Kubernetes clusters maintained by CSCS and offers step-by-step instructions for accessing and interacting with them.

From 42484811fde39d67eb31391d0fd1ceb8bb7f4875 Mon Sep 17 00:00:00 2001
From: eliaoggian <etuz93@gmail.com>
Date: Tue, 15 Jul 2025 16:55:37 +0200
Subject: [PATCH 05/18] Update docs/kubernetes/clusters.md

Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>
---
 docs/kubernetes/clusters.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/kubernetes/clusters.md b/docs/kubernetes/clusters.md
index 506a4ed5..1dcd69fc 100644
--- a/docs/kubernetes/clusters.md
+++ b/docs/kubernetes/clusters.md
@@ -1,4 +1,4 @@
-# CSCS Kubernetes Clusters
+# CSCS Kubernetes clusters
 
 This document provides an overview of the Kubernetes clusters maintained by CSCS and offers step-by-step instructions for accessing and interacting with them.
 

From ff673fe625ceecfe098aa74bd8444c80147b1b3c Mon Sep 17 00:00:00 2001
From: eliaoggian <etuz93@gmail.com>
Date: Tue, 15 Jul 2025 16:56:10 +0200
Subject: [PATCH 06/18] Update docs/kubernetes/clusters.md

Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>
---
 docs/kubernetes/clusters.md | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/docs/kubernetes/clusters.md b/docs/kubernetes/clusters.md
index 1dcd69fc..5d98a2ad 100644
--- a/docs/kubernetes/clusters.md
+++ b/docs/kubernetes/clusters.md
@@ -63,11 +63,14 @@ To interact with the cluster, you need the `kubectl` CLI:
    - If you don't have a CSCS account, open a Service Desk ticket to ask support.
 
 #### Store the kubeconfig file
-   ```bash
-   mv mykubeconfig.yaml ~/.kube/config
-   # or
-   export KUBECONFIG=/home/user/kubeconfig.yaml
-   ```
+
+```bash
+mv mykubeconfig.yaml ~/.kube/config
+```
+or
+```bash
+export KUBECONFIG=/home/user/kubeconfig.yaml
+```
 
 #### Test connectivity
    ```bash

From 1c4fbd769f187eae1e56745a3aa3ef9279221cbc Mon Sep 17 00:00:00 2001
From: eliaoggian <etuz93@gmail.com>
Date: Tue, 15 Jul 2025 16:56:40 +0200
Subject: [PATCH 07/18] Update docs/kubernetes/clusters.md

Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>
---
 docs/kubernetes/clusters.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/kubernetes/clusters.md b/docs/kubernetes/clusters.md
index 5d98a2ad..6a2d8ee3 100644
--- a/docs/kubernetes/clusters.md
+++ b/docs/kubernetes/clusters.md
@@ -112,8 +112,8 @@ Automatically manages DNS entries for:
 kubectl annotate service nginx "external-dns.alpha.kubernetes.io/hostname=nginx.mycluster.tds.cscs.ch."
 ```
 
-> ✅ Use a valid name under the configured subdomain.  
-📄 [external-dns documentation](https://github.com/kubernetes-sigs/external-dns)
+!!! info "Use a valid name under the configured subdomain"
+    [external-dns documentation](https://github.com/kubernetes-sigs/external-dns)
 
 ---
 

From 9d7ed89a6628f240d18b34b78021de09b1ab80b9 Mon Sep 17 00:00:00 2001
From: eliaoggian <etuz93@gmail.com>
Date: Tue, 15 Jul 2025 17:01:11 +0200
Subject: [PATCH 08/18] Update docs/kubernetes/clusters.md

Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>
---
 docs/kubernetes/clusters.md | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/docs/kubernetes/clusters.md b/docs/kubernetes/clusters.md
index 6a2d8ee3..72ae4bdb 100644
--- a/docs/kubernetes/clusters.md
+++ b/docs/kubernetes/clusters.md
@@ -60,7 +60,17 @@ To interact with the cluster, you need the `kubectl` CLI:
     - Run `kcscs list` to list the clusters you have access to
     - Run `kcscs get` to get the kubeconfig file for a specific cluster
 
-   - If you don't have a CSCS account, open a Service Desk ticket to ask support.
+
+- If you have a CSCS account and can access [Rancher](https://rancher.cscs.ch), download the kubeconfig for your cluster.
+  
+- If you have a CSCS account but can't access [Rancher](https://rancher.cscs.ch), request a local Rancher user and use the **kcscs** tool installed on **ela.cscs.ch** to obtain the kubeconfig:
+    - Download your SSH keys from [SSH Service](https://sshservice.cscs.ch)
+    - SSH to `ela.cscs.ch` using the downloaded SSH keys
+    - Run `kcscs login` and insert your Rancher local user credentials (Supplied by CSCS)
+    - Run `kcscs list` to list the clusters you have access to
+    - Run `kcscs get` to get the kubeconfig file for a specific cluster
+
+- If you don't have a CSCS account, open a Service Desk ticket to ask support.
 
 #### Store the kubeconfig file
 

From f34b9863107b54d9b879d9a473263d34073f8f08 Mon Sep 17 00:00:00 2001
From: eliaoggian <etuz93@gmail.com>
Date: Tue, 15 Jul 2025 17:01:47 +0200
Subject: [PATCH 09/18] Update docs/kubernetes/clusters.md

Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>
---
 docs/kubernetes/clusters.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/docs/kubernetes/clusters.md b/docs/kubernetes/clusters.md
index 72ae4bdb..6d56ae44 100644
--- a/docs/kubernetes/clusters.md
+++ b/docs/kubernetes/clusters.md
@@ -87,7 +87,8 @@ export KUBECONFIG=/home/user/kubeconfig.yaml
    kubectl get nodes
    ```
 
-> ⚠️ The kubeconfig file contains credentials. Keep it secure.
+!!! warning
+    The kubeconfig file contains credentials. Keep it secure.
 
 ---
 

From 5a29ac5cdba8e5c2f31144089e69d4a1589c6b78 Mon Sep 17 00:00:00 2001
From: eliaoggian <etuz93@gmail.com>
Date: Tue, 15 Jul 2025 17:04:37 +0200
Subject: [PATCH 10/18] Update docs/kubernetes/kubernetes-upgrades.md

Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>
---
 docs/kubernetes/kubernetes-upgrades.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/kubernetes/kubernetes-upgrades.md b/docs/kubernetes/kubernetes-upgrades.md
index 117db4f2..3db4c723 100644
--- a/docs/kubernetes/kubernetes-upgrades.md
+++ b/docs/kubernetes/kubernetes-upgrades.md
@@ -43,6 +43,6 @@ The **impact of a Kubernetes upgrade can vary**, depending on the nature of the
 
 ## 💬 Questions?
 
-If you have any questions about upcoming Kubernetes upgrades or want help verifying your application’s readiness, please contact the Network and Cloud team via Service Desk ticket.
+If you have any questions about upcoming Kubernetes upgrades or want help verifying your application’s readiness, please [contact the Network and Cloud team via Service Desk ticket][ref-get-in-touch].
 
 Thank you for your support and collaboration in keeping our platform secure and reliable.

From 115c9b4872ae189943e0c2e6c7450b8ff58dde41 Mon Sep 17 00:00:00 2001
From: Elia Oggian <elia.oggian@cscs.ch>
Date: Tue, 15 Jul 2025 17:05:40 +0200
Subject: [PATCH 11/18] Fix docs based on review

---
 docs/kubernetes/clusters.md            | 42 +++++---------------------
 docs/kubernetes/kubernetes-upgrades.md | 14 ---------
 docs/kubernetes/node-upgrades.md       | 11 -------
 3 files changed, 8 insertions(+), 59 deletions(-)

diff --git a/docs/kubernetes/clusters.md b/docs/kubernetes/clusters.md
index 6a2d8ee3..4d114c2c 100644
--- a/docs/kubernetes/clusters.md
+++ b/docs/kubernetes/clusters.md
@@ -2,8 +2,6 @@
 
 This document provides an overview of the Kubernetes clusters maintained by CSCS and offers step-by-step instructions for accessing and interacting with them.
 
----
-
 ## Architecture
 
 All Kubernetes clusters at CSCS are:
@@ -11,8 +9,6 @@ All Kubernetes clusters at CSCS are:
 - Managed using **[Rancher](https://www.rancher.com)**
 - Running **[RKE2 (Rancher Kubernetes Engine 2)](https://github.com/rancher/rke2)**
 
----
-
 ## Cluster Environments
 
 Clusters are grouped into two main environments:
@@ -22,8 +18,6 @@ Clusters are grouped into two main environments:
 
 TDS clusters receive updates first. If no issues arise, the same updates are then applied to PROD clusters.
 
----
-
 ## Kubernetes API Access
 
 You can access the Kubernetes API in two main ways:
@@ -40,8 +34,6 @@ You can access the Kubernetes API in two main ways:
 
 To check which method you are using, examine the `current-context` in your `kubeconfig` file.
 
----
-
 ## Cluster Access
 
 To interact with the cluster, you need the `kubectl` CLI:  
@@ -79,15 +71,11 @@ export KUBECONFIG=/home/user/kubeconfig.yaml
 
 > ⚠️ The kubeconfig file contains credentials. Keep it secure.
 
----
-
 ## Pre-installed Applications
 
 All CSCS-provided clusters include a set of pre-installed tools and components, described below:
 
----
-
-### 📦 `ceph-csi`
+### `ceph-csi`
 
 Provides **dynamic persistent volume provisioning** via the Ceph Container Storage Interface.
 
@@ -98,9 +86,7 @@ Provides **dynamic persistent volume provisioning** via the Ceph Container Stora
 - `rbd-nvme` – RWO, backed by NVMe (high-performance workloads like databases)
 - `*-retain` – Same classes, but retain the volume after PVC deletion
 
----
-
-### 🌐 `external-dns`
+### `external-dns`
 
 Automatically manages DNS entries for:
 
@@ -115,9 +101,7 @@ kubectl annotate service nginx "external-dns.alpha.kubernetes.io/hostname=nginx.
 !!! info "Use a valid name under the configured subdomain"
     [external-dns documentation](https://github.com/kubernetes-sigs/external-dns)
 
----
-
-### 🔐 `cert-manager`
+### `cert-manager`
 
 Handles automatic issuance of TLS certificates from Let's Encrypt.
 
@@ -141,9 +125,7 @@ You can also issue certs automatically via Ingress annotations (see `ingress-ngi
 
 📄 [cert-manager documentation](https://cert-manager.io)
 
----
-
-### 📡 `metallb`
+### `metallb`
 
 Enables `LoadBalancer` service types by assigning public IPs.
 
@@ -151,9 +133,7 @@ Enables `LoadBalancer` service types by assigning public IPs.
 Prefer using `Ingress` unless you specifically need a `LoadBalancer`.  
 📄 [metallb documentation](https://metallb.universe.tf)
 
----
-
-### 🌍 `ingress-nginx`
+###  `ingress-nginx`
 
 Default Ingress controller with class `nginx`.  
 Supports automatic TLS via cert-manager annotations.
@@ -188,25 +168,19 @@ spec:
 📄 [NGINX Ingress Docs](https://docs.nginx.com/nginx-ingress-controller)  
 📄 [cert-manager Ingress Usage](https://cert-manager.io/docs/usage/ingress/)
 
----
-
-### 🔑 `external-secrets`
+### `external-secrets`
 
 Integrates with secret management tools like **HashiCorp Vault**.
 
 📄 [external-secrets documentation](https://external-secrets.io/)
 
----
-
-### 🔁 `kured`
+### `kured`
 
 Responsible for automatic node reboots (e.g., after kernel updates).
 
 📄 [kured documentation](https://kured.dev/)
 
----
-
-### 📊 Observability
+### Observability
 
 Includes:
 
diff --git a/docs/kubernetes/kubernetes-upgrades.md b/docs/kubernetes/kubernetes-upgrades.md
index 117db4f2..93a9fa40 100644
--- a/docs/kubernetes/kubernetes-upgrades.md
+++ b/docs/kubernetes/kubernetes-upgrades.md
@@ -2,8 +2,6 @@
 
 To maintain a secure, stable, and supported platform, we regularly upgrade our Kubernetes clusters. We use **[RKE2](https://docs.rke2.io/)** as our Kubernetes distribution.
 
----
-
 ## 🔄 Upgrade Flow
 
 - **Phased Rollout**:
@@ -15,8 +13,6 @@ To maintain a secure, stable, and supported platform, we regularly upgrade our K
   - Timing may depend on compatibility with **other infrastructure components** (e.g., storage, CNI plugins, monitoring tools).
   - However, all clusters will be upgraded **before the current Kubernetes version reaches End of Life (EOL)**.
 
----
-
 ## ⚠️ Upgrade Impact
 
 The **impact of a Kubernetes upgrade can vary**, depending on the nature of the changes involved:
@@ -31,18 +27,8 @@ The **impact of a Kubernetes upgrade can vary**, depending on the nature of the
 
 > 💡 Applications that follow cloud-native best practices (e.g., readiness probes, multiple replicas, graceful shutdown handling) are **less likely to be impacted** by upgrades.
 
----
-
 ## ✅ What You Can Expect
 
 - Upgrades are performed using safe, tested procedures with minimal risk to production workloads.
 - TDS clusters serve as a **canary environment**, allowing us to identify issues early.
 - All clusters are kept **aligned with supported Kubernetes versions**.
-
----
-
-## 💬 Questions?
-
-If you have any questions about upcoming Kubernetes upgrades or want help verifying your application’s readiness, please contact the Network and Cloud team via Service Desk ticket.
-
-Thank you for your support and collaboration in keeping our platform secure and reliable.
diff --git a/docs/kubernetes/node-upgrades.md b/docs/kubernetes/node-upgrades.md
index fa66631d..f062cfc6 100644
--- a/docs/kubernetes/node-upgrades.md
+++ b/docs/kubernetes/node-upgrades.md
@@ -2,8 +2,6 @@
 
 To ensure the **security** and **stability** of our infrastructure, CSCS will perform **monthly OS updates** on all nodes of our Kubernetes clusters.
 
----
-
 ## 🔄 Maintenance Schedule
 
 - **Frequency**: Every **first week of the month**  
@@ -14,8 +12,6 @@ These updates include important security patches and system updates for the oper
 
 > ⚠️ **Note:** Nodes will be **rebooted only if required** by the updates. If no reboot is necessary, nodes will remain in service without disruption.
 
----
-
 ## 🚨 Urgent Security Patches
 
 In the event of a **critical zero-day vulnerability**, we will apply patches and perform reboots (if required) **as soon as possible**, outside of the regular update schedule if needed.  
@@ -24,8 +20,6 @@ In the event of a **critical zero-day vulnerability**, we will apply patches and
 - Users will be notified ahead of time **when possible**.
 - Standard safety and rolling reboot practices will still be followed.
 
----
-
 ## 🛠️ Reboot Management with Kured
 
 We use [**Kured** (KUbernetes REboot Daemon)](https://github.com/kubereboot/kured) to safely automate the reboot process. Kured ensures that:
@@ -35,8 +29,6 @@ We use [**Kured** (KUbernetes REboot Daemon)](https://github.com/kubereboot/kure
 - Reboots occur **only during the defined window** 
 - Nodes are **cordoned**, **drained**, and **gracefully reintegrated** after reboot.
 
----
-
 ## ✅ Application Requirements
 
 To avoid service disruption during node maintenance, applications **must be designed for high availability**. Specifically:
@@ -50,10 +42,7 @@ To avoid service disruption during node maintenance, applications **must be desi
 
 > ❗ Applications that do not meet these requirements **may experience temporary disruption** during node reboots.
 
----
-
 ## 👩‍💻 Need Help?
 
 If you have questions or need help preparing your applications for rolling node maintenance, please contact the Network and Cloud team via Service Desk ticket.
 
-Thank you for your cooperation and commitment to building robust, cloud-native services.

From 7ec5d89f2a1bb9ded41c8bd55f52b2b16634b64e Mon Sep 17 00:00:00 2001
From: Elia Oggian <elia.oggian@cscs.ch>
Date: Wed, 16 Jul 2025 11:54:10 +0200
Subject: [PATCH 12/18] Improve docs. Add spelling config. Add CODEOWNERS

---
 .github/CODEOWNERS                            |   1 +
 .github/actions/spelling/allow.txt            | 115 ++++++++++--------
 docs/{ => services}/kubernetes/clusters.md    |  97 +++++++++------
 docs/services/kubernetes/index.md             |  34 ++++++
 .../kubernetes/kubernetes-upgrades.md         |  20 +--
 .../kubernetes/node-updates.md}               |  10 +-
 mkdocs.yml                                    |   9 +-
 7 files changed, 176 insertions(+), 110 deletions(-)
 rename docs/{ => services}/kubernetes/clusters.md (55%)
 create mode 100644 docs/services/kubernetes/index.md
 rename docs/{ => services}/kubernetes/kubernetes-upgrades.md (76%)
 rename docs/{kubernetes/node-upgrades.md => services/kubernetes/node-updates.md} (79%)

diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS
index 8e25a88d..0005da62 100644
--- a/.github/CODEOWNERS
+++ b/.github/CODEOWNERS
@@ -1,6 +1,7 @@
 * @bcumming @msimberg @RMeli
 docs/access/jupyterlab.md @rsarm
 docs/services/firecrest @jpdorsch @ekouts
+docs/services/kubernetes @eliaoggian
 docs/software/communication @Madeeks @msimberg
 docs/software/devtools/linaro @jgphpc
 docs/software/prgenv/linalg.md @finkandreas @msimberg
diff --git a/.github/actions/spelling/allow.txt b/.github/actions/spelling/allow.txt
index 34737137..1f95de0b 100644
--- a/.github/actions/spelling/allow.txt
+++ b/.github/actions/spelling/allow.txt
@@ -1,134 +1,137 @@
+aarch
+aarch64
+acl
 ACLs
 ACR
+Alpstein
 AMD
 AWS
-Alpstein
 Balfrin
+biomolecular
+bristen
 Broyden
+bytecode
+capstor
+Ceph
 CFLAGS
 CHARMM
 CHF
+clariden
+concretise
+concretizer
+Containerfile
+containerised
 COSMA
-CPE
 cpe
+CPE
 CPMD
 CSCS
+customised
 CWP
 CXI
-capstor
-Ceph
-Containerfile
+diagonalisation
 DNS
 EDF
 EDFs
 EDFs
-EMPA
-ETHZ
 Ehrenfest
+eiger
+EMPA
 Errigal
+ETHZ
 FFT
+filesystems
 Fock
 GAPW
+Gaussian
 GCC
 GGA
+Google
 GPFS
 GPG
 GPU
 GPUs
 GPW
 GROMACS
+groundstate
 GTL
-Gaussian
-Google
+Hartree
 HDD
 HPC
 HPCP
 HPE
 HSN
-Hartree
+inodes
 iopsstor
 Jax
 Jira
+kcscs
 Keycloak
+kubeconfig
+KUbernetes
+kured
+Kured
 LAMMPS
 LDA
-LOCALID
-LUMI
+lexer
 Libc
+libfabric
 Linaro
 Linux
+LOCALID
+LUMI
+metallb
+MeteoSwiss
 MFA
 MLP
 MNDO
 MPICH
 MPS
-MeteoSwiss
+multitenancy
 NAMD
 NICs
 NVIDIA
 NVMe
 OTP
 OTPs
+Parrinello
 PASC
 PBE
 PDUs
 PID
-PMPI
-POSIX
-Parrinello
 Piz
 Plesset
+PMPI
+podman
+POSIX
+prgenv
+prioritised
+proactively
 Pulay
+quickstart
 RCCL
 RDMA
-ROCm
-RPA
+RKE
 Roboto
+ROCm
 Roothaan
-SSHService
-STMV
-Scopi
-TOTP
-UANs
-UserLab
-VASP
-Waldur
-Wannier
-XDG
-aarch
-aarch64
-acl
-biomolecular
-bristen
-bytecode
-clariden
-concretise
-concretizer
-containerised
-customised
-diagonalisation
-eiger
-filesystems
-groundstate
-inodes
-lexer
-libfabric
-multitenancy
-podman
-prioritised
-prgenv
-proactively
-quickstart
+RPA
+RWO
+RWX
 santis
 sbatch
+Scopi
 screenshot
 slurm
 smartphone
 squashfs
 srun
 ssh
+SSHService
 stackinator
 stakeholders
+STMV
+subdomain
 subfolders
 subtable
 subtables
@@ -140,23 +143,30 @@ tcsh
 testuser
 timeframe
 timelimit
+TLS
 tmpfs
 todi
 toolbar
 toolset
 torchaudio
 torchvision
+TOTP
 treesitter
 trilinos
+UANs
 uarch
 uenv
 uenvs
 uids
+UserLab
+VASP
 vCluster
 vClusters
 venv
 versioned
 versioning
+Waldur
+Wannier
 webhooks
 webinar
 webpage
@@ -166,5 +176,6 @@ workaround
 workflows
 xattr
 xattrs
+XDG
 youtube
 zstd
diff --git a/docs/kubernetes/clusters.md b/docs/services/kubernetes/clusters.md
similarity index 55%
rename from docs/kubernetes/clusters.md
rename to docs/services/kubernetes/clusters.md
index 94b503ad..4e18a796 100644
--- a/docs/kubernetes/clusters.md
+++ b/docs/services/kubernetes/clusters.md
@@ -1,3 +1,4 @@
+[](){#ref-kubernetes-clusters}
 # CSCS Kubernetes clusters
 
 This document provides an overview of the Kubernetes clusters maintained by CSCS and offers step-by-step instructions for accessing and interacting with them.
@@ -9,6 +10,11 @@ All Kubernetes clusters at CSCS are:
 - Managed using **[Rancher](https://www.rancher.com)**
 - Running **[RKE2 (Rancher Kubernetes Engine 2)](https://github.com/rancher/rke2)**
 
+CSCS offers two types of Kubernetes clusters for partners:
+
+- **Harvester-only clusters**: These clusters run exclusively on virtual machines provisioned by Harvester (SUSE Virtualization), providing a flexible and isolated environment suitable for most workloads.
+- **Alpernetes clusters**: These clusters combine Harvester VMs with compute nodes from the Alps supercomputer. This hybrid setup, called *Alpernetes*, enables workloads to leverage both virtualized infrastructure and high-performance computing resources within the same Kubernetes environment.
+
 ## Cluster Environments
 
 Clusters are grouped into two main environments:
@@ -16,7 +22,7 @@ Clusters are grouped into two main environments:
 - **TDS** – Test and Development Systems  
 - **PROD** – Production
 
-TDS clusters receive updates first. If no issues arise, the same updates are then applied to PROD clusters.
+See [Kubernetes upgrades][ref-kubernetes-clusters-upgrades] for detailed upgrade policy.
 
 ## Kubernetes API Access
 
@@ -25,11 +31,11 @@ You can access the Kubernetes API in two main ways:
 ### Direct Internet Access
 
 - A Virtual IP is exposed for the API server.  
-- Access can be restricted by source IP addresses.
+- Access is restricted by source IP addresses of the partner.
 
 ### Access via CSCS Jump Host
 
-- Connect through a bastion host (e.g., `ela.cscs.ch`).
+- Connect through a jump host (e.g., `ela.cscs.ch`).
 - API calls are securely proxied through Rancher.
 
 To check which method you are using, examine the `current-context` in your `kubeconfig` file.
@@ -38,33 +44,43 @@ To check which method you are using, examine the `current-context` in your `kube
 
 To interact with the cluster, you need the `kubectl` CLI:  
 🔗 [Install kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl)  
-> `kubectl` is pre-installed on the CSCS jump host.
+??? Note "`kubectl` is pre-installed on the CSCS jump host."
+
 
-### Step-by-Step Access Guide
+### Retrieve your kubeconfig file
 
-#### Retrieve your kubeconfig file
-   - If you have a CSCS account and can access [Rancher](https://rancher.cscs.ch), download the kubeconfig for your cluster.
+#### Internal CSCS Users
+Access [Rancher](https://rancher.cscs.ch) and download the kubeconfig for your cluster. 
    
-   - If you have a CSCS account but can't access [Rancher](https://rancher.cscs.ch), request a local Rancher user and use the **kcscs** tool installed on **ela.cscs.ch** to obtain the kubeconfig:
-    - Download your SSH keys from [SSH Service](https://sshservice.cscs.ch)
-    - SSH to `ela.cscs.ch` using the downloaded SSH keys
-    - Run `kcscs login` and insert your Rancher local user credentials (Supplied by CSCS)
-    - Run `kcscs list` to list the clusters you have access to
-    - Run `kcscs get` to get the kubeconfig file for a specific cluster
+#### External Users
+A specific Rancher user and password should have been provided to the partner.
 
+Use the `kcscs` tool installed on `ela.cscs.ch` to obtain the kubeconfig by following the next steps.
 
-- If you have a CSCS account and can access [Rancher](https://rancher.cscs.ch), download the kubeconfig for your cluster.
-  
-- If you have a CSCS account but can't access [Rancher](https://rancher.cscs.ch), request a local Rancher user and use the **kcscs** tool installed on **ela.cscs.ch** to obtain the kubeconfig:
-    - Download your SSH keys from [SSH Service](https://sshservice.cscs.ch)
-    - SSH to `ela.cscs.ch` using the downloaded SSH keys
-    - Run `kcscs login` and insert your Rancher local user credentials (Supplied by CSCS)
-    - Run `kcscs list` to list the clusters you have access to
-    - Run `kcscs get` to get the kubeconfig file for a specific cluster
+Download your SSH keys from [SSH Service](https://sshservice.cscs.ch) (and add them to the SSH agent).
+
+SSH to the jump host using the downloaded SSH keys
+```bash
+ssh ela.cscs.ch
+```
 
-- If you don't have a CSCS account, open a Service Desk ticket to ask support.
+Login with `kcscs` with the provided Rancher credentials
+```bash
+kcscs login
+```
 
-#### Store the kubeconfig file
+List the accessible clusters
+```bash
+kcscs list
+```
+
+Retrieve the kubeconfig file for a specific cluster
+```bash
+kcscs get
+```
+
+
+### Store the kubeconfig file
 
 ```bash
 mv mykubeconfig.yaml ~/.kube/config
@@ -74,7 +90,7 @@ or
 export KUBECONFIG=/home/user/kubeconfig.yaml
 ```
 
-#### Test connectivity
+### Test connectivity
    ```bash
    kubectl get nodes
    ```
@@ -88,7 +104,7 @@ All CSCS-provided clusters include a set of pre-installed tools and components,
 
 ### `ceph-csi`
 
-Provides **dynamic persistent volume provisioning** via the Ceph Container Storage Interface.
+Provides dynamic persistent volume provisioning via the Ceph Container Storage Interface (CEPH CSI).
 
 #### Storage Classes
 
@@ -109,8 +125,9 @@ Automatically manages DNS entries for:
 kubectl annotate service nginx "external-dns.alpha.kubernetes.io/hostname=nginx.mycluster.tds.cscs.ch."
 ```
 
-!!! info "Use a valid name under the configured subdomain"
-    [external-dns documentation](https://github.com/kubernetes-sigs/external-dns)
+!!! Note "Use a valid name under the configured subdomain"
+    
+🔗 [external-dns documentation](https://github.com/kubernetes-sigs/external-dns)
 
 ### `cert-manager`
 
@@ -132,24 +149,25 @@ spec:
     name: letsencrypt
 ```
 
-You can also issue certs automatically via Ingress annotations (see `ingress-nginx` section).
+You can also issue certificates automatically via Ingress annotations (see `ingress-nginx` section).
 
-📄 [cert-manager documentation](https://cert-manager.io)
+🔗 [cert-manager documentation](https://cert-manager.io)
 
 ### `metallb`
 
 Enables `LoadBalancer` service types by assigning public IPs.
 
-> ⚠️ The public IP pool is limited.  
-Prefer using `Ingress` unless you specifically need a `LoadBalancer`.  
-📄 [metallb documentation](https://metallb.universe.tf)
+!!! Warning "The public IP pool is limited. Prefer using `Ingress` unless you specifically need a `LoadBalancer` Service for TCP traffic."
+
+🔗 [metallb documentation](https://metallb.universe.tf)
 
 ###  `ingress-nginx`
 
 Default Ingress controller with class `nginx`.  
 Supports automatic TLS via cert-manager annotations.
 
-#### Example\
+Example:
+
 ```yaml
 apiVersion: networking.k8s.io/v1
 kind: Ingress
@@ -176,25 +194,28 @@ spec:
       secretName: myingress-cert
 ```
 
-📄 [NGINX Ingress Docs](https://docs.nginx.com/nginx-ingress-controller)  
-📄 [cert-manager Ingress Usage](https://cert-manager.io/docs/usage/ingress/)
+🔗 [NGINX Ingress Docs](https://docs.nginx.com/nginx-ingress-controller)  
+🔗 [cert-manager Ingress Usage](https://cert-manager.io/docs/usage/ingress/)
 
 ### `external-secrets`
 
 Integrates with secret management tools like **HashiCorp Vault**.
 
-📄 [external-secrets documentation](https://external-secrets.io/)
+Enables the usage of `ExternalSecret` resources to fetch secrets from `SecreStore` or `ClusterSecretStore` resources to fetch secrets and store them into `Secrets` inside the cluster.
+
+It helps to avoid storing secrets in the deployment manifests, especially in GitOps environments.
+
+🔗 [external-secrets documentation](https://external-secrets.io/)
 
 ### `kured`
 
 Responsible for automatic node reboots (e.g., after kernel updates).
 
-📄 [kured documentation](https://kured.dev/)
+🔗 [kured documentation](https://kured.dev/)
 
 ### Observability
 
 Includes:
 
-- **ECK Operator**  
 - **Beats agents** – Export logs and metrics to CSCS’s central log system
 - **Prometheus** – Scrapes metrics and exports them to CSCS's central monitoring cluster
diff --git a/docs/services/kubernetes/index.md b/docs/services/kubernetes/index.md
new file mode 100644
index 00000000..78b36b7d
--- /dev/null
+++ b/docs/services/kubernetes/index.md
@@ -0,0 +1,34 @@
+# Kubernetes
+
+Kubernetes is only available for specific partners. 
+
+!!! Note
+    Kubernetes is not available for normal users on Alps.
+
+This documentation is designed to help partners who have been granted access to a Kubernetes cluster. 
+
+It explains how clusters are provisioned, maintained, and the policies in place for upgrades and updates.
+
+
+
+<div class="grid cards" markdown>
+-   :fontawesome-solid-layer-group: __Cluster Architecture__
+
+    CSCS Kubernetes cluster overview. What are the main components and how to interact with it. 
+
+    [:octicons-arrow-right-24: Clusters][ref-kubernetes-clusters]
+
+-   :fontawesome-solid-arrow-up-from-bracket: __Kubernetes Upgrades__
+
+    Kuberenetes Cluster upgrade policy (Kubernetes version upgrades)
+
+    [:octicons-arrow-right-24: Kubernetes Upgrades][ref-kubernetes-clusters-upgrades]
+
+-   :fontawesome-solid-shield-halved: __Node Updates__
+
+    Cluster Nodes OS update policy (Regular Node Security Updates)
+
+    [:octicons-arrow-right-24: Node OS Updates][ref-kubernetes-node-updates]
+
+</div>
+
diff --git a/docs/kubernetes/kubernetes-upgrades.md b/docs/services/kubernetes/kubernetes-upgrades.md
similarity index 76%
rename from docs/kubernetes/kubernetes-upgrades.md
rename to docs/services/kubernetes/kubernetes-upgrades.md
index 498829be..ab077123 100644
--- a/docs/kubernetes/kubernetes-upgrades.md
+++ b/docs/services/kubernetes/kubernetes-upgrades.md
@@ -1,14 +1,17 @@
+[](){#ref-kubernetes-clusters-upgrades}
 # Kubernetes Cluster Upgrade Policy
 
 To maintain a secure, stable, and supported platform, we regularly upgrade our Kubernetes clusters. We use **[RKE2](https://docs.rke2.io/)** as our Kubernetes distribution.
 
 ## 🔄 Upgrade Flow
 
-- **Phased Rollout**:
+**Phased Rollout**
+
   - Upgrades are first applied to **TDS clusters** (Test and Development Systems).
   - After a **minimum of 2 weeks**, if no critical issues are observed, the same upgrade will be applied to **PROD clusters**.
 
-- **No Fixed Schedule**:
+**No Fixed Schedule**
+
   - Upgrades are not done on a strict calendar basis.
   - Timing may depend on compatibility with **other infrastructure components** (e.g., storage, CNI plugins, monitoring tools).
   - However, all clusters will be upgraded **before the current Kubernetes version reaches End of Life (EOL)**.
@@ -17,15 +20,17 @@ To maintain a secure, stable, and supported platform, we regularly upgrade our K
 
 The **impact of a Kubernetes upgrade can vary**, depending on the nature of the changes involved:
 
-- **Minimal Impact**:
+**Minimal Impact**
+
   - For example, upgrades that affect only the `kubelet` may be **transparent to workloads**.
   - Rolling restarts may occur, but no downtime is expected for well-configured applications.
 
-- **Potentially Disruptive**:
+**Potentially Disruptive**
+
   - Upgrades involving components such as the **CNI (Container Network Interface)** may cause **temporary network interruptions**.
   - Other control plane or critical component updates might cause short-lived disruption to scheduling or connectivity.
 
-> 💡 Applications that follow cloud-native best practices (e.g., readiness probes, multiple replicas, graceful shutdown handling) are **less likely to be impacted** by upgrades.
+??? Note "Applications that follow cloud-native best practices (e.g., readiness probes, multiple replicas, graceful shutdown handling) are **less likely to be impacted** by upgrades."
 
 ## ✅ What You Can Expect
 
@@ -33,8 +38,3 @@ The **impact of a Kubernetes upgrade can vary**, depending on the nature of the
 - TDS clusters serve as a **canary environment**, allowing us to identify issues early.
 - All clusters are kept **aligned with supported Kubernetes versions**.
 
-## 💬 Questions?
-
-If you have any questions about upcoming Kubernetes upgrades or want help verifying your application’s readiness, please [contact the Network and Cloud team via Service Desk ticket][ref-get-in-touch].
-
-
diff --git a/docs/kubernetes/node-upgrades.md b/docs/services/kubernetes/node-updates.md
similarity index 79%
rename from docs/kubernetes/node-upgrades.md
rename to docs/services/kubernetes/node-updates.md
index f062cfc6..4a3cf339 100644
--- a/docs/kubernetes/node-upgrades.md
+++ b/docs/services/kubernetes/node-updates.md
@@ -1,3 +1,4 @@
+[](){#ref-kubernetes-node-updates}
 # Kubernetes Nodes OS Update Policy
 
 To ensure the **security** and **stability** of our infrastructure, CSCS will perform **monthly OS updates** on all nodes of our Kubernetes clusters.
@@ -10,7 +11,7 @@ To ensure the **security** and **stability** of our infrastructure, CSCS will pe
 
 These updates include important security patches and system updates for the operating systems of cluster nodes.
 
-> ⚠️ **Note:** Nodes will be **rebooted only if required** by the updates. If no reboot is necessary, nodes will remain in service without disruption.
+??? Note "Nodes will be rebooted only if required by the updates."
 
 ## 🚨 Urgent Security Patches
 
@@ -40,9 +41,6 @@ To avoid service disruption during node maintenance, applications **must be desi
   - **Stateless design** or resilient handling of state
   - Appropriate **resource requests and limits**
 
-> ❗ Applications that do not meet these requirements **may experience temporary disruption** during node reboots.
-
-## 👩‍💻 Need Help?
-
-If you have questions or need help preparing your applications for rolling node maintenance, please contact the Network and Cloud team via Service Desk ticket.
+!!! Warning
+    Applications that do not meet these requirements **may experience temporary disruption** during node reboots.
 
diff --git a/mkdocs.yml b/mkdocs.yml
index b9b7b648..52a5b9a6 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -98,6 +98,11 @@ nav:
     - services/index.md
     - 'FirecREST': services/firecrest.md
     - 'CI/CD': services/cicd.md
+    - 'Kubernetes':
+      - services/kubernetes/index.md
+      - 'Clusters': services/kubernetes/clusters.md
+      - 'Kubernetes Upgrades': services/kubernetes/kubernetes-upgrades.md
+      - 'Node OS Updates': services/kubernetes/node-updates.md
   - 'Running Jobs':
     - running/index.md
     - 'Slurm': running/slurm.md
@@ -118,10 +123,6 @@ nav:
       - 'LLM Inference': guides/mlp_tutorials/llm-inference.md
       - 'LLM Finetuning': guides/mlp_tutorials/llm-finetuning.md
       - 'LLM Training': guides/mlp_tutorials/llm-nanotron-training.md
-  - 'Kubernetes':
-    - 'Clusters': kubernetes/clusters.md
-    - 'Kubernetes Upgrades': kubernetes/kubernetes-upgrades.md
-    - 'Node OS Upgrades': kubernetes/node-upgrades.md
   - 'Policies':
     - policies/index.md
     - 'User Regulations': policies/regulations.md

From c5f5d65c9967fefb7a155d6c0b6b9de35e5753a1 Mon Sep 17 00:00:00 2001
From: Elia Oggian <elia.oggian@cscs.ch>
Date: Wed, 16 Jul 2025 11:58:33 +0200
Subject: [PATCH 13/18] sort allowed words

---
 .github/actions/spelling/allow.txt | 117 +++++++++++++----------------
 1 file changed, 53 insertions(+), 64 deletions(-)

diff --git a/.github/actions/spelling/allow.txt b/.github/actions/spelling/allow.txt
index 1f95de0b..1efff1c6 100644
--- a/.github/actions/spelling/allow.txt
+++ b/.github/actions/spelling/allow.txt
@@ -1,137 +1,134 @@
-aarch
-aarch64
-acl
 ACLs
 ACR
-Alpstein
 AMD
 AWS
+Alpstein
 Balfrin
-biomolecular
-bristen
 Broyden
-bytecode
-capstor
-Ceph
 CFLAGS
 CHARMM
 CHF
-clariden
-concretise
-concretizer
-Containerfile
-containerised
 COSMA
-cpe
 CPE
 CPMD
 CSCS
-customised
 CWP
 CXI
-diagonalisation
+Ceph
+Containerfile
 DNS
 EDF
 EDFs
 EDFs
-Ehrenfest
-eiger
 EMPA
-Errigal
 ETHZ
+Ehrenfest
+Errigal
 FFT
-filesystems
 Fock
 GAPW
-Gaussian
 GCC
 GGA
-Google
 GPFS
 GPG
 GPU
 GPUs
 GPW
 GROMACS
-groundstate
 GTL
-Hartree
+Gaussian
+Google
 HDD
 HPC
 HPCP
 HPE
 HSN
-inodes
-iopsstor
+Hartree
 Jax
 Jira
-kcscs
 Keycloak
-kubeconfig
-KUbernetes
-kured
-Kured
 LAMMPS
 LDA
-lexer
+LOCALID
+LUMI
 Libc
-libfabric
 Linaro
 Linux
-LOCALID
-LUMI
-metallb
-MeteoSwiss
 MFA
 MLP
 MNDO
 MPICH
 MPS
-multitenancy
+MeteoSwiss
 NAMD
 NICs
 NVIDIA
 NVMe
 OTP
 OTPs
-Parrinello
 PASC
 PBE
 PDUs
 PID
-Piz
-Plesset
 PMPI
-podman
 POSIX
-prgenv
-prioritised
-proactively
+Parrinello
+Piz
+Plesset
 Pulay
-quickstart
 RCCL
 RDMA
-RKE
-Roboto
 ROCm
-Roothaan
 RPA
-RWO
-RWX
+Roboto
+Roothaan
+SSHService
+STMV
+Scopi
+TOTP
+UANs
+UserLab
+VASP
+Waldur
+Wannier
+XDG
+aarch
+aarch64
+acl
+biomolecular
+bristen
+bytecode
+capstor
+clariden
+concretise
+concretizer
+containerised
+cpe
+customised
+diagonalisation
+eiger
+filesystems
+groundstate
+inodes
+iopsstor
+lexer
+libfabric
+multitenancy
+podman
+prgenv
+prioritised
+proactively
+quickstart
 santis
 sbatch
-Scopi
 screenshot
 slurm
 smartphone
 squashfs
 srun
 ssh
-SSHService
 stackinator
 stakeholders
-STMV
-subdomain
 subfolders
 subtable
 subtables
@@ -143,30 +140,23 @@ tcsh
 testuser
 timeframe
 timelimit
-TLS
 tmpfs
 todi
 toolbar
 toolset
 torchaudio
 torchvision
-TOTP
 treesitter
 trilinos
-UANs
 uarch
 uenv
 uenvs
 uids
-UserLab
-VASP
 vCluster
 vClusters
 venv
 versioned
 versioning
-Waldur
-Wannier
 webhooks
 webinar
 webpage
@@ -176,6 +166,5 @@ workaround
 workflows
 xattr
 xattrs
-XDG
 youtube
 zstd

From 7ddcc1e24ac2dabdc2215d77e57ac4d674caa25c Mon Sep 17 00:00:00 2001
From: Elia Oggian <elia.oggian@cscs.ch>
Date: Wed, 16 Jul 2025 12:11:39 +0200
Subject: [PATCH 14/18] Add Kubernetes to the list of services

---
 docs/services/index.md            | 6 ++++++
 docs/services/kubernetes/index.md | 1 +
 2 files changed, 7 insertions(+)

diff --git a/docs/services/index.md b/docs/services/index.md
index e236f98b..c94b4708 100644
--- a/docs/services/index.md
+++ b/docs/services/index.md
@@ -12,5 +12,11 @@
     FirecREST is a RESTful API for programmatically accessing High-Performance Computing resources.
 
     [:octicons-arrow-right-24: FirecREST][ref-firecrest]
+
+-   :fontawesome-solid-dharmachakra: __Kubernetes__
+
+    Kubernetes platform for automating deployment, scaling, and management of containerized applications.
+
+    [:octicons-arrow-right-24: Kubernetes][ref-kubernetes]
 </div>
 
diff --git a/docs/services/kubernetes/index.md b/docs/services/kubernetes/index.md
index 78b36b7d..1c5bce89 100644
--- a/docs/services/kubernetes/index.md
+++ b/docs/services/kubernetes/index.md
@@ -1,3 +1,4 @@
+[](){#ref-kubernetes}
 # Kubernetes
 
 Kubernetes is only available for specific partners. 

From 482a6b6a6326a1fc247872d870513db92f107d78 Mon Sep 17 00:00:00 2001
From: eliaoggian <etuz93@gmail.com>
Date: Mon, 28 Jul 2025 16:15:14 +0200
Subject: [PATCH 15/18] Update docs/services/kubernetes/clusters.md

Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>
---
 docs/services/kubernetes/clusters.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/services/kubernetes/clusters.md b/docs/services/kubernetes/clusters.md
index 4e18a796..9880875a 100644
--- a/docs/services/kubernetes/clusters.md
+++ b/docs/services/kubernetes/clusters.md
@@ -159,7 +159,7 @@ Enables `LoadBalancer` service types by assigning public IPs.
 
 !!! Warning "The public IP pool is limited. Prefer using `Ingress` unless you specifically need a `LoadBalancer` Service for TCP traffic."
 
-🔗 [metallb documentation](https://metallb.universe.tf)
+🔗 [MetalLB documentation](https://metallb.universe.tf)
 
 ###  `ingress-nginx`
 

From 84c6942f05732076daf8dfae2effa9538349329e Mon Sep 17 00:00:00 2001
From: eliaoggian <etuz93@gmail.com>
Date: Mon, 28 Jul 2025 16:15:22 +0200
Subject: [PATCH 16/18] Update docs/services/kubernetes/index.md

Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>
---
 docs/services/kubernetes/index.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/services/kubernetes/index.md b/docs/services/kubernetes/index.md
index 1c5bce89..9fb2c620 100644
--- a/docs/services/kubernetes/index.md
+++ b/docs/services/kubernetes/index.md
@@ -21,7 +21,7 @@ It explains how clusters are provisioned, maintained, and the policies in place
 
 -   :fontawesome-solid-arrow-up-from-bracket: __Kubernetes Upgrades__
 
-    Kuberenetes Cluster upgrade policy (Kubernetes version upgrades)
+    Kubernetes Cluster upgrade policy (Kubernetes version upgrades)
 
     [:octicons-arrow-right-24: Kubernetes Upgrades][ref-kubernetes-clusters-upgrades]
 

From 80a8a3ff5473557a4a9872ea25098e3c1e818448 Mon Sep 17 00:00:00 2001
From: eliaoggian <etuz93@gmail.com>
Date: Mon, 28 Jul 2025 16:15:32 +0200
Subject: [PATCH 17/18] Update .github/actions/spelling/allow.txt

Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>
---
 .github/actions/spelling/allow.txt | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/.github/actions/spelling/allow.txt b/.github/actions/spelling/allow.txt
index ef901a8b..595f788f 100644
--- a/.github/actions/spelling/allow.txt
+++ b/.github/actions/spelling/allow.txt
@@ -232,6 +232,17 @@ pmix
 podman
 prgenv
 preinstalled
+rke
+vms
+alpernetes
+kubeconfig
+ceph
+rwx
+rwo
+subdomain
+tls
+kured
+KUbernetes
 prerelease
 prereleases
 prgenv

From 6c0c89388d0939b214846a20e0bee9bf3bb8bae2 Mon Sep 17 00:00:00 2001
From: Elia Oggian <elia.oggian@cscs.ch>
Date: Mon, 28 Jul 2025 16:27:53 +0200
Subject: [PATCH 18/18] Apply requested changes. Remove Emojis from headers.

---
 docs/services/kubernetes/kubernetes-upgrades.md | 6 +++---
 docs/services/kubernetes/node-updates.md        | 8 ++++----
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/docs/services/kubernetes/kubernetes-upgrades.md b/docs/services/kubernetes/kubernetes-upgrades.md
index ab077123..33903fd5 100644
--- a/docs/services/kubernetes/kubernetes-upgrades.md
+++ b/docs/services/kubernetes/kubernetes-upgrades.md
@@ -3,7 +3,7 @@
 
 To maintain a secure, stable, and supported platform, we regularly upgrade our Kubernetes clusters. We use **[RKE2](https://docs.rke2.io/)** as our Kubernetes distribution.
 
-## 🔄 Upgrade Flow
+## Upgrade Flow
 
 **Phased Rollout**
 
@@ -16,7 +16,7 @@ To maintain a secure, stable, and supported platform, we regularly upgrade our K
   - Timing may depend on compatibility with **other infrastructure components** (e.g., storage, CNI plugins, monitoring tools).
   - However, all clusters will be upgraded **before the current Kubernetes version reaches End of Life (EOL)**.
 
-## ⚠️ Upgrade Impact
+## Upgrade Impact
 
 The **impact of a Kubernetes upgrade can vary**, depending on the nature of the changes involved:
 
@@ -32,7 +32,7 @@ The **impact of a Kubernetes upgrade can vary**, depending on the nature of the
 
 ??? Note "Applications that follow cloud-native best practices (e.g., readiness probes, multiple replicas, graceful shutdown handling) are **less likely to be impacted** by upgrades."
 
-## ✅ What You Can Expect
+## What You Can Expect
 
 - Upgrades are performed using safe, tested procedures with minimal risk to production workloads.
 - TDS clusters serve as a **canary environment**, allowing us to identify issues early.
diff --git a/docs/services/kubernetes/node-updates.md b/docs/services/kubernetes/node-updates.md
index 4a3cf339..ddc7672c 100644
--- a/docs/services/kubernetes/node-updates.md
+++ b/docs/services/kubernetes/node-updates.md
@@ -3,7 +3,7 @@
 
 To ensure the **security** and **stability** of our infrastructure, CSCS will perform **monthly OS updates** on all nodes of our Kubernetes clusters.
 
-## 🔄 Maintenance Schedule
+## Maintenance Schedule
 
 - **Frequency**: Every **first week of the month**  
 - **Reboot Window**: **Monday to Friday**, between **09:00 and 15:00**  
@@ -13,7 +13,7 @@ These updates include important security patches and system updates for the oper
 
 ??? Note "Nodes will be rebooted only if required by the updates."
 
-## 🚨 Urgent Security Patches
+## Urgent Security Patches
 
 In the event of a **critical zero-day vulnerability**, we will apply patches and perform reboots (if required) **as soon as possible**, outside of the regular update schedule if needed.  
 
@@ -21,7 +21,7 @@ In the event of a **critical zero-day vulnerability**, we will apply patches and
 - Users will be notified ahead of time **when possible**.
 - Standard safety and rolling reboot practices will still be followed.
 
-## 🛠️ Reboot Management with Kured
+## Reboot Management with Kured
 
 We use [**Kured** (KUbernetes REboot Daemon)](https://github.com/kubereboot/kured) to safely automate the reboot process. Kured ensures that:
 
@@ -30,7 +30,7 @@ We use [**Kured** (KUbernetes REboot Daemon)](https://github.com/kubereboot/kure
 - Reboots occur **only during the defined window** 
 - Nodes are **cordoned**, **drained**, and **gracefully reintegrated** after reboot.
 
-## ✅ Application Requirements
+## Application Requirements
 
 To avoid service disruption during node maintenance, applications **must be designed for high availability**. Specifically: