eth-cscs · eliaoggian · Jul 7, 2025
@@ -0,0 +1,48 @@
+# Kubernetes Cluster Upgrade Policy
+
+To maintain a secure, stable, and supported platform, we regularly upgrade our Kubernetes clusters. We use **[RKE2](https://docs.rke2.io/)** as our Kubernetes distribution.
+
+---
+
+## 🔄 Upgrade Flow
+
+- **Phased Rollout**:
+  - Upgrades are first applied to **TDS clusters** (Test and Development Systems).
+  - After a **minimum of 2 weeks**, if no critical issues are observed, the same upgrade will be applied to **PROD clusters**.
+
+- **No Fixed Schedule**:
+  - Upgrades are not done on a strict calendar basis.
+  - Timing may depend on compatibility with **other infrastructure components** (e.g., storage, CNI plugins, monitoring tools).
+  - However, all clusters will be upgraded **before the current Kubernetes version reaches End of Life (EOL)**.
+
+---
+
+## ⚠️ Upgrade Impact
+
+The **impact of a Kubernetes upgrade can vary**, depending on the nature of the changes involved:
+
+- **Minimal Impact**:
+  - For example, upgrades that affect only the `kubelet` may be **transparent to workloads**.
+  - Rolling restarts may occur, but no downtime is expected for well-configured applications.
+
+- **Potentially Disruptive**:
+  - Upgrades involving components such as the **CNI (Container Network Interface)** may cause **temporary network interruptions**.
+  - Other control plane or critical component updates might cause short-lived disruption to scheduling or connectivity.
+
+> 💡 Applications that follow cloud-native best practices (e.g., readiness probes, multiple replicas, graceful shutdown handling) are **less likely to be impacted** by upgrades.
+
+---
+
+## ✅ What You Can Expect
+
+- Upgrades are performed using safe, tested procedures with minimal risk to production workloads.
+- TDS clusters serve as a **canary environment**, allowing us to identify issues early.
+- All clusters are kept **aligned with supported Kubernetes versions**.
+
+---
+
+## 💬 Questions?
+
+If you have any questions about upcoming Kubernetes upgrades or want help verifying your application’s readiness, please contact the Network and Cloud team via Service Desk ticket.
+
+Thank you for your support and collaboration in keeping our platform secure and reliable.
@@ -0,0 +1,59 @@
+# Kubernetes Nodes OS Update Policy
+
+To ensure the **security** and **stability** of our infrastructure, CSCS will perform **monthly OS updates** on all nodes of our Kubernetes clusters.
+
+---
+
+## 🔄 Maintenance Schedule
+
+- **Frequency**: Every **first week of the month**  
+- **Reboot Window**: **Monday to Friday**, between **09:00 and 15:00**  
+- **Time Zone**: Europe/Zurich
+
+These updates include important security patches and system updates for the operating systems of cluster nodes.
+
+> ⚠️ **Note:** Nodes will be **rebooted only if required** by the updates. If no reboot is necessary, nodes will remain in service without disruption.
+
+---
+
+## 🚨 Urgent Security Patches
+
+In the event of a **critical zero-day vulnerability**, we will apply patches and perform reboots (if required) **as soon as possible**, outside of the regular update schedule if needed.  
+
+- Affected nodes will be updated **immediately** to protect the platform.
+- Users will be notified ahead of time **when possible**.
+- Standard safety and rolling reboot practices will still be followed.
+
+---
+
+## 🛠️ Reboot Management with Kured
+
+We use [**Kured** (KUbernetes REboot Daemon)](https://github.com/kubereboot/kured) to safely automate the reboot process. Kured ensures that:
+
+- Reboots are triggered **only when necessary** (e.g., after kernel updates).
+- Nodes are rebooted **one at a time** to avoid service disruption.
+- Reboots occur **only during the defined window** 
+- Nodes are **cordoned**, **drained**, and **gracefully reintegrated** after reboot.
+
+---
+
+## ✅ Application Requirements
+
+To avoid service disruption during node maintenance, applications **must be designed for high availability**. Specifically:
+
+- Use **multiple replicas** spread across nodes.
+- Follow **cloud-native best practices**, including:
+  - Proper **readiness** and **liveness probes**
+  - **Graceful shutdown** support
+  - **Stateless design** or resilient handling of state
+  - Appropriate **resource requests and limits**
+
+> ❗ Applications that do not meet these requirements **may experience temporary disruption** during node reboots.
+
+---
+
+## 👩‍💻 Need Help?
+
+If you have questions or need help preparing your applications for rolling node maintenance, please contact the Network and Cloud team via Service Desk ticket.
+
+Thank you for your cooperation and commitment to building robust, cloud-native services.
@@ -118,6 +118,9 @@ nav:
       - 'LLM Inference': guides/mlp_tutorials/llm-inference.md
       - 'LLM Finetuning': guides/mlp_tutorials/llm-finetuning.md
       - 'LLM Training': guides/mlp_tutorials/llm-nanotron-training.md
+  - 'Kubernetes':
+    - 'Kubernetes Upgrades': kubernetes/kubernetes-upgrades.md
+    - 'Node OS Upgrades': kubernetes/node-upgrades.md
   - 'Policies':
     - policies/index.md
     - 'User Regulations': policies/regulations.md