diff --git a/docs/kubernetes/kubernetes-upgrades.md b/docs/kubernetes/kubernetes-upgrades.md new file mode 100644 index 00000000..117db4f2 --- /dev/null +++ b/docs/kubernetes/kubernetes-upgrades.md @@ -0,0 +1,48 @@ +# Kubernetes Cluster Upgrade Policy + +To maintain a secure, stable, and supported platform, we regularly upgrade our Kubernetes clusters. We use **[RKE2](https://docs.rke2.io/)** as our Kubernetes distribution. + +--- + +## ๐Ÿ”„ Upgrade Flow + +- **Phased Rollout**: + - Upgrades are first applied to **TDS clusters** (Test and Development Systems). + - After a **minimum of 2 weeks**, if no critical issues are observed, the same upgrade will be applied to **PROD clusters**. + +- **No Fixed Schedule**: + - Upgrades are not done on a strict calendar basis. + - Timing may depend on compatibility with **other infrastructure components** (e.g., storage, CNI plugins, monitoring tools). + - However, all clusters will be upgraded **before the current Kubernetes version reaches End of Life (EOL)**. + +--- + +## โš ๏ธ Upgrade Impact + +The **impact of a Kubernetes upgrade can vary**, depending on the nature of the changes involved: + +- **Minimal Impact**: + - For example, upgrades that affect only the `kubelet` may be **transparent to workloads**. + - Rolling restarts may occur, but no downtime is expected for well-configured applications. + +- **Potentially Disruptive**: + - Upgrades involving components such as the **CNI (Container Network Interface)** may cause **temporary network interruptions**. + - Other control plane or critical component updates might cause short-lived disruption to scheduling or connectivity. + +> ๐Ÿ’ก Applications that follow cloud-native best practices (e.g., readiness probes, multiple replicas, graceful shutdown handling) are **less likely to be impacted** by upgrades. + +--- + +## โœ… What You Can Expect + +- Upgrades are performed using safe, tested procedures with minimal risk to production workloads. +- TDS clusters serve as a **canary environment**, allowing us to identify issues early. +- All clusters are kept **aligned with supported Kubernetes versions**. + +--- + +## ๐Ÿ’ฌ Questions? + +If you have any questions about upcoming Kubernetes upgrades or want help verifying your applicationโ€™s readiness, please contact the Network and Cloud team via Service Desk ticket. + +Thank you for your support and collaboration in keeping our platform secure and reliable. diff --git a/docs/kubernetes/node-upgrades.md b/docs/kubernetes/node-upgrades.md new file mode 100644 index 00000000..fa66631d --- /dev/null +++ b/docs/kubernetes/node-upgrades.md @@ -0,0 +1,59 @@ +# Kubernetes Nodes OS Update Policy + +To ensure the **security** and **stability** of our infrastructure, CSCS will perform **monthly OS updates** on all nodes of our Kubernetes clusters. + +--- + +## ๐Ÿ”„ Maintenance Schedule + +- **Frequency**: Every **first week of the month** +- **Reboot Window**: **Monday to Friday**, between **09:00 and 15:00** +- **Time Zone**: Europe/Zurich + +These updates include important security patches and system updates for the operating systems of cluster nodes. + +> โš ๏ธ **Note:** Nodes will be **rebooted only if required** by the updates. If no reboot is necessary, nodes will remain in service without disruption. + +--- + +## ๐Ÿšจ Urgent Security Patches + +In the event of a **critical zero-day vulnerability**, we will apply patches and perform reboots (if required) **as soon as possible**, outside of the regular update schedule if needed. + +- Affected nodes will be updated **immediately** to protect the platform. +- Users will be notified ahead of time **when possible**. +- Standard safety and rolling reboot practices will still be followed. + +--- + +## ๐Ÿ› ๏ธ Reboot Management with Kured + +We use [**Kured** (KUbernetes REboot Daemon)](https://github.com/kubereboot/kured) to safely automate the reboot process. Kured ensures that: + +- Reboots are triggered **only when necessary** (e.g., after kernel updates). +- Nodes are rebooted **one at a time** to avoid service disruption. +- Reboots occur **only during the defined window** +- Nodes are **cordoned**, **drained**, and **gracefully reintegrated** after reboot. + +--- + +## โœ… Application Requirements + +To avoid service disruption during node maintenance, applications **must be designed for high availability**. Specifically: + +- Use **multiple replicas** spread across nodes. +- Follow **cloud-native best practices**, including: + - Proper **readiness** and **liveness probes** + - **Graceful shutdown** support + - **Stateless design** or resilient handling of state + - Appropriate **resource requests and limits** + +> โ— Applications that do not meet these requirements **may experience temporary disruption** during node reboots. + +--- + +## ๐Ÿ‘ฉโ€๐Ÿ’ป Need Help? + +If you have questions or need help preparing your applications for rolling node maintenance, please contact the Network and Cloud team via Service Desk ticket. + +Thank you for your cooperation and commitment to building robust, cloud-native services. diff --git a/mkdocs.yml b/mkdocs.yml index 9ef866eb..42d1873c 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -118,6 +118,9 @@ nav: - 'LLM Inference': guides/mlp_tutorials/llm-inference.md - 'LLM Finetuning': guides/mlp_tutorials/llm-finetuning.md - 'LLM Training': guides/mlp_tutorials/llm-nanotron-training.md + - 'Kubernetes': + - 'Kubernetes Upgrades': kubernetes/kubernetes-upgrades.md + - 'Node OS Upgrades': kubernetes/node-upgrades.md - 'Policies': - policies/index.md - 'User Regulations': policies/regulations.md