Task Summary
During the dkNET-AI launch, we noticed that a computing unit keeps running when a user leaves the platform without terminating it. Because CUs are per-user compute pods, these idle CUs hold CPU/memory and pin their EKS nodes causing significant resource underutilization and cost.
We need to (1) define what makes a CU "idle" and (2) add a mechanism that automatically terminates idle CUs.
Based on @chenlica's and my investigation, Kubernetes has no built-in mechanism to terminate a pod for inactivity. It automatically stops pods for health/resource/lifecycle reasons (eviction, OOM, node failure, activeDeadlineSeconds), but never simply because a workload is idle.
Related links:
https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/
https://cloud.google.com/blog/products/containers-kubernetes/scale-to-zero-on-gke-with-keda
https://kubernetes.io/docs/concepts/workloads/autoscaling/horizontal-pod-autoscale/
https://kubernetes.io/docs/concepts/workloads/controllers/job/
Task Type
Task Summary
During the dkNET-AI launch, we noticed that a computing unit keeps running when a user leaves the platform without terminating it. Because CUs are per-user compute pods, these idle CUs hold CPU/memory and pin their EKS nodes causing significant resource underutilization and cost.
We need to (1) define what makes a CU "idle" and (2) add a mechanism that automatically terminates idle CUs.
Based on @chenlica's and my investigation, Kubernetes has no built-in mechanism to terminate a pod for inactivity. It automatically stops pods for health/resource/lifecycle reasons (eviction, OOM, node failure,
activeDeadlineSeconds), but never simply because a workload is idle.Related links:
https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/
https://cloud.google.com/blog/products/containers-kubernetes/scale-to-zero-on-gke-with-keda
https://kubernetes.io/docs/concepts/workloads/autoscaling/horizontal-pod-autoscale/
https://kubernetes.io/docs/concepts/workloads/controllers/job/
Task Type