Skip to content

Automatically terminate idle Computing Units to reclaim cluster resources #5362

@kunwp1

Description

@kunwp1

Task Summary

During the dkNET-AI launch, we noticed that a computing unit keeps running when a user leaves the platform without terminating it. Because CUs are per-user compute pods, these idle CUs hold CPU/memory and pin their EKS nodes causing significant resource underutilization and cost.

We need to (1) define what makes a CU "idle" and (2) add a mechanism that automatically terminates idle CUs.

Based on @chenlica's and my investigation, Kubernetes has no built-in mechanism to terminate a pod for inactivity. It automatically stops pods for health/resource/lifecycle reasons (eviction, OOM, node failure, activeDeadlineSeconds), but never simply because a workload is idle.

Related links:
https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/
https://cloud.google.com/blog/products/containers-kubernetes/scale-to-zero-on-gke-with-keda
https://kubernetes.io/docs/concepts/workloads/autoscaling/horizontal-pod-autoscale/
https://kubernetes.io/docs/concepts/workloads/controllers/job/

Task Type

  • Refactor / Cleanup
  • DevOps / Deployment / CI
  • Testing / QA
  • Documentation
  • Performance
  • Other

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No fields configured for Task.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions