feat: add scale sub resource#474
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Expose spec.replicas via the standard /scale subresource so that external autoscalers (KEDA, HPA) can target InferenceService directly. Without this, KEDA's operator immediately deletes any ScaledObject whose scaleTargetRef points at a CRD that does not implement /scale. Changes: - Add +kubebuilder:subresource:scale marker with specpath=.spec.replicas and statuspath=.status.replicas to InferenceService type - Add status.Replicas int32 field (mirrors readyReplicas; this is the path the scale subresource reads to report current replica count) - Populate status.Replicas = readyReplicas in updateStatusWithSchedulingInfo - Regenerate config/crd/bases CRD YAML via make manifests generate - Update charts/llmkube/templates/crds/inferenceservices.yaml to match kubectl scale and KEDA ScaledObjects can now target InferenceService: kubectl scale inferenceservice/my-model -n ai --replicas=1 Signed-off-by: Mircea-Pavel ANTON <contact@mirceanton.com>
Signed-off-by: Mircea-Pavel ANTON <contact@mirceanton.com>
Signed-off-by: Mircea-Pavel ANTON <contact@mirceanton.com>
Signed-off-by: Mircea-Pavel ANTON <contact@mirceanton.com>
Signed-off-by: Mircea-Pavel ANTON <contact@mirceanton.com>
e545dd4 to
fae4f76
Compare
Defilan
left a comment
There was a problem hiding this comment.
Thanks for this, and thanks for getting the commits signed off. A clean, well-scoped PR: clear What/Why/How, the CRD regen done correctly, and a sample, test, and docs included. CI is green.
One thing I'd like resolved before merge. status.replicas currently mirrors readyReplicas, but the scale subresource's statusReplicasPath is conventionally the total current replica count. HPA reads that field to compute its scaling ratio, so reporting "ready" instead of "total" can cause over-scaling during rollouts. The PR and the README mention HPA, but the subresource also lacks a selectorpath, which autoscaling/v2 HPA needs to resolve pods. Two clean ways forward: either (a) point status.replicas at the Deployment's replica count and add a status.selector plus selectorpath for real HPA support, or (b) scope this PR to KEDA, which needs neither, and adjust the README to say KEDA rather than KEDA/HPA, with HPA as a follow-up.
Also strongly suggested: a test that drives the /scale subresource directly (and a scale-to-0 case) would be more reassuring than the current field-copy assertion, and a one-line note that the scale subresource and spec.autoscaling are mutually exclusive would save users a footgun. Nice contribution overall, and happy to help land it.
| isvc.Status.Phase = phase | ||
| isvc.Status.ModelReady = modelReady | ||
| isvc.Status.ReadyReplicas = readyReplicas | ||
| isvc.Status.Replicas = readyReplicas |
There was a problem hiding this comment.
Status.Replicas is the path statusReplicasPath reads for the scale subresource. By convention it should report the total current replica count (so HPA computes a correct scaling ratio), not the ready count. Reporting readyReplicas here can cause over-scaling during rollouts when pods aren't yet Ready. Consider sourcing this from the Deployment's *Spec.Replicas instead, keeping ReadyReplicas as the separate ready signal. If KEDA-only support is the intent, that's acceptable but should be documented and the README's HPA claim dropped.
There was a problem hiding this comment.
Something like this perhaps? d797121
If I understand correctly, this means that KEDA always sees the intended replica count rather than the ready replica count. During a rollout with pods not yet Ready, it reports 3 (desired) instead of 0 (ready), which prevents any false over-scaling signal.
Signed-off-by: Mircea-Pavel ANTON <contact@mirceanton.com>
Signed-off-by: Mircea-Pavel ANTON <contact@mirceanton.com>
Signed-off-by: Mircea-Pavel ANTON <contact@mirceanton.com>
Signed-off-by: Mircea-Pavel ANTON <contact@mirceanton.com>
Defilan
left a comment
There was a problem hiding this comment.
Excellent work getting this done so fast! You addressed everything I mentioned. I'll get this merged momentarily. Welcome to LLMKube!
What
Scale-to-zero support: keep InferenceService.spec.replicas: 0 at idle, use a KEDA ScaledObject (HTTP or external trigger) to scale to 1 on demand, and scale back to 0 after a cooldown period. Essentially allowing me to hot swap models on limited hardware.
Why
This feature allows InferenceServices to be scaled up/down on-demand via
kubectl scaleor via external scalers such as KEDA. This also allows scale to zero to work.Fixes #473
How
Added scale sub-resource.
Checklist
make testpasses locallymake lintpasses locallygit commit -s) per DCO