Skip to content

feat: add scale sub resource#474

Merged
Defilan merged 9 commits into
defilantech:mainfrom
mircea-pavel-anton:feat/scale-subresource
May 17, 2026
Merged

feat: add scale sub resource#474
Defilan merged 9 commits into
defilantech:mainfrom
mircea-pavel-anton:feat/scale-subresource

Conversation

@mircea-pavel-anton
Copy link
Copy Markdown
Contributor

@mircea-pavel-anton mircea-pavel-anton commented May 16, 2026

What

Scale-to-zero support: keep InferenceService.spec.replicas: 0 at idle, use a KEDA ScaledObject (HTTP or external trigger) to scale to 1 on demand, and scale back to 0 after a cooldown period. Essentially allowing me to hot swap models on limited hardware.

Why

This feature allows InferenceServices to be scaled up/down on-demand via kubectl scale or via external scalers such as KEDA. This also allows scale to zero to work.

Fixes #473

How

Added scale sub-resource.

Checklist

  • Tests added/updated
  • make test passes locally
  • make lint passes locally
  • Commit messages follow conventional commits
  • All commits are signed off (git commit -s) per DCO
  • Documentation updated (if user-facing change)

@codecov
Copy link
Copy Markdown

codecov Bot commented May 17, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Expose spec.replicas via the standard /scale subresource so that external
autoscalers (KEDA, HPA) can target InferenceService directly. Without this,
KEDA's operator immediately deletes any ScaledObject whose scaleTargetRef
points at a CRD that does not implement /scale.

Changes:
- Add +kubebuilder:subresource:scale marker with specpath=.spec.replicas and
  statuspath=.status.replicas to InferenceService type
- Add status.Replicas int32 field (mirrors readyReplicas; this is the path
  the scale subresource reads to report current replica count)
- Populate status.Replicas = readyReplicas in updateStatusWithSchedulingInfo
- Regenerate config/crd/bases CRD YAML via make manifests generate
- Update charts/llmkube/templates/crds/inferenceservices.yaml to match

kubectl scale and KEDA ScaledObjects can now target InferenceService:
  kubectl scale inferenceservice/my-model -n ai --replicas=1

Signed-off-by: Mircea-Pavel ANTON <contact@mirceanton.com>
Signed-off-by: Mircea-Pavel ANTON <contact@mirceanton.com>
Signed-off-by: Mircea-Pavel ANTON <contact@mirceanton.com>
Signed-off-by: Mircea-Pavel ANTON <contact@mirceanton.com>
Signed-off-by: Mircea-Pavel ANTON <contact@mirceanton.com>
Copy link
Copy Markdown
Member

@Defilan Defilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, and thanks for getting the commits signed off. A clean, well-scoped PR: clear What/Why/How, the CRD regen done correctly, and a sample, test, and docs included. CI is green.

One thing I'd like resolved before merge. status.replicas currently mirrors readyReplicas, but the scale subresource's statusReplicasPath is conventionally the total current replica count. HPA reads that field to compute its scaling ratio, so reporting "ready" instead of "total" can cause over-scaling during rollouts. The PR and the README mention HPA, but the subresource also lacks a selectorpath, which autoscaling/v2 HPA needs to resolve pods. Two clean ways forward: either (a) point status.replicas at the Deployment's replica count and add a status.selector plus selectorpath for real HPA support, or (b) scope this PR to KEDA, which needs neither, and adjust the README to say KEDA rather than KEDA/HPA, with HPA as a follow-up.

Also strongly suggested: a test that drives the /scale subresource directly (and a scale-to-0 case) would be more reassuring than the current field-copy assertion, and a one-line note that the scale subresource and spec.autoscaling are mutually exclusive would save users a footgun. Nice contribution overall, and happy to help land it.

Comment thread internal/controller/status_builder.go Outdated
isvc.Status.Phase = phase
isvc.Status.ModelReady = modelReady
isvc.Status.ReadyReplicas = readyReplicas
isvc.Status.Replicas = readyReplicas
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Status.Replicas is the path statusReplicasPath reads for the scale subresource. By convention it should report the total current replica count (so HPA computes a correct scaling ratio), not the ready count. Reporting readyReplicas here can cause over-scaling during rollouts when pods aren't yet Ready. Consider sourcing this from the Deployment's *Spec.Replicas instead, keeping ReadyReplicas as the separate ready signal. If KEDA-only support is the intent, that's acceptable but should be documented and the README's HPA claim dropped.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this perhaps? d797121

If I understand correctly, this means that KEDA always sees the intended replica count rather than the ready replica count. During a rollout with pods not yet Ready, it reports 3 (desired) instead of 0 (ready), which prevents any false over-scaling signal.

Comment thread api/v1alpha1/inferenceservice_types.go
Comment thread config/samples/inferenceservice_scale_subresource.yaml Outdated
Comment thread config/samples/inferenceservice_scale_subresource.yaml
Comment thread README.md Outdated
Signed-off-by: Mircea-Pavel ANTON <contact@mirceanton.com>
Signed-off-by: Mircea-Pavel ANTON <contact@mirceanton.com>
Signed-off-by: Mircea-Pavel ANTON <contact@mirceanton.com>
Signed-off-by: Mircea-Pavel ANTON <contact@mirceanton.com>
Copy link
Copy Markdown
Member

@Defilan Defilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work getting this done so fast! You addressed everything I mentioned. I'll get this merged momentarily. Welcome to LLMKube!

@Defilan Defilan merged commit 73419a5 into defilantech:main May 17, 2026
21 checks passed
@github-actions github-actions Bot mentioned this pull request May 17, 2026
@mircea-pavel-anton mircea-pavel-anton deleted the feat/scale-subresource branch May 17, 2026 16:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] InferenceService should implement the Kubernetes scale subresource

2 participants