feat(observability): runtime label on inference pods + recording rules + starter dashboard (refs #409) by Defilan · Pull Request #410 · defilantech/LLMKube

Defilan · 2026-05-07T01:05:10Z

Summary

v0.1 cut of #409. Three small changes that open up per-runtime inference observability without touching the Go metrics surface:

Pod metadata gains inference.llmkube.dev/runtime: <name> (Empty spec.runtime maps to llamacpp, matching resolveBackend's default.)
PodMonitor relabelings promote operator-managed pod labels (service, namespace, model, runtime) onto every scraped series, so dashboards and recording rules can filter without re-joining against kube_pod_labels.
PrometheusRule grows a llmkube-inference-recording group with recording rules for:
- End-to-end request latency p95 (vLLM via vllm:e2e_request_latency_seconds_bucket + llama.cpp via llamacpp:request_seconds_bucket)
- TTFT p95 for vLLM via vllm:time_to_first_token_seconds_bucket (llama.cpp lacks a first-token histogram in its /metrics output today)
- Inference container restart rate as a coarse error proxy

Plus a starter Grafana dashboard at docs/grafana/llmkube-inference.json that imports cleanly into Grafana 10+ and surfaces request latency / TTFT / queue wait / restart rate, all grouped by service + runtime + namespace.

What's deferred (follow-ups on #409)

Per-request 4xx/5xx counter (need to confirm upstream metric paths per runtime first)
TTFT alert rules (separate PR after baseline data exists)
vllm-swift native metrics support (currently skeletal in tree)
Per-runtime breakdown panels beyond the basic four (iterate post-merge)

Testing

Unit

TestRuntimeNameLabel covers spec-runtime to label mapping for all current backends, empty/nil isvc, and unknown future runtimes
Existing "pod labels operator-only" assertion widened from 3 to 4 keys to account for the new runtime label

Helm chart

helm lint charts/llmkube clean
helm template charts/llmkube --set prometheus.inferencePodMonitor.enabled=true --show-only templates/inference-podmonitor.yaml renders the relabelings as expected
helm template charts/llmkube --set prometheus.prometheusRule.enabled=true --set prometheus.prometheusRule.rules.inference.enabled=true --show-only templates/prometheusrule.yaml renders the new recording-rule group

Manual smoke (planned this week)

Deploy on a real cluster, drive load, confirm Grafana dashboard panels populate
Confirm kube_pod_labels no longer needs to be joined into PromQL for runtime breakdowns

Compatibility

No CRD changes, no breaking changes. The new pod label is additive; existing PodMonitor scrapes that don't use the new relabelings continue to work. Recording rules are new series names (llmkube:inference:*), no naming collisions with existing rules.

Refs #409

…ng rules + starter dashboard (defilantech#409) This is the v0.1 cut of defilantech#409. Three changes that open up per-runtime inference observability without touching the Go metrics surface: 1. Pod metadata gains `inference.llmkube.dev/runtime: <name>` so Prometheus scrapes can attach a `runtime` series label without re-joining against `kube_pod_labels`. Empty `spec.runtime` maps to `llamacpp` (matches resolveBackend's default). 2. PodMonitor relabelings promote the operator-managed pod labels (`service`, `namespace`, `model`, `runtime`) onto every scraped series, so dashboards and recording rules can filter and group consistently. 3. PrometheusRule grows a `llmkube-inference-recording` group with recording rules for end-to-end request latency p95 (vLLM + llama.cpp variants), TTFT p95 (vLLM only — llama.cpp lacks a first-token histogram in its /metrics output today), and a coarse restart-rate error proxy. Per-request 4xx/5xx counters are deferred to a follow-up on defilantech#409 once the upstream metric paths are confirmed for each runtime. A starter Grafana dashboard at `docs/grafana/llmkube-inference.json` imports cleanly into Grafana 10+ and surfaces the four panels with the new labels. Tests: TestRuntimeNameLabel covers the spec-runtime → label mapping for all current backends + nil isvc + unknown future runtimes. The existing "pod labels operator-only" assertion was widened from 3 to 4 keys to account for the new runtime label. Refs defilantech#409 Signed-off-by: Christopher Maher <chris@mahercode.io>

codecov · 2026-05-07T01:08:11Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

…metric names + comments The previous secrets-scan regex (grep -r "password|secret|token") was a substring match, which fired on any text containing those words including upstream metric names like `vllm:time_to_first_token_seconds_bucket` and freeform comments like `# first-token-specific histogram`. The result was that any PR adding a TTFT recording rule or a comment mentioning the word "token" or "secret" got blocked at CI. The new regex requires a YAML/env-style assignment (keyword + optional whitespace + colon + optional whitespace + non-empty literal value), which is what actual hard-coded secrets look like. The trailing `[^[:space:]{]` exempts Helm template placeholders like `password: {{ .Values.foo }}` so the check still ignores legitimate template references. Verified: - `# first-token-specific` (comment): no longer flagged - `vllm:time_to_first_token_seconds_bucket` (metric name): no longer flagged - `password: hardcoded` (real bad pattern): still flagged - `apitoken: "abc123"` (real bad pattern): still flagged - `secret: {{ .Values.x }}` (template ref): not flagged Refs defilantech#409 Signed-off-by: Christopher Maher <chris@mahercode.io>

Two cleanups from PR review: - Add `latency_kind: ttft` label to the TTFT recording rule so all three recording rules in the llmkube-inference-recording group are queryable on the same dimension. Without this, a query like `{__name__=~"llmkube:inference:.*", latency_kind="end_to_end"}` would silently exclude TTFT (which is the right outcome, but only if the label is present everywhere). - Add a maintenance note next to the restart-rate container regex pointing at defilantech#409 for the vllm-swift addition, so future runtime work doesn't forget to update the alternation. No metric or dashboard breakage; the dashboard targets recording-rule names directly, not the new label. Refs defilantech#409 Signed-off-by: Christopher Maher <chris@mahercode.io>

…as fallback (closes #45) (#56) ## Summary Adds a third precedence rule to `ResolveBackend` so InferCost picks up the right backend automatically when running alongside LLMKube. Closes #45. New precedence: 1. `infercost.ai/backend` annotation (explicit override; unchanged) 2. `infercost.ai/backend` label (explicit override at label level; unchanged) 3. `inference.llmkube.dev/runtime` label (NEW; LLMKube emits this on its inference pods per [LLMKube PR #410](defilantech/LLMKube#410)) 4. default to llamacpp (unchanged) ## Background LLMKube doesn't propagate arbitrary annotations to its pod templates ([LLMKube #326](defilantech/LLMKube#326)), so users running both products historically had to `kubectl annotate` every pod out of band, or accept that InferCost would default to llamacpp and silently scrape the wrong endpoint when the pod was actually vLLM. This bit us during a recent deploy: the scraper hammered port 8080 on a vLLM pod, got connection-refused, reported zero tokens, and the UsageReport sat with empty fields. ## Implementation - `internal/scraper/backend.go` gains `LLMKubeRuntimeLabel` constant and a `BackendSource` enum so callers can tell which rule fired without re-running the resolution. `ResolveBackend` is preserved as a backward-compatible thin wrapper around the new `ResolveBackendWithSource`. - The two scrape loops (UsageReport + CostProfile controllers) now call `ResolveBackendWithSource` and emit an info-level log line when rule 3 fires: `inferred backend from LLMKube runtime label`. Not magic, just convenient; explicit is better than implicit. - Empty LLMKube label values fall through to the default; only non-empty values participate in the inference. - Unknown LLMKube runtime values (e.g. `tgi`, `personaplex`) are treated as llamacpp via the existing `normalizeBackend`, which is non-fatal so new LLMKube runtimes don't break InferCost installs that haven't bumped. ## Tests `backend_test.go` adds 8 new cases covering: - VLLM fallback path from LLMKube runtime label - llamacpp fallback path from LLMKube runtime label - Annotation precedence over LLMKube label (rule 1 still wins) - InferCost label precedence over LLMKube label (rule 2 still wins) - Unknown LLMKube runtime values → llamacpp default - Empty LLMKube label value → default source (not LLMKube source) - `ResolveBackendWithSource` reports the correct source for all four rules ``` $ go test ./internal/scraper/... -count=1 ok github.com/defilantech/infercost/internal/scraper 0.351s $ go test ./internal/controller/... -count=1 ok github.com/defilantech/infercost/internal/controller 6.888s ``` ## Compatibility Behavior change is additive: pods without `infercost.ai/backend` annotation/label now resolve to a sensible backend instead of defaulting to llamacpp; pods with the explicit override are unaffected. `ResolveBackend` signature is preserved as a thin wrapper, so any external callers continue to work. Closes #45 Signed-off-by: Christopher Maher <chris@mahercode.io>

Defilan added enhancement New feature or request area/observability Monitoring, metrics, logging, tracing priority/high High priority labels May 7, 2026

Defilan self-assigned this May 7, 2026

Defilan added the size/medium Medium effort (1-3 days) label May 7, 2026

Defilan added 2 commits May 6, 2026 18:17

Defilan merged commit 71743ed into defilantech:main May 7, 2026
19 checks passed

github-actions Bot mentioned this pull request May 7, 2026

chore: release 0.7.7 #402

Merged

This was referenced May 8, 2026

feat(observability): per-InferenceService SLO declaration via Pyrra integration #415

Open

fix(scraper): infer backend from inference.llmkube.dev/runtime label as fallback (closes #45) defilantech/infercost#56

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(observability): runtime label on inference pods + recording rules + starter dashboard (refs #409)#410

feat(observability): runtime label on inference pods + recording rules + starter dashboard (refs #409)#410
Defilan merged 3 commits intodefilantech:mainfrom
Defilan:feat/inference-metrics-v0.1

Defilan commented May 7, 2026

Uh oh!

codecov Bot commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Defilan commented May 7, 2026

Summary

What's deferred (follow-ups on #409)

Testing

Compatibility

Uh oh!

codecov Bot commented May 7, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant