feat(observability): runtime label on inference pods + recording rules + starter dashboard (refs #409)#410
Merged
Defilan merged 3 commits intodefilantech:mainfrom May 7, 2026
Conversation
…ng rules + starter dashboard (defilantech#409) This is the v0.1 cut of defilantech#409. Three changes that open up per-runtime inference observability without touching the Go metrics surface: 1. Pod metadata gains `inference.llmkube.dev/runtime: <name>` so Prometheus scrapes can attach a `runtime` series label without re-joining against `kube_pod_labels`. Empty `spec.runtime` maps to `llamacpp` (matches resolveBackend's default). 2. PodMonitor relabelings promote the operator-managed pod labels (`service`, `namespace`, `model`, `runtime`) onto every scraped series, so dashboards and recording rules can filter and group consistently. 3. PrometheusRule grows a `llmkube-inference-recording` group with recording rules for end-to-end request latency p95 (vLLM + llama.cpp variants), TTFT p95 (vLLM only — llama.cpp lacks a first-token histogram in its /metrics output today), and a coarse restart-rate error proxy. Per-request 4xx/5xx counters are deferred to a follow-up on defilantech#409 once the upstream metric paths are confirmed for each runtime. A starter Grafana dashboard at `docs/grafana/llmkube-inference.json` imports cleanly into Grafana 10+ and surfaces the four panels with the new labels. Tests: TestRuntimeNameLabel covers the spec-runtime → label mapping for all current backends + nil isvc + unknown future runtimes. The existing "pod labels operator-only" assertion was widened from 3 to 4 keys to account for the new runtime label. Refs defilantech#409 Signed-off-by: Christopher Maher <chris@mahercode.io>
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
…metric names + comments
The previous secrets-scan regex (grep -r "password|secret|token") was a
substring match, which fired on any text containing those words including
upstream metric names like `vllm:time_to_first_token_seconds_bucket` and
freeform comments like `# first-token-specific histogram`. The result was
that any PR adding a TTFT recording rule or a comment mentioning the word
"token" or "secret" got blocked at CI.
The new regex requires a YAML/env-style assignment (keyword + optional
whitespace + colon + optional whitespace + non-empty literal value), which
is what actual hard-coded secrets look like. The trailing `[^[:space:]{]`
exempts Helm template placeholders like `password: {{ .Values.foo }}` so
the check still ignores legitimate template references.
Verified:
- `# first-token-specific` (comment): no longer flagged
- `vllm:time_to_first_token_seconds_bucket` (metric name): no longer flagged
- `password: hardcoded` (real bad pattern): still flagged
- `apitoken: "abc123"` (real bad pattern): still flagged
- `secret: {{ .Values.x }}` (template ref): not flagged
Refs defilantech#409
Signed-off-by: Christopher Maher <chris@mahercode.io>
Two cleanups from PR review:
- Add `latency_kind: ttft` label to the TTFT recording rule so all three
recording rules in the llmkube-inference-recording group are queryable
on the same dimension. Without this, a query like
`{__name__=~"llmkube:inference:.*", latency_kind="end_to_end"}` would
silently exclude TTFT (which is the right outcome, but only if the
label is present everywhere).
- Add a maintenance note next to the restart-rate container regex
pointing at defilantech#409 for the vllm-swift addition, so future runtime work
doesn't forget to update the alternation.
No metric or dashboard breakage; the dashboard targets recording-rule
names directly, not the new label.
Refs defilantech#409
Signed-off-by: Christopher Maher <chris@mahercode.io>
Merged
Defilan
added a commit
to defilantech/infercost
that referenced
this pull request
May 9, 2026
…as fallback (closes #45) (#56) ## Summary Adds a third precedence rule to `ResolveBackend` so InferCost picks up the right backend automatically when running alongside LLMKube. Closes #45. New precedence: 1. `infercost.ai/backend` annotation (explicit override; unchanged) 2. `infercost.ai/backend` label (explicit override at label level; unchanged) 3. `inference.llmkube.dev/runtime` label (NEW; LLMKube emits this on its inference pods per [LLMKube PR #410](defilantech/LLMKube#410)) 4. default to llamacpp (unchanged) ## Background LLMKube doesn't propagate arbitrary annotations to its pod templates ([LLMKube #326](defilantech/LLMKube#326)), so users running both products historically had to `kubectl annotate` every pod out of band, or accept that InferCost would default to llamacpp and silently scrape the wrong endpoint when the pod was actually vLLM. This bit us during a recent deploy: the scraper hammered port 8080 on a vLLM pod, got connection-refused, reported zero tokens, and the UsageReport sat with empty fields. ## Implementation - `internal/scraper/backend.go` gains `LLMKubeRuntimeLabel` constant and a `BackendSource` enum so callers can tell which rule fired without re-running the resolution. `ResolveBackend` is preserved as a backward-compatible thin wrapper around the new `ResolveBackendWithSource`. - The two scrape loops (UsageReport + CostProfile controllers) now call `ResolveBackendWithSource` and emit an info-level log line when rule 3 fires: `inferred backend from LLMKube runtime label`. Not magic, just convenient; explicit is better than implicit. - Empty LLMKube label values fall through to the default; only non-empty values participate in the inference. - Unknown LLMKube runtime values (e.g. `tgi`, `personaplex`) are treated as llamacpp via the existing `normalizeBackend`, which is non-fatal so new LLMKube runtimes don't break InferCost installs that haven't bumped. ## Tests `backend_test.go` adds 8 new cases covering: - VLLM fallback path from LLMKube runtime label - llamacpp fallback path from LLMKube runtime label - Annotation precedence over LLMKube label (rule 1 still wins) - InferCost label precedence over LLMKube label (rule 2 still wins) - Unknown LLMKube runtime values → llamacpp default - Empty LLMKube label value → default source (not LLMKube source) - `ResolveBackendWithSource` reports the correct source for all four rules ``` $ go test ./internal/scraper/... -count=1 ok github.com/defilantech/infercost/internal/scraper 0.351s $ go test ./internal/controller/... -count=1 ok github.com/defilantech/infercost/internal/controller 6.888s ``` ## Compatibility Behavior change is additive: pods without `infercost.ai/backend` annotation/label now resolve to a sensible backend instead of defaulting to llamacpp; pods with the explicit override are unaffected. `ResolveBackend` signature is preserved as a thin wrapper, so any external callers continue to work. Closes #45 Signed-off-by: Christopher Maher <chris@mahercode.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
v0.1 cut of #409. Three small changes that open up per-runtime inference observability without touching the Go metrics surface:
inference.llmkube.dev/runtime: <name>(Emptyspec.runtimemaps tollamacpp, matchingresolveBackend's default.)service,namespace,model,runtime) onto every scraped series, so dashboards and recording rules can filter without re-joining againstkube_pod_labels.llmkube-inference-recordinggroup with recording rules for:vllm:e2e_request_latency_seconds_bucket+ llama.cpp viallamacpp:request_seconds_bucket)vllm:time_to_first_token_seconds_bucket(llama.cpp lacks a first-token histogram in its/metricsoutput today)Plus a starter Grafana dashboard at
docs/grafana/llmkube-inference.jsonthat imports cleanly into Grafana 10+ and surfaces request latency / TTFT / queue wait / restart rate, all grouped by service + runtime + namespace.What's deferred (follow-ups on #409)
Testing
Unit
TestRuntimeNameLabelcovers spec-runtime to label mapping for all current backends, empty/nil isvc, and unknown future runtimesHelm chart
helm lint charts/llmkubecleanhelm template charts/llmkube --set prometheus.inferencePodMonitor.enabled=true --show-only templates/inference-podmonitor.yamlrenders the relabelings as expectedhelm template charts/llmkube --set prometheus.prometheusRule.enabled=true --set prometheus.prometheusRule.rules.inference.enabled=true --show-only templates/prometheusrule.yamlrenders the new recording-rule groupManual smoke (planned this week)
kube_pod_labelsno longer needs to be joined into PromQL for runtime breakdownsCompatibility
No CRD changes, no breaking changes. The new pod label is additive; existing PodMonitor scrapes that don't use the new relabelings continue to work. Recording rules are new series names (
llmkube:inference:*), no naming collisions with existing rules.Refs #409