Skip to content

feat(observability): runtime label on inference pods + recording rules + starter dashboard (refs #409)#410

Merged
Defilan merged 3 commits intodefilantech:mainfrom
Defilan:feat/inference-metrics-v0.1
May 7, 2026
Merged

feat(observability): runtime label on inference pods + recording rules + starter dashboard (refs #409)#410
Defilan merged 3 commits intodefilantech:mainfrom
Defilan:feat/inference-metrics-v0.1

Conversation

@Defilan
Copy link
Copy Markdown
Member

@Defilan Defilan commented May 7, 2026

Summary

v0.1 cut of #409. Three small changes that open up per-runtime inference observability without touching the Go metrics surface:

  1. Pod metadata gains inference.llmkube.dev/runtime: <name> (Empty spec.runtime maps to llamacpp, matching resolveBackend's default.)
  2. PodMonitor relabelings promote operator-managed pod labels (service, namespace, model, runtime) onto every scraped series, so dashboards and recording rules can filter without re-joining against kube_pod_labels.
  3. PrometheusRule grows a llmkube-inference-recording group with recording rules for:
    • End-to-end request latency p95 (vLLM via vllm:e2e_request_latency_seconds_bucket + llama.cpp via llamacpp:request_seconds_bucket)
    • TTFT p95 for vLLM via vllm:time_to_first_token_seconds_bucket (llama.cpp lacks a first-token histogram in its /metrics output today)
    • Inference container restart rate as a coarse error proxy

Plus a starter Grafana dashboard at docs/grafana/llmkube-inference.json that imports cleanly into Grafana 10+ and surfaces request latency / TTFT / queue wait / restart rate, all grouped by service + runtime + namespace.

What's deferred (follow-ups on #409)

  • Per-request 4xx/5xx counter (need to confirm upstream metric paths per runtime first)
  • TTFT alert rules (separate PR after baseline data exists)
  • vllm-swift native metrics support (currently skeletal in tree)
  • Per-runtime breakdown panels beyond the basic four (iterate post-merge)

Testing

Unit

  • TestRuntimeNameLabel covers spec-runtime to label mapping for all current backends, empty/nil isvc, and unknown future runtimes
  • Existing "pod labels operator-only" assertion widened from 3 to 4 keys to account for the new runtime label

Helm chart

  • helm lint charts/llmkube clean
  • helm template charts/llmkube --set prometheus.inferencePodMonitor.enabled=true --show-only templates/inference-podmonitor.yaml renders the relabelings as expected
  • helm template charts/llmkube --set prometheus.prometheusRule.enabled=true --set prometheus.prometheusRule.rules.inference.enabled=true --show-only templates/prometheusrule.yaml renders the new recording-rule group

Manual smoke (planned this week)

  • Deploy on a real cluster, drive load, confirm Grafana dashboard panels populate
  • Confirm kube_pod_labels no longer needs to be joined into PromQL for runtime breakdowns

Compatibility

No CRD changes, no breaking changes. The new pod label is additive; existing PodMonitor scrapes that don't use the new relabelings continue to work. Recording rules are new series names (llmkube:inference:*), no naming collisions with existing rules.

Refs #409

…ng rules + starter dashboard (defilantech#409)

This is the v0.1 cut of defilantech#409. Three changes that open up per-runtime
inference observability without touching the Go metrics surface:

1. Pod metadata gains `inference.llmkube.dev/runtime: <name>` so Prometheus
   scrapes can attach a `runtime` series label without re-joining against
   `kube_pod_labels`. Empty `spec.runtime` maps to `llamacpp` (matches
   resolveBackend's default).

2. PodMonitor relabelings promote the operator-managed pod labels
   (`service`, `namespace`, `model`, `runtime`) onto every scraped series,
   so dashboards and recording rules can filter and group consistently.

3. PrometheusRule grows a `llmkube-inference-recording` group with
   recording rules for end-to-end request latency p95 (vLLM + llama.cpp
   variants), TTFT p95 (vLLM only — llama.cpp lacks a first-token
   histogram in its /metrics output today), and a coarse restart-rate
   error proxy. Per-request 4xx/5xx counters are deferred to a follow-up
   on defilantech#409 once the upstream metric paths are confirmed for each runtime.

A starter Grafana dashboard at `docs/grafana/llmkube-inference.json`
imports cleanly into Grafana 10+ and surfaces the four panels with the
new labels.

Tests: TestRuntimeNameLabel covers the spec-runtime → label mapping for
all current backends + nil isvc + unknown future runtimes. The existing
"pod labels operator-only" assertion was widened from 3 to 4 keys to
account for the new runtime label.

Refs defilantech#409

Signed-off-by: Christopher Maher <chris@mahercode.io>
@Defilan Defilan added enhancement New feature or request area/observability Monitoring, metrics, logging, tracing priority/high High priority labels May 7, 2026
@Defilan Defilan self-assigned this May 7, 2026
@Defilan Defilan added the size/medium Medium effort (1-3 days) label May 7, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Defilan added 2 commits May 6, 2026 18:17
…metric names + comments

The previous secrets-scan regex (grep -r "password|secret|token") was a
substring match, which fired on any text containing those words including
upstream metric names like `vllm:time_to_first_token_seconds_bucket` and
freeform comments like `# first-token-specific histogram`. The result was
that any PR adding a TTFT recording rule or a comment mentioning the word
"token" or "secret" got blocked at CI.

The new regex requires a YAML/env-style assignment (keyword + optional
whitespace + colon + optional whitespace + non-empty literal value), which
is what actual hard-coded secrets look like. The trailing `[^[:space:]{]`
exempts Helm template placeholders like `password: {{ .Values.foo }}` so
the check still ignores legitimate template references.

Verified:
- `# first-token-specific` (comment): no longer flagged
- `vllm:time_to_first_token_seconds_bucket` (metric name): no longer flagged
- `password: hardcoded` (real bad pattern): still flagged
- `apitoken: "abc123"` (real bad pattern): still flagged
- `secret: {{ .Values.x }}` (template ref): not flagged

Refs defilantech#409

Signed-off-by: Christopher Maher <chris@mahercode.io>
Two cleanups from PR review:

- Add `latency_kind: ttft` label to the TTFT recording rule so all three
  recording rules in the llmkube-inference-recording group are queryable
  on the same dimension. Without this, a query like
  `{__name__=~"llmkube:inference:.*", latency_kind="end_to_end"}` would
  silently exclude TTFT (which is the right outcome, but only if the
  label is present everywhere).

- Add a maintenance note next to the restart-rate container regex
  pointing at defilantech#409 for the vllm-swift addition, so future runtime work
  doesn't forget to update the alternation.

No metric or dashboard breakage; the dashboard targets recording-rule
names directly, not the new label.

Refs defilantech#409

Signed-off-by: Christopher Maher <chris@mahercode.io>
@Defilan Defilan merged commit 71743ed into defilantech:main May 7, 2026
19 checks passed
@github-actions github-actions Bot mentioned this pull request May 7, 2026
Defilan added a commit to defilantech/infercost that referenced this pull request May 9, 2026
…as fallback (closes #45) (#56)

## Summary

Adds a third precedence rule to `ResolveBackend` so InferCost picks up
the right backend automatically when running alongside LLMKube. Closes
#45.

New precedence:

1. `infercost.ai/backend` annotation (explicit override; unchanged)
2. `infercost.ai/backend` label (explicit override at label level;
unchanged)
3. `inference.llmkube.dev/runtime` label (NEW; LLMKube emits this on its
inference pods per [LLMKube PR
#410](defilantech/LLMKube#410))
4. default to llamacpp (unchanged)

## Background

LLMKube doesn't propagate arbitrary annotations to its pod templates
([LLMKube #326](defilantech/LLMKube#326)), so
users running both products historically had to `kubectl annotate` every
pod out of band, or accept that InferCost would default to llamacpp and
silently scrape the wrong endpoint when the pod was actually vLLM. This
bit us during a recent deploy: the scraper hammered port 8080 on a vLLM
pod, got connection-refused, reported zero tokens, and the UsageReport
sat with empty fields.

## Implementation

- `internal/scraper/backend.go` gains `LLMKubeRuntimeLabel` constant and
a `BackendSource` enum so callers can tell which rule fired without
re-running the resolution. `ResolveBackend` is preserved as a
backward-compatible thin wrapper around the new
`ResolveBackendWithSource`.
- The two scrape loops (UsageReport + CostProfile controllers) now call
`ResolveBackendWithSource` and emit an info-level log line when rule 3
fires: `inferred backend from LLMKube runtime label`. Not magic, just
convenient; explicit is better than implicit.
- Empty LLMKube label values fall through to the default; only non-empty
values participate in the inference.
- Unknown LLMKube runtime values (e.g. `tgi`, `personaplex`) are treated
as llamacpp via the existing `normalizeBackend`, which is non-fatal so
new LLMKube runtimes don't break InferCost installs that haven't bumped.

## Tests

`backend_test.go` adds 8 new cases covering:
- VLLM fallback path from LLMKube runtime label
- llamacpp fallback path from LLMKube runtime label
- Annotation precedence over LLMKube label (rule 1 still wins)
- InferCost label precedence over LLMKube label (rule 2 still wins)
- Unknown LLMKube runtime values → llamacpp default
- Empty LLMKube label value → default source (not LLMKube source)
- `ResolveBackendWithSource` reports the correct source for all four
rules

```
$ go test ./internal/scraper/... -count=1
ok      github.com/defilantech/infercost/internal/scraper       0.351s

$ go test ./internal/controller/... -count=1
ok      github.com/defilantech/infercost/internal/controller    6.888s
```

## Compatibility

Behavior change is additive: pods without `infercost.ai/backend`
annotation/label now resolve to a sensible backend instead of defaulting
to llamacpp; pods with the explicit override are unaffected.
`ResolveBackend` signature is preserved as a thin wrapper, so any
external callers continue to work.

Closes #45

Signed-off-by: Christopher Maher <chris@mahercode.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/observability Monitoring, metrics, logging, tracing enhancement New feature or request priority/high High priority size/medium Medium effort (1-3 days)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant