Skip to content

feat!: multi-source observability and guardrails (v2.0.0 part 2, #34)#37

Merged
fxthiry merged 2 commits intorelease/2.0.0from
fix/multi-source-vl-observability
Apr 16, 2026
Merged

feat!: multi-source observability and guardrails (v2.0.0 part 2, #34)#37
fxthiry merged 2 commits intorelease/2.0.0from
fix/multi-source-vl-observability

Conversation

@fxthiry
Copy link
Copy Markdown
Owner

@fxthiry fxthiry commented Apr 16, 2026

Completes #34 — multi-source VictoriaLogs observability + guardrails. Builds on PR #36 (multi-source core). Last PR before the v2.0.0 release tag.

What ships

  • All per-rule Prometheus counters now also carry a vl_source label. Multi-source operators can attribute every alert, error, throttle hit, and reconnect to a specific source. valerter_queue_size (shared, not per-source) keeps unlabeled.
  • valerter_victorialogs_up{rule_name} removed and replaced by valerter_vl_source_up{vl_source}. Per-source reachability gauge: 1 on connect success, 0 on permanent failure or stream error. Initialized to 0 for every configured source at startup.
  • defaults.max_streams: 50 (configurable). Total streams = sum of (rule, source) pairs spawned for enabled rules. Breach fails the config at load with both the actual count and the cap value. max_streams: 0 is explicitly rejected.
  • ±10% uniform jitter on reconnect backoff per (rule, source) task. Sources behind a flapping load balancer no longer reconnect in lock-step.
  • tests/metrics_snapshot.rs integration test that scrapes /metrics after a 2-source 1-rule warmup and asserts the set of metric names + label keys against an inline expected list. Catches accidental relabel/rename in future PRs.

Validation

  • cargo test — 540+ tests pass (479 lib + integration suite, including new metrics_snapshot)
  • cargo clippy --all-targets --all-features -- -D warnings — clean
  • cargo fmt --check — clean
  • cargo run --bin valerter -- --validate -c config/config.example.yaml — exit 0
  • Manual live regression: 2-source config (one remote, one local docker container). Cap rejection verified with max_streams: 2 override (4 streams required > cap of 2). With max_streams: 50: /metrics scrape shows valerter_vl_source_up{vl_source} for both sources at 1, all 8 per-rule counters carry both rule_name and vl_source labels, and valerter_victorialogs_up series is absent.

Breaking changes summary (full text in CHANGELOG v2.0.0)

  • Every metric currently labeled by rule_name now also carries vl_source (additive).
  • valerter_victorialogs_up{rule_name} removed; replaced by valerter_vl_source_up{vl_source} (per-source semantics, not per-rule). PromQL migration examples in CHANGELOG.
  • defaults.max_streams cap introduced (default 50). Configs whose enabled-rule fan-out exceeds the cap fail at load.

Out of scope (deferred)

  • valerter_active_streams gauge (mentioned in earlier scoping; not needed for v2.0.0 acceptance, follow-up if requested).
  • Configurable jitter range (currently hardcoded ±10%; knob if real ops finds 10% wrong).

François-Xavier THIRY added 2 commits April 16, 2026 14:36
Adds the `vl_source` Prometheus label to every per-rule counter, replaces
the legacy `valerter_victorialogs_up{rule_name}` gauge with a per-source
`valerter_vl_source_up{vl_source}`, introduces a `defaults.max_streams`
cap with load-time enforcement, applies +/-10% uniform jitter on reconnect
backoff per (rule, source) task to break thundering-herd alignment, and
ships a `/metrics` snapshot integration test that catches accidental
relabel or rename in future PRs.

Multi-source operators can now attribute every alert and error rate to a
specific source. The new gauge makes partial-source outages visible
without requiring per-rule reachability inference. The cap prevents an
accidental (rule x source) fan-out from DoS'ing a backend at startup.

Post-review hardening folded in: dated v1.2.1 in CHANGELOG, added
PromQL migration snippet for dashboards, extended snapshot coverage
to include lines_discarded / query_duration / last_query_timestamp /
per-notifier sentinel counters, made the jitter floor test exercise the
actual clamp branch, demoted backoff_delay_default to pub(crate),
extended ReconnectCallback::on_reconnect to receive vl_source, and
rejected max_streams=0 at load with a clear error.

Cargo.toml not bumped here; version bump lives in the release PR.
…dy_html

Documents a pre-existing concern made wider by the v1.2.0 #26 fix: the
example config now actively pipes `{{ _msg }}` into `body`, and operators
who mirror this pattern in `email_body_html` may inadvertently render
unescaped HTML/script content from untrusted log fields. Hardening of the
email path is a follow-up; the advisory makes the limit visible at the
v2.0.0 release boundary.
@fxthiry fxthiry merged commit 70db8fe into release/2.0.0 Apr 16, 2026
@fxthiry fxthiry mentioned this pull request Apr 16, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant