-
Notifications
You must be signed in to change notification settings - Fork 0
Observability
| Path | Purpose |
|---|---|
/ |
CloudEvents receiver |
/metrics |
Prometheus metrics (also on server.metrics_addr if set; the chart ships a ServiceMonitor) |
/healthz |
Liveness — always 200 while the process serves |
/readyz |
Readiness + per-handler diagnostics (JSON) |
/api/v1/dlq · /api/v1/dlq/replay
|
Dead letter queue |
Returns 503 until handlers and decoders are registered, then 200 with live per-handler state — your first stop when "notifications stopped":
{
"status": "ok",
"handlers": {
"github": {
"last_event_at": "2026-06-11T12:01:33Z",
"succeeded": 1041,
"failed": 3,
"last_error": "github API returned 401: Bad credentials",
"last_error_at": "2026-06-11T11:58:02Z"
},
"slack": { "last_event_at": "2026-06-11T12:01:33Z", "succeeded": 87, "failed": 0 }
}
}All relay metrics carry the tekton_events_relay_ prefix.
| Metric | Labels | Meaning |
|---|---|---|
events_received_total |
type, source
|
CloudEvents accepted by the receiver |
events_processed_total |
handler, status
|
Chain-step / handler outcomes (success/error) |
events_filtered_total |
reason |
Dropped by the resource-type filter |
events_unsupported_type_total |
type |
CloudEvent types with no decoder (watch after Tekton upgrades) |
events_backpressure_total |
— | Events answered with 503 (Tekton will retransmit) |
errors_permanent_total |
reason |
Non-retryable chain failures (these go to the DLQ when enabled) |
pipeline_errors_total |
stage |
Internal chain errors per stage |
| Metric | Labels |
|---|---|
chain_duration_seconds |
result |
handler_duration_seconds |
handler |
notifier_latency_seconds |
handler, action
|
handler_timeouts_total |
handler |
| Metric | Labels | Meaning |
|---|---|---|
deduper_hits_total |
— | Duplicates dropped |
dedupe_cache_size |
— | Current entries (memory backend) |
deduper_evictions_total |
— | LRU evictions — sustained growth means dedupe_size is too small |
store_errors_total |
backend, op
|
State-backend failures (relay failed open) |
| Metric | Labels | Meaning |
|---|---|---|
notifier_retries_total |
host, reason
|
Retries (rate_limit, server_error, timeout, network_error) |
notifier_rate_limit_hits_total |
host |
HTTP 429 received per destination |
dlq_size / dlq_enqueued_total
|
— | Dead letter queue depth / inflow |
| Metric | Labels | Meaning |
|---|---|---|
config_reloads_total |
result |
Hot reload attempts (success/failure) |
handlers_registered |
— | Handlers built from config |
Standard HTTP server metrics (http_request_duration_seconds, http_requests_total, http_requests_in_flight) and Go runtime/process collectors are also exported.
# permanent failures appearing
increase(tekton_events_relay_errors_permanent_total[10m]) > 0
# DLQ filling up — broken credential or config
tekton_events_relay_dlq_size > 0
# being rate-limited by a provider
increase(tekton_events_relay_notifier_rate_limit_hits_total[5m]) > 10
# dedupe degraded (store down, failing open)
increase(tekton_events_relay_store_errors_total[5m]) > 0
# config rollout broke the config
increase(tekton_events_relay_config_reloads_total{result="failure"}[15m]) > 0
Structured JSON via zap. logging.level: debug unlocks the verbose switches (caller, http_calls, payloads — payloads are redacted of known secret keys). Every request gets an X-Request-ID and trace/span IDs for correlation.
Log lines worth alerting/searching on: permanent error in pipeline chain, event preserved in DLQ for replay, dedupe store unavailable, processing event without deduplication, config reload: …, no decoder registered for event type.
Set tracing.endpoint to an OTLP HTTP collector (e.g. otel-collector:4318) and each event produces a trace: receiver span → chain → one handler.execute span per handler (with handler.name/handler.type attributes and recorded errors). Use it to answer "which provider made this event slow".
For engineering-metrics platforms, don't scrape — relay the events themselves with the generic webhook notifier and its gojq transform, shaping the payload to whatever schema the destination expects (e.g. Apache DevLake's deployments webhook). See Examples.
Getting started
Reference
SCM providers
Notifiers
Running in production
More