Observability

Endpoints

Path	Purpose
`/`	CloudEvents receiver
`/metrics`	Prometheus metrics (also on `server.metrics_addr` if set; the chart ships a ServiceMonitor)
`/healthz`	Liveness — always 200 while the process serves
`/readyz`	Readiness + per-handler diagnostics (JSON)
`/api/v1/dlq` · `/api/v1/dlq/replay`	Dead letter queue

`/readyz`

Returns 503 until handlers and decoders are registered, then 200 with live per-handler state — your first stop when "notifications stopped":

{
  "status": "ok",
  "handlers": {
    "github": {
      "last_event_at": "2026-06-11T12:01:33Z",
      "succeeded": 1041,
      "failed": 3,
      "last_error": "github API returned 401: Bad credentials",
      "last_error_at": "2026-06-11T11:58:02Z"
    },
    "slack": { "last_event_at": "2026-06-11T12:01:33Z", "succeeded": 87, "failed": 0 }
  }
}

Prometheus metrics

All relay metrics carry the tekton_events_relay_ prefix.

Event flow

Metric	Labels	Meaning
`events_received_total`	`type`, `source`	CloudEvents accepted by the receiver
`events_processed_total`	`handler`, `status`	Chain-step / handler outcomes (`success`/`error`)
`events_filtered_total`	`reason`	Dropped by the resource-type filter
`events_unsupported_type_total`	`type`	CloudEvent types with no decoder (watch after Tekton upgrades)
`events_backpressure_total`	—	Events answered with 503 (Tekton will retransmit)
`errors_permanent_total`	`reason`	Non-retryable chain failures (these go to the DLQ when enabled)
`pipeline_errors_total`	`stage`	Internal chain errors per stage

Latency

Metric	Labels
`chain_duration_seconds`	`result`
`handler_duration_seconds`	`handler`
`notifier_latency_seconds`	`handler`, `action`
`handler_timeouts_total`	`handler`

Deduplication & state

Metric	Labels	Meaning
`deduper_hits_total`	—	Duplicates dropped
`dedupe_cache_size`	—	Current entries (memory backend)
`deduper_evictions_total`	—	LRU evictions — sustained growth means `dedupe_size` is too small
`store_errors_total`	`backend`, `op`	State-backend failures (relay failed open)

Outbound reliability

Metric	Labels	Meaning
`notifier_retries_total`	`host`, `reason`	Retries (`rate_limit`, `server_error`, `timeout`, `network_error`)
`notifier_rate_limit_hits_total`	`host`	HTTP 429 received per destination
`dlq_size` / `dlq_enqueued_total`	—	Dead letter queue depth / inflow

Operations

Metric	Labels	Meaning
`config_reloads_total`	`result`	Hot reload attempts (`success`/`failure`)
`handlers_registered`	—	Handlers built from config

Standard HTTP server metrics (http_request_duration_seconds, http_requests_total, http_requests_in_flight) and Go runtime/process collectors are also exported.

Alerting starting points

# permanent failures appearing
increase(tekton_events_relay_errors_permanent_total[10m]) > 0

# DLQ filling up — broken credential or config
tekton_events_relay_dlq_size > 0

# being rate-limited by a provider
increase(tekton_events_relay_notifier_rate_limit_hits_total[5m]) > 10

# dedupe degraded (store down, failing open)
increase(tekton_events_relay_store_errors_total[5m]) > 0

# config rollout broke the config
increase(tekton_events_relay_config_reloads_total{result="failure"}[15m]) > 0

Logging

Structured JSON via zap. logging.level: debug unlocks the verbose switches (caller, http_calls, payloads — payloads are redacted of known secret keys). Every request gets an X-Request-ID and trace/span IDs for correlation.

Log lines worth alerting/searching on: permanent error in pipeline chain, event preserved in DLQ for replay, dedupe store unavailable, processing event without deduplication, config reload: …, no decoder registered for event type.

Tracing

Set tracing.endpoint to an OTLP HTTP collector (e.g. otel-collector:4318) and each event produces a trace: receiver span → chain → one handler.execute span per handler (with handler.name/handler.type attributes and recorded errors). Use it to answer "which provider made this event slow".

Exporting events elsewhere (DevLake, data lakes…)

For engineering-metrics platforms, don't scrape — relay the events themselves with the generic webhook notifier and its gojq transform, shaping the payload to whatever schema the destination expects (e.g. Apache DevLake's deployments webhook). See Examples.

Repository · Releases · Helm Chart · Issues · Security Policy

🏠 Home

Getting started

Reference

SCM providers

Notifiers

Running in production

More

Observability

Observability

Endpoints

/readyz

Prometheus metrics

Event flow

Latency

Deduplication & state

Outbound reliability

Operations

Alerting starting points

Logging

Tracing

Exporting events elsewhere (DevLake, data lakes…)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

`/readyz`