Skip to content

Observability

Fábio Luciano edited this page Jun 11, 2026 · 1 revision

Observability

Endpoints

Path Purpose
/ CloudEvents receiver
/metrics Prometheus metrics (also on server.metrics_addr if set; the chart ships a ServiceMonitor)
/healthz Liveness — always 200 while the process serves
/readyz Readiness + per-handler diagnostics (JSON)
/api/v1/dlq · /api/v1/dlq/replay Dead letter queue

/readyz

Returns 503 until handlers and decoders are registered, then 200 with live per-handler state — your first stop when "notifications stopped":

{
  "status": "ok",
  "handlers": {
    "github": {
      "last_event_at": "2026-06-11T12:01:33Z",
      "succeeded": 1041,
      "failed": 3,
      "last_error": "github API returned 401: Bad credentials",
      "last_error_at": "2026-06-11T11:58:02Z"
    },
    "slack": { "last_event_at": "2026-06-11T12:01:33Z", "succeeded": 87, "failed": 0 }
  }
}

Prometheus metrics

All relay metrics carry the tekton_events_relay_ prefix.

Event flow

Metric Labels Meaning
events_received_total type, source CloudEvents accepted by the receiver
events_processed_total handler, status Chain-step / handler outcomes (success/error)
events_filtered_total reason Dropped by the resource-type filter
events_unsupported_type_total type CloudEvent types with no decoder (watch after Tekton upgrades)
events_backpressure_total Events answered with 503 (Tekton will retransmit)
errors_permanent_total reason Non-retryable chain failures (these go to the DLQ when enabled)
pipeline_errors_total stage Internal chain errors per stage

Latency

Metric Labels
chain_duration_seconds result
handler_duration_seconds handler
notifier_latency_seconds handler, action
handler_timeouts_total handler

Deduplication & state

Metric Labels Meaning
deduper_hits_total Duplicates dropped
dedupe_cache_size Current entries (memory backend)
deduper_evictions_total LRU evictions — sustained growth means dedupe_size is too small
store_errors_total backend, op State-backend failures (relay failed open)

Outbound reliability

Metric Labels Meaning
notifier_retries_total host, reason Retries (rate_limit, server_error, timeout, network_error)
notifier_rate_limit_hits_total host HTTP 429 received per destination
dlq_size / dlq_enqueued_total Dead letter queue depth / inflow

Operations

Metric Labels Meaning
config_reloads_total result Hot reload attempts (success/failure)
handlers_registered Handlers built from config

Standard HTTP server metrics (http_request_duration_seconds, http_requests_total, http_requests_in_flight) and Go runtime/process collectors are also exported.

Alerting starting points

# permanent failures appearing
increase(tekton_events_relay_errors_permanent_total[10m]) > 0

# DLQ filling up — broken credential or config
tekton_events_relay_dlq_size > 0

# being rate-limited by a provider
increase(tekton_events_relay_notifier_rate_limit_hits_total[5m]) > 10

# dedupe degraded (store down, failing open)
increase(tekton_events_relay_store_errors_total[5m]) > 0

# config rollout broke the config
increase(tekton_events_relay_config_reloads_total{result="failure"}[15m]) > 0

Logging

Structured JSON via zap. logging.level: debug unlocks the verbose switches (caller, http_calls, payloads — payloads are redacted of known secret keys). Every request gets an X-Request-ID and trace/span IDs for correlation.

Log lines worth alerting/searching on: permanent error in pipeline chain, event preserved in DLQ for replay, dedupe store unavailable, processing event without deduplication, config reload: …, no decoder registered for event type.

Tracing

Set tracing.endpoint to an OTLP HTTP collector (e.g. otel-collector:4318) and each event produces a trace: receiver span → chain → one handler.execute span per handler (with handler.name/handler.type attributes and recorded errors). Use it to answer "which provider made this event slow".

Exporting events elsewhere (DevLake, data lakes…)

For engineering-metrics platforms, don't scrape — relay the events themselves with the generic webhook notifier and its gojq transform, shaping the payload to whatever schema the destination expects (e.g. Apache DevLake's deployments webhook). See Examples.

Clone this wiki locally