Skip to content

Operations

Fábio Luciano edited this page Jun 16, 2026 · 4 revisions

Operations

Everything you need to run the relay reliably in production: scaling, shared state, the DLQ, hot reload, security hardening and shutdown behavior.

State backends

Two components hold state: the deduper (drops Tekton retransmissions by CloudEvent ID) and the accumulator (buffers TaskRuns per PipelineRun for summary comments). The store config selects where that state lives:

memory (default) valkey olric
Extra infrastructure none 1 small Valkey/RESP server none (embedded in relay pods)
Correct with N replicas ❌ per-pod
Survives pod restart partial (rolling updates yes, full restart no)
Network requirements none egress to Valkey pod-to-pod ports 3320/tcp + 3322/tcp+udp (chart wires NetworkPolicy + headless Service)

Why this matters: with memory, each replica has its own dedupe cache. Tekton retransmits events (the relay even asks it to, via 503 back-pressure) — and the retransmission can land on another replica, slipping past dedup and duplicating comments; the accumulator likewise fragments summaries across pods. Run one replica with memory, or pick a shared backend before scaling.

store:
  backend: valkey
  ttl: 1h
  valkey:
    address: valkey.tekton-events-relay.svc:6379

A 64Mi Valkey without persistence is enough — losing the cache only risks a rare duplicate notification. Any RESP-compatible server works (Valkey, KeyDB, …).

olric trades the external server for an embedded gossip cluster between the relay pods themselves (heavier dependency, state lost if all pods restart simultaneously).

All backends fail open: if the store is unreachable, events are processed without deduplication instead of being dropped, and tekton_events_relay_store_errors_total{backend,op} counts the failures.

Comments survive even that, if you use mode: upsert — idempotency then lives in the PR itself.

Dead letter queue

Permanent failures (expired token, deleted repo, 4xx from the provider) are not retried by Tekton — the relay acks them with 200. Without the DLQ they'd be lost; with it they're preserved for inspection and replay:

dlq:
  enabled: true

API (behind server.auth when enabled):

# inspect (oldest first; ?limit=N)
curl -s http://relay:8080/api/v1/dlq | jq

# replay everything: re-runs the chain; successes are removed,
# still-failing events stay with retry_count bumped
curl -s -X POST http://relay:8080/api/v1/dlq/replay

Entries carry the full envelope, failure cause, timestamp and replay count. Storage is a size-bounded JSONL file (dlq.max_size_bytes, oldest dropped first) on an emptyDir the chart mounts. Watch tekton_events_relay_dlq_size — a growing DLQ means a broken credential or config.

Typical flow: token expires at 02:00 → statuses fail permanently → events accumulate in the DLQ → you rotate the secret at 09:00 → POST /api/v1/dlq/replay → all morning's statuses are delivered.

Configuration hot reload

The relay reloads its config without restart when the file changes (Kubernetes ConfigMap updates are detected, including the atomic symlink swap) or on SIGHUP:

kubectl exec deploy/tekton-events-relay -- kill -HUP 1

The new config is validated first; an invalid config is rejected and the current one stays active (tekton_events_relay_config_reloads_total{result="failure"}). Handlers and the chain are rebuilt and swapped atomically; in-flight events finish on the old set; the dedupe store is preserved across the swap. Secrets files are re-read too, so mounted Secret rotation propagates without restart.

The webhook, grafana, sentry and jira notifiers go further: with a static credential they re-read the mounted secret file on every request, so a rotated Secret takes effect immediately — no config change or SIGHUP required. The webhook and Jira notifiers can also use OAuth2 client credentials (auth.oauth2 + token_url, grant_type client_credentials or refresh_token), where the access token is fetched and auto-refreshed before expiry, so a long-running pod never serves a stale token. (Grafana/Sentry have no OAuth2 client-credentials path — their APIs use a service-account / auth token — so they rely on the secret re-read.)

Sections that still require a restart (a warning is logged if they change): server, store, dlq, logging, tracing.

Security hardening

  • Inbound authserver.auth with hmac-sha256 (validates X-Hub-Signature-256 over the body) or bearer. Add validate_timestamp: true for replay protection (X-Webhook-Timestamp within timestamp_tolerance, default 5m).
  • Native TLSserver.tls.cert_file/key_file to serve HTTPS directly instead of relying on ingress termination.
  • Outbound TLS — prefer mounting a custom CA over insecure_skip_verify for self-hosted SCMs.
  • NetworkPolicy — enabled by default in the chart; it opens Valkey egress / Olric pod-to-pod ports only when the matching backend is selected.
  • Supply chain — images and charts are Cosign-signed (keyless); verify with the commands in the repository README.

Scaling & resources

  • The relay is I/O-bound on provider APIs; CPU/memory needs are small (the chart's defaults fit most clusters; GOMEMLIMIT is derived from the memory limit).
  • max_concurrency bounds parallel handler executions per event; handler_timeout (default 10s) keeps one slow provider from stalling dispatch (tekton_events_relay_handler_timeouts_total).
  • Outbound retry policy honors provider rate limits (429 + Retry-After); watch tekton_events_relay_notifier_rate_limit_hits_total per host.
  • Before replicaCount > 1 or HPA: configure a shared state backend.

Lifecycle & shutdown

  • Back-pressure protocol: retryable trouble → HTTP 503 → Tekton retransmits later. Permanent failure → 200 + DLQ. So brief outages self-heal: events queued by Tekton are simply re-delivered.
  • Graceful shutdown: SIGTERM → preStop sleep → server drain within shutdown_timeout_sec → store closed. Rolling updates don't lose events: anything in flight is either completed or re-sent by Tekton to the next pod.
  • Probes: /healthz (liveness), /readyz (readiness, with per-handler diagnostics — see Observability).

Clone this wiki locally