-
Notifications
You must be signed in to change notification settings - Fork 0
Operations
Everything you need to run the relay reliably in production: scaling, shared state, the DLQ, hot reload, security hardening and shutdown behavior.
Two components hold state: the deduper (drops Tekton retransmissions by CloudEvent ID) and the accumulator (buffers TaskRuns per PipelineRun for summary comments). The store config selects where that state lives:
memory (default) |
valkey |
olric |
|
|---|---|---|---|
| Extra infrastructure | none | 1 small Valkey/RESP server | none (embedded in relay pods) |
| Correct with N replicas | ❌ per-pod | ✅ | ✅ |
| Survives pod restart | ❌ | ✅ | partial (rolling updates yes, full restart no) |
| Network requirements | none | egress to Valkey | pod-to-pod ports 3320/tcp + 3322/tcp+udp (chart wires NetworkPolicy + headless Service) |
Why this matters: with memory, each replica has its own dedupe cache. Tekton retransmits events (the relay even asks it to, via 503 back-pressure) — and the retransmission can land on another replica, slipping past dedup and duplicating comments; the accumulator likewise fragments summaries across pods. Run one replica with memory, or pick a shared backend before scaling.
store:
backend: valkey
ttl: 1h
valkey:
address: valkey.tekton-events-relay.svc:6379A 64Mi Valkey without persistence is enough — losing the cache only risks a rare duplicate notification. Any RESP-compatible server works (Valkey, KeyDB, …).
olric trades the external server for an embedded gossip cluster between the relay pods themselves (heavier dependency, state lost if all pods restart simultaneously).
All backends fail open: if the store is unreachable, events are processed without deduplication instead of being dropped, and tekton_events_relay_store_errors_total{backend,op} counts the failures.
Comments survive even that, if you use mode: upsert — idempotency then lives in the PR itself.
Permanent failures (expired token, deleted repo, 4xx from the provider) are not retried by Tekton — the relay acks them with 200. Without the DLQ they'd be lost; with it they're preserved for inspection and replay:
dlq:
enabled: trueAPI (behind server.auth when enabled):
# inspect (oldest first; ?limit=N)
curl -s http://relay:8080/api/v1/dlq | jq
# replay everything: re-runs the chain; successes are removed,
# still-failing events stay with retry_count bumped
curl -s -X POST http://relay:8080/api/v1/dlq/replayEntries carry the full envelope, failure cause, timestamp and replay count. Storage is a size-bounded JSONL file (dlq.max_size_bytes, oldest dropped first) on an emptyDir the chart mounts. Watch tekton_events_relay_dlq_size — a growing DLQ means a broken credential or config.
Typical flow: token expires at 02:00 → statuses fail permanently → events accumulate in the DLQ → you rotate the secret at 09:00 → POST /api/v1/dlq/replay → all morning's statuses are delivered.
The relay reloads its config without restart when the file changes (Kubernetes ConfigMap updates are detected, including the atomic symlink swap) or on SIGHUP:
kubectl exec deploy/tekton-events-relay -- kill -HUP 1The new config is validated first; an invalid config is rejected and the current one stays active (tekton_events_relay_config_reloads_total{result="failure"}). Handlers and the chain are rebuilt and swapped atomically; in-flight events finish on the old set; the dedupe store is preserved across the swap. Secrets files are re-read too, so mounted Secret rotation propagates without restart.
The webhook, grafana, sentry and jira notifiers go further: with a static credential they re-read the mounted secret file on every request, so a rotated Secret takes effect immediately — no config change or SIGHUP required. The webhook and Jira notifiers can also use OAuth2 client credentials (auth.oauth2 + token_url, grant_type client_credentials or refresh_token), where the access token is fetched and auto-refreshed before expiry, so a long-running pod never serves a stale token. (Grafana/Sentry have no OAuth2 client-credentials path — their APIs use a service-account / auth token — so they rely on the secret re-read.)
Sections that still require a restart (a warning is logged if they change): server, store, dlq, logging, tracing.
-
Inbound auth —
server.authwithhmac-sha256(validatesX-Hub-Signature-256over the body) orbearer. Addvalidate_timestamp: truefor replay protection (X-Webhook-Timestampwithintimestamp_tolerance, default 5m). -
Native TLS —
server.tls.cert_file/key_fileto serve HTTPS directly instead of relying on ingress termination. -
Outbound TLS — prefer mounting a custom CA over
insecure_skip_verifyfor self-hosted SCMs. - NetworkPolicy — enabled by default in the chart; it opens Valkey egress / Olric pod-to-pod ports only when the matching backend is selected.
- Supply chain — images and charts are Cosign-signed (keyless); verify with the commands in the repository README.
- The relay is I/O-bound on provider APIs; CPU/memory needs are small (the chart's defaults fit most clusters;
GOMEMLIMITis derived from the memory limit). -
max_concurrencybounds parallel handler executions per event;handler_timeout(default 10s) keeps one slow provider from stalling dispatch (tekton_events_relay_handler_timeouts_total). - Outbound retry policy honors provider rate limits (429 +
Retry-After); watchtekton_events_relay_notifier_rate_limit_hits_totalper host. - Before
replicaCount > 1or HPA: configure a shared state backend.
- Back-pressure protocol: retryable trouble → HTTP 503 → Tekton retransmits later. Permanent failure → 200 + DLQ. So brief outages self-heal: events queued by Tekton are simply re-delivered.
-
Graceful shutdown: SIGTERM → preStop sleep → server drain within
shutdown_timeout_sec→ store closed. Rolling updates don't lose events: anything in flight is either completed or re-sent by Tekton to the next pod. -
Probes:
/healthz(liveness),/readyz(readiness, with per-handler diagnostics — see Observability).
Getting started
Reference
SCM providers
Notifiers
Running in production
More