Skip to content

Troubleshooting

Fábio Luciano edited this page Jun 16, 2026 · 3 revisions

Troubleshooting

Work top-down: is the event arriving → is it passing the chain → is the handler firing → is the provider accepting it? Three tools answer almost everything: pod logs, /readyz, and the metrics.

kubectl logs -n tekton-events-relay deploy/tekton-events-relay -f
kubectl exec -n tekton-events-relay deploy/tekton-events-relay -- wget -qO- localhost:8080/readyz

Nothing happens at all

Check How Fix
Tekton is sending events events_received_total increasing? Any cloudevent_request_started logs? Set default-cloud-events-sink in the config-defaults ConfigMap (tekton-pipelines ns) to the relay Service URL; check NetworkPolicies between namespaces.
Events arrive but are dropped early Log no decoder registered for event type + events_unsupported_type_total Unknown CloudEvent type (often after a Tekton upgrade) — open an issue with the type.
Resource type filtered events_filtered_total Enable the type under filter: (allow_taskrun: true, …).
Missing annotations Log missing annotation tekton.dev/tekton-events-relay.scm.provider Annotate the PipelineRun in your TriggerTemplate.
Wrong provider name Dispatcher log no handlers processed event scm.provider must equal a configured instance name exactly.
Handler exists but when never matches enable logging.level: debug Test the CEL expression; remember states are lowercase.

Provider rejects the call

/readyz shows the last error per handler — start there.

Symptom Cause Fix
401 Bad credentials / 403 expired or under-scoped token Rotate the Secret (hot reload picks it up; the webhook/grafana/sentry/jira notifiers re-read the mounted secret per request, so the new value applies immediately). For webhook/jira you can instead use OAuth2 client credentials (auth.oauth2 + token_url) so the relay auto-refreshes the token before expiry. Check scopes on the provider page.
404 on status/comment wrong owner/name/project annotations, or token can't see the repo Compare annotations against the repo URL; remember GitLab prefers repo-id.
422 / validation error field limits (context/description/label length) Shorten the template; limits are provider-specific.
Frequent 429 provider rate limiting Watch notifier_rate_limit_hits_total{host}; the retry policy honors Retry-After — if sustained, reduce event volume per action with when/filters, or use a GitHub App (higher limits).
Self-signed TLS errors private CA Mount the CA and configure the client; avoid insecure_skip_verify.

Permanent failures are preserved in the DLQ when enabled — after fixing credentials, POST /api/v1/dlq/replay.

Duplicate or missing notifications

Symptom Cause Fix
Duplicate comments, replicaCount > 1 per-pod memory store: retransmissions land on another replica Use a shared store (valkey/olric) — or 1 replica. mode: upsert also neutralizes duplicates for comments.
Duplicates after pod restart memory store lost Same as above.
store_errors_total rising + occasional duplicates store backend down — relay fails open Fix Valkey/Olric connectivity; events were delivered, only dedup degraded.
deduper_evictions_total climbing cache smaller than event volume Raise dedupe_size (memory) / rely on TTL-based remote backends.
Events silently missing under load back-pressure events_backpressure_total — these return 503 and Tekton retransmits; check what's slow via notifier_latency_seconds.
One slow provider delays everything no It can't: handler_timeout (default 10s) bounds each handler — see handler_timeouts_total.

Config & deploy issues

Symptom Fix
Pod CrashLoops at start tekton-events-relay --validate --config … against the rendered ConfigMap; the error message names the bad key. helm install already schema-validates values.
Edited the ConfigMap, nothing changed Hot reload only applies valid configs — check config_reloads_total{result="failure"} and the config reload: log line. server/store/dlq/logging/tracing changes need a restart.
verbose options require logging.level to be 'debug' Exactly that — set the level or drop the verbose flags.
Olric pods don't form a cluster Pod-to-pod 3320/tcp + 3322/tcp+udp must be open (the chart's NetworkPolicy handles it when backend: olric; check other policies/CNI).

Debugging one event end-to-end

  1. logging.level: debug (+ verbose.payloads: true if needed — secrets are redacted).
  2. Trigger the run; grep the logs for the CloudEvent ID (ce_id).
  3. You'll see: received → decoder → each chain step → per-handler success/failure with the provider's response.
  4. With tracing.endpoint set, the same journey is one trace with per-handler spans.

Still stuck? Open an issue with the relay version, the ce_id log excerpt and your (redacted) config.

Clone this wiki locally