Skip to content

sizing guide

Fábio Luciano edited this page Jun 29, 2026 · 1 revision

Sizing Guide and Capacity Planning

This guide helps you pick the right resource limits, store backend, and replica count for tekton-events-relay based on your event volume.

Quick reference

Setting <100 events/s 100-1000 events/s >1000 events/s
Replicas 1 2-3 3-5+
CPU 100m-250m 250m-500m 500m-1000m
Memory 128Mi-256Mi 256Mi-512Mi 512Mi-1Gi
max_concurrency 50 (default 100) 100-200 200-500
handler_timeout 10s (default) 10s 10-15s
store.backend memory valkey or olric valkey
dedupe_size 10000 (default) 50000 100000
retry.max_attempts 4 (default) 4 4-6

How to estimate your event volume

Each Tekton PipelineRun emits multiple CloudEvents: one per state transition (queued, started, running, succeeded/failed/canceled). A typical 5-task pipeline generates roughly 10-15 events. Multiply by your pipeline frequency:

events/s ≈ (pipelines/hour × avg_events_per_pipeline) / 3600

Example: 200 pipelines/hour with 12 events each = ~0.67 events/s. That's the low tier.

Store backends

The store backend controls deduplication and run-buffer state. Pick based on your replica count and durability needs.

memory (default)

  • Per-pod LRU cache. State is lost on pod restart.
  • Works fine for single-replica deployments.
  • No external dependencies.
  • Deduplication may briefly double-process events after a restart.

valkey

  • Any RESP-compatible server (Redis, Valkey).
  • Shared across all replicas. Correct dedup at scale.
  • Requires an external server or the embedded Valkey subchart (store.valkey.embedded.enabled: true in Helm).
  • Best for 2+ replicas where you already run Redis/Valkey.

olric

  • Embedded distributed cache via memberlist gossip.
  • Relay pods form the cluster themselves. No extra deployment.
  • Peers discovered via headless K8s service.
  • Good middle ground: shared state without external infrastructure.
  • Set store.olric.env to lan for same-nodepool, wan for cross-zone.

Decision matrix

Scenario Backend Why
Single replica, dev/staging memory Zero config, no deps
2-3 replicas, no Redis olric Self-contained, no extra infra
2+ replicas, Redis available valkey Battle-tested, simplest ops
>1000 events/s, 3+ replicas valkey Best throughput under contention

max_concurrency and errgroup.SetLimit

The dispatcher fans out to all matched handlers concurrently using errgroup.Group. max_concurrency controls how many handlers execute in parallel for a single event via g.SetLimit(d.maxConcurrency).

How it works

When an event arrives, the dispatcher:

  1. Filters handlers by provider name and type.
  2. Spawns one goroutine per matched handler, up to max_concurrency.
  3. Each handler gets its own timeout context (handler_timeout).
  4. Errors are collected but don't stop other handlers.

If you have 15 registered handlers (e.g., 8 GitHub actions + 3 Slack notifiers + 2 Grafana + 2 webhook), all 15 goroutines launch, but only max_concurrency run at a time. The rest queue.

Tuning guidance

Registered handlers Recommended max_concurrency Notes
1-5 10-25 Low contention, default is fine
5-15 25-50 Typical production setup
15-30 50-100 Many providers/actions
30+ 100-200 Large multi-team deployments

Rule of thumb: Set max_concurrency to roughly 2-3x your handler count. This keeps the pipeline responsive while avoiding goroutine explosion under burst traffic.

Why not the default 100?

The default of 100 works for most setups. Only change it if:

  • You see handler_timeout errors clustering on slow providers (lower it to protect faster ones).
  • You have 30+ handlers and events queue up (raise it).
  • You run on resource-constrained nodes (lower it to reduce memory pressure).

handler_timeout

Each handler runs with a per-event deadline. If a handler exceeds this, it's canceled and counted as a timeout (handler_timeouts metric).

Setting Use case
10s (default) Most setups. Covers normal API latency + retry backoff.
15-30s Providers with known slow APIs (GitLab self-managed, large repos).
5s High-throughput setups where you'd rather skip a slow handler than block.

Don't set this below 5s. SCM API calls often take 1-3s even when healthy, and the retry policy adds up to 250ms initial backoff.

dedupe_size

The in-memory deduper uses an LRU cache to track CloudEvent IDs. The size controls how many unique events are remembered before old entries are evicted.

Event rate Window (at default TTL) Recommended dedupe_size
<10/s ~17 min at 10000 10000 (default)
10-100/s ~2 min at 10000 50000
100-1000/s ~10s at 10000 100000

For valkey and olric backends, dedupe_size controls the in-memory LRU only; remote entries use the store's TTL instead. You can keep the default.

Replica sizing

Single replica

  • Use memory store.
  • Simplest deployment. Good for dev, staging, or low-traffic production.
  • Dedup state is lost on restart (brief window of possible duplicate processing).

2-3 replicas

  • Use valkey or olric store.
  • Set replicaCount: 3 in Helm values.
  • The Helm chart refuses multi-replica + memory store unless you set unsafe.allowMemoryStoreWithMultipleReplicas: true.
  • With olric, ensure the headless service is configured for peer discovery.

3+ replicas

  • Use valkey for best throughput.
  • Monitor deduper_hits and store_errors metrics to catch contention.
  • Each replica opens connections to the store. Size the store accordingly.

Resource requests and limits

These are starting points. Adjust based on your actual event volume and handler count.

Low volume (<100 events/s)

resources:
  requests:
    cpu: 100m
    memory: 128Mi
  limits:
    cpu: 250m
    memory: 256Mi

Medium volume (100-1000 events/s)

resources:
  requests:
    cpu: 250m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 512Mi

High volume (>1000 events/s)

resources:
  requests:
    cpu: 500m
    memory: 512Mi
  limits:
    cpu: "1"
    memory: 1Gi

retry configuration

The retry policy applies to all outbound HTTP calls (SCM APIs, webhook notifiers). Defaults work for most setups.

Parameter Default When to change
max_attempts 4 Raise to 6 if providers are flaky. Lower to 2 for fire-and-forget notifiers.
initial_backoff 250ms Rarely needs tuning.
max_backoff 30s Lower to 10s if you need faster failure detection.

The retry policy honors Retry-After headers and applies exponential backoff with jitter.

Monitoring your sizing

After deployment, check these metrics to validate your sizing:

Metric What to watch
handler_timeouts If non-zero, raise handler_timeout or investigate the slow handler.
handler_duration (p99) Should stay well below handler_timeout.
deduper_hits High rate means events are being retried. Normal, but watch for spikes.
store_errors Any non-zero count means the store backend is failing. Deduper fails open, but accuracy drops.
events_backpressure Non-zero means the relay returned 503 to Tekton. Events will be retried.
pipeline_errors Per-step errors. Check which chain step is failing.

Access metrics at GET /metrics (default port 9090).

Example configurations

Minimal single-replica (dev/staging)

replicaCount: 1
config:
  max_concurrency: 50
  handler_timeout: 10s
  store:
    backend: memory
  dedupe_size: 10000

Production with 3 replicas and Valkey

replicaCount: 3
config:
  max_concurrency: 100
  handler_timeout: 10s
  store:
    backend: valkey
    valkey:
      address: "valkey:6379"
      password_file: /etc/secrets/valkey/password
      key_prefix: "ter:"
  dedupe_size: 50000
  retry:
    max_attempts: 4
    initial_backoff: 250ms
    max_backoff: 30s

High-throughput with Olric

replicaCount: 5
config:
  max_concurrency: 200
  handler_timeout: 15s
  store:
    backend: olric
    olric:
      bind_addr: "0.0.0.0"
      bind_port: 12320
      memberlist_port: 12321
      env: lan
  dedupe_size: 100000

Clone this wiki locally