sizing guide

Sizing Guide and Capacity Planning

This guide helps you pick the right resource limits, store backend, and replica count for tekton-events-relay based on your event volume.

Quick reference

Setting	<100 events/s	100-1000 events/s	>1000 events/s
Replicas	1	2-3	3-5+
CPU	100m-250m	250m-500m	500m-1000m
Memory	128Mi-256Mi	256Mi-512Mi	512Mi-1Gi
`max_concurrency`	50 (default 100)	100-200	200-500
`handler_timeout`	10s (default)	10s	10-15s
`store.backend`	`memory`	`valkey` or `olric`	`valkey`
`dedupe_size`	10000 (default)	50000	100000
`retry.max_attempts`	4 (default)	4	4-6

How to estimate your event volume

Each Tekton PipelineRun emits multiple CloudEvents: one per state transition (queued, started, running, succeeded/failed/canceled). A typical 5-task pipeline generates roughly 10-15 events. Multiply by your pipeline frequency:

events/s ≈ (pipelines/hour × avg_events_per_pipeline) / 3600

Example: 200 pipelines/hour with 12 events each = ~0.67 events/s. That's the low tier.

Store backends

The store backend controls deduplication and run-buffer state. Pick based on your replica count and durability needs.

`memory` (default)

Per-pod LRU cache. State is lost on pod restart.
Works fine for single-replica deployments.
No external dependencies.
Deduplication may briefly double-process events after a restart.

`valkey`

Any RESP-compatible server (Redis, Valkey).
Shared across all replicas. Correct dedup at scale.
Requires an external server or the embedded Valkey subchart (store.valkey.embedded.enabled: true in Helm).
Best for 2+ replicas where you already run Redis/Valkey.

`olric`

Embedded distributed cache via memberlist gossip.
Relay pods form the cluster themselves. No extra deployment.
Peers discovered via headless K8s service.
Good middle ground: shared state without external infrastructure.
Set store.olric.env to lan for same-nodepool, wan for cross-zone.

Decision matrix

Scenario	Backend	Why
Single replica, dev/staging	`memory`	Zero config, no deps
2-3 replicas, no Redis	`olric`	Self-contained, no extra infra
2+ replicas, Redis available	`valkey`	Battle-tested, simplest ops
>1000 events/s, 3+ replicas	`valkey`	Best throughput under contention

`max_concurrency` and `errgroup.SetLimit`

The dispatcher fans out to all matched handlers concurrently using errgroup.Group. max_concurrency controls how many handlers execute in parallel for a single event via g.SetLimit(d.maxConcurrency).

How it works

When an event arrives, the dispatcher:

Filters handlers by provider name and type.
Spawns one goroutine per matched handler, up to max_concurrency.
Each handler gets its own timeout context (handler_timeout).
Errors are collected but don't stop other handlers.

If you have 15 registered handlers (e.g., 8 GitHub actions + 3 Slack notifiers + 2 Grafana + 2 webhook), all 15 goroutines launch, but only max_concurrency run at a time. The rest queue.

Tuning guidance

Registered handlers	Recommended `max_concurrency`	Notes
1-5	10-25	Low contention, default is fine
5-15	25-50	Typical production setup
15-30	50-100	Many providers/actions
30+	100-200	Large multi-team deployments

Rule of thumb: Set max_concurrency to roughly 2-3x your handler count. This keeps the pipeline responsive while avoiding goroutine explosion under burst traffic.

Why not the default 100?

The default of 100 works for most setups. Only change it if:

You see handler_timeout errors clustering on slow providers (lower it to protect faster ones).
You have 30+ handlers and events queue up (raise it).
You run on resource-constrained nodes (lower it to reduce memory pressure).

`handler_timeout`

Each handler runs with a per-event deadline. If a handler exceeds this, it's canceled and counted as a timeout (handler_timeouts metric).

Setting	Use case
`10s` (default)	Most setups. Covers normal API latency + retry backoff.
`15-30s`	Providers with known slow APIs (GitLab self-managed, large repos).
`5s`	High-throughput setups where you'd rather skip a slow handler than block.

Don't set this below 5s. SCM API calls often take 1-3s even when healthy, and the retry policy adds up to 250ms initial backoff.

`dedupe_size`

The in-memory deduper uses an LRU cache to track CloudEvent IDs. The size controls how many unique events are remembered before old entries are evicted.

Event rate	Window (at default TTL)	Recommended `dedupe_size`
<10/s	~17 min at 10000	10000 (default)
10-100/s	~2 min at 10000	50000
100-1000/s	~10s at 10000	100000

For valkey and olric backends, dedupe_size controls the in-memory LRU only; remote entries use the store's TTL instead. You can keep the default.

Replica sizing

Single replica

Use memory store.
Simplest deployment. Good for dev, staging, or low-traffic production.
Dedup state is lost on restart (brief window of possible duplicate processing).

2-3 replicas

Use valkey or olric store.
Set replicaCount: 3 in Helm values.
The Helm chart refuses multi-replica + memory store unless you set unsafe.allowMemoryStoreWithMultipleReplicas: true.
With olric, ensure the headless service is configured for peer discovery.

3+ replicas

Use valkey for best throughput.
Monitor deduper_hits and store_errors metrics to catch contention.
Each replica opens connections to the store. Size the store accordingly.

Resource requests and limits

These are starting points. Adjust based on your actual event volume and handler count.

Low volume (<100 events/s)

resources:
  requests:
    cpu: 100m
    memory: 128Mi
  limits:
    cpu: 250m
    memory: 256Mi

Medium volume (100-1000 events/s)

resources:
  requests:
    cpu: 250m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 512Mi

High volume (>1000 events/s)

resources:
  requests:
    cpu: 500m
    memory: 512Mi
  limits:
    cpu: "1"
    memory: 1Gi

`retry` configuration

The retry policy applies to all outbound HTTP calls (SCM APIs, webhook notifiers). Defaults work for most setups.

Parameter	Default	When to change
`max_attempts`	4	Raise to 6 if providers are flaky. Lower to 2 for fire-and-forget notifiers.
`initial_backoff`	250ms	Rarely needs tuning.
`max_backoff`	30s	Lower to 10s if you need faster failure detection.

The retry policy honors Retry-After headers and applies exponential backoff with jitter.

Monitoring your sizing

After deployment, check these metrics to validate your sizing:

Metric	What to watch
`handler_timeouts`	If non-zero, raise `handler_timeout` or investigate the slow handler.
`handler_duration` (p99)	Should stay well below `handler_timeout`.
`deduper_hits`	High rate means events are being retried. Normal, but watch for spikes.
`store_errors`	Any non-zero count means the store backend is failing. Deduper fails open, but accuracy drops.
`events_backpressure`	Non-zero means the relay returned 503 to Tekton. Events will be retried.
`pipeline_errors`	Per-step errors. Check which chain step is failing.

Access metrics at GET /metrics (default port 9090).

Example configurations

Minimal single-replica (dev/staging)

replicaCount: 1
config:
  max_concurrency: 50
  handler_timeout: 10s
  store:
    backend: memory
  dedupe_size: 10000

Production with 3 replicas and Valkey

replicaCount: 3
config:
  max_concurrency: 100
  handler_timeout: 10s
  store:
    backend: valkey
    valkey:
      address: "valkey:6379"
      password_file: /etc/secrets/valkey/password
      key_prefix: "ter:"
  dedupe_size: 50000
  retry:
    max_attempts: 4
    initial_backoff: 250ms
    max_backoff: 30s

High-throughput with Olric

replicaCount: 5
config:
  max_concurrency: 200
  handler_timeout: 15s
  store:
    backend: olric
    olric:
      bind_addr: "0.0.0.0"
      bind_port: 12320
      memberlist_port: 12321
      env: lan
  dedupe_size: 100000

Repository · Releases · Helm Chart · Issues · Security Policy

🏠 Home

Getting started

Reference

SCM providers

Notifiers

Running in production

More

sizing guide

Sizing Guide and Capacity Planning

Quick reference

How to estimate your event volume

Store backends

memory (default)

valkey

olric

Decision matrix

max_concurrency and errgroup.SetLimit

How it works

Tuning guidance

Why not the default 100?

handler_timeout

dedupe_size

Replica sizing

Single replica

2-3 replicas

3+ replicas

Resource requests and limits

Low volume (<100 events/s)

Medium volume (100-1000 events/s)

High volume (>1000 events/s)

retry configuration

Monitoring your sizing

Example configurations

Minimal single-replica (dev/staging)

Production with 3 replicas and Valkey

High-throughput with Olric

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

`memory` (default)

`valkey`

`olric`

`max_concurrency` and `errgroup.SetLimit`

`handler_timeout`

`dedupe_size`

`retry` configuration