-
Notifications
You must be signed in to change notification settings - Fork 0
sizing guide
This guide helps you pick the right resource limits, store backend, and replica count for tekton-events-relay based on your event volume.
| Setting | <100 events/s | 100-1000 events/s | >1000 events/s |
|---|---|---|---|
| Replicas | 1 | 2-3 | 3-5+ |
| CPU | 100m-250m | 250m-500m | 500m-1000m |
| Memory | 128Mi-256Mi | 256Mi-512Mi | 512Mi-1Gi |
max_concurrency |
50 (default 100) | 100-200 | 200-500 |
handler_timeout |
10s (default) | 10s | 10-15s |
store.backend |
memory |
valkey or olric
|
valkey |
dedupe_size |
10000 (default) | 50000 | 100000 |
retry.max_attempts |
4 (default) | 4 | 4-6 |
Each Tekton PipelineRun emits multiple CloudEvents: one per state transition (queued, started, running, succeeded/failed/canceled). A typical 5-task pipeline generates roughly 10-15 events. Multiply by your pipeline frequency:
events/s ≈ (pipelines/hour × avg_events_per_pipeline) / 3600
Example: 200 pipelines/hour with 12 events each = ~0.67 events/s. That's the low tier.
The store backend controls deduplication and run-buffer state. Pick based on your replica count and durability needs.
- Per-pod LRU cache. State is lost on pod restart.
- Works fine for single-replica deployments.
- No external dependencies.
- Deduplication may briefly double-process events after a restart.
- Any RESP-compatible server (Redis, Valkey).
- Shared across all replicas. Correct dedup at scale.
- Requires an external server or the embedded Valkey subchart (
store.valkey.embedded.enabled: truein Helm). - Best for 2+ replicas where you already run Redis/Valkey.
- Embedded distributed cache via memberlist gossip.
- Relay pods form the cluster themselves. No extra deployment.
- Peers discovered via headless K8s service.
- Good middle ground: shared state without external infrastructure.
- Set
store.olric.envtolanfor same-nodepool,wanfor cross-zone.
| Scenario | Backend | Why |
|---|---|---|
| Single replica, dev/staging | memory |
Zero config, no deps |
| 2-3 replicas, no Redis | olric |
Self-contained, no extra infra |
| 2+ replicas, Redis available | valkey |
Battle-tested, simplest ops |
| >1000 events/s, 3+ replicas | valkey |
Best throughput under contention |
The dispatcher fans out to all matched handlers concurrently using errgroup.Group. max_concurrency controls how many handlers execute in parallel for a single event via g.SetLimit(d.maxConcurrency).
When an event arrives, the dispatcher:
- Filters handlers by provider name and type.
- Spawns one goroutine per matched handler, up to
max_concurrency. - Each handler gets its own timeout context (
handler_timeout). - Errors are collected but don't stop other handlers.
If you have 15 registered handlers (e.g., 8 GitHub actions + 3 Slack notifiers + 2 Grafana + 2 webhook), all 15 goroutines launch, but only max_concurrency run at a time. The rest queue.
| Registered handlers | Recommended max_concurrency
|
Notes |
|---|---|---|
| 1-5 | 10-25 | Low contention, default is fine |
| 5-15 | 25-50 | Typical production setup |
| 15-30 | 50-100 | Many providers/actions |
| 30+ | 100-200 | Large multi-team deployments |
Rule of thumb: Set max_concurrency to roughly 2-3x your handler count. This keeps the pipeline responsive while avoiding goroutine explosion under burst traffic.
The default of 100 works for most setups. Only change it if:
- You see
handler_timeouterrors clustering on slow providers (lower it to protect faster ones). - You have 30+ handlers and events queue up (raise it).
- You run on resource-constrained nodes (lower it to reduce memory pressure).
Each handler runs with a per-event deadline. If a handler exceeds this, it's canceled and counted as a timeout (handler_timeouts metric).
| Setting | Use case |
|---|---|
10s (default) |
Most setups. Covers normal API latency + retry backoff. |
15-30s |
Providers with known slow APIs (GitLab self-managed, large repos). |
5s |
High-throughput setups where you'd rather skip a slow handler than block. |
Don't set this below 5s. SCM API calls often take 1-3s even when healthy, and the retry policy adds up to 250ms initial backoff.
The in-memory deduper uses an LRU cache to track CloudEvent IDs. The size controls how many unique events are remembered before old entries are evicted.
| Event rate | Window (at default TTL) | Recommended dedupe_size
|
|---|---|---|
| <10/s | ~17 min at 10000 | 10000 (default) |
| 10-100/s | ~2 min at 10000 | 50000 |
| 100-1000/s | ~10s at 10000 | 100000 |
For valkey and olric backends, dedupe_size controls the in-memory LRU only; remote entries use the store's TTL instead. You can keep the default.
- Use
memorystore. - Simplest deployment. Good for dev, staging, or low-traffic production.
- Dedup state is lost on restart (brief window of possible duplicate processing).
- Use
valkeyorolricstore. - Set
replicaCount: 3in Helm values. - The Helm chart refuses multi-replica +
memorystore unless you setunsafe.allowMemoryStoreWithMultipleReplicas: true. - With
olric, ensure the headless service is configured for peer discovery.
- Use
valkeyfor best throughput. - Monitor
deduper_hitsandstore_errorsmetrics to catch contention. - Each replica opens connections to the store. Size the store accordingly.
These are starting points. Adjust based on your actual event volume and handler count.
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 250m
memory: 256Miresources:
requests:
cpu: 250m
memory: 256Mi
limits:
cpu: 500m
memory: 512Miresources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: "1"
memory: 1GiThe retry policy applies to all outbound HTTP calls (SCM APIs, webhook notifiers). Defaults work for most setups.
| Parameter | Default | When to change |
|---|---|---|
max_attempts |
4 | Raise to 6 if providers are flaky. Lower to 2 for fire-and-forget notifiers. |
initial_backoff |
250ms | Rarely needs tuning. |
max_backoff |
30s | Lower to 10s if you need faster failure detection. |
The retry policy honors Retry-After headers and applies exponential backoff with jitter.
After deployment, check these metrics to validate your sizing:
| Metric | What to watch |
|---|---|
handler_timeouts |
If non-zero, raise handler_timeout or investigate the slow handler. |
handler_duration (p99) |
Should stay well below handler_timeout. |
deduper_hits |
High rate means events are being retried. Normal, but watch for spikes. |
store_errors |
Any non-zero count means the store backend is failing. Deduper fails open, but accuracy drops. |
events_backpressure |
Non-zero means the relay returned 503 to Tekton. Events will be retried. |
pipeline_errors |
Per-step errors. Check which chain step is failing. |
Access metrics at GET /metrics (default port 9090).
replicaCount: 1
config:
max_concurrency: 50
handler_timeout: 10s
store:
backend: memory
dedupe_size: 10000replicaCount: 3
config:
max_concurrency: 100
handler_timeout: 10s
store:
backend: valkey
valkey:
address: "valkey:6379"
password_file: /etc/secrets/valkey/password
key_prefix: "ter:"
dedupe_size: 50000
retry:
max_attempts: 4
initial_backoff: 250ms
max_backoff: 30sreplicaCount: 5
config:
max_concurrency: 200
handler_timeout: 15s
store:
backend: olric
olric:
bind_addr: "0.0.0.0"
bind_port: 12320
memberlist_port: 12321
env: lan
dedupe_size: 100000Getting started
Reference
SCM providers
Notifiers
Running in production
More