Summary
Assume promgithub is intended to support multi-instance and highly available deployments. This issue tracks the architecture and implementation work required to make webhook ingestion and metric production correct when multiple replicas are running simultaneously.
Today, stateless horizontal scaling would not be correct on its own: duplicate webhook deliveries can be processed by multiple replicas, in-memory workflow/job state does not compose across instances, and retries or out-of-order events can corrupt gauges.
This issue should define and drive the changes needed to make multi-instance deployment a first-class supported model.
Why this matters
- Operators should be able to run multiple replicas behind a load balancer.
- GitHub retries and repeated deliveries must not inflate counters or corrupt gauges.
- Workflow/job state must remain consistent across replicas.
- HA deployments need a clear storage and failure model.
Goals
- Make webhook processing idempotent across replicas.
- Support shared deduplication and shared workflow/job state.
- Define a recommended multi-instance deployment topology.
- Add observability for dedupe/state backend health and failure modes.
Suggested scope
- Use
X-GitHub-Delivery as a shared idempotency key.
- Add a shared backend for deduplication with bounded retention, likely Redis or equivalent.
- Move workflow/job state tracking to shared storage keyed by stable GitHub IDs such as workflow
run_id and job id.
- Define how gauges are derived from shared state rather than per-process memory.
- Document the recommended HA deployment architecture, including receiver replicas, shared state backend, scrape behavior, and failure handling.
- Add internal metrics for duplicate deliveries, backend failures, backend latency, and state processing errors.
Child issues
- Add shared deduplication for GitHub delivery IDs.
- Move workflow/job state tracking to a shared backend.
- Make event handling idempotent across replicas.
- Document recommended HA deployment architecture.
- Add observability for dedupe/state backend behavior.
Acceptance criteria
- Running multiple promgithub replicas behind a load balancer is a supported and documented deployment mode.
- Duplicate deliveries do not inflate counters or corrupt workflow/job state.
- Workflow/job gauges are derived from shared state rather than relying on per-instance memory.
- Operators can observe failures and latency in the dedupe/state path through metrics and documentation.
Summary
Assume
promgithubis intended to support multi-instance and highly available deployments. This issue tracks the architecture and implementation work required to make webhook ingestion and metric production correct when multiple replicas are running simultaneously.Today, stateless horizontal scaling would not be correct on its own: duplicate webhook deliveries can be processed by multiple replicas, in-memory workflow/job state does not compose across instances, and retries or out-of-order events can corrupt gauges.
This issue should define and drive the changes needed to make multi-instance deployment a first-class supported model.
Why this matters
Goals
Suggested scope
X-GitHub-Deliveryas a shared idempotency key.run_idand jobid.Child issues
Acceptance criteria