Skip to content

Add multi-instance safe event processing and deployment model #45

@darthfork

Description

@darthfork

Summary

Assume promgithub is intended to support multi-instance and highly available deployments. This issue tracks the architecture and implementation work required to make webhook ingestion and metric production correct when multiple replicas are running simultaneously.

Today, stateless horizontal scaling would not be correct on its own: duplicate webhook deliveries can be processed by multiple replicas, in-memory workflow/job state does not compose across instances, and retries or out-of-order events can corrupt gauges.

This issue should define and drive the changes needed to make multi-instance deployment a first-class supported model.

Why this matters

  • Operators should be able to run multiple replicas behind a load balancer.
  • GitHub retries and repeated deliveries must not inflate counters or corrupt gauges.
  • Workflow/job state must remain consistent across replicas.
  • HA deployments need a clear storage and failure model.

Goals

  • Make webhook processing idempotent across replicas.
  • Support shared deduplication and shared workflow/job state.
  • Define a recommended multi-instance deployment topology.
  • Add observability for dedupe/state backend health and failure modes.

Suggested scope

  • Use X-GitHub-Delivery as a shared idempotency key.
  • Add a shared backend for deduplication with bounded retention, likely Redis or equivalent.
  • Move workflow/job state tracking to shared storage keyed by stable GitHub IDs such as workflow run_id and job id.
  • Define how gauges are derived from shared state rather than per-process memory.
  • Document the recommended HA deployment architecture, including receiver replicas, shared state backend, scrape behavior, and failure handling.
  • Add internal metrics for duplicate deliveries, backend failures, backend latency, and state processing errors.

Child issues

  • Add shared deduplication for GitHub delivery IDs.
  • Move workflow/job state tracking to a shared backend.
  • Make event handling idempotent across replicas.
  • Document recommended HA deployment architecture.
  • Add observability for dedupe/state backend behavior.

Acceptance criteria

  • Running multiple promgithub replicas behind a load balancer is a supported and documented deployment mode.
  • Duplicate deliveries do not inflate counters or corrupt workflow/job state.
  • Workflow/job gauges are derived from shared state rather than relying on per-instance memory.
  • Operators can observe failures and latency in the dedupe/state path through metrics and documentation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions