Skip to content

[Phase 4] Result aggregation pipeline for distributed tests #49

@cbaugus

Description

@cbaugus

Aggregate metrics from all regional nodes into a single coherent result set, exposed on one Prometheus endpoint and printed in the final report.

Data flow

[Node us-central1] ─┐
[Node us-east1]    ─┤── gRPC StreamMetrics ──► [Leader aggregator] ──► /metrics (Prometheus)
[Node europe-west1] ─┘                                               └──► Final report (stdout)

What gets aggregated

Counters (sum across all nodes)

  • requests_total, requests_status_codes_total, request_errors_by_category
  • scenario_requests_total

Histograms (merge bucket counts — NOT average of averages)

  • request_duration_seconds — merge raw histogram buckets from each node before computing percentiles
  • This gives correct global P95/P99 without the statistical error of averaging percentiles

Per-region labels preserved

  • Leader re-emits all metrics with a region label so per-region breakdown is still available in Grafana
  • Global metrics (no region label) are the merged totals

Partial failure handling

  • If a node goes silent (no metrics for CLUSTER_METRICS_TIMEOUT_SECS, default 15s), leader marks it degraded
  • Aggregation continues with remaining nodes
  • Final report flags which nodes were degraded and for how long
  • Leader does NOT abort the test — partial results are better than no results

Streaming protocol

  • Workers send MetricsBatch via gRPC server-side stream every METRICS_FLUSH_INTERVAL_SECS (default: 5s)
  • Leader buffers the last N batches per node and re-aggregates on each Prometheus scrape
  • On test completion, leader collects a final flush from all nodes before printing the report

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions