-
Notifications
You must be signed in to change notification settings - Fork 2
Closed
Labels
component:metricsMetrics and observabilityMetrics and observabilityeffort:lLarge - 1-2 weeksLarge - 1-2 weeksphase-4:distributedPhase 4: Distributed Testing (Months 7-8)Phase 4: Distributed Testing (Months 7-8)priority:p1-highHigh priorityHigh prioritytype:featureNew feature or functionalityNew feature or functionality
Description
Aggregate metrics from all regional nodes into a single coherent result set, exposed on one Prometheus endpoint and printed in the final report.
Data flow
[Node us-central1] ─┐
[Node us-east1] ─┤── gRPC StreamMetrics ──► [Leader aggregator] ──► /metrics (Prometheus)
[Node europe-west1] ─┘ └──► Final report (stdout)
What gets aggregated
Counters (sum across all nodes)
requests_total,requests_status_codes_total,request_errors_by_categoryscenario_requests_total
Histograms (merge bucket counts — NOT average of averages)
request_duration_seconds— merge raw histogram buckets from each node before computing percentiles- This gives correct global P95/P99 without the statistical error of averaging percentiles
Per-region labels preserved
- Leader re-emits all metrics with a
regionlabel so per-region breakdown is still available in Grafana - Global metrics (no region label) are the merged totals
Partial failure handling
- If a node goes silent (no metrics for
CLUSTER_METRICS_TIMEOUT_SECS, default 15s), leader marks it degraded - Aggregation continues with remaining nodes
- Final report flags which nodes were degraded and for how long
- Leader does NOT abort the test — partial results are better than no results
Streaming protocol
- Workers send
MetricsBatchvia gRPC server-side stream everyMETRICS_FLUSH_INTERVAL_SECS(default: 5s) - Leader buffers the last N batches per node and re-aggregates on each Prometheus scrape
- On test completion, leader collects a final flush from all nodes before printing the report
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
component:metricsMetrics and observabilityMetrics and observabilityeffort:lLarge - 1-2 weeksLarge - 1-2 weeksphase-4:distributedPhase 4: Distributed Testing (Months 7-8)Phase 4: Distributed Testing (Months 7-8)priority:p1-highHigh priorityHigh prioritytype:featureNew feature or functionalityNew feature or functionality