v0.5.5 — concurrency fix + adversarial validation
A second, harder adversarial pass — DDoS/flood amplification, a cardinality bomb, a failing-job storm, a job-dispatching endpoint under flood, multi-driver supervision and the public RUM endpoint — which surfaced and fixed one real concurrency bug.
Fixed
- Failure-group occurrence counts undercounted under concurrent failures. The per-group
occurrencescounter was a read-modify-write, so simultaneous workers recording the same failure signature (a failing-job storm) clobbered each other's increments (~10% loss measured across 2000 failures on 3 workers). Now uses a race-safecreateOrFirst()(on the uniquesignatureindex → no duplicate groups) plus an atomic SQL increment, so the count is exact under any concurrency.
Validated (no code change)
- No DDoS amplification — throughput/error-rate identical with Vigilance on or off under sustained flood; enabling it never introduced an error.
- Bounded cardinality — thousands of distinct URLs collapse to the route pattern in APM (1 key); random 404 floods write nothing (only matched routes are recorded). The aggregate tables can't be exploded.
- Job-dispatch storm — an endpoint enqueuing 10 jobs/request under flood captured every job exactly, no failed requests;
VIGILANCE_SAMPLE_RATEthrottles enqueue-write load. - Multiple queue drivers at once — one
vigilance:supervisedrained database + Redis + beanstalkd concurrently, correct per-driver attribution. - Public RUM endpoint — rate-limited (
rum.throttle, default 120/min) and capped to ≤12 metrics + ≤5 errors per request with length-bounded fields; safe to expose. - Extreme concurrency — graceful degradation + immediate recovery, no Vigilance-induced errors, ~one storage connection per worker.
Full notes: CHANGELOG.md.