Before Creating the Enhancement Request
Summary
A three-phase initiative to systematically improve the RocketMQ test suite: (1) detect and quarantine flaky tests via large-scale repeated execution, (2) root-cause and fix the quarantined tests, (3) deeply improve tests with performance and design issues such as excessive sleep, over-long output, and slow execution.
分三个阶段系统性提升 RocketMQ 测试套件质量:(1)通过大规模重复执行检测并隔离不稳定测试;(2)对被隔离的测试进行根因分析和修复;(3)对包含 sleep 过多、输出过长、执行过慢等问题的用例进行深度改进。
Motivation
Flaky tests erode developer confidence in CI signals. When tests fail non-deterministically, developers begin ignoring red builds, which masks real regressions. Beyond flakiness, some tests suffer from poor design: hard-coded Thread.sleep() makes them fragile and slow, excessive log output makes failures hard to diagnose, and unnecessary resource initialization inflates total CI time. Addressing all three layers is required to achieve a fast, reliable, and maintainable test suite.
不稳定测试会削弱开发者对 CI 信号的信任。当测试以非确定性方式失败时,开发者会开始忽略红色构建,从而掩盖真正的回归问题。除了 flakiness 之外,部分测试还存在设计缺陷:硬编码 Thread.sleep() 导致脆弱且缓慢,过多日志输出导致失败难以诊断,不必要的资源初始化拖慢整体 CI 耗时。三个层面都需要治理,才能达到快速、可靠、可维护的测试套件。
Describe the Solution You'd Like
Phase 1: Detection & Quarantine
Approach: Run the full RocketMQ test suite 100× across 10 ECS nodes using a three-layer funnel (module → class → method) to statistically identify non-deterministic failures. Quarantine methods with ≥1% failure rate using @Ignore.
在 10 台 ECS 节点上将 RocketMQ 全量测试执行 100 次,采用三层漏斗(模块 → 类 → 方法)逐步缩小范围,通过统计识别非确定性失败。对失败率 ≥1% 的方法添加 @Ignore 隔离。
Reference: This follows Google's "deflake + quarantine" methodology from Flaky Tests at Google and How We Mitigate Them (2016).
参考 Google 在 2016 年提出的 "deflake + quarantine" 方法论。
Phase 2: Root-Cause & Fix
Approach: For each quarantined test, analyze the root cause (race conditions, resource conflicts, time-dependent assertions, shared mutable state, etc.) and apply a targeted fix. Priority: high failure rate first (≥10%), then moderate (1%-10%). Exit criteria: remove @Ignore, re-run 100× with zero failures.
对每个被隔离的测试进行根因分析(竞态条件、资源冲突、时间相关断言、共享可变状态等)并针对性修复。优先级:先修高失败率(≥10%),再处理中等(1%-10%)。退出标准:移除 @Ignore,重新执行 100 次零失败。
Phase 3: Deep Quality Improvement
Approach: Beyond flakiness, identify and improve tests with structural quality issues — excessive Thread.sleep (replace with event-driven waiting like Awaitility), overly verbose output (tune test log levels), slow execution due to unnecessary full-component startup (use targeted mocks), and resource leaks (enforce proper cleanup). The goal is a test suite that runs fast, fails clearly, and doesn't create new flakiness over time.
在 flakiness 之外,识别并改进存在结构性质量问题的测试 —— 过度使用 Thread.sleep(替换为 Awaitility 等事件驱动等待)、输出过于冗长(调整测试日志级别)、因不必要的完整组件启动导致执行过慢(使用精确 mock)、以及资源泄漏(强制正确清理)。目标是让测试套件跑得快、失败信息清晰、且不会随时间产生新的 flakiness。
Additional Context
- Methodology: 100 iterations × 10 nodes = ~1000 effective runs per test method.
- Industry reference: Google (deflake + quarantine, 2016), Meta (aggressive retry, 2018), Spotify (three-stage flaky test governance, 2019).
- Phased delivery: Phase 1 is complete. Phase 2 and 3 are tracked as follow-up work items with individual sub-issues per test.
- Success metric: CI green rate improves from ~85% to >99% on the main branch.
- 方法论:100 次迭代 × 10 个节点 ≈ 每个测试方法 1000 次有效执行。
- 业界参考:Google(deflake + quarantine, 2016)、Meta(aggressive retry, 2018)、Spotify(三阶段治理, 2019)。
- 分阶段交付:第一阶段已完成。第二、三阶段作为后续工作项跟踪,每个测试单独建立子 issue。
- 成功指标:主分支 CI 绿色率从 ~85% 提升到 >99%。
Before Creating the Enhancement Request
Summary
A three-phase initiative to systematically improve the RocketMQ test suite: (1) detect and quarantine flaky tests via large-scale repeated execution, (2) root-cause and fix the quarantined tests, (3) deeply improve tests with performance and design issues such as excessive sleep, over-long output, and slow execution.
分三个阶段系统性提升 RocketMQ 测试套件质量:(1)通过大规模重复执行检测并隔离不稳定测试;(2)对被隔离的测试进行根因分析和修复;(3)对包含 sleep 过多、输出过长、执行过慢等问题的用例进行深度改进。
Motivation
Flaky tests erode developer confidence in CI signals. When tests fail non-deterministically, developers begin ignoring red builds, which masks real regressions. Beyond flakiness, some tests suffer from poor design: hard-coded
Thread.sleep()makes them fragile and slow, excessive log output makes failures hard to diagnose, and unnecessary resource initialization inflates total CI time. Addressing all three layers is required to achieve a fast, reliable, and maintainable test suite.不稳定测试会削弱开发者对 CI 信号的信任。当测试以非确定性方式失败时,开发者会开始忽略红色构建,从而掩盖真正的回归问题。除了 flakiness 之外,部分测试还存在设计缺陷:硬编码
Thread.sleep()导致脆弱且缓慢,过多日志输出导致失败难以诊断,不必要的资源初始化拖慢整体 CI 耗时。三个层面都需要治理,才能达到快速、可靠、可维护的测试套件。Describe the Solution You'd Like
Phase 1: Detection & Quarantine
Approach: Run the full RocketMQ test suite 100× across 10 ECS nodes using a three-layer funnel (module → class → method) to statistically identify non-deterministic failures. Quarantine methods with ≥1% failure rate using
@Ignore.在 10 台 ECS 节点上将 RocketMQ 全量测试执行 100 次,采用三层漏斗(模块 → 类 → 方法)逐步缩小范围,通过统计识别非确定性失败。对失败率 ≥1% 的方法添加
@Ignore隔离。Reference: This follows Google's "deflake + quarantine" methodology from Flaky Tests at Google and How We Mitigate Them (2016).
参考 Google 在 2016 年提出的 "deflake + quarantine" 方法论。
Phase 2: Root-Cause & Fix
Approach: For each quarantined test, analyze the root cause (race conditions, resource conflicts, time-dependent assertions, shared mutable state, etc.) and apply a targeted fix. Priority: high failure rate first (≥10%), then moderate (1%-10%). Exit criteria: remove
@Ignore, re-run 100× with zero failures.对每个被隔离的测试进行根因分析(竞态条件、资源冲突、时间相关断言、共享可变状态等)并针对性修复。优先级:先修高失败率(≥10%),再处理中等(1%-10%)。退出标准:移除
@Ignore,重新执行 100 次零失败。Phase 3: Deep Quality Improvement
Approach: Beyond flakiness, identify and improve tests with structural quality issues — excessive
Thread.sleep(replace with event-driven waiting likeAwaitility), overly verbose output (tune test log levels), slow execution due to unnecessary full-component startup (use targeted mocks), and resource leaks (enforce proper cleanup). The goal is a test suite that runs fast, fails clearly, and doesn't create new flakiness over time.在 flakiness 之外,识别并改进存在结构性质量问题的测试 —— 过度使用
Thread.sleep(替换为Awaitility等事件驱动等待)、输出过于冗长(调整测试日志级别)、因不必要的完整组件启动导致执行过慢(使用精确 mock)、以及资源泄漏(强制正确清理)。目标是让测试套件跑得快、失败信息清晰、且不会随时间产生新的 flakiness。Additional Context