Summary
Automated scan of committed-code (Timer and Post Merge Action) gradle-check failures in the past 24 hours (2026-05-18 to 2026-05-19). All tests were run locally with their original CI seeds; none reproduced deterministically, confirming these are timing/environment-sensitive flakes rather than deterministic failures.
CI runners are currently m7a.8xlarge (since mid-April 2026). Several tests show increased failure rates coinciding with this runner change.
Summary Table (sorted by total builds affected)
| # |
Test |
Builds Affected |
First Failure |
Pattern |
Build Link |
| 1 |
RecoveryWhileUnderLoadIT.testRecoverWhileUnderLoadAllocateReplicasRelocatePrimariesTest |
282 |
2024-04-03 |
Chronic, worsening since Apr 2026 |
77490 |
| 2 |
WarmIndexSegmentReplicationIT.testReplicationAfterForceMerge |
235 |
2025-03-11 |
Chronic, low-rate steady |
77440 |
| 3 |
ShardIndexingPressureSettingsIT.testShardIndexingPressureEnforcedEnabledDisabledSetting |
202 |
2024-03-26 |
Chronic, worsening in May 2026 |
77440 |
| 4 |
RemoteStoreReplicationSourceTests.testGetMergedSegmentFilesDownloadTimeout |
163 |
2024-04-17 |
Was improving, uptick in May 2026 |
77460 |
| 5 |
IngestFromKafkaIT.testRawPayloadMapperIngestion |
131 |
2025-01-12 |
Chronic (mostly PR builds) |
77460 |
| 6 |
ConcurrentSeqNoVersioningIT.testSeqNoCASLinearizability |
118 |
2024-10-03 |
Chronic, sharp spike Apr 2026 (35 builds) |
77456 |
| 7 |
SystemIndexRestIT.classMethod |
94 |
2024-07-03 |
Spike Nov 2025, worsening in 2026 |
77477 |
| 8 |
DeleteSnapshotIT.testDeleteShallowCopySnapshot |
31 |
2024-04-06 |
Low-rate chronic, stable |
77472 |
| 9 |
IndexFieldDataServiceTests.testClearField |
12 |
2024-07-30 |
Low-rate, slight uptick May 2026 |
77413 |
| 10 |
SegmentReplicationUsingRemoteStoreIT.testPrimaryStopped_ReplicaPromoted |
9 |
2024-07-15 |
New pattern since Apr 2026 |
77420 |
Detailed Findings
1. RecoveryWhileUnderLoadIT.testRecoverWhileUnderLoadAllocateReplicasRelocatePrimariesTest
- Build: 77490 (Post Merge Action)
- Seed:
EB7480DD2E1D21AC
- Error:
replica shards haven't caught up with primary expected:<20> but was:<15>
- Reproduced locally: No
- First failure: 2024-04-03
- Total builds affected: 282
- Pattern: Chronic flake. Major spike in Apr-Jul 2025 (21→38→21 builds/month), subsided, then resumed Feb 2026 onward (8→11→13→12). Worsening trend correlates with m7a.8xlarge runner change.
2. WarmIndexSegmentReplicationIT.testReplicationAfterForceMerge
- Build: 77440 (Timer)
- Seed:
26407BF518CD68CE
- Error:
Expected: a value equal to or greater than <3L> but: <0L> was less than <3L>
- Reproduced locally: No
- First failure: 2025-03-11
- Total builds affected: 235 (note: class-level count includes other test methods)
- Pattern: Steady low-rate flake across many months (1-4 builds/month).
3. ShardIndexingPressureSettingsIT.testShardIndexingPressureEnforcedEnabledDisabledSetting
- Build: 77440 (Timer)
- Seed:
26407BF518CD68CE
- Error:
Suite timeout exceeded (>= 1200000 msec) — test timed out
- Reproduced locally: No
- First failure: 2024-03-26
- Total builds affected: 202 (class-level)
- Pattern: Chronic timeout flake. Steady 1-9 builds/month since Jan 2025. Sharp spike to 14 in May 2026 — likely environmental sensitivity to runner change.
4. RemoteStoreReplicationSourceTests.testGetMergedSegmentFilesDownloadTimeout
- Build: 77460 (Timer)
- Seed:
48E5F968DFEEDAF0
- Error:
Suite timeout exceeded (>= 1200000 msec)
- Reproduced locally: No
- First failure: 2024-04-17
- Total builds affected: 163 (class-level)
- Pattern: Major spike Aug-Oct 2025 (37→32→18), then largely resolved. Uptick to 6 builds in May 2026.
5. IngestFromKafkaIT.testRawPayloadMapperIngestion
- Build: 77460 (Timer)
- Seed:
48E5F968DFEEDAF0
- Error:
ConditionTimeoutException: Condition was not fulfilled within 1 [second]
- Reproduced locally: No
- First failure: 2025-01-12
- Total builds affected: 131 (class-level, includes other Kafka IT tests)
- Pattern: Kafka integration tests require embedded Kafka; timeout suggests environmental sensitivity.
6. ConcurrentSeqNoVersioningIT.testSeqNoCASLinearizability
- Build: 77456 (Post Merge Action)
- Seed:
893C85F2FC3AC7D5
- Error:
ClusterHealthResponse has timed out
- Reproduced locally: No
- First failure: 2024-10-03
- Total builds affected: 118
- Pattern: Low-rate chronic (1-6 builds/month) until Apr 2026 when it spiked to 35 builds. Strong correlation with m7a.8xlarge runner change — faster CPUs likely amplify race conditions in this linearizability test.
7. SystemIndexRestIT.classMethod
- Build: 77477 (Post Merge Action)
- Seed:
3AB0B4BC0125C3CF
- Error:
1 channels still being tracked in RestCancellableNodeClient while there should be none expected:<0> but was:<1>
- Reproduced locally: No
- First failure: 2024-07-03
- Total builds affected: 94
- Pattern: Spike of 41 builds in Nov 2025, then subsided. Worsening again in 2026 (13 in Apr, 19 in May so far).
8. DeleteSnapshotIT.testDeleteShallowCopySnapshot
- Build: 77472 (Post Merge Action)
- Seed:
3D45AF61B849E410
- Error:
Expected: is <4> but: was <3>
- Reproduced locally: No
- First failure: 2024-04-06
- Total builds affected: 31
- Pattern: Low-rate stable flake (0-3 builds/month). No significant trend change.
9. IndexFieldDataServiceTests.testClearField
- Build: 77413 (Post Merge Action)
- Seed:
25319181894D5792
- Error:
expected:<0> but was:<1>
- Reproduced locally: No
- First failure: 2024-07-30
- Total builds affected: 12
- Pattern: Very low-rate (0-2 builds/month). Slight uptick to 3 in May 2026.
10. SegmentReplicationUsingRemoteStoreIT.testPrimaryStopped_ReplicaPromoted
- Build: 77420 (Post Merge Action)
- Seed:
4B76556EC34B0E76
- Error:
Count is 2 hits but 3 was expected. Total shards: 1 Successful shards: 1
- Reproduced locally: No
- First failure: 2024-07-15
- Total builds affected: 9 (for this specific test method)
- Pattern: New pattern — 3 builds in Apr 2026, 2 in May 2026. Appears to be newly flaky, possibly triggered by runner change.
Reproduction Notes
All tests were run locally with their original CI seeds on the current main branch. None failed. This is expected for timing-sensitive flakes — the seed controls randomized parameters but not thread scheduling, GC pauses, or network timing.
Notable Trends
- ConcurrentSeqNoVersioningIT had a dramatic spike from ~3-6 builds/month to 35 in April 2026, strongly suggesting the m7a.8xlarge runner change amplified an existing race condition.
- ShardIndexingPressureSettingsIT and RemoteStoreReplicationSourceTests both fail with suite timeouts, suggesting they are sensitive to overall system load or scheduling.
- SystemIndexRestIT has a channel-leak assertion that is worsening month over month in 2026.
- SegmentReplicationUsingRemoteStoreIT.testPrimaryStopped_ReplicaPromoted is a newly emerging flake (first appeared Apr 2026).
Summary
Automated scan of committed-code (Timer and Post Merge Action) gradle-check failures in the past 24 hours (2026-05-18 to 2026-05-19). All tests were run locally with their original CI seeds; none reproduced deterministically, confirming these are timing/environment-sensitive flakes rather than deterministic failures.
CI runners are currently
m7a.8xlarge(since mid-April 2026). Several tests show increased failure rates coinciding with this runner change.Summary Table (sorted by total builds affected)
RecoveryWhileUnderLoadIT.testRecoverWhileUnderLoadAllocateReplicasRelocatePrimariesTestWarmIndexSegmentReplicationIT.testReplicationAfterForceMergeShardIndexingPressureSettingsIT.testShardIndexingPressureEnforcedEnabledDisabledSettingRemoteStoreReplicationSourceTests.testGetMergedSegmentFilesDownloadTimeoutIngestFromKafkaIT.testRawPayloadMapperIngestionConcurrentSeqNoVersioningIT.testSeqNoCASLinearizabilitySystemIndexRestIT.classMethodDeleteSnapshotIT.testDeleteShallowCopySnapshotIndexFieldDataServiceTests.testClearFieldSegmentReplicationUsingRemoteStoreIT.testPrimaryStopped_ReplicaPromotedDetailed Findings
1. RecoveryWhileUnderLoadIT.testRecoverWhileUnderLoadAllocateReplicasRelocatePrimariesTest
EB7480DD2E1D21ACreplica shards haven't caught up with primary expected:<20> but was:<15>2. WarmIndexSegmentReplicationIT.testReplicationAfterForceMerge
26407BF518CD68CEExpected: a value equal to or greater than <3L> but: <0L> was less than <3L>3. ShardIndexingPressureSettingsIT.testShardIndexingPressureEnforcedEnabledDisabledSetting
26407BF518CD68CESuite timeout exceeded (>= 1200000 msec)— test timed out4. RemoteStoreReplicationSourceTests.testGetMergedSegmentFilesDownloadTimeout
48E5F968DFEEDAF0Suite timeout exceeded (>= 1200000 msec)5. IngestFromKafkaIT.testRawPayloadMapperIngestion
48E5F968DFEEDAF0ConditionTimeoutException: Condition was not fulfilled within 1 [second]6. ConcurrentSeqNoVersioningIT.testSeqNoCASLinearizability
893C85F2FC3AC7D5ClusterHealthResponse has timed out7. SystemIndexRestIT.classMethod
3AB0B4BC0125C3CF1 channels still being tracked in RestCancellableNodeClient while there should be none expected:<0> but was:<1>8. DeleteSnapshotIT.testDeleteShallowCopySnapshot
3D45AF61B849E410Expected: is <4> but: was <3>9. IndexFieldDataServiceTests.testClearField
25319181894D5792expected:<0> but was:<1>10. SegmentReplicationUsingRemoteStoreIT.testPrimaryStopped_ReplicaPromoted
4B76556EC34B0E76Count is 2 hits but 3 was expected. Total shards: 1 Successful shards: 1Reproduction Notes
All tests were run locally with their original CI seeds on the current
mainbranch. None failed. This is expected for timing-sensitive flakes — the seed controls randomized parameters but not thread scheduling, GC pauses, or network timing.Notable Trends