Summary
Two distinct test failures were observed against committed code (Timer builds on main) in the past 24 hours (2026-05-23 to 2026-05-24). Neither failure reproduced locally with the original seed, indicating timing/scheduling-dependent flakiness.
Failures Summary Table
| Test |
Builds Affected (total) |
First Seen |
Recent Trend |
Build Link |
IngestFromKafkaIT.testAllActiveOffsetBasedLag |
40 |
2025-10-15 |
Worsening (2→8→13→17/mo) |
#78082 |
RemoteShrinkIndexIT.testShrinkIndexPrimaryTerm |
12 |
2024-04-11 |
Stable/low-rate |
#78086 |
Detailed Findings
1. IngestFromKafkaIT.testAllActiveOffsetBasedLag
- Build: 78082 (Timer, main)
- Seed:
464A08CB052C469
- Error:
java.lang.AssertionError at Assert.assertTrue — a polling/timing assertion failed
- Reproduced locally: No. Seed is not deterministic.
- First failure: 2025-10-15
- Total unique builds affected: 40
- Pattern: Clearly worsening. Monthly unique build failures: Oct 2025 (2), Mar 2026 (8), Apr 2026 (13), May 2026 (17, month in progress). The class-level (
IngestFromKafkaIT all methods) shows 138 total affected builds with a similar acceleration pattern.
- Assessment: This test has a timing-dependent assertion that is increasingly likely to fail as CI runners get faster (correlates with the April 2026 m7a.8xlarge migration). The failure rate has roughly doubled month-over-month since March 2026.
2. RemoteShrinkIndexIT.testShrinkIndexPrimaryTerm
- Build: 78086 (Timer, main)
- Seed:
6FF42AC7AA85FF11
- Error:
java.lang.AssertionError at ReplicationTracker.getReplicationGroup(ReplicationTracker.java:1096) — assertion inside SegmentReplicationSourceService.clusterChanged
- Reproduced locally: No. Seed is not deterministic.
- First failure: 2024-04-11
- Total unique builds affected: 12 (method-specific), 70 (class-level)
- Pattern: Stable, low-rate chronic flake. Sporadic failures across 2+ years with no clear trend. The class-level had a spike in Nov 2025 (41 builds) but the specific method has remained at ~1-3 failures per active month.
- Assessment: Long-lived race condition in replication tracking during shard shrink operations. The assertion fires when
clusterChanged is processed concurrently with shard state transitions. Low priority given the stable, low failure rate.
Reproduction Commands
# Test 1 (did not reproduce)
./gradlew ':plugins:ingestion-kafka:internalClusterTest' --tests 'org.opensearch.plugin.kafka.IngestFromKafkaIT.testAllActiveOffsetBasedLag' -Dtests.seed=464A08CB052C469
# Test 2 (did not reproduce)
./gradlew ':server:internalClusterTest' --tests 'org.opensearch.action.admin.indices.create.RemoteShrinkIndexIT.testShrinkIndexPrimaryTerm' -Dtests.seed=6FF42AC7AA85FF11
Notes
- Both failures are non-deterministic with respect to the seed, meaning the root cause involves thread scheduling, network timing, or other factors not controlled by
RandomizedRunner.
- The
IngestFromKafkaIT failure pattern strongly correlates with the CI runner migration to faster hardware (m7a.8xlarge, ~April 2026), suggesting a timing-sensitive polling assertion.
- Neither test was modified recently; these are latent flakes, not code regressions.
Summary
Two distinct test failures were observed against committed code (Timer builds on
main) in the past 24 hours (2026-05-23 to 2026-05-24). Neither failure reproduced locally with the original seed, indicating timing/scheduling-dependent flakiness.Failures Summary Table
IngestFromKafkaIT.testAllActiveOffsetBasedLagRemoteShrinkIndexIT.testShrinkIndexPrimaryTermDetailed Findings
1. IngestFromKafkaIT.testAllActiveOffsetBasedLag
464A08CB052C469java.lang.AssertionErroratAssert.assertTrue— a polling/timing assertion failedIngestFromKafkaITall methods) shows 138 total affected builds with a similar acceleration pattern.2. RemoteShrinkIndexIT.testShrinkIndexPrimaryTerm
6FF42AC7AA85FF11java.lang.AssertionErroratReplicationTracker.getReplicationGroup(ReplicationTracker.java:1096)— assertion insideSegmentReplicationSourceService.clusterChangedclusterChangedis processed concurrently with shard state transitions. Low priority given the stable, low failure rate.Reproduction Commands
Notes
RandomizedRunner.IngestFromKafkaITfailure pattern strongly correlates with the CI runner migration to faster hardware (m7a.8xlarge, ~April 2026), suggesting a timing-sensitive polling assertion.