Skip to content

Flaky test report: committed-code failures on 2026-05-10 #261

@andrross

Description

@andrross

Summary

Flaky test failures observed in committed-code CI builds (Timer and Post Merge Action) during the 24-hour window ending 2026-05-10T10:00Z. None of the failures reproduced locally with the original seed, confirming they are timing/environment-dependent flakes rather than deterministic failures.

Failing Tests

# Test Build Seed Reproduced Locally First Seen Total Builds Affected Trend
1 MixedClusterClientYamlTestSuiteIT (310_match_bool_prefix) 76388 50E3E31F4E9E08C2 Skipped (BWC test) 2024-03-25 463 Stable/chronic (~5-18 builds/month)
2 RemoteRestoreSnapshotIT.testClusterManagerFailoverDuringSnapshotCreation 76405 51E20D91F32548B6 No 2024-09-02 215 Stable (~6-17 builds/month)
3 RemoteRestoreSnapshotIT.classMethod (suite timeout) 76405 51E20D91F32548B6 No 2024-08-30 135 Stable (~5-16 builds/month)
4 RecoveryWhileUnderLoadIT.testRecoverWhileUnderLoadWithReducedAllowedNodes 76419 3B49CE0DA9CDBF7C No 2024-04-03 108 Worsening (0/month → 10-17/month since Feb 2026)
5 LangPainlessClientYamlTestSuiteIT (derived_fields search definition) 76396 320563A8B9CEAE61 No 2024-05-15 63 Stable/intermittent (~1-6 builds/month)
6 ReindexIT.testReindexTask 76394, 76398 CC8FB0BBA1B722D9, 703D20324B41038C No 2024-06-18 25 Stable/low (~1-3 builds/month)
7 WarmIndexSegmentReplicationIT.testPrimaryReceivesDocsDuringReplicaRecovery 76394 CC8FB0BBA1B722D9 No 2025-03-17 11 Slightly worsening (3 in Apr 2026)
8 IndicesRequestCacheCleanupIT.testDynamicStalenessThresholdUpdate 76398 703D20324B41038C No 2024-08-08 10 Stable/low (~1 build/month)

Failure Details

MixedClusterClientYamlTestSuiteIT (310_match_bool_prefix)

  • Error: hits.hits.0._id: expected String [4] but was String [1] — document ordering mismatch in multi_match bool prefix queries
  • Pattern: Chronic flake since March 2024 with 463 affected builds. Peaked at 185 builds in Sep 2024, otherwise steady at 5-18/month. This is a BWC test requiring a mixed-version cluster so it cannot be reproduced with a simple local gradle command.

RemoteRestoreSnapshotIT.testClusterManagerFailoverDuringSnapshotCreation

  • Error: Test abandoned because suite timeout was reached (>= 1200000 msec)
  • Pattern: Consistent flake since Sep 2024. The test exercises cluster manager failover during snapshot creation — inherently timing-sensitive. The classMethod failure is a consequence of the same suite timeout.

RecoveryWhileUnderLoadIT.testRecoverWhileUnderLoadWithReducedAllowedNodes

  • Error: replica shards haven't caught up with primary expected:<27> but was:<24>
  • Pattern: Was dormant from Jun 2024 through Jan 2025, then resurfaced in Feb 2026 and has been worsening significantly (4→17→13→10 builds/month in Feb-May 2026). This timing correlates with the mid-April 2026 CI runner migration to m7a.8xlarge — faster CPUs may be amplifying a latent race in segment replication recovery.

LangPainlessClientYamlTestSuiteIT (derived_fields)

  • Error: hits.total: expected Integer [4] but was Integer [3] — missing document in derived field search results
  • Pattern: Intermittent since May 2024 with gaps of several months between occurrences. Low overall impact.

ReindexIT.testReindexTask

  • Error: java.lang.AssertionError (assertTrue failure in task completion check)
  • Pattern: Low-frequency chronic flake since Jun 2024. Appeared in 2 separate builds in the same 24h window, suggesting a possible environmental trigger.

WarmIndexSegmentReplicationIT.testPrimaryReceivesDocsDuringReplicaRecovery

  • Error: CorruptIndexException during refresh on replica — segment replication race during recovery
  • Pattern: Relatively new (Mar 2025), low frequency but slightly increasing in recent months.

IndicesRequestCacheCleanupIT.testDynamicStalenessThresholdUpdate

  • Error: expected:<1> but was:<2> — cache entry count mismatch after staleness threshold update
  • Pattern: Very low frequency (10 builds total over 22 months). Rare but persistent.

Methodology

  1. Queried gradle-check-* index in the OpenSearch metrics cluster for failures in Timer/main and Post Merge Action builds within the past 24 hours
  2. Aggregated historical failures by month across all build types (including PR builds) using cardinality aggregation on build_number
  3. Extracted seeds from Jenkins test report stack traces
  4. Attempted local reproduction using ./gradlew :<module>:<testTask> --tests "<class>.<method>" -Dtests.seed=<SEED>
  5. None reproduced — consistent with timing-dependent flakes where the seed controls randomization but not thread scheduling or I/O timing

Recommendations

  • RecoveryWhileUnderLoadIT deserves priority attention given its worsening trend correlating with the CI runner migration
  • RemoteRestoreSnapshotIT and MixedClusterClientYamlTestSuiteIT are high-volume chronic flakes that contribute significant CI noise
  • WarmIndexSegmentReplicationIT is worth watching as a potentially worsening trend

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions