Skip to content

Flaky test report: committed-code failures on 2026-04-09 #235

@andrross

Description

@andrross

Flaky test report: committed-code failures on 2026-04-09

Summary

13 test failures were detected across committed-code builds (Timer/main and Post Merge Action) in the past 24 hours, representing 10 distinct tests across 9 unique builds. None of the failures reproduced locally with the original seed, confirming they are non-deterministic (flaky).

Methodology

  • Queried the OpenSearch metrics cluster (gradle-check-* indices) for FAILED tests in Timer/main and Post Merge Action builds from the past 24 hours
  • Extracted reproduction seeds from Jenkins console logs
  • Attempted local reproduction with the original seed for each test (where feasible)
  • Queried historical failure data across all build types (including PR builds) using monthly aggregations with unique build counts

Summary Table (sorted by total builds affected)

# Test Builds Affected First Seen Recent Build Reproduced? Pattern
1 ClientYamlTestSuiteIT (string profiler via global ordinals) 355 2024-03 74188 No Stable flake
2 IndexingIT.testIndexingWithSegRep 244 2024-03 74175 Skipped (rolling-upgrade) Stable flake
3 FullRollingRestartIT.testFullRollingRestart (SEGMENT) 116 2025-07 74235 No Worsening
4 AzureBlobStoreRepositoryTests.testContainerCreationAndDeletion 114 2024-04 74233 Skipped (Azure mock) Stable flake
5 FullRollingRestartIT.testFullRollingRestart_withNoRecoveryPayloadAndSource (SEGMENT) 107 2024-10 74203 No Worsening
6 SharedClusterSnapshotRestoreIT.testSnapshotFileFailureDuringSnapshot 88 2024-08 74228 No Stable flake
7 AzureBlobStoreRepositoryTests.testWriteRead 83 2024-04 74194 Skipped (Azure mock) Stable flake
8 RestoreShallowSnapshotV2IT.classMethod 79 2025-05 74209 Skipped (suite timeout) Stable flake
9 RestoreShallowSnapshotV2IT.testContinuousIndexing 75 2025-02 74209 Skipped (suite timeout) Stable flake
10 IngestFromKinesisIT.testPluginsAreInstalled 62 2025-05 74201 Skipped (Kinesis infra) Worsening (spike in Mar 2026)

Detailed Findings

1. ClientYamlTestSuiteIT — string profiler via global ordinals

  • Build: 74188 (Post Merge Action)
  • Seed: 82DE96594B7080D2
  • Error: field [profile.shards.0.aggregations.0.debug.segments_with_single_valued_ords] is not greater than [0]
  • Reproduced locally: No — passed with same seed
  • First seen: March 2024
  • Total unique builds: 355
  • Monthly trend (last 6 months): 10, 4, 24, 4, 12, 5
  • Pattern: Long-standing stable flake. Intermittent across all months with no clear trend. The January 2026 spike (24 builds) is notable but didn't persist.

2. IndexingIT.testIndexingWithSegRep

  • Build: 74175 (Timer/main)
  • Seed: A2B0C92946BE16EC
  • Error: expected:<0> but was:<1>
  • Reproduced locally: Skipped — requires rolling-upgrade multi-version cluster infrastructure
  • First seen: March 2024
  • Total unique builds: 244
  • Monthly trend (last 6 months): 16, 4, 11, 16, 18, 2
  • Pattern: Persistent long-standing flake. Consistently fails 4-18 times per month.

3. FullRollingRestartIT.testFullRollingRestart (SEGMENT)

  • Build: 74235 (Timer/main)
  • Seed: 1F5B6F8715288B33
  • Error: replica shards haven't caught up with primary expected:<22> but was:<17>
  • Reproduced locally: No — passed with same seed
  • First seen: July 2025
  • Total unique builds: 116
  • Monthly trend (last 6 months): 0, 0, 0, 20, 12, 5
  • Pattern: Worsening. Was dormant until February 2026, then started failing heavily. Likely a regression introduced around that time.

4. AzureBlobStoreRepositoryTests.testContainerCreationAndDeletion

  • Build: 74233 (Post Merge Action)
  • Seed: 6DD75D74EB2363B1
  • Error: RepositoryVerificationException: path is not accessible on cluster-manager node
  • Reproduced locally: Skipped — requires Azure blob store mock infrastructure
  • First seen: April 2024
  • Total unique builds: 114
  • Monthly trend (last 6 months): 5, 12, 7, 11, 8, 10
  • Pattern: Stable flake. Consistently fails 5-12 times per month.

5. FullRollingRestartIT.testFullRollingRestart_withNoRecoveryPayloadAndSource (SEGMENT)

  • Build: 74203 (Timer/main)
  • Seed: 500B52348ACEC49B
  • Error: replica shards haven't caught up with primary expected:<18> but was:<15>
  • Reproduced locally: No — passed with same seed
  • First seen: October 2024
  • Total unique builds: 107
  • Monthly trend (last 6 months): 0, 0, 0, 17, 13, 3
  • Pattern: Worsening. Same pattern as test Bump com.diffplug.spotless from 5.6.1 to 6.2.2 #3 — dormant until February 2026, then started failing. Both FullRollingRestartIT SEGMENT tests share the same root cause.

6. SharedClusterSnapshotRestoreIT.testSnapshotFileFailureDuringSnapshot

  • Build: 74228 (Timer/main)
  • Seed: BA8E73CB0D6E8F2B
  • Error: Expected: <0L> but: was <1L>
  • Reproduced locally: No — passed with same seed
  • First seen: August 2024
  • Total unique builds: 88
  • Monthly trend (last 6 months): 3, 4, 3, 0, 6, 4
  • Pattern: Stable low-frequency flake. Consistently fails 3-6 times per month.

7. AzureBlobStoreRepositoryTests.testWriteRead

8. RestoreShallowSnapshotV2IT.classMethod

  • Build: 74209 (Post Merge Action)
  • Seed: F2EFD250CE679D6E
  • Error: Suite timeout exceeded (>= 1200000 msec)
  • Reproduced locally: Skipped — suite timeout, not a deterministic test failure
  • First seen: May 2025
  • Total unique builds: 79
  • Monthly trend (last 6 months): 8, 10, 9, 7, 6, 4
  • Pattern: Stable flake. This is a suite-level timeout, not a specific test assertion failure.

9. RestoreShallowSnapshotV2IT.testContinuousIndexing

  • Build: 74209 (Post Merge Action)
  • Seed: F2EFD250CE679D6E
  • Error: Test abandoned because suite timeout was reached
  • Reproduced locally: Skipped — suite timeout
  • First seen: February 2025
  • Total unique builds: 75
  • Monthly trend (last 6 months): 6, 5, 8, 7, 8, 2
  • Pattern: Stable flake. Same suite timeout as test Bump jopt-simple from 5.0.2 to 5.0.4 in /libs/cli #8.

10. IngestFromKinesisIT.testPluginsAreInstalled

  • Build: 74201 (Post Merge Action)
  • Seed: A74676B701B5AD6E
  • Error: ResourceInUseException: Stream test already exists
  • Reproduced locally: Skipped — requires Kinesis infrastructure
  • First seen: May 2025
  • Total unique builds: 62
  • Monthly trend (last 6 months): 1, 1, 0, 0, 51, 4
  • Pattern: Worsening. Massive spike in March 2026 (51 builds) suggests an infrastructure or test isolation issue was introduced.

Additional tests not in top 10

  • IndexFieldDataServiceTests.testClearField (build 74250, 9 builds total, did not reproduce locally)
  • IndexFieldDataServiceTests.testExceptionWhileRemovingKey (build 74250, 3 builds total, did not reproduce locally)
  • RemoteSplitIndexIT.testSplitFromOneToN (build 74189, 29 builds total, did not reproduce locally)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions