Skip to content

test(SuggestionsTest): wait for non-tag peer metadata in awaitConsistency#2506

Merged
cmgrote merged 2 commits into
mainfrom
fix-suggestions-await-consistency
May 18, 2026
Merged

test(SuggestionsTest): wait for non-tag peer metadata in awaitConsistency#2506
cmgrote merged 2 commits into
mainfrom
fix-suggestions-await-consistency

Conversation

@sriram-atlan
Copy link
Copy Markdown
Contributor

Summary

SuggestionsTest.awaitConsistency previously only waited for tag denormalisation via waitForTagsToSync. The three findSuggestions* test methods that follow aggregate on additional non-tag fields (ownerGroups, description, userDescription, assignedTerms) which weren't being waited on, so under CI matrix load the queries fired before the peer assets' metadata reached the ES search index → empty aggregations → expected [1] but found [0].

This PR extends awaitConsistency to also wait until the four metadata-bearing peers (table1, table3, t1c1, v1c1) show ownerGroups + assignedTerms visible in ES.

Diagnosis path

Step Result
SuggestionsTest run in isolation against leangraph-test 24/24 pass, 1m 39s
Daily Test (leangraph-test) workflow, full matrix (8/3/5) 3 sub-failures: findSuggestionsDefault, findSuggestionsAcrossTypes, findLimitedSuggestions
Same workflow, reduced matrix to 2/2/2 (run 26027064093) Same 3 sub-failures still fail — confirms even modest concurrent write contention triggers the race
Server-side trace ([ms-1268-trace] instrumentation) Confirms peer metadata is correctly persisted in Cassandra and emitted into the ES bulk update body — the race is purely about ES refresh visibility, not data loss

The 2-way reproduction was the deciding evidence: high parallelism isn't required; the race window between commit and search just needs to land below ES's refresh tick (default 1s) for any concurrent write workload.

Why test-side wait, not server-side

A server-side change (?refresh=wait_for on the inline ES bulk) was considered in atlas-metastore#6727. Closed without merging due to performance concerns under bulk-import load (the index.max_refresh_listeners ceiling of 1000 can flip wait_for into forced-refresh behaviour when many writers stack up concurrently, fragmenting segments). Tests should wait for their explicit preconditions rather than rely on implicit server timing guarantees.

What this is not trying to fix

  • Integration (PurposeTest) / asset-import: chunk 0/1/3 token-permission failures — separate
  • Integration (SearchTest) — pre-existing failure on both lean-graph and legacy tenants
  • Packages (duplicate-detector) / Packages (cube-assets-builder) — separate flake/permission territory

Test plan

  • CI build green (PR build)
  • After merge: Integration (SuggestionsTest) job green for 3 consecutive daily Test (leangraph-test) runs
  • Non-leangraph nightly Test workflow still passes Integration (SuggestionsTest) (no regression on the other tenant)

Linear

Resolves part of MS-1270 (the ES refresh race manifesting as Suggestions failures). MS-1270 to be closed once 3 consecutive daily runs are green.

🤖 Generated with Claude Code

…ency

awaitConsistency previously only called waitForTagsToSync, which covers
the classification denorm but not ownerGroups, descriptions, or
assignedTerms. The three find* tests that follow all aggregate on those
non-tag fields:

  - findSuggestionsDefault       — aggregates ownerGroups / descriptions
                                   / atlanTags / assignedTerms from peer
                                   columns (t1c1, v1c1)
  - findSuggestionsAcrossTypes   — same aggregations from peer table3
                                   (which shares VIEW_NAME with view1)
  - findLimitedSuggestions       — aggregates ownerGroups / system desc
                                   from peer table1 (shares TABLE_NAME
                                   with table2)

Under CI matrix load on the shared leangraph-test tenant the time
between the suggestions.update.column.* writes and the find* queries
shrinks below ES's natural refresh window (default 1s, sometimes more
on a contended cluster). The aggregations come back empty and the
three tests fail with "expected [1] but found [0]". Reproduced even at
reduced matrix parallelism (max-parallel: 2) in run 26027064093,
confirming the race is sensitive to even modest concurrent write
contention — not just high fan-out.

The local isolated run (no parallelism) passes 24/24 because the gap
between commit and query exceeds the ES refresh tick.

Fix: extend awaitConsistency with a retrySearchUntil that waits for
all four metadata-bearing peers (table1, table3, t1c1, v1c1) to show
ownerGroups + assignedTerms visible in ES. Once those four are
indexed, the downstream find* aggregations will see consistent state.
retrySearchUntil already encapsulates bounded exponential backoff —
this is the same wait pattern other integration tests already use.

Companion fix to atlan-java#2500 (InsightsTest retry threshold). The
broader server-side question of whether the inline ES bulk should
take refresh=wait_for for read-after-write semantics was deferred
(atlas-metastore#6727 closed) due to perf concerns under bulk-import
load — tests should wait for their preconditions explicitly rather
than depend on implicit server timing guarantees.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sriram-atlan sriram-atlan requested a review from cmgrote as a code owner May 18, 2026 11:45
Signed-off-by: Chris (He/Him) <cgrote@gmail.com>
@cmgrote cmgrote enabled auto-merge May 18, 2026 12:56
@cmgrote cmgrote merged commit cc0ca44 into main May 18, 2026
7 checks passed
@cmgrote cmgrote added the ignore Exclude from release notes label May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ignore Exclude from release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants