test(SuggestionsTest): wait for non-tag peer metadata in awaitConsistency#2506
Merged
Conversation
…ency
awaitConsistency previously only called waitForTagsToSync, which covers
the classification denorm but not ownerGroups, descriptions, or
assignedTerms. The three find* tests that follow all aggregate on those
non-tag fields:
- findSuggestionsDefault — aggregates ownerGroups / descriptions
/ atlanTags / assignedTerms from peer
columns (t1c1, v1c1)
- findSuggestionsAcrossTypes — same aggregations from peer table3
(which shares VIEW_NAME with view1)
- findLimitedSuggestions — aggregates ownerGroups / system desc
from peer table1 (shares TABLE_NAME
with table2)
Under CI matrix load on the shared leangraph-test tenant the time
between the suggestions.update.column.* writes and the find* queries
shrinks below ES's natural refresh window (default 1s, sometimes more
on a contended cluster). The aggregations come back empty and the
three tests fail with "expected [1] but found [0]". Reproduced even at
reduced matrix parallelism (max-parallel: 2) in run 26027064093,
confirming the race is sensitive to even modest concurrent write
contention — not just high fan-out.
The local isolated run (no parallelism) passes 24/24 because the gap
between commit and query exceeds the ES refresh tick.
Fix: extend awaitConsistency with a retrySearchUntil that waits for
all four metadata-bearing peers (table1, table3, t1c1, v1c1) to show
ownerGroups + assignedTerms visible in ES. Once those four are
indexed, the downstream find* aggregations will see consistent state.
retrySearchUntil already encapsulates bounded exponential backoff —
this is the same wait pattern other integration tests already use.
Companion fix to atlan-java#2500 (InsightsTest retry threshold). The
broader server-side question of whether the inline ES bulk should
take refresh=wait_for for read-after-write semantics was deferred
(atlas-metastore#6727 closed) due to perf concerns under bulk-import
load — tests should wait for their preconditions explicitly rather
than depend on implicit server timing guarantees.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Chris (He/Him) <cgrote@gmail.com>
cmgrote
approved these changes
May 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
SuggestionsTest.awaitConsistencypreviously only waited for tag denormalisation viawaitForTagsToSync. The threefindSuggestions*test methods that follow aggregate on additional non-tag fields (ownerGroups,description,userDescription,assignedTerms) which weren't being waited on, so under CI matrix load the queries fired before the peer assets' metadata reached the ES search index → empty aggregations →expected [1] but found [0].This PR extends
awaitConsistencyto also wait until the four metadata-bearing peers (table1,table3,t1c1,v1c1) showownerGroups + assignedTermsvisible in ES.Diagnosis path
SuggestionsTestrun in isolation againstleangraph-testTest (leangraph-test)workflow, full matrix (8/3/5)findSuggestionsDefault,findSuggestionsAcrossTypes,findLimitedSuggestions[ms-1268-trace]instrumentation)The 2-way reproduction was the deciding evidence: high parallelism isn't required; the race window between commit and search just needs to land below ES's refresh tick (default 1s) for any concurrent write workload.
Why test-side wait, not server-side
A server-side change (
?refresh=wait_foron the inline ES bulk) was considered in atlas-metastore#6727. Closed without merging due to performance concerns under bulk-import load (theindex.max_refresh_listenersceiling of 1000 can flipwait_forinto forced-refresh behaviour when many writers stack up concurrently, fragmenting segments). Tests should wait for their explicit preconditions rather than rely on implicit server timing guarantees.What this is not trying to fix
Integration (PurposeTest)/asset-import: chunk 0/1/3token-permission failures — separateIntegration (SearchTest)— pre-existing failure on both lean-graph and legacy tenantsPackages (duplicate-detector)/Packages (cube-assets-builder)— separate flake/permission territoryTest plan
Integration (SuggestionsTest)job green for 3 consecutive dailyTest (leangraph-test)runsTestworkflow still passesIntegration (SuggestionsTest)(no regression on the other tenant)Linear
Resolves part of MS-1270 (the ES refresh race manifesting as Suggestions failures). MS-1270 to be closed once 3 consecutive daily runs are green.
🤖 Generated with Claude Code