test(SuggestionsTest): wait for non-tag peer metadata in awaitConsistency by sriram-atlan · Pull Request #2506 · atlanhq/atlan-java

sriram-atlan · 2026-05-18T11:45:54Z

Summary

SuggestionsTest.awaitConsistency previously only waited for tag denormalisation via waitForTagsToSync. The three findSuggestions* test methods that follow aggregate on additional non-tag fields (ownerGroups, description, userDescription, assignedTerms) which weren't being waited on, so under CI matrix load the queries fired before the peer assets' metadata reached the ES search index → empty aggregations → expected [1] but found [0].

This PR extends awaitConsistency to also wait until the four metadata-bearing peers (table1, table3, t1c1, v1c1) show ownerGroups + assignedTerms visible in ES.

Diagnosis path

Step	Result
`SuggestionsTest` run in isolation against `leangraph-test`	24/24 pass, 1m 39s
Daily `Test (leangraph-test)` workflow, full matrix (8/3/5)	3 sub-failures: `findSuggestionsDefault`, `findSuggestionsAcrossTypes`, `findLimitedSuggestions`
Same workflow, reduced matrix to 2/2/2 (run 26027064093)	Same 3 sub-failures still fail — confirms even modest concurrent write contention triggers the race
Server-side trace (`[ms-1268-trace]` instrumentation)	Confirms peer metadata is correctly persisted in Cassandra and emitted into the ES bulk update body — the race is purely about ES refresh visibility, not data loss

The 2-way reproduction was the deciding evidence: high parallelism isn't required; the race window between commit and search just needs to land below ES's refresh tick (default 1s) for any concurrent write workload.

Why test-side wait, not server-side

A server-side change (?refresh=wait_for on the inline ES bulk) was considered in atlas-metastore#6727. Closed without merging due to performance concerns under bulk-import load (the index.max_refresh_listeners ceiling of 1000 can flip wait_for into forced-refresh behaviour when many writers stack up concurrently, fragmenting segments). Tests should wait for their explicit preconditions rather than rely on implicit server timing guarantees.

What this is not trying to fix

Integration (PurposeTest) / asset-import: chunk 0/1/3 token-permission failures — separate
Integration (SearchTest) — pre-existing failure on both lean-graph and legacy tenants
Packages (duplicate-detector) / Packages (cube-assets-builder) — separate flake/permission territory

Test plan

CI build green (PR build)
After merge: Integration (SuggestionsTest) job green for 3 consecutive daily Test (leangraph-test) runs
Non-leangraph nightly Test workflow still passes Integration (SuggestionsTest) (no regression on the other tenant)

Linear

Resolves part of MS-1270 (the ES refresh race manifesting as Suggestions failures). MS-1270 to be closed once 3 consecutive daily runs are green.

🤖 Generated with Claude Code

…ency awaitConsistency previously only called waitForTagsToSync, which covers the classification denorm but not ownerGroups, descriptions, or assignedTerms. The three find* tests that follow all aggregate on those non-tag fields: - findSuggestionsDefault — aggregates ownerGroups / descriptions / atlanTags / assignedTerms from peer columns (t1c1, v1c1) - findSuggestionsAcrossTypes — same aggregations from peer table3 (which shares VIEW_NAME with view1) - findLimitedSuggestions — aggregates ownerGroups / system desc from peer table1 (shares TABLE_NAME with table2) Under CI matrix load on the shared leangraph-test tenant the time between the suggestions.update.column.* writes and the find* queries shrinks below ES's natural refresh window (default 1s, sometimes more on a contended cluster). The aggregations come back empty and the three tests fail with "expected [1] but found [0]". Reproduced even at reduced matrix parallelism (max-parallel: 2) in run 26027064093, confirming the race is sensitive to even modest concurrent write contention — not just high fan-out. The local isolated run (no parallelism) passes 24/24 because the gap between commit and query exceeds the ES refresh tick. Fix: extend awaitConsistency with a retrySearchUntil that waits for all four metadata-bearing peers (table1, table3, t1c1, v1c1) to show ownerGroups + assignedTerms visible in ES. Once those four are indexed, the downstream find* aggregations will see consistent state. retrySearchUntil already encapsulates bounded exponential backoff — this is the same wait pattern other integration tests already use. Companion fix to atlan-java#2500 (InsightsTest retry threshold). The broader server-side question of whether the inline ES bulk should take refresh=wait_for for read-after-write semantics was deferred (atlas-metastore#6727 closed) due to perf concerns under bulk-import load — tests should wait for their preconditions explicitly rather than depend on implicit server timing guarantees. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Chris (He/Him) <cgrote@gmail.com>

sriram-atlan requested a review from cmgrote as a code owner May 18, 2026 11:45

chore: linting

633b5d1

Signed-off-by: Chris (He/Him) <cgrote@gmail.com>

cmgrote enabled auto-merge May 18, 2026 12:56

cmgrote approved these changes May 18, 2026

View reviewed changes

cmgrote merged commit cc0ca44 into main May 18, 2026
7 checks passed

cmgrote added the ignore Exclude from release notes label May 20, 2026

sriram-atlan mentioned this pull request May 21, 2026

test: address 3 remaining leangraph CI flakes (TableSearchTest, AvoidDuplicateLinkTest, SearchTest source-synced) #2518

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(SuggestionsTest): wait for non-tag peer metadata in awaitConsistency#2506

test(SuggestionsTest): wait for non-tag peer metadata in awaitConsistency#2506
cmgrote merged 2 commits into
mainfrom
fix-suggestions-await-consistency

sriram-atlan commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sriram-atlan commented May 18, 2026

Summary

Diagnosis path

Why test-side wait, not server-side

What this is not trying to fix

Test plan

Linear

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants