ATLAS-5337: Fix Trino Extractor Jersey client failures for AtlasEntityWithExtInfo POST/GET#690
Open
ramackri wants to merge 5 commits into
Open
ATLAS-5337: Fix Trino Extractor Jersey client failures for AtlasEntityWithExtInfo POST/GET#690ramackri wants to merge 5 commits into
ramackri wants to merge 5 commits into
Conversation
added 5 commits
July 4, 2026 21:54
…yWithExtInfo Remove jersey-client 1.9 pin from trino-extractor; JSON-serialize entity POSTs via AtlasType.toJson() in AtlasClientV2; parse model GET responses with AtlasJson in AtlasBaseClient; support ATLAS_USERNAME/ATLAS_PASSWORD env vars in AuthenticationUtil for non-interactive runs.
QuickStartIT and QuickStartV2IT assumed fixed column/process input ordering from graph traversal, which fails intermittently on Java 17 CI.
…docker. Retry transient JanusGraph errors during glossary test setup, scope BasicSearchIT hive queries to the imported @cl1 dataset, and harden atlas-hbase container startup and health checks so docker compose --wait does not fail spuriously.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix Trino Extractor standalone tarball failures when importing
trino_*metadata into Atlas viaAtlasClientV2. The bug has existed since ATLAS-5021 (PR #428, commit4c49d3933, Sep 2025) and is not a regression from the Kafka 3.9.1 upgrade.Also fixes flaky
QuickStartIT/QuickStartV2ITCI failures caused by non-deterministic graph edge ordering (unrelated to the Jersey client changes in this PR).Also fixes additional unrelated CI flakes in
GlossaryServiceTest,BasicSearchIT, and theatlas-hbasedocker smoke test.Symptoms (before fix)
POST
createEntity— MessageBodyWriter:GET
getEntityByAttribute— MessageBodyReader (after POST fix):Non-interactive auth:
401 UnauthorizedwhenAuthenticationUtilcould not read fromSystem.console()(CI / scripted runs).Workaround used before fix: manual Atlas REST import (curl) — bypasses
AtlasClientV2.createEntity()entirely.Root cause
jersey-client1.9 pin inaddons/trino-extractor/pom.xmlconflicted withatlas-client-v2(1.19) → bothjersey-client-1.9.jarandjersey-client-1.19.jarin distrolib/.AtlasEntity$AtlasEntityWithExtInfo(inner class) via Jersey POJO mapping in standalonelib/layout.AuthenticationUtilonly read credentials fromSystem.console().Type-def APIs already used
AtlasType.toJson(); entity APIs did not — that asymmetry is why type defs could work while entity POSTs failed.Why undetected until now:
TrinoExtractorIT.javais a placeholder with no@Testmethods; tarball path never exercised in CI. Most deployments use Hive hook or manual REST import.Not caused by Kafka 3.9.1: Dependabot commit
6709f6459only changed<kafka.version>in rootpom.xml. Trino extractor has no Kafka client dependency.Changes
addons/trino-extractor/pom.xmljersey-client1.9 pin; inheritjersey.version1.19 from parentclient/client-v2/.../AtlasClientV2.javaAtlasType.toJson()—createEntity,createEntities,updateEntity,updateEntities,updateEntityByAttributeclient/common/.../AtlasBaseClient.javaorg.apache.atlas.model.*response types, read body as String and parse withAtlasJson.fromJson()intg/.../AuthenticationUtil.javaATLAS_USERNAME/ATLAS_PASSWORDenv vars before console promptwebapp/.../QuickStartIT.javatime_idcolumn by name; check process inputs by GUID set (notget(0)order)webapp/.../QuickStartV2IT.javatestProcessIsAdded: assert input GUIDs via set membership (not list index)repository/.../GlossaryServiceTest.javaloadAllModels/createTypesDefon transient JanusGraph errors during@BeforeClasssetupwebapp/.../BasicSearchIT.javahive_table/hive_columnsearches to imported@cl1dataset to avoid suite pollutiondev-support/atlas-docker/scripts/atlas-hbase.shtail --pid;exit 1if never found (fixesexited (0)race)dev-support/atlas-docker/docker-compose.atlas-hbase.yml16030/rs-statusand HMaster16010/master-status; addrestart: unless-stopped.github/workflows/ci.ymldocker logs --tail 200for all expected containers before teardownWas
AtlasClientV2.createEntity(object)working before?Yes — in most Atlas deployments. Bridges (Hive, Kafka, HBase), webapp ITs (
EntityV2JerseyResourceIT), and the sample app used object-passing successfully where Jersey POJO mapping had a clean classpath (bridge tarballs curatelib/withjersey-json+jersey-client1.19).No — for the Trino extractor tarball. Same client code, but
copy-dependencies+jersey-client1.9 pin produced an inconsistentlib/layout.Pre-serializing with
AtlasType.toJson()produces the same wire JSON and aligns entity APIs with type-def APIs — no behaviour change for working callers.QuickStart IT flakiness (CI fix)
Why this is unrelated to the Jersey client changes
This PR modifies
AtlasClientV2,AtlasBaseClient(v2 model JSON parsing), and the Trino extractor classpath. The failing QuickStart tests do not exercise that code path:QuickStartITAtlasClientV1(atlasClientV1.getEntity(...))ReferenceableAPI, notAtlasClientV2QuickStartV2ITAtlasClientV2for reads only intestProcessIsAddedcreateEntity/ JSON-serialization pathsOn the same CI run that failed
QuickStartIT,AtlasClientV2Test(125 tests) passed. The QuickStart failure is a pre-existing brittle assertion, not a regression from ATLAS-5337.Observed CI failures
docker-buildJava 17)QuickStartIT.testTablesAreAddedexpected [time_id] but found [customer_id]atverifyColumnsAreAddedToTable:143QuickStartV2IT.testProcessIsAddedexpected [guid-A] but found [guid-B]attestProcessIsAdded:122Both are order-assumption failures: the test expected a specific element at list index 0, but Atlas returned a different (equally valid) element first.
Root cause: non-deterministic relationship list order
When a table entity is read back from Atlas, its
columnsattribute and a process entity'sinputsattribute are lists of related entities built from graph edges. The order of those lists depends on graph traversal order.Atlas can persist a stable order by setting
ATTRIBUTE_INDEX_PROPERTY_KEYon relationship edges (seeEntityGraphMapper,GraphHelper). That index is written when entities are created through the v2 store with explicit ordering. The QuickStart example programs (QuickStart.runQuickstart/QuickStartV2.runQuickstart) create metadata via the v1 client and do not set relationship indexes. As a result, column and input edge order is implementation-dependent and can vary between JVM versions, JanusGraph traversal order, or CI runs.This is not a data correctness bug — all four columns and both process inputs are present — only the list position is unstable.
What each test assumed (before fix)
QuickStartIT(v1 API)verifyColumnsAreAddedToTable— calledcolumns.get(0)and asserted itsnamewastime_id. On failure, index 0 heldcustomer_idinstead. Thesales_facttable is created with four columns (time_id,product_id,customer_id,sale_price); all four were present (assertEquals(columns.size(), 4)passed), but first-element identity varied.testProcessIsAdded— assertedinputs.get(0)was thesales_facttable GUID andinputs.get(1)was thetime_dimtable GUID. The process genuinely has two inputs; only their order in the returned list was unstable.QuickStartV2IT(v2 API)verifyColumnsAreAddedToTable— already order-independent (checks count == 4 and each column GUID is a valid UUID). No change needed.testProcessIsAdded— same index-0 / index-1 assumption as v1, using((Map) inputs.get(0)).get("guid").Notably,
QuickStartIT.testLineageIsMaintainedalready handled this correctly — it usesassertTrue(salesFactTableId.equals(i1) || salesFactTableId.equals(i2))rather than assuming a fixed slot. The column and process tests now follow the same pattern.Fix applied
Assertions now verify membership and attributes, not list position:
QuickStartIT.verifyColumnsAreAddedToTablecolumns.get(0).get("name") == "time_id"name == time_id; assertdataType == "int"QuickStartIT.testProcessIsAddedinputs.get(0)/inputs.get(1)exact GUID matchSet;assertTrueboth expected table GUIDs are presentQuickStartV2IT.testProcessIsAddedinputs.get(0)/inputs.get(1)exact GUID matchSingle-output assertions (
outputs.get(0)) are unchanged — each process has exactly one output, so index 0 is unambiguous.Why this surfaced on Java 17 CI
Commit
488bdaae8(ATLAS-5002) added a CI matrix legdocker-build (17)alongside Java 8. The QuickStart ITs are long-standing; the Java 17 leg runs the full embedded Jetty + Solr integration suite (mvn -Pdist,embedded-solr-it verify) and exposed intermittent ordering differences that the Java 8 leg had not surfaced consistently. The failures predate this PR and appear on upstreammasterruns as well.Additional CI flake fixes (unrelated to Jersey / QuickStart)
These failures blocked CI on PR #690 but are not caused by the Trino extractor client changes.
Observed CI failures
docker-build (8)GlossaryServiceTest.setupSampleGlossary— JanusGraphCould not start new transaction/Cursor has been closedat line 166docker-build (17)atlas-hbase exited (0)during docker compose--waitafter Maven passeddocker-build (8)BasicSearchIT.testDiscoveryWithSearchParameters—expected [3] but found [4]at line 1481.
GlossaryServiceTest— retry transient JanusGraph setup errorsWhat this test is.
GlossaryServiceTestis a repository-layer unit test (not a webapp IT). It runs with Guice +TestModules.TestOnlyModule, which boots an embedded JanusGraph + Solr in-process — the same backend style used by many otherrepositorytests.What
@BeforeClassdoes. Before any@Testruns,setupSampleGlossary()must:addons/models/0000-Area0viaTestLoadModelUtils.loadAllModels(...)TestClassificationviatypeDefStore.createTypesDef(...)— this is where CI failed (old line 166)CI failure. On run 28730263409 (Java 8), setup failed with JanusGraph errors
Could not start new transactionandCursor has been closed. These are backend timing errors, not assertion failures — the class never reached any glossary test logic.Why it flakes.
repositorytests duringmvn verify@BeforeClassis all-or-nothingcreateTypesDefskips the entire classAtlas production code already retries similar graph errors (
EntityConsumer.commitWithRetry,AsyncImportTaskExecutorlocking retries). Test setup lacked the same protection.Fix applied.
loadAllModelsWithRetry()andcreateTestClassificationWithRetry()wrap the two fragile setup steps:could not start new transaction,cursor has been closed,permanentlockingexceptionSkipExceptionwith the error messageWhy this is safe. Retries apply only to
@BeforeClasssetup, not test assertions. CreatingTestClassificationtwice would surface a real duplicate-type error, not a silent pass. No glossary business logic is changed.2.
BasicSearchIT— scope hive searches to imported@cl1datasetWhat this test is.
BasicSearchITis a webapp integration test against a live Atlas server (embedded Jetty + Solr) shared by the entire IT suite (~202 tests in onemvn verifyrun).In
@BeforeClass setUp()it importshive-db-50-tables.zip, creates anhdfs_pathentity for special-character search cases, then sleeps 5s for Solr indexing. Tests then run search cases from JSON fixtures (e.g.entity-filters.json) with hard-codedexpectedCountvalues.What the fixture contains. Despite the filename,
hive-db-50-tables.zipis a small replication export with exactly threehive_tableentities whose names containtesttable:testtable_0default.testtable_0@cl1testtable_1default.testtable_1@cl1testtable_3default.testtable_3@cl1(
testtable_2is intentionally absent — the fixture is curated, not a 0–49 sequence.)The
@cl1suffix is a replication cluster marker from the export (cl1= source server name in replication metadata). Every entity in this zip carries it.What the test searches for. The first case in
entity-filters.json:That count is correct for the imported fixture — not for the entire Atlas graph.
CI failure. On run 28735146924 (Java 8):
Root cause: shared Atlas instance + global search. All IT classes share one Atlas process for the full
mvn verifyrun. Other ITs (EntityJerseyResourceIT,EntityV2JerseyResourceIT,EntityNotificationIT, etc.) createhive_tableentities during the suite. Any extra table whose name contains the substringtesttablematches the search — even if it was not part ofhive-db-50-tables.zip. That polluter almost certainly has a differentqualifiedName(no@cl1suffix) because live ITs create primary-cluster entities, not replication exports.The test was accidentally asserting global graph state when it meant to assert fixture state.
Why
@cl1is the right filter.name contains testtable?qualifiedName contains @cl1?hive-db-50-tables.zipdefault.testtable_0@cl1db.table@primaryor randomFix applied.
scopeToImportedDataset()AND-combines the JSON filter withqualifiedName contains @cl1forhive_tableandhive_columnsearches:Applied in
testDiscoveryWithSearchParameters,testAttributeSearch, andtestSavedSearch(so saved-search execute/update tests stay consistent). Not applied tohdfs_pathsearches, negative validation tests, or quick-search tests.Example — first failing case.
name contains "testtable"onlyqualifiedName contains "@cl1"ANDname contains "testtable"Sort assertion (
testtable_3first in DESC order) still holds — all three fixture tables remain in the result set.Why not other approaches?
expectedCountto 4@BeforeClass3.
atlas-hbasedocker stack + CI diagnosticsWhat this is. After
mvn verifysucceeds, thedocker-buildjob brings up the full Atlas docker smoke stack (dev-support/atlas-docker). With default.env(ATLAS_BACKEND=hbase), theatlasservice depends onatlas-backend→hbase, which extendsdocker-compose.atlas-hbase.yml. Theatlas-hbasecontainer runs HBase + the Atlas HBase hook and must stay up for the post-build container health check in.github/workflows/ci.yml.CI failure. On run 28732248560 (Java 17), Maven passed but
docker compose ... up -d --waitfailed becauseatlas-hbaseexited with code 0 — a clean exit, not a crash. The job then reported the container as not running in the manual container-status step.Root cause: startup race in
atlas-hbase.sh. The container entrypoint starts HBase, then immediately snapshots the HMaster PID and runstail --pid=$HBASE_MASTER_PID -f /dev/nullto keep the container alive:start-hbase.shreturns before HMaster is always registered inps. On a slow CI host, the one-shotgrepcan return an empty PID.tail --pid=then exits immediately (exit 0), the entrypoint script ends, and Docker marks the container exited (0). From the outside this looks like a spurious flake — HBase may still be starting, or may never have been pinned as the watched process.A second weakness was in
docker-compose.atlas-hbase.ymlhealthcheck. The old check only probed the RegionServer status page:RegionServer can report healthy while HMaster is still starting or has already died.
docker compose --waitcould pass (or flap) on RS alone even though the process the entrypoint watches (HMaster) was not stable — or the container had already exited because of the empty-PID race.Fix applied — three files, one goal: keep HBase alive and report true readiness.
dev-support/atlas-docker/scripts/atlas-hbase.shps | grep HMasterimmediately afterstart-hbase.shtail --pid=→ container exits 0HBase HMaster failed to startandexit 1(explicit failure, not silent exit)The
tail --pid=$HBASE_MASTER_PIDkeep-alive pattern is unchanged once a valid PID is found — we only fixed the race before that line.dev-support/atlas-docker/docker-compose.atlas-hbase.ymlhealthcheck.testwget→16030/rs-statusonly (RegionServer)CMD-SHELLchecks both16030/rs-statusand16010/master-status--waitnow requires RS and HMaster HTTP status endpointsrestartrestart: unless-stoppedinterval/timeout/retries/start_periodPort reference:
16010/master-status16030/rs-status.github/workflows/ci.yml— container failure diagnosticsThe
Check status of containers and remove themstep already iterated the expected container list (atlas-zk,atlas-solr,atlas-kafka,atlas-db,atlas-hadoop,atlas-hbase,atlas-hive,atlas) and failed the job if any were not running. On failure it previously only printed which container was down, then stopped/removed everything — no logs.Change: when any container is not running, dump
docker logs --tail 200for every expected container before teardown:This does not change pass/fail behaviour; it makes the next
atlas-hbase exited (0)flake self-explanatory in the Actions log (HMaster wait timeout, RS/Master healthcheck failure, Hadoop dependency, etc.) without reproducing locally.Why these fixes are safe.
start_period+ health retriesrestart: unless-stoppedmasks real bugs?--wait; restart helps transient races onlyatlas-hbase.shactually watchesTesting
Build
Confirm tarball ships only
jersey-client-1.19.jar(not 1.9):Manual — Trino extractor smoke test
apache-atlas-*-trino-extractor.tar.gz; configureatlas.rest.address, Trino JDBC URL, namespace, and catalog inatlas-trino-extractor.properties.ATLAS_USERNAME/ATLAS_PASSWORDset (no TTY required).trino_columnentity present via Atlas REST unique-attribute lookup (qualifiedName=hive.hr.trino_pii_hive_v2.ssn@dev).Result: PASS — no
MessageBodyWriter/MessageBodyReadererrors;trino_*entities imported successfully.Manual — Trino tag-auth E2E
trino_*entities (not manual REST fallback)trino_columnvia Atlas RESTOverall: PASS — extractor-first metadata path; tag-based deny/mask enforced on Trino queries.
CI — QuickStart IT
Result: PASS —
QuickStartIT: 5 tests, 0 failures.Related