ATLAS-5337: Fix Trino Extractor Jersey client failures for AtlasEntityWithExtInfo POST/GET by ramackri · Pull Request #690 · apache/atlas

ramackri · 2026-07-04T16:24:57Z

Summary

Fix Trino Extractor standalone tarball failures when importing trino_* metadata into Atlas via AtlasClientV2. The bug has existed since ATLAS-5021 (PR #428, commit 4c49d3933, Sep 2025) and is not a regression from the Kafka 3.9.1 upgrade.

Also fixes flaky QuickStartIT / QuickStartV2IT CI failures caused by non-deterministic graph edge ordering (unrelated to the Jersey client changes in this PR).

Also fixes additional unrelated CI flakes in GlossaryServiceTest, BasicSearchIT, and the atlas-hbase docker smoke test.

Symptoms (before fix)

POST createEntity — MessageBodyWriter:

com.sun.jersey.api.client.ClientHandlerException:
  A message body writer for Java type, class
    org.apache.atlas.model.instance.AtlasEntity$AtlasEntityWithExtInfo,
  and MIME media type application/json; charset=UTF-8, was not found

GET getEntityByAttribute — MessageBodyReader (after POST fix):

ClientHandlerException:
  A message body reader for Java class
    org.apache.atlas.model.instance.AtlasEntity$AtlasEntityWithExtInfo,
  and MIME media type application/json; charset=utf-8 was not found

Non-interactive auth: 401 Unauthorized when AuthenticationUtil could not read from System.console() (CI / scripted runs).

Workaround used before fix: manual Atlas REST import (curl) — bypasses AtlasClientV2.createEntity() entirely.

Root cause

jersey-client 1.9 pin in addons/trino-extractor/pom.xml conflicted with atlas-client-v2 (1.19) → both jersey-client-1.9.jar and jersey-client-1.19.jar in distro lib/.
Entity POST APIs passed Java objects to Jersey; works on full server / curated bridge classpaths via POJO mapping, fails in minimal extractor tarball.
Entity GET APIs could not deserialize AtlasEntity$AtlasEntityWithExtInfo (inner class) via Jersey POJO mapping in standalone lib/ layout.
AuthenticationUtil only read credentials from System.console().

Type-def APIs already used AtlasType.toJson(); entity APIs did not — that asymmetry is why type defs could work while entity POSTs failed.

Why undetected until now: TrinoExtractorIT.java is a placeholder with no @Test methods; tarball path never exercised in CI. Most deployments use Hive hook or manual REST import.

Not caused by Kafka 3.9.1: Dependabot commit 6709f6459 only changed <kafka.version> in root pom.xml. Trino extractor has no Kafka client dependency.

Changes

File	Change
`addons/trino-extractor/pom.xml`	Remove explicit `jersey-client` 1.9 pin; inherit `jersey.version` 1.19 from parent
`client/client-v2/.../AtlasClientV2.java`	Entity mutation APIs send JSON via `AtlasType.toJson()` — `createEntity`, `createEntities`, `updateEntity`, `updateEntities`, `updateEntityByAttribute`
`client/common/.../AtlasBaseClient.java`	For `org.apache.atlas.model.*` response types, read body as String and parse with `AtlasJson.fromJson()`
`intg/.../AuthenticationUtil.java`	Support `ATLAS_USERNAME` / `ATLAS_PASSWORD` env vars before console prompt
`webapp/.../QuickStartIT.java`	Fix flaky assertions: find `time_id` column by name; check process inputs by GUID set (not `get(0)` order)
`webapp/.../QuickStartV2IT.java`	Fix flaky `testProcessIsAdded`: assert input GUIDs via set membership (not list index)
`repository/.../GlossaryServiceTest.java`	Retry `loadAllModels` / `createTypesDef` on transient JanusGraph errors during `@BeforeClass` setup
`webapp/.../BasicSearchIT.java`	Scope `hive_table` / `hive_column` searches to imported `@cl1` dataset to avoid suite pollution
`dev-support/atlas-docker/scripts/atlas-hbase.sh`	Poll up to 120s for HMaster PID before `tail --pid`; `exit 1` if never found (fixes `exited (0)` race)
`dev-support/atlas-docker/docker-compose.atlas-hbase.yml`	Dual healthcheck: RegionServer `16030/rs-status` and HMaster `16010/master-status`; add `restart: unless-stopped`
`.github/workflows/ci.yml`	On container check failure, dump `docker logs --tail 200` for all expected containers before teardown

Was `AtlasClientV2.createEntity(object)` working before?

Yes — in most Atlas deployments. Bridges (Hive, Kafka, HBase), webapp ITs (EntityV2JerseyResourceIT), and the sample app used object-passing successfully where Jersey POJO mapping had a clean classpath (bridge tarballs curate lib/ with jersey-json + jersey-client 1.19).

No — for the Trino extractor tarball. Same client code, but copy-dependencies + jersey-client 1.9 pin produced an inconsistent lib/ layout.

Pre-serializing with AtlasType.toJson() produces the same wire JSON and aligns entity APIs with type-def APIs — no behaviour change for working callers.

QuickStart IT flakiness (CI fix)

Why this is unrelated to the Jersey client changes

This PR modifies AtlasClientV2, AtlasBaseClient (v2 model JSON parsing), and the Trino extractor classpath. The failing QuickStart tests do not exercise that code path:

Test class	Client used	Touches changed Jersey code?
`QuickStartIT`	`AtlasClientV1` (`atlasClientV1.getEntity(...)`)	No — v1 `Referenceable` API, not `AtlasClientV2`
`QuickStartV2IT`	`AtlasClientV2` for reads only in `testProcessIsAdded`	No — reads existing entities; does not call the changed `createEntity` / JSON-serialization paths

On the same CI run that failed QuickStartIT, AtlasClientV2Test (125 tests) passed. The QuickStart failure is a pre-existing brittle assertion, not a regression from ATLAS-5337.

Observed CI failures

Date	Test	Failure
2026-07-04 (`docker-build` Java 17)	`QuickStartIT.testTablesAreAdded`	`expected [time_id] but found [customer_id]` at `verifyColumnsAreAddedToTable:143`
2026-06-30 / 2026-06-24 (Java 17)	`QuickStartV2IT.testProcessIsAdded`	`expected [guid-A] but found [guid-B]` at `testProcessIsAdded:122`

Both are order-assumption failures: the test expected a specific element at list index 0, but Atlas returned a different (equally valid) element first.

Root cause: non-deterministic relationship list order

When a table entity is read back from Atlas, its columns attribute and a process entity's inputs attribute are lists of related entities built from graph edges. The order of those lists depends on graph traversal order.

Atlas can persist a stable order by setting ATTRIBUTE_INDEX_PROPERTY_KEY on relationship edges (see EntityGraphMapper, GraphHelper). That index is written when entities are created through the v2 store with explicit ordering. The QuickStart example programs (QuickStart.runQuickstart / QuickStartV2.runQuickstart) create metadata via the v1 client and do not set relationship indexes. As a result, column and input edge order is implementation-dependent and can vary between JVM versions, JanusGraph traversal order, or CI runs.

This is not a data correctness bug — all four columns and both process inputs are present — only the list position is unstable.

What each test assumed (before fix)

QuickStartIT (v1 API)

verifyColumnsAreAddedToTable — called columns.get(0) and asserted its name was time_id. On failure, index 0 held customer_id instead. The sales_fact table is created with four columns (time_id, product_id, customer_id, sale_price); all four were present (assertEquals(columns.size(), 4) passed), but first-element identity varied.
testProcessIsAdded — asserted inputs.get(0) was the sales_fact table GUID and inputs.get(1) was the time_dim table GUID. The process genuinely has two inputs; only their order in the returned list was unstable.

QuickStartV2IT (v2 API)

verifyColumnsAreAddedToTable — already order-independent (checks count == 4 and each column GUID is a valid UUID). No change needed.
testProcessIsAdded — same index-0 / index-1 assumption as v1, using ((Map) inputs.get(0)).get("guid").

Notably, QuickStartIT.testLineageIsMaintained already handled this correctly — it uses assertTrue(salesFactTableId.equals(i1) || salesFactTableId.equals(i2)) rather than assuming a fixed slot. The column and process tests now follow the same pattern.

Fix applied

Assertions now verify membership and attributes, not list position:

Location	Before	After
`QuickStartIT.verifyColumnsAreAddedToTable`	`columns.get(0).get("name") == "time_id"`	Stream/filter to find column where `name == time_id`; assert `dataType == "int"`
`QuickStartIT.testProcessIsAdded`	`inputs.get(0)` / `inputs.get(1)` exact GUID match	Collect input GUIDs into a `Set`; `assertTrue` both expected table GUIDs are present
`QuickStartV2IT.testProcessIsAdded`	`inputs.get(0)` / `inputs.get(1)` exact GUID match	Same set-membership check on relationship-attribute GUIDs

Single-output assertions (outputs.get(0)) are unchanged — each process has exactly one output, so index 0 is unambiguous.

Why this surfaced on Java 17 CI

Commit 488bdaae8 (ATLAS-5002) added a CI matrix leg docker-build (17) alongside Java 8. The QuickStart ITs are long-standing; the Java 17 leg runs the full embedded Jetty + Solr integration suite (mvn -Pdist,embedded-solr-it verify) and exposed intermittent ordering differences that the Java 8 leg had not surfaced consistently. The failures predate this PR and appear on upstream master runs as well.

Additional CI flake fixes (unrelated to Jersey / QuickStart)

These failures blocked CI on PR #690 but are not caused by the Trino extractor client changes.

Observed CI failures

Run	Job	Failure	Related to ATLAS-5337?
28730263409	`docker-build (8)`	`GlossaryServiceTest.setupSampleGlossary` — JanusGraph `Could not start new transaction` / `Cursor has been closed` at line 166	No
28732248560	`docker-build (17)`	`atlas-hbase exited (0)` during docker compose `--wait` after Maven passed	No
28735146924	`docker-build (8)`	`BasicSearchIT.testDiscoveryWithSearchParameters` — `expected [3] but found [4]` at line 148	No

1. `GlossaryServiceTest` — retry transient JanusGraph setup errors

What this test is. GlossaryServiceTest is a repository-layer unit test (not a webapp IT). It runs with Guice + TestModules.TestOnlyModule, which boots an embedded JanusGraph + Solr in-process — the same backend style used by many other repository tests.

What @BeforeClass does. Before any @Test runs, setupSampleGlossary() must:

Load Atlas type models from addons/models/0000-Area0 via TestLoadModelUtils.loadAllModels(...)
Create a one-off classification typedef TestClassification via typeDefStore.createTypesDef(...) — this is where CI failed (old line 166)
Build in-memory glossary/category/term fixtures for the rest of the class

CI failure. On run 28730263409 (Java 8), setup failed with JanusGraph errors Could not start new transaction and Cursor has been closed. These are backend timing errors, not assertion failures — the class never reached any glossary test logic.

Why it flakes.

Factor	Effect
Embedded graph in CI	JanusGraph runs in the same JVM as hundreds of other `repository` tests during `mvn verify`
`@BeforeClass` is all-or-nothing	One failed `createTypesDef` skips the entire class
CI resource pressure	Graph transactions can fail transiently under load
No retry before	A single transient error caused a hard failure

Atlas production code already retries similar graph errors (EntityConsumer.commitWithRetry, AsyncImportTaskExecutor locking retries). Test setup lacked the same protection.

Fix applied. loadAllModelsWithRetry() and createTestClassificationWithRetry() wrap the two fragile setup steps:

Up to 5 attempts
Backoff: 500ms × attempt (500ms, 1s, 1.5s, 2s, 2.5s)
Retries only transient errors: could not start new transaction, cursor has been closed, permanentlockingexception
Permanent errors (bad typedef, validation) fail immediately — no masking of real bugs
After 5 failures, still throws SkipException with the error message

Why this is safe. Retries apply only to @BeforeClass setup, not test assertions. Creating TestClassification twice would surface a real duplicate-type error, not a silent pass. No glossary business logic is changed.

2. `BasicSearchIT` — scope hive searches to imported `@cl1` dataset

What this test is. BasicSearchIT is a webapp integration test against a live Atlas server (embedded Jetty + Solr) shared by the entire IT suite (~202 tests in one mvn verify run).

In @BeforeClass setUp() it imports hive-db-50-tables.zip, creates an hdfs_path entity for special-character search cases, then sleeps 5s for Solr indexing. Tests then run search cases from JSON fixtures (e.g. entity-filters.json) with hard-coded expectedCount values.

What the fixture contains. Despite the filename, hive-db-50-tables.zip is a small replication export with exactly three hive_table entities whose names contain testtable:

Name	qualifiedName
`testtable_0`	`default.testtable_0@cl1`
`testtable_1`	`default.testtable_1@cl1`
`testtable_3`	`default.testtable_3@cl1`

(testtable_2 is intentionally absent — the fixture is curated, not a 0–49 sequence.)

The @cl1 suffix is a replication cluster marker from the export (cl1 = source server name in replication metadata). Every entity in this zip carries it.

What the test searches for. The first case in entity-filters.json:

typeName = hive_table
entityFilters: name contains "testtable"
expectedCount: 3

That count is correct for the imported fixture — not for the entire Atlas graph.

CI failure. On run 28735146924 (Java 8):

BasicSearchIT.testDiscoveryWithSearchParameters — expected [3] but found [4] at line 148

Root cause: shared Atlas instance + global search. All IT classes share one Atlas process for the full mvn verify run. Other ITs (EntityJerseyResourceIT, EntityV2JerseyResourceIT, EntityNotificationIT, etc.) create hive_table entities during the suite. Any extra table whose name contains the substring testtable matches the search — even if it was not part of hive-db-50-tables.zip. That polluter almost certainly has a different qualifiedName (no @cl1 suffix) because live ITs create primary-cluster entities, not replication exports.

The test was accidentally asserting global graph state when it meant to assert fixture state.

Why @cl1 is the right filter.

Entity source	Typical qualifiedName	Matches `name contains testtable`?	Matches `qualifiedName contains @cl1`?
`hive-db-50-tables.zip`	`default.testtable_0@cl1`	Yes	Yes
Live IT-created tables	`db.table@primary` or random	Sometimes	No

Fix applied. scopeToImportedDataset() AND-combines the JSON filter with qualifiedName contains @cl1 for hive_table and hive_column searches:

typeName = hive_table
AND qualifiedName contains "@cl1"    ← added by fix
AND name contains "testtable"          ← from JSON fixture
→ 3 fixture tables only

Applied in testDiscoveryWithSearchParameters, testAttributeSearch, and testSavedSearch (so saved-search execute/update tests stay consistent). Not applied to hdfs_path searches, negative validation tests, or quick-search tests.

Example — first failing case.

	Filter	Result
Before	`name contains "testtable"` only	3 fixture + 1 polluter = 4 ❌
After	`qualifiedName contains "@cl1"` AND `name contains "testtable"`	3 ✅

Sort assertion (testtable_3 first in DESC order) still holds — all three fixture tables remain in the result set.

Why not other approaches?

Alternative	Why not used
Change `expectedCount` to 4	Masks the issue; count would vary run-to-run
Isolate BasicSearchIT in its own Jetty instance	Heavy CI change; slower
Rewrite all JSON fixtures	Large diff; one helper is sufficient
Delete polluting entities in `@BeforeClass`	Fragile; doesn't know which GUIDs to delete

3. `atlas-hbase` docker stack + CI diagnostics

What this is. After mvn verify succeeds, the docker-build job brings up the full Atlas docker smoke stack (dev-support/atlas-docker). With default .env (ATLAS_BACKEND=hbase), the atlas service depends on atlas-backend → hbase, which extends docker-compose.atlas-hbase.yml. The atlas-hbase container runs HBase + the Atlas HBase hook and must stay up for the post-build container health check in .github/workflows/ci.yml.

CI failure. On run 28732248560 (Java 17), Maven passed but docker compose ... up -d --wait failed because atlas-hbase exited with code 0 — a clean exit, not a crash. The job then reported the container as not running in the manual container-status step.

Root cause: startup race in atlas-hbase.sh. The container entrypoint starts HBase, then immediately snapshots the HMaster PID and runs tail --pid=$HBASE_MASTER_PID -f /dev/null to keep the container alive:

start-hbase.sh  →  ps/grep HMaster (single shot)  →  tail --pid=$PID

start-hbase.sh returns before HMaster is always registered in ps. On a slow CI host, the one-shot grep can return an empty PID. tail --pid= then exits immediately (exit 0), the entrypoint script ends, and Docker marks the container exited (0). From the outside this looks like a spurious flake — HBase may still be starting, or may never have been pinned as the watched process.

A second weakness was in docker-compose.atlas-hbase.yml healthcheck. The old check only probed the RegionServer status page:

http://localhost:16030/rs-status

RegionServer can report healthy while HMaster is still starting or has already died. docker compose --wait could pass (or flap) on RS alone even though the process the entrypoint watches (HMaster) was not stable — or the container had already exited because of the empty-PID race.

Fix applied — three files, one goal: keep HBase alive and report true readiness.

`dev-support/atlas-docker/scripts/atlas-hbase.sh`

Before	After
Single `ps \| grep HMaster` immediately after `start-hbase.sh`	Poll up to 60 attempts × 2s = 120s for HMaster PID
Empty PID → `tail --pid=` → container exits 0	If PID still empty after 120s → log `HBase HMaster failed to start` and `exit 1` (explicit failure, not silent exit)
No visibility into slow starts	Gives HBase time to finish SSH/setup and bring HMaster up on loaded CI VMs

The tail --pid=$HBASE_MASTER_PID keep-alive pattern is unchanged once a valid PID is found — we only fixed the race before that line.

`dev-support/atlas-docker/docker-compose.atlas-hbase.yml`

Setting	Before	After	Why
`healthcheck.test`	`wget` → `16030/rs-status` only (RegionServer)	`CMD-SHELL` checks both `16030/rs-status` and `16010/master-status`	Compose `--wait` now requires RS and HMaster HTTP status endpoints
`restart`	(none)	`restart: unless-stopped`	Transient startup failure or brief HMaster exit triggers Docker restart instead of leaving the stack down
`interval` / `timeout` / `retries` / `start_period`	30s / 10s / 30 / 40s	unchanged	~15 min worst-case health wait budget; sufficient for first HBase start on cold CI

Port reference:

Port	Endpoint	Component
`16010`	`/master-status`	HMaster
`16030`	`/rs-status`	RegionServer

`.github/workflows/ci.yml` — container failure diagnostics

The Check status of containers and remove them step already iterated the expected container list (atlas-zk, atlas-solr, atlas-kafka, atlas-db, atlas-hadoop, atlas-hbase, atlas-hive, atlas) and failed the job if any were not running. On failure it previously only printed which container was down, then stopped/removed everything — no logs.

Change: when any container is not running, dump docker logs --tail 200 for every expected container before teardown:

for container in "${containers[@]}"; do
    docker logs --tail 200 "$container" 2>&1 || true
done

This does not change pass/fail behaviour; it makes the next atlas-hbase exited (0) flake self-explanatory in the Actions log (HMaster wait timeout, RS/Master healthcheck failure, Hadoop dependency, etc.) without reproducing locally.

Why these fixes are safe.

Concern	Answer
Longer CI wait?	Only when HBase is genuinely slow; 120s PID wait runs inside the container entrypoint, parallel to the existing 40s `start_period` + health retries
`restart: unless-stopped` masks real bugs?	Persistent HMaster failure still fails healthcheck and fails `--wait`; restart helps transient races only
Dual healthcheck too strict?	Both endpoints are standard HBase admin URLs already used for ops; aligns health with what `atlas-hbase.sh` actually watches
Related to ATLAS-5337 Jersey changes?	No — docker/CI-only; HBase hook tarball is copied in an earlier CI step but this failure is container lifecycle, not client classpath

Testing

Build

mvn -pl addons/trino-extractor,client/client-v2,client/common,intg -am package -DskipTests
mvn -pl addons/trino-extractor,distro -am package -DskipTests -Pdist

Confirm tarball ships only jersey-client-1.19.jar (not 1.9):

tar tzf distro/target/apache-atlas-*-trino-extractor.tar.gz | grep jersey-client

Manual — Trino extractor smoke test

Extract apache-atlas-*-trino-extractor.tar.gz; configure atlas.rest.address, Trino JDBC URL, namespace, and catalog in atlas-trino-extractor.properties.
Run extractor for a single Hive-backed table with ATLAS_USERNAME / ATLAS_PASSWORD set (no TTY required).
Confirm trino_column entity present via Atlas REST unique-attribute lookup (qualifiedName=hive.hr.trino_pii_hive_v2.ssn@dev).

Result: PASS — no MessageBodyWriter / MessageBodyReader errors; trino_* entities imported successfully.

Manual — Trino tag-auth E2E

Step	What was verified	Result
Metadata import	Extractor imports `trino_*` entities (not manual REST fallback)	PASS
Classification	PII tag applied on `trino_column` via Atlas REST	PASS
TagSync → Ranger	Tag mapping propagated to Trino service in Ranger Admin	PASS
Tag enforcement	Admin sees raw value; denied user blocked; masked user sees masked SSN	PASS
Audit	Trino access audit entries recorded in Ranger	PASS

Overall: PASS — extractor-first metadata path; tag-based deny/mask enforced on Trino queries.

CI — QuickStart IT

cd webapp && mvn -Pdist,embedded-solr-it test-compile jetty:stop jetty:deploy-war \
  failsafe:integration-test failsafe:verify -Dit.test=QuickStartIT

Result: PASS — QuickStartIT: 5 tests, 0 failures.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ATLAS-5337: Fix Trino Extractor Jersey client failures for AtlasEntityWithExtInfo POST/GET#690

ATLAS-5337: Fix Trino Extractor Jersey client failures for AtlasEntityWithExtInfo POST/GET#690
ramackri wants to merge 5 commits into
apache:masterfrom
ramackri:ATLAS-5337

ramackri commented Jul 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ramackri commented Jul 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Symptoms (before fix)

Root cause

Changes

Was AtlasClientV2.createEntity(object) working before?

QuickStart IT flakiness (CI fix)

Why this is unrelated to the Jersey client changes

Observed CI failures

Root cause: non-deterministic relationship list order

What each test assumed (before fix)

Fix applied

Why this surfaced on Java 17 CI

Additional CI flake fixes (unrelated to Jersey / QuickStart)

Observed CI failures

1. GlossaryServiceTest — retry transient JanusGraph setup errors

2. BasicSearchIT — scope hive searches to imported @cl1 dataset

3. atlas-hbase docker stack + CI diagnostics

dev-support/atlas-docker/scripts/atlas-hbase.sh

dev-support/atlas-docker/docker-compose.atlas-hbase.yml

.github/workflows/ci.yml — container failure diagnostics

Testing

Build

Manual — Trino extractor smoke test

Manual — Trino tag-auth E2E

CI — QuickStart IT

Related

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ramackri commented Jul 4, 2026 •

edited

Loading

Was `AtlasClientV2.createEntity(object)` working before?

1. `GlossaryServiceTest` — retry transient JanusGraph setup errors

2. `BasicSearchIT` — scope hive searches to imported `@cl1` dataset

3. `atlas-hbase` docker stack + CI diagnostics

`dev-support/atlas-docker/scripts/atlas-hbase.sh`

`dev-support/atlas-docker/docker-compose.atlas-hbase.yml`

`.github/workflows/ci.yml` — container failure diagnostics