Skip to content

ATLAS-5337: Fix Trino Extractor Jersey client failures for AtlasEntityWithExtInfo POST/GET#690

Open
ramackri wants to merge 5 commits into
apache:masterfrom
ramackri:ATLAS-5337
Open

ATLAS-5337: Fix Trino Extractor Jersey client failures for AtlasEntityWithExtInfo POST/GET#690
ramackri wants to merge 5 commits into
apache:masterfrom
ramackri:ATLAS-5337

Conversation

@ramackri

@ramackri ramackri commented Jul 4, 2026

Copy link
Copy Markdown

Summary

Fix Trino Extractor standalone tarball failures when importing trino_* metadata into Atlas via AtlasClientV2. The bug has existed since ATLAS-5021 (PR #428, commit 4c49d3933, Sep 2025) and is not a regression from the Kafka 3.9.1 upgrade.

Also fixes flaky QuickStartIT / QuickStartV2IT CI failures caused by non-deterministic graph edge ordering (unrelated to the Jersey client changes in this PR).

Also fixes additional unrelated CI flakes in GlossaryServiceTest, BasicSearchIT, and the atlas-hbase docker smoke test.

Symptoms (before fix)

POST createEntity — MessageBodyWriter:

com.sun.jersey.api.client.ClientHandlerException:
  A message body writer for Java type, class
    org.apache.atlas.model.instance.AtlasEntity$AtlasEntityWithExtInfo,
  and MIME media type application/json; charset=UTF-8, was not found

GET getEntityByAttribute — MessageBodyReader (after POST fix):

ClientHandlerException:
  A message body reader for Java class
    org.apache.atlas.model.instance.AtlasEntity$AtlasEntityWithExtInfo,
  and MIME media type application/json; charset=utf-8 was not found

Non-interactive auth: 401 Unauthorized when AuthenticationUtil could not read from System.console() (CI / scripted runs).

Workaround used before fix: manual Atlas REST import (curl) — bypasses AtlasClientV2.createEntity() entirely.

Root cause

  1. jersey-client 1.9 pin in addons/trino-extractor/pom.xml conflicted with atlas-client-v2 (1.19) → both jersey-client-1.9.jar and jersey-client-1.19.jar in distro lib/.
  2. Entity POST APIs passed Java objects to Jersey; works on full server / curated bridge classpaths via POJO mapping, fails in minimal extractor tarball.
  3. Entity GET APIs could not deserialize AtlasEntity$AtlasEntityWithExtInfo (inner class) via Jersey POJO mapping in standalone lib/ layout.
  4. AuthenticationUtil only read credentials from System.console().

Type-def APIs already used AtlasType.toJson(); entity APIs did not — that asymmetry is why type defs could work while entity POSTs failed.

Why undetected until now: TrinoExtractorIT.java is a placeholder with no @Test methods; tarball path never exercised in CI. Most deployments use Hive hook or manual REST import.

Not caused by Kafka 3.9.1: Dependabot commit 6709f6459 only changed <kafka.version> in root pom.xml. Trino extractor has no Kafka client dependency.

Changes

File Change
addons/trino-extractor/pom.xml Remove explicit jersey-client 1.9 pin; inherit jersey.version 1.19 from parent
client/client-v2/.../AtlasClientV2.java Entity mutation APIs send JSON via AtlasType.toJson()createEntity, createEntities, updateEntity, updateEntities, updateEntityByAttribute
client/common/.../AtlasBaseClient.java For org.apache.atlas.model.* response types, read body as String and parse with AtlasJson.fromJson()
intg/.../AuthenticationUtil.java Support ATLAS_USERNAME / ATLAS_PASSWORD env vars before console prompt
webapp/.../QuickStartIT.java Fix flaky assertions: find time_id column by name; check process inputs by GUID set (not get(0) order)
webapp/.../QuickStartV2IT.java Fix flaky testProcessIsAdded: assert input GUIDs via set membership (not list index)
repository/.../GlossaryServiceTest.java Retry loadAllModels / createTypesDef on transient JanusGraph errors during @BeforeClass setup
webapp/.../BasicSearchIT.java Scope hive_table / hive_column searches to imported @cl1 dataset to avoid suite pollution
dev-support/atlas-docker/scripts/atlas-hbase.sh Poll up to 120s for HMaster PID before tail --pid; exit 1 if never found (fixes exited (0) race)
dev-support/atlas-docker/docker-compose.atlas-hbase.yml Dual healthcheck: RegionServer 16030/rs-status and HMaster 16010/master-status; add restart: unless-stopped
.github/workflows/ci.yml On container check failure, dump docker logs --tail 200 for all expected containers before teardown

Was AtlasClientV2.createEntity(object) working before?

Yes — in most Atlas deployments. Bridges (Hive, Kafka, HBase), webapp ITs (EntityV2JerseyResourceIT), and the sample app used object-passing successfully where Jersey POJO mapping had a clean classpath (bridge tarballs curate lib/ with jersey-json + jersey-client 1.19).

No — for the Trino extractor tarball. Same client code, but copy-dependencies + jersey-client 1.9 pin produced an inconsistent lib/ layout.

Pre-serializing with AtlasType.toJson() produces the same wire JSON and aligns entity APIs with type-def APIs — no behaviour change for working callers.

QuickStart IT flakiness (CI fix)

Why this is unrelated to the Jersey client changes

This PR modifies AtlasClientV2, AtlasBaseClient (v2 model JSON parsing), and the Trino extractor classpath. The failing QuickStart tests do not exercise that code path:

Test class Client used Touches changed Jersey code?
QuickStartIT AtlasClientV1 (atlasClientV1.getEntity(...)) No — v1 Referenceable API, not AtlasClientV2
QuickStartV2IT AtlasClientV2 for reads only in testProcessIsAdded No — reads existing entities; does not call the changed createEntity / JSON-serialization paths

On the same CI run that failed QuickStartIT, AtlasClientV2Test (125 tests) passed. The QuickStart failure is a pre-existing brittle assertion, not a regression from ATLAS-5337.

Observed CI failures

Date Test Failure
2026-07-04 (docker-build Java 17) QuickStartIT.testTablesAreAdded expected [time_id] but found [customer_id] at verifyColumnsAreAddedToTable:143
2026-06-30 / 2026-06-24 (Java 17) QuickStartV2IT.testProcessIsAdded expected [guid-A] but found [guid-B] at testProcessIsAdded:122

Both are order-assumption failures: the test expected a specific element at list index 0, but Atlas returned a different (equally valid) element first.

Root cause: non-deterministic relationship list order

When a table entity is read back from Atlas, its columns attribute and a process entity's inputs attribute are lists of related entities built from graph edges. The order of those lists depends on graph traversal order.

Atlas can persist a stable order by setting ATTRIBUTE_INDEX_PROPERTY_KEY on relationship edges (see EntityGraphMapper, GraphHelper). That index is written when entities are created through the v2 store with explicit ordering. The QuickStart example programs (QuickStart.runQuickstart / QuickStartV2.runQuickstart) create metadata via the v1 client and do not set relationship indexes. As a result, column and input edge order is implementation-dependent and can vary between JVM versions, JanusGraph traversal order, or CI runs.

This is not a data correctness bug — all four columns and both process inputs are present — only the list position is unstable.

What each test assumed (before fix)

QuickStartIT (v1 API)

  1. verifyColumnsAreAddedToTable — called columns.get(0) and asserted its name was time_id. On failure, index 0 held customer_id instead. The sales_fact table is created with four columns (time_id, product_id, customer_id, sale_price); all four were present (assertEquals(columns.size(), 4) passed), but first-element identity varied.

  2. testProcessIsAdded — asserted inputs.get(0) was the sales_fact table GUID and inputs.get(1) was the time_dim table GUID. The process genuinely has two inputs; only their order in the returned list was unstable.

QuickStartV2IT (v2 API)

  1. verifyColumnsAreAddedToTable — already order-independent (checks count == 4 and each column GUID is a valid UUID). No change needed.

  2. testProcessIsAdded — same index-0 / index-1 assumption as v1, using ((Map) inputs.get(0)).get("guid").

Notably, QuickStartIT.testLineageIsMaintained already handled this correctly — it uses assertTrue(salesFactTableId.equals(i1) || salesFactTableId.equals(i2)) rather than assuming a fixed slot. The column and process tests now follow the same pattern.

Fix applied

Assertions now verify membership and attributes, not list position:

Location Before After
QuickStartIT.verifyColumnsAreAddedToTable columns.get(0).get("name") == "time_id" Stream/filter to find column where name == time_id; assert dataType == "int"
QuickStartIT.testProcessIsAdded inputs.get(0) / inputs.get(1) exact GUID match Collect input GUIDs into a Set; assertTrue both expected table GUIDs are present
QuickStartV2IT.testProcessIsAdded inputs.get(0) / inputs.get(1) exact GUID match Same set-membership check on relationship-attribute GUIDs

Single-output assertions (outputs.get(0)) are unchanged — each process has exactly one output, so index 0 is unambiguous.

Why this surfaced on Java 17 CI

Commit 488bdaae8 (ATLAS-5002) added a CI matrix leg docker-build (17) alongside Java 8. The QuickStart ITs are long-standing; the Java 17 leg runs the full embedded Jetty + Solr integration suite (mvn -Pdist,embedded-solr-it verify) and exposed intermittent ordering differences that the Java 8 leg had not surfaced consistently. The failures predate this PR and appear on upstream master runs as well.

Additional CI flake fixes (unrelated to Jersey / QuickStart)

These failures blocked CI on PR #690 but are not caused by the Trino extractor client changes.

Observed CI failures

Run Job Failure Related to ATLAS-5337?
28730263409 docker-build (8) GlossaryServiceTest.setupSampleGlossary — JanusGraph Could not start new transaction / Cursor has been closed at line 166 No
28732248560 docker-build (17) atlas-hbase exited (0) during docker compose --wait after Maven passed No
28735146924 docker-build (8) BasicSearchIT.testDiscoveryWithSearchParametersexpected [3] but found [4] at line 148 No

1. GlossaryServiceTest — retry transient JanusGraph setup errors

What this test is. GlossaryServiceTest is a repository-layer unit test (not a webapp IT). It runs with Guice + TestModules.TestOnlyModule, which boots an embedded JanusGraph + Solr in-process — the same backend style used by many other repository tests.

What @BeforeClass does. Before any @Test runs, setupSampleGlossary() must:

  1. Load Atlas type models from addons/models/0000-Area0 via TestLoadModelUtils.loadAllModels(...)
  2. Create a one-off classification typedef TestClassification via typeDefStore.createTypesDef(...) — this is where CI failed (old line 166)
  3. Build in-memory glossary/category/term fixtures for the rest of the class

CI failure. On run 28730263409 (Java 8), setup failed with JanusGraph errors Could not start new transaction and Cursor has been closed. These are backend timing errors, not assertion failures — the class never reached any glossary test logic.

Why it flakes.

Factor Effect
Embedded graph in CI JanusGraph runs in the same JVM as hundreds of other repository tests during mvn verify
@BeforeClass is all-or-nothing One failed createTypesDef skips the entire class
CI resource pressure Graph transactions can fail transiently under load
No retry before A single transient error caused a hard failure

Atlas production code already retries similar graph errors (EntityConsumer.commitWithRetry, AsyncImportTaskExecutor locking retries). Test setup lacked the same protection.

Fix applied. loadAllModelsWithRetry() and createTestClassificationWithRetry() wrap the two fragile setup steps:

  • Up to 5 attempts
  • Backoff: 500ms × attempt (500ms, 1s, 1.5s, 2s, 2.5s)
  • Retries only transient errors: could not start new transaction, cursor has been closed, permanentlockingexception
  • Permanent errors (bad typedef, validation) fail immediately — no masking of real bugs
  • After 5 failures, still throws SkipException with the error message

Why this is safe. Retries apply only to @BeforeClass setup, not test assertions. Creating TestClassification twice would surface a real duplicate-type error, not a silent pass. No glossary business logic is changed.

2. BasicSearchIT — scope hive searches to imported @cl1 dataset

What this test is. BasicSearchIT is a webapp integration test against a live Atlas server (embedded Jetty + Solr) shared by the entire IT suite (~202 tests in one mvn verify run).

In @BeforeClass setUp() it imports hive-db-50-tables.zip, creates an hdfs_path entity for special-character search cases, then sleeps 5s for Solr indexing. Tests then run search cases from JSON fixtures (e.g. entity-filters.json) with hard-coded expectedCount values.

What the fixture contains. Despite the filename, hive-db-50-tables.zip is a small replication export with exactly three hive_table entities whose names contain testtable:

Name qualifiedName
testtable_0 default.testtable_0@cl1
testtable_1 default.testtable_1@cl1
testtable_3 default.testtable_3@cl1

(testtable_2 is intentionally absent — the fixture is curated, not a 0–49 sequence.)

The @cl1 suffix is a replication cluster marker from the export (cl1 = source server name in replication metadata). Every entity in this zip carries it.

What the test searches for. The first case in entity-filters.json:

typeName = hive_table
entityFilters: name contains "testtable"
expectedCount: 3

That count is correct for the imported fixture — not for the entire Atlas graph.

CI failure. On run 28735146924 (Java 8):

BasicSearchIT.testDiscoveryWithSearchParameters — expected [3] but found [4] at line 148

Root cause: shared Atlas instance + global search. All IT classes share one Atlas process for the full mvn verify run. Other ITs (EntityJerseyResourceIT, EntityV2JerseyResourceIT, EntityNotificationIT, etc.) create hive_table entities during the suite. Any extra table whose name contains the substring testtable matches the search — even if it was not part of hive-db-50-tables.zip. That polluter almost certainly has a different qualifiedName (no @cl1 suffix) because live ITs create primary-cluster entities, not replication exports.

The test was accidentally asserting global graph state when it meant to assert fixture state.

Why @cl1 is the right filter.

Entity source Typical qualifiedName Matches name contains testtable? Matches qualifiedName contains @cl1?
hive-db-50-tables.zip default.testtable_0@cl1 Yes Yes
Live IT-created tables db.table@primary or random Sometimes No

Fix applied. scopeToImportedDataset() AND-combines the JSON filter with qualifiedName contains @cl1 for hive_table and hive_column searches:

typeName = hive_table
AND qualifiedName contains "@cl1"    ← added by fix
AND name contains "testtable"          ← from JSON fixture
→ 3 fixture tables only

Applied in testDiscoveryWithSearchParameters, testAttributeSearch, and testSavedSearch (so saved-search execute/update tests stay consistent). Not applied to hdfs_path searches, negative validation tests, or quick-search tests.

Example — first failing case.

Filter Result
Before name contains "testtable" only 3 fixture + 1 polluter = 4
After qualifiedName contains "@cl1" AND name contains "testtable" 3

Sort assertion (testtable_3 first in DESC order) still holds — all three fixture tables remain in the result set.

Why not other approaches?

Alternative Why not used
Change expectedCount to 4 Masks the issue; count would vary run-to-run
Isolate BasicSearchIT in its own Jetty instance Heavy CI change; slower
Rewrite all JSON fixtures Large diff; one helper is sufficient
Delete polluting entities in @BeforeClass Fragile; doesn't know which GUIDs to delete

3. atlas-hbase docker stack + CI diagnostics

What this is. After mvn verify succeeds, the docker-build job brings up the full Atlas docker smoke stack (dev-support/atlas-docker). With default .env (ATLAS_BACKEND=hbase), the atlas service depends on atlas-backendhbase, which extends docker-compose.atlas-hbase.yml. The atlas-hbase container runs HBase + the Atlas HBase hook and must stay up for the post-build container health check in .github/workflows/ci.yml.

CI failure. On run 28732248560 (Java 17), Maven passed but docker compose ... up -d --wait failed because atlas-hbase exited with code 0 — a clean exit, not a crash. The job then reported the container as not running in the manual container-status step.

Root cause: startup race in atlas-hbase.sh. The container entrypoint starts HBase, then immediately snapshots the HMaster PID and runs tail --pid=$HBASE_MASTER_PID -f /dev/null to keep the container alive:

start-hbase.sh  →  ps/grep HMaster (single shot)  →  tail --pid=$PID

start-hbase.sh returns before HMaster is always registered in ps. On a slow CI host, the one-shot grep can return an empty PID. tail --pid= then exits immediately (exit 0), the entrypoint script ends, and Docker marks the container exited (0). From the outside this looks like a spurious flake — HBase may still be starting, or may never have been pinned as the watched process.

A second weakness was in docker-compose.atlas-hbase.yml healthcheck. The old check only probed the RegionServer status page:

http://localhost:16030/rs-status

RegionServer can report healthy while HMaster is still starting or has already died. docker compose --wait could pass (or flap) on RS alone even though the process the entrypoint watches (HMaster) was not stable — or the container had already exited because of the empty-PID race.

Fix applied — three files, one goal: keep HBase alive and report true readiness.

dev-support/atlas-docker/scripts/atlas-hbase.sh
Before After
Single ps | grep HMaster immediately after start-hbase.sh Poll up to 60 attempts × 2s = 120s for HMaster PID
Empty PID → tail --pid= → container exits 0 If PID still empty after 120s → log HBase HMaster failed to start and exit 1 (explicit failure, not silent exit)
No visibility into slow starts Gives HBase time to finish SSH/setup and bring HMaster up on loaded CI VMs

The tail --pid=$HBASE_MASTER_PID keep-alive pattern is unchanged once a valid PID is found — we only fixed the race before that line.

dev-support/atlas-docker/docker-compose.atlas-hbase.yml
Setting Before After Why
healthcheck.test wget16030/rs-status only (RegionServer) CMD-SHELL checks both 16030/rs-status and 16010/master-status Compose --wait now requires RS and HMaster HTTP status endpoints
restart (none) restart: unless-stopped Transient startup failure or brief HMaster exit triggers Docker restart instead of leaving the stack down
interval / timeout / retries / start_period 30s / 10s / 30 / 40s unchanged ~15 min worst-case health wait budget; sufficient for first HBase start on cold CI

Port reference:

Port Endpoint Component
16010 /master-status HMaster
16030 /rs-status RegionServer
.github/workflows/ci.yml — container failure diagnostics

The Check status of containers and remove them step already iterated the expected container list (atlas-zk, atlas-solr, atlas-kafka, atlas-db, atlas-hadoop, atlas-hbase, atlas-hive, atlas) and failed the job if any were not running. On failure it previously only printed which container was down, then stopped/removed everything — no logs.

Change: when any container is not running, dump docker logs --tail 200 for every expected container before teardown:

for container in "${containers[@]}"; do
    docker logs --tail 200 "$container" 2>&1 || true
done

This does not change pass/fail behaviour; it makes the next atlas-hbase exited (0) flake self-explanatory in the Actions log (HMaster wait timeout, RS/Master healthcheck failure, Hadoop dependency, etc.) without reproducing locally.

Why these fixes are safe.

Concern Answer
Longer CI wait? Only when HBase is genuinely slow; 120s PID wait runs inside the container entrypoint, parallel to the existing 40s start_period + health retries
restart: unless-stopped masks real bugs? Persistent HMaster failure still fails healthcheck and fails --wait; restart helps transient races only
Dual healthcheck too strict? Both endpoints are standard HBase admin URLs already used for ops; aligns health with what atlas-hbase.sh actually watches
Related to ATLAS-5337 Jersey changes? No — docker/CI-only; HBase hook tarball is copied in an earlier CI step but this failure is container lifecycle, not client classpath

Testing

Build

mvn -pl addons/trino-extractor,client/client-v2,client/common,intg -am package -DskipTests
mvn -pl addons/trino-extractor,distro -am package -DskipTests -Pdist

Confirm tarball ships only jersey-client-1.19.jar (not 1.9):

tar tzf distro/target/apache-atlas-*-trino-extractor.tar.gz | grep jersey-client

Manual — Trino extractor smoke test

  1. Extract apache-atlas-*-trino-extractor.tar.gz; configure atlas.rest.address, Trino JDBC URL, namespace, and catalog in atlas-trino-extractor.properties.
  2. Run extractor for a single Hive-backed table with ATLAS_USERNAME / ATLAS_PASSWORD set (no TTY required).
  3. Confirm trino_column entity present via Atlas REST unique-attribute lookup (qualifiedName=hive.hr.trino_pii_hive_v2.ssn@dev).

Result: PASS — no MessageBodyWriter / MessageBodyReader errors; trino_* entities imported successfully.

Manual — Trino tag-auth E2E

Step What was verified Result
Metadata import Extractor imports trino_* entities (not manual REST fallback) PASS
Classification PII tag applied on trino_column via Atlas REST PASS
TagSync → Ranger Tag mapping propagated to Trino service in Ranger Admin PASS
Tag enforcement Admin sees raw value; denied user blocked; masked user sees masked SSN PASS
Audit Trino access audit entries recorded in Ranger PASS

Overall: PASS — extractor-first metadata path; tag-based deny/mask enforced on Trino queries.

CI — QuickStart IT

cd webapp && mvn -Pdist,embedded-solr-it test-compile jetty:stop jetty:deploy-war \
  failsafe:integration-test failsafe:verify -Dit.test=QuickStartIT

Result: PASSQuickStartIT: 5 tests, 0 failures.

Related

ramk added 5 commits July 4, 2026 21:54
…yWithExtInfo

Remove jersey-client 1.9 pin from trino-extractor; JSON-serialize entity POSTs
via AtlasType.toJson() in AtlasClientV2; parse model GET responses with
AtlasJson in AtlasBaseClient; support ATLAS_USERNAME/ATLAS_PASSWORD env vars
in AuthenticationUtil for non-interactive runs.
QuickStartIT and QuickStartV2IT assumed fixed column/process input ordering from graph traversal, which fails intermittently on Java 17 CI.
…docker.

Retry transient JanusGraph errors during glossary test setup, scope BasicSearchIT
hive queries to the imported @cl1 dataset, and harden atlas-hbase container startup
and health checks so docker compose --wait does not fail spuriously.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant