feat(test): add Spark integration test infrastructure for Hadoop catalog by tanmayrauth · Pull Request #968 · apache/iceberg-go

tanmayrauth · 2026-05-01T21:54:52Z

Set up Docker and Spark infrastructure for Hadoop catalog cross-compatibility testing with Java's HadoopCatalog.

Add hadoop_validation.py: SparkSession configured with spark.sql.catalog.hadoop_test (type=hadoop, warehouse=/home/iceberg/hadoop-warehouse)
Add shared volume mount in docker-compose.yml: /tmp/iceberg-hadoop-warehouse (host) <-> /home/iceberg/hadoop-warehouse (Spark)
Copy hadoop_validation.py into Spark container via Dockerfile
Add make integration-hadoop target

Depends on: #963
Relates to #798

Set up Docker and Spark infrastructure for Hadoop catalog cross-compatibility testing with Java's HadoopCatalog. - Add hadoop_validation.py: SparkSession configured with spark.sql.catalog.hadoop_test (type=hadoop, warehouse=/home/iceberg/hadoop-warehouse) - Add shared volume mount in docker-compose.yml: /tmp/iceberg-hadoop-warehouse (host) <-> /home/iceberg/hadoop-warehouse (Spark) - Copy hadoop_validation.py into Spark container via Dockerfile - Add make integration-hadoop target No Go code — purely infrastructure so subsequent PRs can add integration test cases that validate Go ↔ Spark interop. Depends-on: nothing (parallel with PR 1) Depended-on-by: PRs 4, 5, 6 (integration test cases)

tanmayrauth · 2026-05-03T04:22:45Z

@laskoviymishka @zeroshade can you please review this PR?

laskoviymishka

I think this PR need a bit more.

Mostly around making the integration setup actually usable and reliable outside CI.

The biggest issue is the warehouse path contract. The PR description says the host path is /tmp/iceberg-hadoop-warehouse, but docker-compose.yml mounts ./hadoop-warehouse into a different path inside the container. For HadoopCatalog round-trip tests, Go-on-host and Spark-in-container need to agree on the same physical path. I’d prefer matching the existing Hive pattern and using the same absolute path on both sides, e.g.:

/tmp/iceberg-hadoop-warehouse:/tmp/iceberg-hadoop-warehouse

That should also be the path documented in the PR/test config.

A few related DevEx/test reliability things would make this much easier to run locally:

Add cleanup/reset behavior. HadoopCatalog persists state on disk, so repeated local runs should not fail with “table already exists.”
Avoid CI-only setup logic. Today CI computes env vars via docker inspect / docker ps, but the Makefile does not, so local make integration-setup && make integration-hadoop can behave differently from CI.
Replace fixed sleep with compose healthchecks / readiness checks where possible.
Add local helper targets like integration-down and integration-logs.
Make sure the Spark validation asserts a real round-trip result, not just “Spark did not crash.”

Longer term, I think testcontainers-go compose support would be a good fit here: keep docker-compose.yml as the source of truth, but let Go tests own lifecycle, ports, waits, and cleanup.

I would not block this PR on that refactor, but I’d at least fix the warehouse path mismatch now so a future testcontainers migration does not inherit the same ambiguity.

zeroshade · 2026-05-04T16:15:54Z

I agree with @laskoviymishka's comments. Let's get the DevEx stuff in here before we merge this

Use same absolute path /tmp/iceberg-hadoop-warehouse on host and container (matching Hive pattern). Replace sleep-based setup with docker compose --wait + healthcheck. Add integration-down, logs, env, and hadoop-clean helper targets. Add --assert-rows to hadoop_validation.py for meaningful round-trip verification.

tanmayrauth · 2026-05-04T23:49:20Z

Addressed all points:
(1) Fixed the warehouse path mismatch - now using /tmp/iceberg-hadoop- warehouse:/tmp/iceberg-hadoop-warehouse (same absolute path on both sides, matching the Hive pattern).
(2) Added make integration-hadoop-clean for resetting state between local runs.
(3) Replaced sleep 10 with docker compose up --wait + a healthcheck on spark-iceberg, so local and CI behave identically.
(4) Added make integration-down, integration-logs, and integration-env helper targets.
(5) hadoop_validation.py now supports --assert-rows to verify actual query results, not just "Spark didn't crash." Agreed on testcontainers-go as a good future direction these changes should make that migration straightforward since the path contract is now unambiguous.

laskoviymishka

LGTM!

There is still some gaps in local run (outside of CI), integration-env is a good mitigation, so i think we can merge this.

tanmayrauth requested a review from zeroshade as a code owner May 1, 2026 21:54

tanmayrauth force-pushed the feat/hadoop-integration-infra branch from b0b2ff6 to c76d9ab Compare May 1, 2026 21:57

This was referenced May 1, 2026

feat(catalog): hadoop list drop rename operations #970

Open

feat(catalog): hadoop table and namespace CRUD operations #969

Open

laskoviymishka requested changes May 4, 2026

View reviewed changes

laskoviymishka approved these changes May 5, 2026

View reviewed changes

laskoviymishka merged commit 330fcdf into apache:main May 5, 2026
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(test): add Spark integration test infrastructure for Hadoop catalog#968

feat(test): add Spark integration test infrastructure for Hadoop catalog#968
laskoviymishka merged 2 commits intoapache:mainfrom
tanmayrauth:feat/hadoop-integration-infra

tanmayrauth commented May 1, 2026 •

edited

Loading

Uh oh!

tanmayrauth commented May 3, 2026

Uh oh!

laskoviymishka left a comment

Uh oh!

zeroshade commented May 4, 2026

Uh oh!

tanmayrauth commented May 4, 2026

Uh oh!

laskoviymishka left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tanmayrauth commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tanmayrauth commented May 3, 2026

Uh oh!

laskoviymishka left a comment

Choose a reason for hiding this comment

Uh oh!

zeroshade commented May 4, 2026

Uh oh!

tanmayrauth commented May 4, 2026

Uh oh!

laskoviymishka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tanmayrauth commented May 1, 2026 •

edited

Loading