Skip to content

feat(test): add Spark integration test infrastructure for Hadoop catalog#968

Merged
laskoviymishka merged 2 commits intoapache:mainfrom
tanmayrauth:feat/hadoop-integration-infra
May 5, 2026
Merged

feat(test): add Spark integration test infrastructure for Hadoop catalog#968
laskoviymishka merged 2 commits intoapache:mainfrom
tanmayrauth:feat/hadoop-integration-infra

Conversation

@tanmayrauth
Copy link
Copy Markdown
Contributor

@tanmayrauth tanmayrauth commented May 1, 2026

Set up Docker and Spark infrastructure for Hadoop catalog cross-compatibility testing with Java's HadoopCatalog.

  • Add hadoop_validation.py: SparkSession configured with spark.sql.catalog.hadoop_test (type=hadoop, warehouse=/home/iceberg/hadoop-warehouse)
  • Add shared volume mount in docker-compose.yml: /tmp/iceberg-hadoop-warehouse (host) <-> /home/iceberg/hadoop-warehouse (Spark)
  • Copy hadoop_validation.py into Spark container via Dockerfile
  • Add make integration-hadoop target

Depends on: #963
Relates to #798

@tanmayrauth tanmayrauth requested a review from zeroshade as a code owner May 1, 2026 21:54
Set up Docker and Spark infrastructure for Hadoop catalog
cross-compatibility testing with Java's HadoopCatalog.

- Add hadoop_validation.py: SparkSession configured with
  spark.sql.catalog.hadoop_test (type=hadoop, warehouse=/home/iceberg/hadoop-warehouse)
- Add shared volume mount in docker-compose.yml:
  /tmp/iceberg-hadoop-warehouse (host) <-> /home/iceberg/hadoop-warehouse (Spark)
- Copy hadoop_validation.py into Spark container via Dockerfile
- Add make integration-hadoop target

No Go code — purely infrastructure so subsequent PRs can add
integration test cases that validate Go ↔ Spark interop.

Depends-on: nothing (parallel with PR 1)
Depended-on-by: PRs 4, 5, 6 (integration test cases)
@tanmayrauth
Copy link
Copy Markdown
Contributor Author

@laskoviymishka @zeroshade can you please review this PR?

Copy link
Copy Markdown
Contributor

@laskoviymishka laskoviymishka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR need a bit more.

Mostly around making the integration setup actually usable and reliable outside CI.

The biggest issue is the warehouse path contract. The PR description says the host path is /tmp/iceberg-hadoop-warehouse, but docker-compose.yml mounts ./hadoop-warehouse into a different path inside the container. For HadoopCatalog round-trip tests, Go-on-host and Spark-in-container need to agree on the same physical path. I’d prefer matching the existing Hive pattern and using the same absolute path on both sides, e.g.:

/tmp/iceberg-hadoop-warehouse:/tmp/iceberg-hadoop-warehouse

That should also be the path documented in the PR/test config.

A few related DevEx/test reliability things would make this much easier to run locally:

  • Add cleanup/reset behavior. HadoopCatalog persists state on disk, so repeated local runs should not fail with “table already exists.”
  • Avoid CI-only setup logic. Today CI computes env vars via docker inspect / docker ps, but the Makefile does not, so local make integration-setup && make integration-hadoop can behave differently from CI.
  • Replace fixed sleep with compose healthchecks / readiness checks where possible.
  • Add local helper targets like integration-down and integration-logs.
  • Make sure the Spark validation asserts a real round-trip result, not just “Spark did not crash.”

Longer term, I think testcontainers-go compose support would be a good fit here: keep docker-compose.yml as the source of truth, but let Go tests own lifecycle, ports, waits, and cleanup.

I would not block this PR on that refactor, but I’d at least fix the warehouse path mismatch now so a future testcontainers migration does not inherit the same ambiguity.

@zeroshade
Copy link
Copy Markdown
Member

I agree with @laskoviymishka's comments. Let's get the DevEx stuff in here before we merge this

Use same absolute path /tmp/iceberg-hadoop-warehouse on host and
container (matching Hive pattern). Replace sleep-based setup with
docker compose --wait + healthcheck. Add integration-down, logs,
env, and hadoop-clean helper targets. Add --assert-rows to
hadoop_validation.py for meaningful round-trip verification.
@tanmayrauth
Copy link
Copy Markdown
Contributor Author

Addressed all points:
(1) Fixed the warehouse path mismatch - now using /tmp/iceberg-hadoop- warehouse:/tmp/iceberg-hadoop-warehouse (same absolute path on both sides, matching the Hive pattern).
(2) Added make integration-hadoop-clean for resetting state between local runs.
(3) Replaced sleep 10 with docker compose up --wait + a healthcheck on spark-iceberg, so local and CI behave identically.
(4) Added make integration-down, integration-logs, and integration-env helper targets.
(5) hadoop_validation.py now supports --assert-rows to verify actual query results, not just "Spark didn't crash." Agreed on testcontainers-go as a good future direction these changes should make that migration straightforward since the path contract is now unambiguous.

Copy link
Copy Markdown
Contributor

@laskoviymishka laskoviymishka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

There is still some gaps in local run (outside of CI), integration-env is a good mitigation, so i think we can merge this.

@laskoviymishka laskoviymishka merged commit 330fcdf into apache:main May 5, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants