feat(test): add Spark integration test infrastructure for Hadoop catalog#968
Conversation
Set up Docker and Spark infrastructure for Hadoop catalog cross-compatibility testing with Java's HadoopCatalog. - Add hadoop_validation.py: SparkSession configured with spark.sql.catalog.hadoop_test (type=hadoop, warehouse=/home/iceberg/hadoop-warehouse) - Add shared volume mount in docker-compose.yml: /tmp/iceberg-hadoop-warehouse (host) <-> /home/iceberg/hadoop-warehouse (Spark) - Copy hadoop_validation.py into Spark container via Dockerfile - Add make integration-hadoop target No Go code — purely infrastructure so subsequent PRs can add integration test cases that validate Go ↔ Spark interop. Depends-on: nothing (parallel with PR 1) Depended-on-by: PRs 4, 5, 6 (integration test cases)
b0b2ff6 to
c76d9ab
Compare
|
@laskoviymishka @zeroshade can you please review this PR? |
laskoviymishka
left a comment
There was a problem hiding this comment.
I think this PR need a bit more.
Mostly around making the integration setup actually usable and reliable outside CI.
The biggest issue is the warehouse path contract. The PR description says the host path is /tmp/iceberg-hadoop-warehouse, but docker-compose.yml mounts ./hadoop-warehouse into a different path inside the container. For HadoopCatalog round-trip tests, Go-on-host and Spark-in-container need to agree on the same physical path. I’d prefer matching the existing Hive pattern and using the same absolute path on both sides, e.g.:
/tmp/iceberg-hadoop-warehouse:/tmp/iceberg-hadoop-warehouseThat should also be the path documented in the PR/test config.
A few related DevEx/test reliability things would make this much easier to run locally:
- Add cleanup/reset behavior. HadoopCatalog persists state on disk, so repeated local runs should not fail with “table already exists.”
- Avoid CI-only setup logic. Today CI computes env vars via
docker inspect/docker ps, but the Makefile does not, so localmake integration-setup && make integration-hadoopcan behave differently from CI. - Replace fixed
sleepwith compose healthchecks / readiness checks where possible. - Add local helper targets like
integration-downandintegration-logs. - Make sure the Spark validation asserts a real round-trip result, not just “Spark did not crash.”
Longer term, I think testcontainers-go compose support would be a good fit here: keep docker-compose.yml as the source of truth, but let Go tests own lifecycle, ports, waits, and cleanup.
I would not block this PR on that refactor, but I’d at least fix the warehouse path mismatch now so a future testcontainers migration does not inherit the same ambiguity.
|
I agree with @laskoviymishka's comments. Let's get the DevEx stuff in here before we merge this |
Use same absolute path /tmp/iceberg-hadoop-warehouse on host and container (matching Hive pattern). Replace sleep-based setup with docker compose --wait + healthcheck. Add integration-down, logs, env, and hadoop-clean helper targets. Add --assert-rows to hadoop_validation.py for meaningful round-trip verification.
|
Addressed all points: |
laskoviymishka
left a comment
There was a problem hiding this comment.
LGTM!
There is still some gaps in local run (outside of CI), integration-env is a good mitigation, so i think we can merge this.
Set up Docker and Spark infrastructure for Hadoop catalog cross-compatibility testing with Java's HadoopCatalog.
Depends on: #963
Relates to #798