Spark: Trim row-level test parameter rows from 6 to 3 by stevenzwu · Pull Request #16549 · apache/iceberg

stevenzwu · 2026-05-24T04:31:32Z

Summary

Reduces SparkRowLevelOperationsTestBase parameter rows from 6 to 3 in v4.0 / v4.1, and from 7 to 3 in v3.5, shifting from "test every catalog backend" to "test the catalogs that matter for production":

testhive (Hive) — kept as the Hive-metastore baseline. Two rows on testhive cover ORC + PARQUET formats and the joint axes (HASH/null/DISTRIBUTED) that previously sat on testhadoop.
spark_catalog (REST-backed) — repointed from Hive to REST. The SparkSessionCatalog row now exercises the REST commit path. Non-session REST is intentionally not added here — for row-level operation code paths, both Spark catalog wrappers return SparkTable and commit through the same MergingSnapshotProducer, so the SessionCatalog row covers what matters; non-session REST is already exercised by other tests (TestStructuredStreamingRead3, etc.).
testhadoop dropped — HadoopCatalog isn't recommended for production.

The trim affects 9 concrete subclasses that inherit this base: TestCopyOnWriteMerge / TestMergeOnReadMerge / TestCopyOnWriteUpdate / TestMergeOnReadUpdate / TestCopyOnWriteDelete / TestMergeOnReadDelete, both *MergeMetrics subclasses, and TestMergeSchemaEvolution. Each cuts test invocations by 50% (≈57% on Spark 3.5). TestCopyOnWriteWithLineage / TestMergeOnReadWithLineage are unaffected — TestRowLevelOperationsWithLineage redeclares parameters() locally.

Expected savings: ~6-7 minutes off each extensions matrix-cell wallclock on CI (out of ~33 min total).

A small RESTCatalogServer fixture fix is bundled to make this PR's REST row pass CI — see "Bundled fixture fix" below.

Axis coverage — before

v4.0 / v4.1 (6 rows):

#	catalog	format	vec	dist	branch	planning	fanout	fmtV
1	testhive	ORC	true	NONE	MAIN	LOCAL	true	2
2	testhive	PARQUET	true	NONE	test	DISTRIBUTED	false	2
3	testhadoop	PARQUET	random	HASH	null	LOCAL	true	2
4	spark_catalog (Hive)	AVRO	false	RANGE	test	DISTRIBUTED	false	2
5	testhadoop	PARQUET	random	HASH	null	LOCAL	true	3
6	spark_catalog (Hive)	AVRO	false	RANGE	test	DISTRIBUTED	false	3

v3.5 (7 rows): same as v4.x rows 1-4, plus testhadoop / PARQUET / vec=true / HASH / v3 and vec=false variants (rows 5+6), with the v3 spark_catalog row at position 7.

Axis	Values present (rows)
Catalog	testhive (1, 2) · testhadoop (3, 5) · spark_catalog/Hive (4, 6)
File format	ORC (1) · PARQUET (2, 3, 5) · AVRO (4, 6)
Vectorized	true (1, 2) · random (3, 5) · false (4, 6)
Distribution	NONE (1, 2) · HASH (3, 5) · RANGE (4, 6)
Branch	MAIN (1) · test (2, 4, 6) · null (3, 5)
Planning	LOCAL (1, 3, 5) · DISTRIBUTED (2, 4, 6)
Fanout	true (1, 3, 5) · false (2, 4, 6)
formatVersion	v2 (1, 2, 3, 4) · v3 (5, 6)

Two pairs are duplicates on every axis except formatVersion: rows 3+5 and rows 4+6.

Axis coverage — after

#	catalog	format	vec	dist	branch	planning	fanout	fmtV
1	testhive	ORC	true	NONE	MAIN	LOCAL	true	2
2	testhive	PARQUET	true	HASH	null	DISTRIBUTED	false	2
3	spark_catalog (REST)	AVRO	false	RANGE	test	DISTRIBUTED	false	3

Axis	Values present (rows)
Catalog	testhive (1, 2) · spark_catalog/REST (3)
File format	ORC (1) · PARQUET (2) · AVRO (3)
Vectorized	true (1, 2) · false (3)
Distribution	NONE (1) · HASH (2) · RANGE (3)
Branch	MAIN (1) · null (2) · test (3)
Planning	LOCAL (1) · DISTRIBUTED (2, 3)
Fanout	true (1) · false (2, 3)
formatVersion	v2 (1, 2) · v3 (3)

Every value of every axis is still covered. Three axes (file format, distribution, branch) have exactly one row per value — joint-axis coverage is intentionally thinner than before.

Design rationale

testhadoop dropped: HadoopCatalog isn't a production target; the row-level operation paths it exercises (commit through filesystem rename) aren't catalog-distinctive enough to justify the CI cost.
Two testhive rows: Hive metastore is a real production target. The first row keeps the original ORC + LOCAL + MAIN baseline; the second row absorbs the HASH / null / DISTRIBUTED axes that used to live on testhadoop.
spark_catalog repointed to REST: REST is the OSS-strategic catalog. The SessionCatalog wrapper test is more valuable when it exercises REST than when it duplicates Hive coverage already provided by row 1. For the row-level operation code paths these tests cover, MERGE/UPDATE/DELETE go through the same DSv2 path regardless of whether the catalog is wrapped in SparkSessionCatalog or registered as a custom SparkCatalog. The session-wrapper differences live in DDL / table-resolution paths, not in row-level execution.
formatVersion collapsed to one v3 row: v2 covered twice (testhive ORC, testhive PARQUET); v3 covered once. v3 introduces DV semantics that validateSnapshot checks via formatVersion >= 3, so at least one v3 row is required.

Bundled fixture fix

The default warehouse path in RESTCatalogServer (the test fixture used by RESTServerExtension) was set via new File(...).getAbsolutePath(), which returns plain filesystem paths without a URI scheme like /tmp/iceberg_warehouse/.../iceberg_data. HiveCatalog and HadoopCatalog return paths with file:// schemes via Hadoop's Path machinery, so test fixtures that consume catalog paths via Paths.get(URI.create(...)) work for those catalogs but break for REST with IllegalArgumentException: Missing scheme.

Without the fix, this PR fails CI: the new spark_catalog/REST row triggers the bug in TestDelete.testDeleteWithoutScanningTable and the MERGE-side equivalent (which call TestBase.move(...) → Paths.get(URI.create(...))).

The fix swaps getAbsolutePath() for toURI().toString() in RESTCatalogServer, making REST fixture paths match the Hive/Hadoop convention. One line in a shared fixture, addresses the root cause once, instead of patching every consumer.

There is one related precedent that would no longer be needed: CatalogTests.java:1028-1031 has an inline ternary that branches on startsWith("file:") for the same reason. Out of scope for this PR but a clean follow-up cleanup once the fixture is consistent.

Test plan

spotlessCheck + compileTestJava pass on all 3 Spark versions; :iceberg-open-api:spotlessCheck and :iceberg-open-api:compileTestFixturesJava pass.
Local smoke run on Spark 4.1 against TestCopyOnWriteDelete, TestMergeOnReadDelete, TestCopyOnWriteMerge, TestMergeOnReadMerge: all green, 0 failures across 759 invocations.
Verified no test method ends up with zero matching rows after the trim — the only stacked assumeThat predicates (!= testhadoop AND cachingCatalogEnabled()) match both testhive rows; all single-predicate cases match at least one row.
Full Spark CI run on this branch.

🤖 Generated with Claude Code

Reduces SparkRowLevelOperationsTestBase parameter rows from 6 to 3 in v4.0 / v4.1 (and from 7 to 3 in v3.5), shifting from "test every catalog backend" to "test the catalogs that matter for production": - testhive (Hive) — kept as the established Hive metastore baseline - testrest (REST) — added in place of testhadoop, since REST is the OSS-strategic catalog and testhadoop isn't recommended for prod - spark_catalog (REST-backed) — repointed from Hive to REST so the SessionCatalog row exercises the REST commit path instead of Hive formatVersion 2 covered by the testhive and testrest rows; formatVersion 3 covered by the spark_catalog/REST row, which exercises the DV (deletion-vector) path that validateSnapshot checks via formatVersion >= 3. The trim affects 9 concrete subclasses (TestCopyOnWriteMerge/Update/Delete, TestMergeOnReadMerge/Update/Delete, both *MergeMetrics, and TestMergeSchemaEvolution), each cutting test invocations by 50% (~57% on Spark 3.5). Note: TestCopyOnWriteWithLineage and TestMergeOnReadWithLineage are unaffected — TestRowLevelOperationsWithLineage redeclares parameters() with its own row set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drops the testrest row in favor of restoring testhive as the carrier for the HASH/null/DISTRIBUTED axes that previously sat on testhadoop. REST catalog coverage is preserved by the spark_catalog (REST-backed) row. Rationale: for the row-level operation code paths these tests exercise, a non-session SparkCatalog wrapper around REST adds little beyond what the SessionCatalog wrapper already covers. Both return SparkTable; both commit through the same MergingSnapshotProducer; the differences live in table-resolution paths (DDL/aliasing), not in MERGE/UPDATE/DELETE. Other test classes (TestStructuredStreamingRead3 etc.) already exercise REST as a non-session catalog. Eliminates 2 of the 3 known TestBase.move() / metadata-delete fixture failures from the previous version. The remaining failure (testDeleteWithoutScanningTable on spark_catalog/REST) is a pre-existing TestBase.move() limitation with non-URI paths and needs a follow-up fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Paths.get(URI.create(location)) requires the URI to carry a scheme. HiveCatalog returns manifest paths with file:// schemes, so the existing move() works. RESTCatalog (as configured by RESTServerExtension in tests) returns plain local paths without a scheme, which makes Paths.get(URI) throw IllegalArgumentException: Missing scheme. Add a small toPath() helper that falls back to Paths.get(location) when URI.create(location).getScheme() is null. This unblocks testDeleteWithoutScanningTable and the equivalent MERGE-side test on spark_catalog/REST rows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Move the URI-scheme handling from TestBase.move() (Spark) to the source of the inconsistency: the REST catalog test fixture's default warehouse path. RESTCatalogServer used getAbsolutePath() which returns plain filesystem paths without a URI scheme; HiveCatalog and HadoopCatalog return file:// paths via Hadoop's Path machinery, so test fixtures that consume catalog paths (e.g. TestBase.move() which calls Paths.get(URI.create(...))) work for those catalogs but break for REST. Switch to toURI().toString() so the REST fixture matches the Hive and Hadoop convention. testDeleteWithoutScanningTable and the equivalent MERGE-side test now pass on the spark_catalog/REST row without needing scheme-handling logic at every consumer. Reverts the TestBase.move() change from the previous commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The test had a special-case branch that called file.getAbsolutePath() (no scheme) instead of file.toURI().toString() when catalogName was testrest, because the old REST fixture returned warehouse paths without a scheme and the test needed the deletes.parquet path to match. With RESTCatalogServer now using toURI().toString() for the default warehouse, table.location() returns a scheme-prefixed path on REST too, so the workaround is no longer needed - and is now incorrect, since RewriteTablePathUtil.newPositionDeleteEntry validates that delete file paths start with the table location prefix (which now includes file://). Removing the special case lets the simple toURI().toString() path apply uniformly across all catalogs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

stevenzwu · 2026-05-24T17:49:55Z

      File tmp = java.nio.file.Files.createTempDirectory("iceberg_warehouse").toFile();
      tmp.deleteOnExit();
-      warehouseLocation = new File(tmp, "iceberg_data").getAbsolutePath();
+      warehouseLocation = new File(tmp, "iceberg_data").toURI().toString();


Either we fix the RESTCatalogServer and add the file:// scheme in in the warehouse location, or we have to handle the path with and without scheme in read.

stevenzwu · 2026-05-24T17:51:10Z


    File file = new File(removePrefix(table.location()) + "/data/deletes.parquet");
    String filePath = file.toURI().toString();
-    if (SparkCatalogConfig.REST.catalogName().equals(catalogName)) {


we don't need this special handling anymore after adding the file:// scheme to the path in the RESTCatalogServer.

github-actions Bot added the spark label May 24, 2026

stevenzwu and others added 3 commits May 23, 2026 21:43

github-actions Bot added the OPENAPI label May 24, 2026

stevenzwu commented May 24, 2026

View reviewed changes

stevenzwu marked this pull request as ready for review May 24, 2026 20:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark: Trim row-level test parameter rows from 6 to 3#16549

Spark: Trim row-level test parameter rows from 6 to 3#16549
stevenzwu wants to merge 5 commits into
apache:mainfrom
stevenzwu:spark-rowlevel-trim-3rows

stevenzwu commented May 24, 2026 •

edited

Loading

Uh oh!

stevenzwu May 24, 2026 •

edited

Loading

Uh oh!

stevenzwu May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stevenzwu commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Axis coverage — before

Axis coverage — after

Design rationale

Bundled fixture fix

Test plan

Uh oh!

stevenzwu May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevenzwu May 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

stevenzwu commented May 24, 2026 •

edited

Loading

stevenzwu May 24, 2026 •

edited

Loading