Spark: Trim row-level test parameter rows from 6 to 3#16549
Open
stevenzwu wants to merge 5 commits into
Open
Conversation
Reduces SparkRowLevelOperationsTestBase parameter rows from 6 to 3 in
v4.0 / v4.1 (and from 7 to 3 in v3.5), shifting from "test every catalog
backend" to "test the catalogs that matter for production":
- testhive (Hive) — kept as the established Hive metastore baseline
- testrest (REST) — added in place of testhadoop, since REST is the
OSS-strategic catalog and testhadoop isn't recommended for prod
- spark_catalog (REST-backed) — repointed from Hive to REST so the
SessionCatalog row exercises the REST commit path instead of Hive
formatVersion 2 covered by the testhive and testrest rows; formatVersion
3 covered by the spark_catalog/REST row, which exercises the DV
(deletion-vector) path that validateSnapshot checks via
formatVersion >= 3.
The trim affects 9 concrete subclasses (TestCopyOnWriteMerge/Update/Delete,
TestMergeOnReadMerge/Update/Delete, both *MergeMetrics, and
TestMergeSchemaEvolution), each cutting test invocations by 50% (~57%
on Spark 3.5).
Note: TestCopyOnWriteWithLineage and TestMergeOnReadWithLineage are
unaffected — TestRowLevelOperationsWithLineage redeclares parameters()
with its own row set.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops the testrest row in favor of restoring testhive as the carrier for the HASH/null/DISTRIBUTED axes that previously sat on testhadoop. REST catalog coverage is preserved by the spark_catalog (REST-backed) row. Rationale: for the row-level operation code paths these tests exercise, a non-session SparkCatalog wrapper around REST adds little beyond what the SessionCatalog wrapper already covers. Both return SparkTable; both commit through the same MergingSnapshotProducer; the differences live in table-resolution paths (DDL/aliasing), not in MERGE/UPDATE/DELETE. Other test classes (TestStructuredStreamingRead3 etc.) already exercise REST as a non-session catalog. Eliminates 2 of the 3 known TestBase.move() / metadata-delete fixture failures from the previous version. The remaining failure (testDeleteWithoutScanningTable on spark_catalog/REST) is a pre-existing TestBase.move() limitation with non-URI paths and needs a follow-up fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Paths.get(URI.create(location)) requires the URI to carry a scheme. HiveCatalog returns manifest paths with file:// schemes, so the existing move() works. RESTCatalog (as configured by RESTServerExtension in tests) returns plain local paths without a scheme, which makes Paths.get(URI) throw IllegalArgumentException: Missing scheme. Add a small toPath() helper that falls back to Paths.get(location) when URI.create(location).getScheme() is null. This unblocks testDeleteWithoutScanningTable and the equivalent MERGE-side test on spark_catalog/REST rows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move the URI-scheme handling from TestBase.move() (Spark) to the source of the inconsistency: the REST catalog test fixture's default warehouse path. RESTCatalogServer used getAbsolutePath() which returns plain filesystem paths without a URI scheme; HiveCatalog and HadoopCatalog return file:// paths via Hadoop's Path machinery, so test fixtures that consume catalog paths (e.g. TestBase.move() which calls Paths.get(URI.create(...))) work for those catalogs but break for REST. Switch to toURI().toString() so the REST fixture matches the Hive and Hadoop convention. testDeleteWithoutScanningTable and the equivalent MERGE-side test now pass on the spark_catalog/REST row without needing scheme-handling logic at every consumer. Reverts the TestBase.move() change from the previous commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The test had a special-case branch that called file.getAbsolutePath() (no scheme) instead of file.toURI().toString() when catalogName was testrest, because the old REST fixture returned warehouse paths without a scheme and the test needed the deletes.parquet path to match. With RESTCatalogServer now using toURI().toString() for the default warehouse, table.location() returns a scheme-prefixed path on REST too, so the workaround is no longer needed - and is now incorrect, since RewriteTablePathUtil.newPositionDeleteEntry validates that delete file paths start with the table location prefix (which now includes file://). Removing the special case lets the simple toURI().toString() path apply uniformly across all catalogs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stevenzwu
commented
May 24, 2026
| File tmp = java.nio.file.Files.createTempDirectory("iceberg_warehouse").toFile(); | ||
| tmp.deleteOnExit(); | ||
| warehouseLocation = new File(tmp, "iceberg_data").getAbsolutePath(); | ||
| warehouseLocation = new File(tmp, "iceberg_data").toURI().toString(); |
Contributor
Author
There was a problem hiding this comment.
Either we fix the RESTCatalogServer and add the file:// scheme in in the warehouse location, or we have to handle the path with and without scheme in read.
stevenzwu
commented
May 24, 2026
|
|
||
| File file = new File(removePrefix(table.location()) + "/data/deletes.parquet"); | ||
| String filePath = file.toURI().toString(); | ||
| if (SparkCatalogConfig.REST.catalogName().equals(catalogName)) { |
Contributor
Author
There was a problem hiding this comment.
we don't need this special handling anymore after adding the file:// scheme to the path in the RESTCatalogServer.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Reduces
SparkRowLevelOperationsTestBaseparameter rows from 6 to 3 in v4.0 / v4.1, and from 7 to 3 in v3.5, shifting from "test every catalog backend" to "test the catalogs that matter for production":SparkSessionCatalogrow now exercises the REST commit path. Non-session REST is intentionally not added here — for row-level operation code paths, both Spark catalog wrappers returnSparkTableand commit through the sameMergingSnapshotProducer, so the SessionCatalog row covers what matters; non-session REST is already exercised by other tests (TestStructuredStreamingRead3, etc.).HadoopCatalogisn't recommended for production.The trim affects 9 concrete subclasses that inherit this base:
TestCopyOnWriteMerge/TestMergeOnReadMerge/TestCopyOnWriteUpdate/TestMergeOnReadUpdate/TestCopyOnWriteDelete/TestMergeOnReadDelete, both*MergeMetricssubclasses, andTestMergeSchemaEvolution. Each cuts test invocations by 50% (≈57% on Spark 3.5).TestCopyOnWriteWithLineage/TestMergeOnReadWithLineageare unaffected —TestRowLevelOperationsWithLineageredeclaresparameters()locally.Expected savings: ~6-7 minutes off each
extensionsmatrix-cell wallclock on CI (out of ~33 min total).A small
RESTCatalogServerfixture fix is bundled to make this PR's REST row pass CI — see "Bundled fixture fix" below.Axis coverage — before
v4.0 / v4.1 (6 rows):
v3.5 (7 rows): same as v4.x rows 1-4, plus
testhadoop / PARQUET / vec=true / HASH / v3andvec=falsevariants (rows 5+6), with the v3 spark_catalog row at position 7.Two pairs are duplicates on every axis except
formatVersion: rows 3+5 and rows 4+6.Axis coverage — after
Every value of every axis is still covered. Three axes (file format, distribution, branch) have exactly one row per value — joint-axis coverage is intentionally thinner than before.
Design rationale
testhadoopdropped:HadoopCatalogisn't a production target; the row-level operation paths it exercises (commit through filesystem rename) aren't catalog-distinctive enough to justify the CI cost.testhiverows: Hive metastore is a real production target. The first row keeps the original ORC + LOCAL + MAIN baseline; the second row absorbs the HASH / null / DISTRIBUTED axes that used to live on testhadoop.spark_catalogrepointed to REST: REST is the OSS-strategic catalog. The SessionCatalog wrapper test is more valuable when it exercises REST than when it duplicates Hive coverage already provided by row 1. For the row-level operation code paths these tests cover, MERGE/UPDATE/DELETE go through the same DSv2 path regardless of whether the catalog is wrapped inSparkSessionCatalogor registered as a customSparkCatalog. The session-wrapper differences live in DDL / table-resolution paths, not in row-level execution.formatVersioncollapsed to one v3 row: v2 covered twice (testhive ORC, testhive PARQUET); v3 covered once. v3 introduces DV semantics thatvalidateSnapshotchecks viaformatVersion >= 3, so at least one v3 row is required.Bundled fixture fix
The default warehouse path in
RESTCatalogServer(the test fixture used byRESTServerExtension) was set vianew File(...).getAbsolutePath(), which returns plain filesystem paths without a URI scheme like/tmp/iceberg_warehouse/.../iceberg_data.HiveCatalogandHadoopCatalogreturn paths withfile://schemes via Hadoop'sPathmachinery, so test fixtures that consume catalog paths viaPaths.get(URI.create(...))work for those catalogs but break for REST withIllegalArgumentException: Missing scheme.Without the fix, this PR fails CI: the new
spark_catalog/RESTrow triggers the bug inTestDelete.testDeleteWithoutScanningTableand the MERGE-side equivalent (which callTestBase.move(...)→Paths.get(URI.create(...))).The fix swaps
getAbsolutePath()fortoURI().toString()inRESTCatalogServer, making REST fixture paths match the Hive/Hadoop convention. One line in a shared fixture, addresses the root cause once, instead of patching every consumer.There is one related precedent that would no longer be needed:
CatalogTests.java:1028-1031has an inline ternary that branches onstartsWith("file:")for the same reason. Out of scope for this PR but a clean follow-up cleanup once the fixture is consistent.Test plan
spotlessCheck+compileTestJavapass on all 3 Spark versions;:iceberg-open-api:spotlessCheckand:iceberg-open-api:compileTestFixturesJavapass.TestCopyOnWriteDelete,TestMergeOnReadDelete,TestCopyOnWriteMerge,TestMergeOnReadMerge: all green, 0 failures across 759 invocations.assumeThatpredicates (!= testhadoopANDcachingCatalogEnabled()) match both testhive rows; all single-predicate cases match at least one row.🤖 Generated with Claude Code