Skip to content

Spark: Trim row-level test parameter rows from 6 to 3#16549

Open
stevenzwu wants to merge 5 commits into
apache:mainfrom
stevenzwu:spark-rowlevel-trim-3rows
Open

Spark: Trim row-level test parameter rows from 6 to 3#16549
stevenzwu wants to merge 5 commits into
apache:mainfrom
stevenzwu:spark-rowlevel-trim-3rows

Conversation

@stevenzwu
Copy link
Copy Markdown
Contributor

@stevenzwu stevenzwu commented May 24, 2026

Summary

Reduces SparkRowLevelOperationsTestBase parameter rows from 6 to 3 in v4.0 / v4.1, and from 7 to 3 in v3.5, shifting from "test every catalog backend" to "test the catalogs that matter for production":

  • testhive (Hive) — kept as the Hive-metastore baseline. Two rows on testhive cover ORC + PARQUET formats and the joint axes (HASH/null/DISTRIBUTED) that previously sat on testhadoop.
  • spark_catalog (REST-backed) — repointed from Hive to REST. The SparkSessionCatalog row now exercises the REST commit path. Non-session REST is intentionally not added here — for row-level operation code paths, both Spark catalog wrappers return SparkTable and commit through the same MergingSnapshotProducer, so the SessionCatalog row covers what matters; non-session REST is already exercised by other tests (TestStructuredStreamingRead3, etc.).
  • testhadoop droppedHadoopCatalog isn't recommended for production.

The trim affects 9 concrete subclasses that inherit this base: TestCopyOnWriteMerge / TestMergeOnReadMerge / TestCopyOnWriteUpdate / TestMergeOnReadUpdate / TestCopyOnWriteDelete / TestMergeOnReadDelete, both *MergeMetrics subclasses, and TestMergeSchemaEvolution. Each cuts test invocations by 50% (≈57% on Spark 3.5). TestCopyOnWriteWithLineage / TestMergeOnReadWithLineage are unaffected — TestRowLevelOperationsWithLineage redeclares parameters() locally.

Expected savings: ~6-7 minutes off each extensions matrix-cell wallclock on CI (out of ~33 min total).

A small RESTCatalogServer fixture fix is bundled to make this PR's REST row pass CI — see "Bundled fixture fix" below.

Axis coverage — before

v4.0 / v4.1 (6 rows):

# catalog format vec dist branch planning fanout fmtV
1 testhive ORC true NONE MAIN LOCAL true 2
2 testhive PARQUET true NONE test DISTRIBUTED false 2
3 testhadoop PARQUET random HASH null LOCAL true 2
4 spark_catalog (Hive) AVRO false RANGE test DISTRIBUTED false 2
5 testhadoop PARQUET random HASH null LOCAL true 3
6 spark_catalog (Hive) AVRO false RANGE test DISTRIBUTED false 3

v3.5 (7 rows): same as v4.x rows 1-4, plus testhadoop / PARQUET / vec=true / HASH / v3 and vec=false variants (rows 5+6), with the v3 spark_catalog row at position 7.

Axis Values present (rows)
Catalog testhive (1, 2) · testhadoop (3, 5) · spark_catalog/Hive (4, 6)
File format ORC (1) · PARQUET (2, 3, 5) · AVRO (4, 6)
Vectorized true (1, 2) · random (3, 5) · false (4, 6)
Distribution NONE (1, 2) · HASH (3, 5) · RANGE (4, 6)
Branch MAIN (1) · test (2, 4, 6) · null (3, 5)
Planning LOCAL (1, 3, 5) · DISTRIBUTED (2, 4, 6)
Fanout true (1, 3, 5) · false (2, 4, 6)
formatVersion v2 (1, 2, 3, 4) · v3 (5, 6)

Two pairs are duplicates on every axis except formatVersion: rows 3+5 and rows 4+6.

Axis coverage — after

# catalog format vec dist branch planning fanout fmtV
1 testhive ORC true NONE MAIN LOCAL true 2
2 testhive PARQUET true HASH null DISTRIBUTED false 2
3 spark_catalog (REST) AVRO false RANGE test DISTRIBUTED false 3
Axis Values present (rows)
Catalog testhive (1, 2) · spark_catalog/REST (3)
File format ORC (1) · PARQUET (2) · AVRO (3)
Vectorized true (1, 2) · false (3)
Distribution NONE (1) · HASH (2) · RANGE (3)
Branch MAIN (1) · null (2) · test (3)
Planning LOCAL (1) · DISTRIBUTED (2, 3)
Fanout true (1) · false (2, 3)
formatVersion v2 (1, 2) · v3 (3)

Every value of every axis is still covered. Three axes (file format, distribution, branch) have exactly one row per value — joint-axis coverage is intentionally thinner than before.

Design rationale

  • testhadoop dropped: HadoopCatalog isn't a production target; the row-level operation paths it exercises (commit through filesystem rename) aren't catalog-distinctive enough to justify the CI cost.
  • Two testhive rows: Hive metastore is a real production target. The first row keeps the original ORC + LOCAL + MAIN baseline; the second row absorbs the HASH / null / DISTRIBUTED axes that used to live on testhadoop.
  • spark_catalog repointed to REST: REST is the OSS-strategic catalog. The SessionCatalog wrapper test is more valuable when it exercises REST than when it duplicates Hive coverage already provided by row 1. For the row-level operation code paths these tests cover, MERGE/UPDATE/DELETE go through the same DSv2 path regardless of whether the catalog is wrapped in SparkSessionCatalog or registered as a custom SparkCatalog. The session-wrapper differences live in DDL / table-resolution paths, not in row-level execution.
  • formatVersion collapsed to one v3 row: v2 covered twice (testhive ORC, testhive PARQUET); v3 covered once. v3 introduces DV semantics that validateSnapshot checks via formatVersion >= 3, so at least one v3 row is required.

Bundled fixture fix

The default warehouse path in RESTCatalogServer (the test fixture used by RESTServerExtension) was set via new File(...).getAbsolutePath(), which returns plain filesystem paths without a URI scheme like /tmp/iceberg_warehouse/.../iceberg_data. HiveCatalog and HadoopCatalog return paths with file:// schemes via Hadoop's Path machinery, so test fixtures that consume catalog paths via Paths.get(URI.create(...)) work for those catalogs but break for REST with IllegalArgumentException: Missing scheme.

Without the fix, this PR fails CI: the new spark_catalog/REST row triggers the bug in TestDelete.testDeleteWithoutScanningTable and the MERGE-side equivalent (which call TestBase.move(...)Paths.get(URI.create(...))).

The fix swaps getAbsolutePath() for toURI().toString() in RESTCatalogServer, making REST fixture paths match the Hive/Hadoop convention. One line in a shared fixture, addresses the root cause once, instead of patching every consumer.

There is one related precedent that would no longer be needed: CatalogTests.java:1028-1031 has an inline ternary that branches on startsWith("file:") for the same reason. Out of scope for this PR but a clean follow-up cleanup once the fixture is consistent.

Test plan

  • spotlessCheck + compileTestJava pass on all 3 Spark versions; :iceberg-open-api:spotlessCheck and :iceberg-open-api:compileTestFixturesJava pass.
  • Local smoke run on Spark 4.1 against TestCopyOnWriteDelete, TestMergeOnReadDelete, TestCopyOnWriteMerge, TestMergeOnReadMerge: all green, 0 failures across 759 invocations.
  • Verified no test method ends up with zero matching rows after the trim — the only stacked assumeThat predicates (!= testhadoop AND cachingCatalogEnabled()) match both testhive rows; all single-predicate cases match at least one row.
  • Full Spark CI run on this branch.

🤖 Generated with Claude Code

Reduces SparkRowLevelOperationsTestBase parameter rows from 6 to 3 in
v4.0 / v4.1 (and from 7 to 3 in v3.5), shifting from "test every catalog
backend" to "test the catalogs that matter for production":

  - testhive (Hive) — kept as the established Hive metastore baseline
  - testrest (REST) — added in place of testhadoop, since REST is the
    OSS-strategic catalog and testhadoop isn't recommended for prod
  - spark_catalog (REST-backed) — repointed from Hive to REST so the
    SessionCatalog row exercises the REST commit path instead of Hive

formatVersion 2 covered by the testhive and testrest rows; formatVersion
3 covered by the spark_catalog/REST row, which exercises the DV
(deletion-vector) path that validateSnapshot checks via
formatVersion >= 3.

The trim affects 9 concrete subclasses (TestCopyOnWriteMerge/Update/Delete,
TestMergeOnReadMerge/Update/Delete, both *MergeMetrics, and
TestMergeSchemaEvolution), each cutting test invocations by 50% (~57%
on Spark 3.5).

Note: TestCopyOnWriteWithLineage and TestMergeOnReadWithLineage are
unaffected — TestRowLevelOperationsWithLineage redeclares parameters()
with its own row set.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the spark label May 24, 2026
stevenzwu and others added 3 commits May 23, 2026 21:43
Drops the testrest row in favor of restoring testhive as the carrier for
the HASH/null/DISTRIBUTED axes that previously sat on testhadoop. REST
catalog coverage is preserved by the spark_catalog (REST-backed) row.

Rationale: for the row-level operation code paths these tests exercise,
a non-session SparkCatalog wrapper around REST adds little beyond what
the SessionCatalog wrapper already covers. Both return SparkTable; both
commit through the same MergingSnapshotProducer; the differences live
in table-resolution paths (DDL/aliasing), not in MERGE/UPDATE/DELETE.
Other test classes (TestStructuredStreamingRead3 etc.) already exercise
REST as a non-session catalog.

Eliminates 2 of the 3 known TestBase.move() / metadata-delete fixture
failures from the previous version. The remaining failure
(testDeleteWithoutScanningTable on spark_catalog/REST) is a pre-existing
TestBase.move() limitation with non-URI paths and needs a follow-up fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Paths.get(URI.create(location)) requires the URI to carry a scheme.
HiveCatalog returns manifest paths with file:// schemes, so the
existing move() works. RESTCatalog (as configured by RESTServerExtension
in tests) returns plain local paths without a scheme, which makes
Paths.get(URI) throw IllegalArgumentException: Missing scheme.

Add a small toPath() helper that falls back to Paths.get(location) when
URI.create(location).getScheme() is null. This unblocks
testDeleteWithoutScanningTable and the equivalent MERGE-side test on
spark_catalog/REST rows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move the URI-scheme handling from TestBase.move() (Spark) to the source
of the inconsistency: the REST catalog test fixture's default warehouse
path. RESTCatalogServer used getAbsolutePath() which returns plain
filesystem paths without a URI scheme; HiveCatalog and HadoopCatalog
return file:// paths via Hadoop's Path machinery, so test fixtures
that consume catalog paths (e.g. TestBase.move() which calls
Paths.get(URI.create(...))) work for those catalogs but break for REST.

Switch to toURI().toString() so the REST fixture matches the Hive and
Hadoop convention. testDeleteWithoutScanningTable and the equivalent
MERGE-side test now pass on the spark_catalog/REST row without needing
scheme-handling logic at every consumer.

Reverts the TestBase.move() change from the previous commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The test had a special-case branch that called file.getAbsolutePath() (no
scheme) instead of file.toURI().toString() when catalogName was testrest,
because the old REST fixture returned warehouse paths without a scheme
and the test needed the deletes.parquet path to match.

With RESTCatalogServer now using toURI().toString() for the default
warehouse, table.location() returns a scheme-prefixed path on REST too,
so the workaround is no longer needed - and is now incorrect, since
RewriteTablePathUtil.newPositionDeleteEntry validates that delete file
paths start with the table location prefix (which now includes file://).

Removing the special case lets the simple toURI().toString() path apply
uniformly across all catalogs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
File tmp = java.nio.file.Files.createTempDirectory("iceberg_warehouse").toFile();
tmp.deleteOnExit();
warehouseLocation = new File(tmp, "iceberg_data").getAbsolutePath();
warehouseLocation = new File(tmp, "iceberg_data").toURI().toString();
Copy link
Copy Markdown
Contributor Author

@stevenzwu stevenzwu May 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either we fix the RESTCatalogServer and add the file:// scheme in in the warehouse location, or we have to handle the path with and without scheme in read.


File file = new File(removePrefix(table.location()) + "/data/deletes.parquet");
String filePath = file.toURI().toString();
if (SparkCatalogConfig.REST.catalogName().equals(catalogName)) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need this special handling anymore after adding the file:// scheme to the path in the RESTCatalogServer.

@stevenzwu stevenzwu marked this pull request as ready for review May 24, 2026 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant