Spark 4.1: Simplify description and toString in scans #15281

aokolnychyi · 2026-02-10T00:51:12Z

This PR contains a subset of changes for #15240 that moves around some scan objects.

A few notable changes:

Standardize on description() that is actually used in Spark plans. No need for almost identical unused toString().
No need to include the schema as it is already supplied by Spark. See Javadoc.
No need to include useless info like case sensitivity.

singhpk234

Copilot

Pull request overview

Standardizes Spark v4.1 scan string representations by focusing on Scan.description() output (used by Spark plans) and de-emphasizing redundant/unused toString() content, while updating tests to match the new formatting.

Changes:

Simplified and standardized scan description strings (drop schema/case-sensitivity noise; include key fields like table, filters, grouping).
Adjusted equals/hashCode implementations to use the new filter description formatting in some scans.
Updated and added tests to assert the new description() output.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
spark/v4.1/spark/src/test/java/org/apache/iceberg/spark/sql/TestFilterPushDown.java	Updates plan-string assertions to match new scan description formatting.
spark/v4.1/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkScan.java	Adds tests validating `Scan.description()` for batch and copy-on-write scans.
spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkStagedScan.java	Fixes `hashCode` inputs and replaces verbose `toString()` with simpler `description()`.
spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java	Introduces helper `filtersDesc()` and changes `toString()` to delegate to `description()`.
spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkPartitioningAwareScan.java	Adds helper to stringify grouping key fields.
spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkLocalScan.java	Simplifies `description()` and makes `toString()` delegate to it.
spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkCopyOnWriteScan.java	Switches to `description()` and updates equality/hash to use described filters/grouping.
spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkChangelogScan.java	Simplifies `description()` and updates equality/hash to use described filters.
spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatchQueryScan.java	Switches to `description()` and updates equality/hash to use described filters/runtime filters/grouping.
spark/v4.1/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMerge.java	Updates plan-string assertions to match new scan description formatting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java

Copilot · 2026-02-10T01:38:46Z

spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java

  }

+  protected String filtersDesc() {
+    return Spark3Util.describe(filterExpressions);


Using a human-readable description string as a proxy for filter identity is risky when it is later used by equals()/hashCode() in subclasses (as introduced in this PR). Different filter expression trees can potentially serialize to the same description string (lossy formatting, reordered terms, elided parentheses), which can make equals() return true for non-equal scans and break hash-based collections/caches. Prefer comparing the actual expression objects (or a canonical, non-lossy representation) in equals()/hashCode() rather than Spark3Util.describe(...) output.

Suggested change

return Spark3Util.describe(filterExpressions);

return filterExpressions.stream()

.map(Expression::toString)

.collect(Collectors.joining(" AND "));

Copilot · 2026-02-10T01:38:46Z

spark/v4.1/spark/src/test/java/org/apache/iceberg/spark/sql/TestFilterPushDown.java

    assertThat(planAsString)
        .as("Pushed filters must match")
-        .contains("[filters=" + icebergFilters + ",");
+        .contains("filters=" + icebergFilters + ",");


This assertion can produce false positives because runtimeFilters= contains the substring filters=. If icebergFilters happens to match the runtime filter string, the test could pass even if the main filters= section is missing/incorrect. Make the match more specific (e.g., include a delimiter like \", filters=\" or match the full scan prefix such as \"IcebergScan(\" + \"filters=...\") so it cannot accidentally match runtimeFilters=.

Suggested change

.contains("filters=" + icebergFilters + ",");

.contains("filters=" + icebergFilters + ", runtimeFilters=");

Copilot · 2026-02-10T01:38:47Z

spark/v4.1/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMerge.java

          assertThat(planAsString)
              .as("Pushed filters must match")
-              .contains("[filters=" + icebergFilters + ",");
+              .contains("filters=" + icebergFilters + ",");


Same issue as above: runtimeFilters= includes filters= as a substring, so this can accidentally match the wrong section of the plan string. Tighten the assertion to include an unambiguous delimiter (e.g., \", filters=\") or context around the expected field.

Suggested change

.contains("filters=" + icebergFilters + ",");

.contains(", filters=" + icebergFilters + ",");

Copilot · 2026-02-10T01:38:47Z

spark/v4.1/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkScan.java

+          assertThat(description).contains("IcebergScan");
+          assertThat(description).contains(tableName);
+          assertThat(description).contains("filters=id = 1, id > 0");
+          assertThat(description).contains("groupedBy=data");


These assertions bake in the exact formatting and ordering of the rendered filter description (e.g., \"id = 1, id > 0\"). If Spark3Util.describe(...) changes formatting (commas vs AND, spacing, predicate order), the tests will fail even though behavior is correct. Consider asserting each predicate independently (e.g., contains \"id = 1\" and contains \"id > 0\") and separately asserting the filters= label, rather than asserting the full combined string in a single contains check.

Copilot · 2026-02-10T01:38:47Z

spark/v4.1/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkScan.java

+
+          assertThat(description).contains("IcebergCopyOnWriteScan");
+          assertThat(description).contains(tableName);
+          assertThat(description).contains("filters=id = 2, id < 10");


These assertions bake in the exact formatting and ordering of the rendered filter description (e.g., \"id = 1, id > 0\"). If Spark3Util.describe(...) changes formatting (commas vs AND, spacing, predicate order), the tests will fail even though behavior is correct. Consider asserting each predicate independently (e.g., contains \"id = 1\" and contains \"id > 0\") and separately asserting the filters= label, rather than asserting the full combined string in a single contains check.

Suggested change

assertThat(description).contains("filters=id = 2, id < 10");

assertThat(description).contains("filters=");

assertThat(description).contains("id = 2");

assertThat(description).contains("id < 10");

spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkPartitioningAwareScan.java

This reverts commit b19aea0.

pan3793 · 2026-02-10T14:22:09Z

@aokolnychyi, would you mind providing some concrete examples (screenshots) of how this change affects Spark UI / EXPLAIN outputs? Displaying seems was not properly considered and documented in the Spark DSv2 API, we internally made some hacky changes to make the UI display pretty ...

aokolnychyi · 2026-02-10T15:02:59Z

@pan3793, if I recall correctly, Scan$description is only used in scan exec nodes. This PR removes the projection schema per Javadoc instructions in Spark and adds the scan class to the name (I am planning to add even more scan types later so it is helpful to see with which scan we are dealing with). I debated whether I should preserve the runtime filters field but decided to keep it as is for now, even though it somewhat duplicates what Spark includes.

== Physical Plan ==
*(1) ColumnarToRow
+- BatchScan testhive.default.table[id#39, dep#40] IcebergScan(table=testhive.default.table, branch=null, filters=, runtimeFilters=, groupedBy=) RuntimeFilters: []

== Physical Plan ==
ReplaceData IcebergWrite(table=testhive.default.table, format=ORC)
+- AdaptiveSparkPlan isFinalPlan=false
   +- MergeRowsExec[__row_operation#55, id#56, dep#57, _file#58]
      +- SortMergeJoin [id#39], [id#33], FullOuter
         :- Sort [id#39 ASC NULLS FIRST], false, 0
         :  +- Exchange hashpartitioning(id#39, 4), ENSURE_REQUIREMENTS, [plan_id=91]
         :     +- Project [id#39, dep#40, _file#48, true AS __row_from_target#51, monotonically_increasing_id() AS __row_id#52L]
         :        +- BatchScan testhive.default.table[id#39, dep#40, _file#48] IcebergCopyOnWriteScan(table=testhive.default.table, filters=true, groupedBy=) RuntimeFilters: []
         +- Sort [id#33 ASC NULLS FIRST], false, 0
            +- Exchange hashpartitioning(id#33, 4), ENSURE_REQUIREMENTS, [plan_id=92]
               +- Project [id#33, dep#34, true AS __row_from_source#53]
                  +- Scan ExistingRDD[id#33,dep#34]

aokolnychyi · 2026-02-10T15:06:01Z

@pan3793, I am going to merge this one to rebase my other PR. However, I will check in for any additional feedback you may have. Let me know what you think and whether there are gaps we should fix either in Iceberg or in Spark.

Spark 4.1: Simplify description and toString in scan

e832be2

github-actions bot added the spark label Feb 10, 2026

singhpk234 approved these changes Feb 10, 2026

View reviewed changes

manuzhang requested a review from Copilot February 10, 2026 01:36

huaxingao approved these changes Feb 10, 2026

View reviewed changes

Copilot AI reviewed Feb 10, 2026

View reviewed changes

aokolnychyi added 4 commits February 9, 2026 18:13

Minor fixes

b19aea0

More fixes

926e08f

Import

cf3e45b

Revert "Minor fixes"

c542fe2

This reverts commit b19aea0.

aokolnychyi merged commit 71b05af into apache:main Feb 10, 2026
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark 4.1: Simplify description and toString in scans #15281

Spark 4.1: Simplify description and toString in scans #15281

Uh oh!

aokolnychyi commented Feb 10, 2026

Uh oh!

singhpk234 left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Uh oh!

pan3793 commented Feb 10, 2026 •

edited

Loading

Uh oh!

aokolnychyi commented Feb 10, 2026

Uh oh!

aokolnychyi commented Feb 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

-    return Spark3Util.describe(filterExpressions);
+    return filterExpressions.stream()
+        .map(Expression::toString)
+        .collect(Collectors.joining(" AND "));

	.contains("filters=" + icebergFilters + ",");
	.contains("filters=" + icebergFilters + ", runtimeFilters=");

	.contains("filters=" + icebergFilters + ",");
	.contains(", filters=" + icebergFilters + ",");

-          assertThat(description).contains("filters=id = 2, id < 10");
+          assertThat(description).contains("filters=");
+          assertThat(description).contains("id = 2");
+          assertThat(description).contains("id < 10");

Spark 4.1: Simplify description and toString in scans #15281

Spark 4.1: Simplify description and toString in scans #15281

Uh oh!

Conversation

aokolnychyi commented Feb 10, 2026

Uh oh!

singhpk234 left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pan3793 commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aokolnychyi commented Feb 10, 2026

Uh oh!

aokolnychyi commented Feb 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pan3793 commented Feb 10, 2026 •

edited

Loading