Skip to content

Conversation

@aokolnychyi
Copy link
Contributor

This PR contains a subset of changes for #15240 that moves around some scan objects.

A few notable changes:

  • Standardize on description() that is actually used in Spark plans. No need for almost identical unused toString().
  • No need to include the schema as it is already supplied by Spark. See Javadoc.
  • No need to include useless info like case sensitivity.

@github-actions github-actions bot added the spark label Feb 10, 2026
Copy link
Contributor

@singhpk234 singhpk234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @aokolnychyi !

@manuzhang manuzhang requested a review from Copilot February 10, 2026 01:36
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Standardizes Spark v4.1 scan string representations by focusing on Scan.description() output (used by Spark plans) and de-emphasizing redundant/unused toString() content, while updating tests to match the new formatting.

Changes:

  • Simplified and standardized scan description strings (drop schema/case-sensitivity noise; include key fields like table, filters, grouping).
  • Adjusted equals/hashCode implementations to use the new filter description formatting in some scans.
  • Updated and added tests to assert the new description() output.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
spark/v4.1/spark/src/test/java/org/apache/iceberg/spark/sql/TestFilterPushDown.java Updates plan-string assertions to match new scan description formatting.
spark/v4.1/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkScan.java Adds tests validating Scan.description() for batch and copy-on-write scans.
spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkStagedScan.java Fixes hashCode inputs and replaces verbose toString() with simpler description().
spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java Introduces helper filtersDesc() and changes toString() to delegate to description().
spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkPartitioningAwareScan.java Adds helper to stringify grouping key fields.
spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkLocalScan.java Simplifies description() and makes toString() delegate to it.
spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkCopyOnWriteScan.java Switches to description() and updates equality/hash to use described filters/grouping.
spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkChangelogScan.java Simplifies description() and updates equality/hash to use described filters.
spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatchQueryScan.java Switches to description() and updates equality/hash to use described filters/runtime filters/grouping.
spark/v4.1/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMerge.java Updates plan-string assertions to match new scan description formatting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

}

protected String filtersDesc() {
return Spark3Util.describe(filterExpressions);
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a human-readable description string as a proxy for filter identity is risky when it is later used by equals()/hashCode() in subclasses (as introduced in this PR). Different filter expression trees can potentially serialize to the same description string (lossy formatting, reordered terms, elided parentheses), which can make equals() return true for non-equal scans and break hash-based collections/caches. Prefer comparing the actual expression objects (or a canonical, non-lossy representation) in equals()/hashCode() rather than Spark3Util.describe(...) output.

Suggested change
return Spark3Util.describe(filterExpressions);
return filterExpressions.stream()
.map(Expression::toString)
.collect(Collectors.joining(" AND "));

Copilot uses AI. Check for mistakes.
assertThat(planAsString)
.as("Pushed filters must match")
.contains("[filters=" + icebergFilters + ",");
.contains("filters=" + icebergFilters + ",");
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion can produce false positives because runtimeFilters= contains the substring filters=. If icebergFilters happens to match the runtime filter string, the test could pass even if the main filters= section is missing/incorrect. Make the match more specific (e.g., include a delimiter like \", filters=\" or match the full scan prefix such as \"IcebergScan(\" + \"filters=...\") so it cannot accidentally match runtimeFilters=.

Suggested change
.contains("filters=" + icebergFilters + ",");
.contains("filters=" + icebergFilters + ", runtimeFilters=");

Copilot uses AI. Check for mistakes.
assertThat(planAsString)
.as("Pushed filters must match")
.contains("[filters=" + icebergFilters + ",");
.contains("filters=" + icebergFilters + ",");
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as above: runtimeFilters= includes filters= as a substring, so this can accidentally match the wrong section of the plan string. Tighten the assertion to include an unambiguous delimiter (e.g., \", filters=\") or context around the expected field.

Suggested change
.contains("filters=" + icebergFilters + ",");
.contains(", filters=" + icebergFilters + ",");

Copilot uses AI. Check for mistakes.
Comment on lines +1041 to +1044
assertThat(description).contains("IcebergScan");
assertThat(description).contains(tableName);
assertThat(description).contains("filters=id = 1, id > 0");
assertThat(description).contains("groupedBy=data");
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These assertions bake in the exact formatting and ordering of the rendered filter description (e.g., \"id = 1, id > 0\"). If Spark3Util.describe(...) changes formatting (commas vs AND, spacing, predicate order), the tests will fail even though behavior is correct. Consider asserting each predicate independently (e.g., contains \"id = 1\" and contains \"id > 0\") and separately asserting the filters= label, rather than asserting the full combined string in a single contains check.

Copilot uses AI. Check for mistakes.

assertThat(description).contains("IcebergCopyOnWriteScan");
assertThat(description).contains(tableName);
assertThat(description).contains("filters=id = 2, id < 10");
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These assertions bake in the exact formatting and ordering of the rendered filter description (e.g., \"id = 1, id > 0\"). If Spark3Util.describe(...) changes formatting (commas vs AND, spacing, predicate order), the tests will fail even though behavior is correct. Consider asserting each predicate independently (e.g., contains \"id = 1\" and contains \"id > 0\") and separately asserting the filters= label, rather than asserting the full combined string in a single contains check.

Suggested change
assertThat(description).contains("filters=id = 2, id < 10");
assertThat(description).contains("filters=");
assertThat(description).contains("id = 2");
assertThat(description).contains("id < 10");

Copilot uses AI. Check for mistakes.
@pan3793
Copy link
Member

pan3793 commented Feb 10, 2026

@aokolnychyi, would you mind providing some concrete examples (screenshots) of how this change affects Spark UI / EXPLAIN outputs? Displaying seems was not properly considered and documented in the Spark DSv2 API, we internally made some hacky changes to make the UI display pretty ...

@aokolnychyi
Copy link
Contributor Author

@pan3793, if I recall correctly, Scan$description is only used in scan exec nodes. This PR removes the projection schema per Javadoc instructions in Spark and adds the scan class to the name (I am planning to add even more scan types later so it is helpful to see with which scan we are dealing with). I debated whether I should preserve the runtime filters field but decided to keep it as is for now, even though it somewhat duplicates what Spark includes.

== Physical Plan ==
*(1) ColumnarToRow
+- BatchScan testhive.default.table[id#39, dep#40] IcebergScan(table=testhive.default.table, branch=null, filters=, runtimeFilters=, groupedBy=) RuntimeFilters: []
== Physical Plan ==
ReplaceData IcebergWrite(table=testhive.default.table, format=ORC)
+- AdaptiveSparkPlan isFinalPlan=false
   +- MergeRowsExec[__row_operation#55, id#56, dep#57, _file#58]
      +- SortMergeJoin [id#39], [id#33], FullOuter
         :- Sort [id#39 ASC NULLS FIRST], false, 0
         :  +- Exchange hashpartitioning(id#39, 4), ENSURE_REQUIREMENTS, [plan_id=91]
         :     +- Project [id#39, dep#40, _file#48, true AS __row_from_target#51, monotonically_increasing_id() AS __row_id#52L]
         :        +- BatchScan testhive.default.table[id#39, dep#40, _file#48] IcebergCopyOnWriteScan(table=testhive.default.table, filters=true, groupedBy=) RuntimeFilters: []
         +- Sort [id#33 ASC NULLS FIRST], false, 0
            +- Exchange hashpartitioning(id#33, 4), ENSURE_REQUIREMENTS, [plan_id=92]
               +- Project [id#33, dep#34, true AS __row_from_source#53]
                  +- Scan ExistingRDD[id#33,dep#34]

@aokolnychyi
Copy link
Contributor Author

@pan3793, I am going to merge this one to rebase my other PR. However, I will check in for any additional feedback you may have. Let me know what you think and whether there are gaps we should fix either in Iceberg or in Spark.

@aokolnychyi aokolnychyi merged commit 71b05af into apache:main Feb 10, 2026
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants