-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark 4.1: Simplify description and toString in scans #15281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
singhpk234
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @aokolnychyi !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Standardizes Spark v4.1 scan string representations by focusing on Scan.description() output (used by Spark plans) and de-emphasizing redundant/unused toString() content, while updating tests to match the new formatting.
Changes:
- Simplified and standardized scan description strings (drop schema/case-sensitivity noise; include key fields like table, filters, grouping).
- Adjusted
equals/hashCodeimplementations to use the new filter description formatting in some scans. - Updated and added tests to assert the new
description()output.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| spark/v4.1/spark/src/test/java/org/apache/iceberg/spark/sql/TestFilterPushDown.java | Updates plan-string assertions to match new scan description formatting. |
| spark/v4.1/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkScan.java | Adds tests validating Scan.description() for batch and copy-on-write scans. |
| spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkStagedScan.java | Fixes hashCode inputs and replaces verbose toString() with simpler description(). |
| spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java | Introduces helper filtersDesc() and changes toString() to delegate to description(). |
| spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkPartitioningAwareScan.java | Adds helper to stringify grouping key fields. |
| spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkLocalScan.java | Simplifies description() and makes toString() delegate to it. |
| spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkCopyOnWriteScan.java | Switches to description() and updates equality/hash to use described filters/grouping. |
| spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkChangelogScan.java | Simplifies description() and updates equality/hash to use described filters. |
| spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatchQueryScan.java | Switches to description() and updates equality/hash to use described filters/runtime filters/grouping. |
| spark/v4.1/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMerge.java | Updates plan-string assertions to match new scan description formatting. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java
Show resolved
Hide resolved
| } | ||
|
|
||
| protected String filtersDesc() { | ||
| return Spark3Util.describe(filterExpressions); |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a human-readable description string as a proxy for filter identity is risky when it is later used by equals()/hashCode() in subclasses (as introduced in this PR). Different filter expression trees can potentially serialize to the same description string (lossy formatting, reordered terms, elided parentheses), which can make equals() return true for non-equal scans and break hash-based collections/caches. Prefer comparing the actual expression objects (or a canonical, non-lossy representation) in equals()/hashCode() rather than Spark3Util.describe(...) output.
| return Spark3Util.describe(filterExpressions); | |
| return filterExpressions.stream() | |
| .map(Expression::toString) | |
| .collect(Collectors.joining(" AND ")); |
| assertThat(planAsString) | ||
| .as("Pushed filters must match") | ||
| .contains("[filters=" + icebergFilters + ","); | ||
| .contains("filters=" + icebergFilters + ","); |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assertion can produce false positives because runtimeFilters= contains the substring filters=. If icebergFilters happens to match the runtime filter string, the test could pass even if the main filters= section is missing/incorrect. Make the match more specific (e.g., include a delimiter like \", filters=\" or match the full scan prefix such as \"IcebergScan(\" + \"filters=...\") so it cannot accidentally match runtimeFilters=.
| .contains("filters=" + icebergFilters + ","); | |
| .contains("filters=" + icebergFilters + ", runtimeFilters="); |
| assertThat(planAsString) | ||
| .as("Pushed filters must match") | ||
| .contains("[filters=" + icebergFilters + ","); | ||
| .contains("filters=" + icebergFilters + ","); |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue as above: runtimeFilters= includes filters= as a substring, so this can accidentally match the wrong section of the plan string. Tighten the assertion to include an unambiguous delimiter (e.g., \", filters=\") or context around the expected field.
| .contains("filters=" + icebergFilters + ","); | |
| .contains(", filters=" + icebergFilters + ","); |
| assertThat(description).contains("IcebergScan"); | ||
| assertThat(description).contains(tableName); | ||
| assertThat(description).contains("filters=id = 1, id > 0"); | ||
| assertThat(description).contains("groupedBy=data"); |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These assertions bake in the exact formatting and ordering of the rendered filter description (e.g., \"id = 1, id > 0\"). If Spark3Util.describe(...) changes formatting (commas vs AND, spacing, predicate order), the tests will fail even though behavior is correct. Consider asserting each predicate independently (e.g., contains \"id = 1\" and contains \"id > 0\") and separately asserting the filters= label, rather than asserting the full combined string in a single contains check.
|
|
||
| assertThat(description).contains("IcebergCopyOnWriteScan"); | ||
| assertThat(description).contains(tableName); | ||
| assertThat(description).contains("filters=id = 2, id < 10"); |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These assertions bake in the exact formatting and ordering of the rendered filter description (e.g., \"id = 1, id > 0\"). If Spark3Util.describe(...) changes formatting (commas vs AND, spacing, predicate order), the tests will fail even though behavior is correct. Consider asserting each predicate independently (e.g., contains \"id = 1\" and contains \"id > 0\") and separately asserting the filters= label, rather than asserting the full combined string in a single contains check.
| assertThat(description).contains("filters=id = 2, id < 10"); | |
| assertThat(description).contains("filters="); | |
| assertThat(description).contains("id = 2"); | |
| assertThat(description).contains("id < 10"); |
spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkPartitioningAwareScan.java
Show resolved
Hide resolved
This reverts commit b19aea0.
|
@aokolnychyi, would you mind providing some concrete examples (screenshots) of how this change affects Spark UI / EXPLAIN outputs? Displaying seems was not properly considered and documented in the Spark DSv2 API, we internally made some hacky changes to make the UI display pretty ... |
|
@pan3793, if I recall correctly, |
|
@pan3793, I am going to merge this one to rebase my other PR. However, I will check in for any additional feedback you may have. Let me know what you think and whether there are gaps we should fix either in Iceberg or in Spark. |
This PR contains a subset of changes for #15240 that moves around some scan objects.
A few notable changes:
description()that is actually used in Spark plans. No need for almost identical unusedtoString().