-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-49110][SQL] Fix reading metadata columns for tables with CHAR columns #53846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
JIRA Issue Information=== Improvement SPARK-49110 === This comment was automatically generated by GitHub Actions |
502a294 to
61e96a8
Compare
|
|
||
| def readSideCharPadding: Boolean = getConf(SQLConf.READ_SIDE_CHAR_PADDING) | ||
|
|
||
| def readSideCharPaddingAfterAlias: Boolean = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we usually don't add a method if it's only called once.
Instead of adding char padding Project between SubqueryAlias and the data source scan then moving it afterwards, use a two-pass approach: - First pass: match SubqueryAlias with data source scan, add padding Project AFTER SubqueryAlias to preserve metadata column access - Second pass: match data source scan alone (for cases like spark.read.format(...).load()) For idempotence, when adding char padding Project, the rule also clears out char type in the output attributes, so the second pass does nothing if the first pass matches. This removes the need for the READ_SIDE_PADDING_PROJECT_TAG.
Use two-pass approach for char padding with SubqueryAlias
| sql(s"CREATE TABLE $tbl (id bigint, data char(1)) PARTITIONED BY (bucket(4, id), id)") | ||
| sql(s"INSERT INTO $tbl VALUES (1, 'a'), (2, 'b'), (3, 'c')") | ||
| val sqlQuery = sql(s"SELECT id, data, index, _partition FROM $tbl") | ||
| val sqlQueryWithAlias = sql(s"SELECT t.id, t.data, t.index, t._partition FROM $tbl t") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
user specified subquery alias always work, can we test SELECT $tbl.id, ... FROM $tbl?
What changes were proposed in this pull request?
This PR modifies
SubqueryAliasto propagate the metadata output of its child if the is aProjectinjected byApplyCharTypePadding.Why are the changes needed?
This is needed because with these changes it is not possible to read metadata columns when reading a table with a CHAR column when read-side padding is enabled. In the case the plan is
SubqueryAlias(Project(LeafNode))), so without this patch the metadata output is not propagated.Does this PR introduce any user-facing change?
Yes, there will be more cases in which the metadata columns can be accessed, and users will no longer get an exception in these cases.
How was this patch tested?
Added a case to
MetadataColumnSuite.Was this patch authored or co-authored using generative AI tooling?
No