Skip to content

Conversation

@tomvanbussel
Copy link
Contributor

What changes were proposed in this pull request?

This PR modifies SubqueryAlias to propagate the metadata output of its child if the is a Project injected by ApplyCharTypePadding.

Why are the changes needed?

This is needed because with these changes it is not possible to read metadata columns when reading a table with a CHAR column when read-side padding is enabled. In the case the plan is SubqueryAlias(Project(LeafNode))), so without this patch the metadata output is not propagated.

Does this PR introduce any user-facing change?

Yes, there will be more cases in which the metadata columns can be accessed, and users will no longer get an exception in these cases.

How was this patch tested?

Added a case to MetadataColumnSuite.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Jan 18, 2026
@github-actions
Copy link

JIRA Issue Information

=== Improvement SPARK-49110 ===
Summary: Unable to access _metadata column of tables with CHAR column with reader side padding enabled
Assignee: None
Status: Open
Affected: ["3.5.1","4.0.0"]


This comment was automatically generated by GitHub Actions


def readSideCharPadding: Boolean = getConf(SQLConf.READ_SIDE_CHAR_PADDING)

def readSideCharPaddingAfterAlias: Boolean =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we usually don't add a method if it's only called once.

cloud-fan and others added 3 commits January 20, 2026 00:23
Instead of adding char padding Project between SubqueryAlias and the
data source scan then moving it afterwards, use a two-pass approach:

- First pass: match SubqueryAlias with data source scan, add padding
  Project AFTER SubqueryAlias to preserve metadata column access
- Second pass: match data source scan alone (for cases like
  spark.read.format(...).load())

For idempotence, when adding char padding Project, the rule also clears
out char type in the output attributes, so the second pass does nothing
if the first pass matches.

This removes the need for the READ_SIDE_PADDING_PROJECT_TAG.
Use two-pass approach for char padding with SubqueryAlias
sql(s"CREATE TABLE $tbl (id bigint, data char(1)) PARTITIONED BY (bucket(4, id), id)")
sql(s"INSERT INTO $tbl VALUES (1, 'a'), (2, 'b'), (3, 'c')")
val sqlQuery = sql(s"SELECT id, data, index, _partition FROM $tbl")
val sqlQueryWithAlias = sql(s"SELECT t.id, t.data, t.index, t._partition FROM $tbl t")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

user specified subquery alias always work, can we test SELECT $tbl.id, ... FROM $tbl?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants