[SPARK-43123][SQL] Internal field metadata should not be leaked to catalogs#40776
Closed
cloud-fan wants to merge 2 commits intoapache:masterfrom
Closed
[SPARK-43123][SQL] Internal field metadata should not be leaked to catalogs#40776cloud-fan wants to merge 2 commits intoapache:masterfrom
cloud-fan wants to merge 2 commits intoapache:masterfrom
Conversation
Contributor
Author
| def removeInternalMetadata(schema: StructType): StructType = { | ||
| StructType(schema.map { field => | ||
| val newMetadata = new MetadataBuilder().withMetadata(field.metadata) | ||
| .remove(METADATA_COL_ATTR_KEY) |
Member
There was a problem hiding this comment.
Nit: shalll we define an array for all the internal metadata outside the method?
E.g.
val internalMetaData = Seq(
METADATA_COL_ATTR_KEY,
QUALIFIED_ACCESS_ONLY,
...
Contributor
Author
|
thanks for review, merging to master! |
zsxwing
reviewed
Apr 17, 2023
| val AUTO_GENERATED_ALIAS = "__autoGeneratedAlias" | ||
|
|
||
| val INTERNAL_METADATA_KEYS = Seq( | ||
| AUTO_GENERATED_ALIAS, |
Member
There was a problem hiding this comment.
Are these metadata keys only in the top level columns?
Contributor
Author
There was a problem hiding this comment.
I believe so, after checking the code that generates them. For example, https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala#L171-L175
vkorukanti
pushed a commit
to delta-io/delta
that referenced
this pull request
Apr 20, 2023
[SPARK-43123](apache/spark#40776) fixed an issue that Spark might leak internal metadata, which caused Delta to store Spark's internal metadata in its table schema. Spark's internal metadata may trigger special behaviors. For example, if a column metadata has `__metadata_col`, it cannot be selected by star. If we leak `__metadata_col` to any column in a Delta table, we won't be able to query this column when using `SELECT *`. Although [SPARK-43123](apache/spark#40776) fixes the issue in new Spark versions, we might have already persisted internal metadata in some Delta tables. To make these Delta tables readable again, this PR adds an extra step to clean up internal metadata before returning the table schema to Spark. GitOrigin-RevId: 60eb4046d55e955379c98e409993b33e753c5256
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
In Spark, we have defined some internal field metadata to help query resolution and compilation. For example, there are quite some field metadata that are related to metadata columns.
However, when we create tables, these internal field metadata can be leaked. This PR updates CTAS/RTAS commands to remove these internal field metadata before creating tables. CREATE/REPLACE TABLE command is fine as users can't generate these internal field metadata via the type string.
Why are the changes needed?
to avoid potential issues, like mistakenly treating a data column as metadata column
Does this PR introduce any user-facing change?
no
How was this patch tested?
new test