-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-38159][SQL] Add a new FileSourceMetadataAttribute for the Hidden File Metadata #35459
Conversation
Hi @cloud-fan, an XS PR to refactor the |
@@ -437,16 +437,20 @@ object VirtualColumn { | |||
* The internal representation of the hidden metadata struct: | |||
* set `__metadata_col` to `true` in AttributeReference metadata | |||
* - apply() will create a metadata attribute reference | |||
* - unapply() will check if an attribute reference is the metadata attribute reference | |||
* - unapply() will check if an attribute reference is the file source metadata attribute reference | |||
*/ | |||
object MetadataAttribute { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should have two objects
MetadataAttribute
: This is used to create and extract general metadata attributes (not only for file source)FileSourceMetadataAttribute
: To extract file source metadata attributes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense!
@cloud-fan addressed your comment by adding a |
val ATTRIBUTE_NAME = "_metadata" | ||
|
||
def unapply(attr: AttributeReference): Option[AttributeReference] = { | ||
if (attr.metadata.contains(METADATA_COL_ATTR_KEY) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
attr match {
case MetadataAttribute(a) if a.name.equalsIgnoreCase(ATTRIBUTE_NAME) => Some(attr)
case _ => None
}
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
Show resolved
Hide resolved
@@ -366,9 +364,8 @@ case class FileSourceScanExec( | |||
@transient | |||
private lazy val pushedDownFilters = { | |||
val supportNestedPredicatePushdown = DataSourceUtils.supportNestedPredicatePushdown(relation) | |||
// TODO: should be able to push filters containing metadata columns down to skip files | |||
dataFilters.filterNot(_.references.exists { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to remove the metadata col filter in FileSourceStrategy
, as that's where we make the metadata col filter invalid: we remove the struct-type metadata col and convert its fields to attributes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, it is better to remove in FileSourceStrategy
, but we actually need to use the intact dataFilters
here in FileSourceScanExec
to filter files to read (i.e. listFiles(..., dataFilters)
).
added some comments here, do you think it is OK for now?
*/ | ||
object FileSourceMetadataAttribute { | ||
|
||
val ATTRIBUTE_NAME = "_metadata" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After more thought, I think it's more robust to not rely on the attribute name to extract file source metadata attrs. We can add a new metadata key to indicate file source metadata attributes. Sorry for the back and forth!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np! sounds good to me!
Can one of the admins verify this patch? |
* - apply() will create a file source metadata attribute reference | ||
* - unapply() will check if an attribute reference is the file source metadata attribute reference | ||
*/ | ||
object FileSourceMetadataAttribute { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still keep the METADATA_COL_ATTR_KEY
in FileSourceMetadataAttribute
: FileSourceMetadataAttribute
<= MetadataAttribute
, please let me know WDYT
@cloud-fan addressed all your comments:
ready for another pass, thanks! |
@@ -196,12 +196,10 @@ case class FileSourceScanExec( | |||
optionalNumCoalescedBuckets: Option[Int], | |||
dataFilters: Seq[Expression], | |||
tableIdentifier: Option[TableIdentifier], | |||
disableBucketedScan: Boolean = false) | |||
disableBucketedScan: Boolean = false, | |||
metadataColumns: Seq[AttributeReference]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now we can avoid adding this new parameter. When convert the fields of struct-type metadata col, we can use FileSourceMetadataAttribute
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, forgot to revert it back...
document building fails. Can you rerun the Github Action job? |
done! thanks |
thanks, merging to master! |
What changes were proposed in this pull request?
Add a new
FileSourceMetadataAttribute
object with anapply
method to create aFileSourceMetadataAttribute
, and anunapply
method to match only file source metadata attribute.Why are the changes needed?
Extra safeguard to make sure it matches file source metadata attribute.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Existing UTs