Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-38159][SQL] Add a new FileSourceMetadataAttribute for the Hidden File Metadata #35459

Closed
wants to merge 5 commits into from

Conversation

Yaohua628
Copy link
Contributor

@Yaohua628 Yaohua628 commented Feb 9, 2022

What changes were proposed in this pull request?

Add a new FileSourceMetadataAttribute object with an apply method to create a FileSourceMetadataAttribute, and an unapply method to match only file source metadata attribute.

Why are the changes needed?

Extra safeguard to make sure it matches file source metadata attribute.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing UTs

@github-actions github-actions bot added the SQL label Feb 9, 2022
@Yaohua628
Copy link
Contributor Author

Hi @cloud-fan, an XS PR to refactor the MetadataAttribute unapply method, please take a look whenever you have time, thanks a lot!

@@ -437,16 +437,20 @@ object VirtualColumn {
* The internal representation of the hidden metadata struct:
* set `__metadata_col` to `true` in AttributeReference metadata
* - apply() will create a metadata attribute reference
* - unapply() will check if an attribute reference is the metadata attribute reference
* - unapply() will check if an attribute reference is the file source metadata attribute reference
*/
object MetadataAttribute {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should have two objects

  1. MetadataAttribute: This is used to create and extract general metadata attributes (not only for file source)
  2. FileSourceMetadataAttribute: To extract file source metadata attributes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense!

@HyukjinKwon HyukjinKwon changed the title [SPARK-38159] Minor refactor of MetadataAttribute unapply method [SPARK-38159][SQL] Minor refactor of MetadataAttribute unapply method Feb 10, 2022
@Yaohua628
Copy link
Contributor Author

@cloud-fan addressed your comment by adding a FileSourceMetadataAttribute, thanks!

val ATTRIBUTE_NAME = "_metadata"

def unapply(attr: AttributeReference): Option[AttributeReference] = {
if (attr.metadata.contains(METADATA_COL_ATTR_KEY)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

attr match {
  case MetadataAttribute(a) if a.name.equalsIgnoreCase(ATTRIBUTE_NAME) => Some(attr)
  case _ => None
}

@@ -366,9 +364,8 @@ case class FileSourceScanExec(
@transient
private lazy val pushedDownFilters = {
val supportNestedPredicatePushdown = DataSourceUtils.supportNestedPredicatePushdown(relation)
// TODO: should be able to push filters containing metadata columns down to skip files
dataFilters.filterNot(_.references.exists {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to remove the metadata col filter in FileSourceStrategy, as that's where we make the metadata col filter invalid: we remove the struct-type metadata col and convert its fields to attributes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, it is better to remove in FileSourceStrategy, but we actually need to use the intact dataFilters here in FileSourceScanExec to filter files to read (i.e. listFiles(..., dataFilters)).

added some comments here, do you think it is OK for now?

*/
object FileSourceMetadataAttribute {

val ATTRIBUTE_NAME = "_metadata"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After more thought, I think it's more robust to not rely on the attribute name to extract file source metadata attrs. We can add a new metadata key to indicate file source metadata attributes. Sorry for the back and forth!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np! sounds good to me!

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

* - apply() will create a file source metadata attribute reference
* - unapply() will check if an attribute reference is the file source metadata attribute reference
*/
object FileSourceMetadataAttribute {
Copy link
Contributor Author

@Yaohua628 Yaohua628 Feb 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still keep the METADATA_COL_ATTR_KEY in FileSourceMetadataAttribute: FileSourceMetadataAttribute <= MetadataAttribute, please let me know WDYT

@Yaohua628
Copy link
Contributor Author

@cloud-fan addressed all your comments:

  • added a new key for FileSourceMetadataAttribute
  • addressed some nits
  • replied one of your suggestions

ready for another pass, thanks!

@@ -196,12 +196,10 @@ case class FileSourceScanExec(
optionalNumCoalescedBuckets: Option[Int],
dataFilters: Seq[Expression],
tableIdentifier: Option[TableIdentifier],
disableBucketedScan: Boolean = false)
disableBucketedScan: Boolean = false,
metadataColumns: Seq[AttributeReference])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we can avoid adding this new parameter. When convert the fields of struct-type metadata col, we can use FileSourceMetadataAttribute

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, forgot to revert it back...

@Yaohua628 Yaohua628 changed the title [SPARK-38159][SQL] Minor refactor of MetadataAttribute unapply method [SPARK-38159][SQL] Add a new FileSourceMetadataAttribute for the Hidden File Metadata Feb 10, 2022
@cloud-fan
Copy link
Contributor

document building fails. Can you rerun the Github Action job?

@Yaohua628
Copy link
Contributor Author

document building fails. Can you rerun the Github Action job?

done! thanks

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 7ed51bb Feb 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants