Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-37273][SQL] Support hidden file metadata columns in Spark SQL #34575

Closed
wants to merge 22 commits into from

Conversation

Yaohua628
Copy link
Contributor

@Yaohua628 Yaohua628 commented Nov 12, 2021

What changes were proposed in this pull request?

This PR proposes a new interface in Spark SQL that allows users to query the metadata of the input files for all file formats. Spark SQL will expose them as built-in hidden columns meaning users can only see them when they explicitly reference them. Currently, This PR proposes to support the following metadata columns inside of a metadata struct _metadata:

Name Type Description Example
_metadata.file_path String The absolute file path of the input file. file:/tmp/spark-7f600b30-b3ec-43a8-8cd2-686491654f9b/f0.csv
_metadata.file_name String The name of the input file along with the extension. f0.csv
_metadata.file_size Long The length of the input file, in bytes. 628
_metadata.file_modification_time Timestamp The modification timestamp of the file. 2021-12-20 20:05:21

This proposed hidden file metadata interface has the following behaviors:

  • Hidden: metadata columns are hidden. They will not show up when only selecting data columns or selecting all (SELECT *). In other words, they are not returned unless being explicitly referenced.
  • Not overwrite the data schema: in the case of name collisions with data columns, data columns will be returned instead of the metadata columns. In other words, metadata columns can not overwrite user data in any case.

Why are the changes needed?

To improve the Spark SQL observability for all file formats that still leverage DSV1.

Does this PR introduce any user-facing change?

Yes.

spark.read.format("csv")
     .schema(schema)
     .load("file:/tmp/*")
     .select("name", "age",
             "_metadata.file_path", "_metadata.file_name",
             "_metadata.file_size", "_metadata.file_modification_time")

Example return:

name age file_path file_name file_size file_modification_time
Debbie 18 file:/tmp/f0.csv f0.csv 12 2021-07-02 01:05:21
Frank 24 file:/tmp/f1.csv f1.csv 11 2021-12-20 02:06:21

How was this patch tested?

Add new testsuite: FileMetadataColumnsSuite

@github-actions github-actions bot added the SQL label Nov 12, 2021
/**
* The internal representation of the hidden metadata column
*/
class MetadataAttribute(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will think about this new class. Maybe have something like AttributeReferenceBase trait.

@Yaohua628
Copy link
Contributor Author

@cloud-fan @brkyvz It would be great if you can take a look! Thanks!

@brkyvz
Copy link
Contributor

brkyvz commented Nov 13, 2021

ok to test

override val metadata: Metadata = Metadata.empty)(
override val exprId: ExprId = NamedExpression.newExprId,
override val qualifier: Seq[String] = Seq.empty[String])
extends AttributeReference(name, dataType, nullable, metadata)(exprId, qualifier) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not extend AttributeReference, otherwise copy can cause issues

@@ -276,3 +276,10 @@ object LogicalPlanIntegrity {
checkIfSameExprIdNotReused(plan) && hasUniqueExprIdsForOutput(plan)
}
}

/**
* A logical plan node with exposed metadata columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A logical plan node that can generate metadata columns

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Nov 13, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49652/

@SparkQA
Copy link

SparkQA commented Nov 13, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49652/

@SparkQA
Copy link

SparkQA commented Nov 13, 2021

Test build #145183 has finished for PR 34575 at commit fc043fd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 16, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49723/

@SparkQA
Copy link

SparkQA commented Nov 16, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49726/

@SparkQA
Copy link

SparkQA commented Nov 16, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49723/

@SparkQA
Copy link

SparkQA commented Nov 16, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49726/

@SparkQA
Copy link

SparkQA commented Nov 16, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49733/

@SparkQA
Copy link

SparkQA commented Nov 16, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49733/

@SparkQA
Copy link

SparkQA commented Nov 16, 2021

Test build #145253 has finished for PR 34575 at commit 170378b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 16, 2021

Test build #145256 has finished for PR 34575 at commit 73593c5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 16, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49738/

@SparkQA
Copy link

SparkQA commented Nov 16, 2021

Test build #145263 has finished for PR 34575 at commit c531300.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 16, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49738/

@SparkQA
Copy link

SparkQA commented Nov 16, 2021

Test build #145268 has finished for PR 34575 at commit bd28eb7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

*/
object MetadataAttribute {
def apply(name: String, dataType: DataType): AttributeReference =
AttributeReference(name, dataType, true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we allow non-nullable metadata attr? We should probably add one more parameter in apply: nullable: boolean

.add(StructField(FILE_PATH, StringType))
.add(StructField(FILE_NAME, StringType))
.add(StructField(FILE_SIZE, LongType))
.add(StructField(FILE_MODIFICATION_TIME, LongType))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be TimestampType?

Copy link
Contributor Author

@Yaohua628 Yaohua628 Dec 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more like a design choice? I think both are fine, I don't have a strong opinion on it.
long matches what the file modification tells you directly;
timestamp is more readable;
WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this one is an easy decision. Timestamp type is much better as people can do WHERE _metadata.modificationTime < TIMESTAMP'2020-12-12 12:12:12' or other datetime operations. And df.show can also display the value in a more user-readable format.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, it makes sense! addressed.

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only one comment about the data type of one metadata column.

@SparkQA
Copy link

SparkQA commented Dec 21, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50904/

@SparkQA
Copy link

SparkQA commented Dec 21, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50903/

@SparkQA
Copy link

SparkQA commented Dec 21, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50907/

@SparkQA
Copy link

SparkQA commented Dec 21, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50904/

@@ -45,7 +45,7 @@ import org.apache.spark.util.NextIterator
* @param filePath URI of the file to read
* @param start the beginning offset (in bytes) of the block.
* @param length number of bytes to read.
* @param modificationTime The modification time of the input file, in milliseconds.
* @param modificationTime The modification time of the input file, in microseconds.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think we can still put milliseconds here, as it matches file.getModificationTime. We can * 1000 in FileScanRDD when we set the value to the internal row.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure! done

@SparkQA
Copy link

SparkQA commented Dec 21, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50903/

@SparkQA
Copy link

SparkQA commented Dec 21, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50907/

@SparkQA
Copy link

SparkQA commented Dec 21, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50911/

@SparkQA
Copy link

SparkQA commented Dec 21, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50912/

@SparkQA
Copy link

SparkQA commented Dec 21, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50911/

@SparkQA
Copy link

SparkQA commented Dec 21, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50912/

@SparkQA
Copy link

SparkQA commented Dec 21, 2021

Test build #146428 has finished for PR 34575 at commit 00bda90.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 21, 2021

Test build #146429 has finished for PR 34575 at commit afa0a83.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 21, 2021

Test build #146431 has finished for PR 34575 at commit 65e79ab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 62cf4d4 Dec 21, 2021
@SparkQA
Copy link

SparkQA commented Dec 21, 2021

Test build #146436 has finished for PR 34575 at commit 3b3d635.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 21, 2021

Test build #146437 has finished for PR 34575 at commit 4400f6a.

  • This patch fails from timeout after a configured wait of 500m.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review, thanks for @Yaohua628 for the work! Just have some questions. Thanks.

Comment on lines +164 to +171
val columnVector = new OnHeapColumnVector(c.numRows(), StringType)
rowId = 0
// use a tight-loop for better performance
while (rowId < c.numRows()) {
columnVector.putByteArray(rowId, filePathBytes)
rowId += 1
}
columnVector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like for each batch of input rows, we need to recreate new column vector onheap, and write the same constant values per each row (i.e. file path, file name, file size, etc). Just wondering the performance penalty when reading a large table, how big of table have we tested?

Maybe a simple optimization here is to come up with something like ConstantColumnVector, where for each row, all values are same, and we only need to save one copy of value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for reviewing! make sense - also Bart mentioned something optimizing putByteArray for all rows: #34575 (comment)

val filePathBytes = path.toString.getBytes
val fileNameBytes = path.getName.getBytes
var rowId = 0
metadataColumns.map(_.name).map {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should already know how to fill column vector for each metadata column, so the pattern matching can be done outside of execution, and here it does not need to do pattern matching per batch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per-batch should be fine to have some small overhead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree, it's not a huge issue as it's per batch not per row. But also think it's not hard to organize code as the most efficient way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the comments, really appreciate that!

but, we have to do something per batch right? since we cannot be sure of c.numRows (small file or last batch), and different file formats could have configurable different max rows per batch: Parquet, ORC.

unless we could have a ConstantColumnVector as you mentioned or something else and refer only c.numRows for each batch?


val FILE_PATH = "file_path"

val FILE_NAME = "file_name"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering do we also plan to deprecate existing expression InputFileName in Spark?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I think we should, as InputFileName is really fragile and can't be used with join for example.

case MetadataAttribute(attr) => attr
}

// TODO (yaohua): should be able to prune the metadata struct only containing what needed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Shall we file a JIRA?

cloud-fan pushed a commit that referenced this pull request Jan 18, 2022
### What changes were proposed in this pull request?
Follow-up PR of #34575. Support the metadata struct schema pruning for all file formats.

### Why are the changes needed?
Performance improvements.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing UTs and a new UT.

Closes #35147 from Yaohua628/spark-37768.

Authored-by: yaohua <yaohua.zhao@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Jan 19, 2022
…present in the data filter

### What changes were proposed in this pull request?
Follow-up PR of #34575. Filtering files if metadata columns are present in the data filter.

### Why are the changes needed?
Performance improvements.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing UTs and a new UT.

Closes #35055 from Yaohua628/spark-37769.

Authored-by: yaohua <yaohua.zhao@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
8 participants