[SPARK-37273][SQL] Support hidden file metadata columns in Spark SQL #34575

Yaohua628 · 2021-11-12T21:41:48Z

What changes were proposed in this pull request?

This PR proposes a new interface in Spark SQL that allows users to query the metadata of the input files for all file formats. Spark SQL will expose them as built-in hidden columns meaning users can only see them when they explicitly reference them. Currently, This PR proposes to support the following metadata columns inside of a metadata struct _metadata:

Name	Type	Description	Example
_metadata.file_path	String	The absolute file path of the input file.	file:/tmp/spark-7f600b30-b3ec-43a8-8cd2-686491654f9b/f0.csv
_metadata.file_name	String	The name of the input file along with the extension.	f0.csv
_metadata.file_size	Long	The length of the input file, in bytes.	628
_metadata.file_modification_time	Timestamp	The modification timestamp of the file.	2021-12-20 20:05:21

This proposed hidden file metadata interface has the following behaviors:

Hidden: metadata columns are hidden. They will not show up when only selecting data columns or selecting all (SELECT *). In other words, they are not returned unless being explicitly referenced.
Not overwrite the data schema: in the case of name collisions with data columns, data columns will be returned instead of the metadata columns. In other words, metadata columns can not overwrite user data in any case.

Why are the changes needed?

To improve the Spark SQL observability for all file formats that still leverage DSV1.

Does this PR introduce any user-facing change?

Yes.

spark.read.format("csv")
     .schema(schema)
     .load("file:/tmp/*")
     .select("name", "age",
             "_metadata.file_path", "_metadata.file_name",
             "_metadata.file_size", "_metadata.file_modification_time")

Example return:

name	age	file_path	file_name	file_size	file_modification_time
Debbie	18	file:/tmp/f0.csv	f0.csv	12	2021-07-02 01:05:21
Frank	24	file:/tmp/f1.csv	f1.csv	11	2021-12-20 02:06:21

How was this patch tested?

Add new testsuite: FileMetadataColumnsSuite

Yaohua628 · 2021-11-12T21:43:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala

+/**
+ * The internal representation of the hidden metadata column
+ */
+class MetadataAttribute(


Will think about this new class. Maybe have something like AttributeReferenceBase trait.

Yaohua628 · 2021-11-12T21:46:10Z

@cloud-fan @brkyvz It would be great if you can take a look! Thanks!

brkyvz · 2021-11-13T00:01:01Z

ok to test

brkyvz · 2021-11-13T00:01:39Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala

+    override val metadata: Metadata = Metadata.empty)(
+    override val exprId: ExprId = NamedExpression.newExprId,
+    override val qualifier: Seq[String] = Seq.empty[String])
+  extends AttributeReference(name, dataType, nullable, metadata)(exprId, qualifier) {


Let's not extend AttributeReference, otherwise copy can cause issues

brkyvz · 2021-11-13T00:02:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala

@@ -276,3 +276,10 @@ object LogicalPlanIntegrity {
    checkIfSameExprIdNotReused(plan) && hasUniqueExprIdsForOutput(plan)
  }
 }
+
+/**
+ * A logical plan node with exposed metadata columns


A logical plan node that can generate metadata columns

HyukjinKwon · 2021-11-13T06:30:59Z

ok to test

SparkQA · 2021-11-13T07:17:55Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49652/

SparkQA · 2021-11-13T08:18:33Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49652/

SparkQA · 2021-11-13T11:15:52Z

Test build #145183 has finished for PR 34575 at commit fc043fd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-16T00:28:09Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49723/

SparkQA · 2021-11-16T01:13:12Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49726/

SparkQA · 2021-11-16T01:29:06Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49723/

SparkQA · 2021-11-16T01:58:06Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49726/

SparkQA · 2021-11-16T03:14:48Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49733/

SparkQA · 2021-11-16T03:58:34Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49733/

SparkQA · 2021-11-16T04:42:51Z

Test build #145253 has finished for PR 34575 at commit 170378b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-16T05:34:27Z

Test build #145256 has finished for PR 34575 at commit 73593c5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-16T06:21:12Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49738/

SparkQA · 2021-11-16T07:20:33Z

Test build #145263 has finished for PR 34575 at commit c531300.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-16T07:20:53Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49738/

SparkQA · 2021-11-16T10:20:23Z

Test build #145268 has finished for PR 34575 at commit bd28eb7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-11-19T08:07:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala

+ */
+object MetadataAttribute {
+  def apply(name: String, dataType: DataType): AttributeReference =
+    AttributeReference(name, dataType, true,


shall we allow non-nullable metadata attr? We should probably add one more parameter in apply: nullable: boolean

cloud-fan · 2021-12-21T08:56:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala

+    .add(StructField(FILE_PATH, StringType))
+    .add(StructField(FILE_NAME, StringType))
+    .add(StructField(FILE_SIZE, LongType))
+    .add(StructField(FILE_MODIFICATION_TIME, LongType))


should this be TimestampType?

I think it's more like a design choice? I think both are fine, I don't have a strong opinion on it.
long matches what the file modification tells you directly;
timestamp is more readable;
WDYT?

I think this one is an easy decision. Timestamp type is much better as people can do WHERE _metadata.modificationTime < TIMESTAMP'2020-12-12 12:12:12' or other datetime operations. And df.show can also display the value in a more user-readable format.

got it, it makes sense! addressed.

cloud-fan

LGTM, only one comment about the data type of one metadata column.

SparkQA · 2021-12-21T09:37:31Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50904/

SparkQA · 2021-12-21T09:43:03Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50903/

SparkQA · 2021-12-21T10:31:03Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50907/

SparkQA · 2021-12-21T10:32:20Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50904/

cloud-fan · 2021-12-21T10:48:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala

@@ -45,7 +45,7 @@ import org.apache.spark.util.NextIterator
 * @param filePath URI of the file to read
 * @param start the beginning offset (in bytes) of the block.
 * @param length number of bytes to read.
- * @param modificationTime The modification time of the input file, in milliseconds.
+ * @param modificationTime The modification time of the input file, in microseconds.


nit: I think we can still put milliseconds here, as it matches file.getModificationTime. We can * 1000 in FileScanRDD when we set the value to the internal row.

SparkQA · 2021-12-21T10:57:32Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50903/

SparkQA · 2021-12-21T11:34:36Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50907/

SparkQA · 2021-12-21T11:35:02Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50911/

SparkQA · 2021-12-21T11:46:23Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50912/

SparkQA · 2021-12-21T12:29:40Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50911/

SparkQA · 2021-12-21T12:57:31Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50912/

SparkQA · 2021-12-21T13:51:04Z

Test build #146428 has finished for PR 34575 at commit 00bda90.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-12-21T13:58:20Z

Test build #146429 has finished for PR 34575 at commit afa0a83.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-12-21T14:06:02Z

Test build #146431 has finished for PR 34575 at commit 65e79ab.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-12-21T14:47:38Z

thanks, merging to master!

SparkQA · 2021-12-21T15:53:11Z

Test build #146436 has finished for PR 34575 at commit 3b3d635.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-12-21T19:23:50Z

Test build #146437 has finished for PR 34575 at commit 4400f6a.

This patch fails from timeout after a configured wait of 500m.
This patch merges cleanly.
This patch adds no public classes.

c21

Sorry for the late review, thanks for @Yaohua628 for the work! Just have some questions. Thanks.

c21 · 2021-12-22T07:41:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala

+            val columnVector = new OnHeapColumnVector(c.numRows(), StringType)
+            rowId = 0
+            // use a tight-loop for better performance
+            while (rowId < c.numRows()) {
+              columnVector.putByteArray(rowId, filePathBytes)
+              rowId += 1
+            }
+            columnVector


It looks like for each batch of input rows, we need to recreate new column vector onheap, and write the same constant values per each row (i.e. file path, file name, file size, etc). Just wondering the performance penalty when reading a large table, how big of table have we tested?

Maybe a simple optimization here is to come up with something like ConstantColumnVector, where for each row, all values are same, and we only need to save one copy of value.

thanks for reviewing! make sense - also Bart mentioned something optimizing putByteArray for all rows: #34575 (comment)

c21 · 2021-12-22T07:42:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala

+        val filePathBytes = path.toString.getBytes
+        val fileNameBytes = path.getName.getBytes
+        var rowId = 0
+        metadataColumns.map(_.name).map {


We should already know how to fill column vector for each metadata column, so the pattern matching can be done outside of execution, and here it does not need to do pattern matching per batch.

per-batch should be fine to have some small overhead.

Yeah I agree, it's not a huge issue as it's per batch not per row. But also think it's not hard to organize code as the most efficient way.

thanks for the comments, really appreciate that!

but, we have to do something per batch right? since we cannot be sure of c.numRows (small file or last batch), and different file formats could have configurable different max rows per batch: Parquet, ORC.

unless we could have a ConstantColumnVector as you mentioned or something else and refer only c.numRows for each batch?

c21 · 2021-12-22T07:45:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala

+
+  val FILE_PATH = "file_path"
+
+  val FILE_NAME = "file_name"


wondering do we also plan to deprecate existing expression InputFileName in Spark?

Good point. I think we should, as InputFileName is really fragile and can't be used with join for example.

c21 · 2021-12-22T07:45:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala

+        case MetadataAttribute(attr) => attr
+      }
+
+      // TODO (yaohua): should be able to prune the metadata struct only containing what needed


nit: Shall we file a JIRA?

### What changes were proposed in this pull request? Follow-up PR of #34575. Support the metadata struct schema pruning for all file formats. ### Why are the changes needed? Performance improvements. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UTs and a new UT. Closes #35147 from Yaohua628/spark-37768. Authored-by: yaohua <yaohua.zhao@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…present in the data filter ### What changes were proposed in this pull request? Follow-up PR of #34575. Filtering files if metadata columns are present in the data filter. ### Why are the changes needed? Performance improvements. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UTs and a new UT. Closes #35055 from Yaohua628/spark-37769. Authored-by: yaohua <yaohua.zhao@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Yaohua628 added 3 commits November 11, 2021 16:41

first draft

dee06f6

nit of renaming file

06ac79e

more meta

fc043fd

github-actions bot added the SQL label Nov 12, 2021

Yaohua628 commented Nov 12, 2021

View reviewed changes

brkyvz reviewed Nov 13, 2021

View reviewed changes

Yaohua628 added 2 commits November 15, 2021 15:40

metadata unapply

170378b

nits

73593c5

exhausive pattern match

c531300

nit refactor

bd28eb7

cloud-fan reviewed Nov 19, 2021

View reviewed changes

name nit

65e79ab

cloud-fan reviewed Dec 21, 2021

View reviewed changes

cloud-fan approved these changes Dec 21, 2021

View reviewed changes

change modification_time to TimestampType

3b3d635

cloud-fan reviewed Dec 21, 2021

View reviewed changes

1000 in filescanrdd

4400f6a

cloud-fan approved these changes Dec 21, 2021

View reviewed changes

cloud-fan closed this in 62cf4d4 Dec 21, 2021

c21 reviewed Dec 22, 2021

View reviewed changes

Yaohua628 mentioned this pull request Dec 29, 2021

[SPARK-37769][SQL][FOLLOWUP] Filtering files if metadata columns are present in the data filter #35055

Closed

Yaohua628 mentioned this pull request Jan 10, 2022

[SPARK-37768][SQL][FOLLOWUP] Schema pruning for the metadata struct #35147

Closed

CTTY mentioned this pull request Jul 20, 2022

[HUDI-4186] Support Hudi with Spark 3.3.0 apache/hudi#5943

Merged

5 tasks

[SPARK-37273][SQL] Support hidden file metadata columns in Spark SQL #34575

[SPARK-37273][SQL] Support hidden file metadata columns in Spark SQL #34575

Conversation

Yaohua628 commented Nov 12, 2021 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

Yaohua628 commented Nov 12, 2021

brkyvz commented Nov 13, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Nov 13, 2021

SparkQA commented Nov 13, 2021

SparkQA commented Nov 13, 2021

SparkQA commented Nov 13, 2021

SparkQA commented Nov 16, 2021

SparkQA commented Nov 16, 2021

SparkQA commented Nov 16, 2021

SparkQA commented Nov 16, 2021

SparkQA commented Nov 16, 2021

SparkQA commented Nov 16, 2021

SparkQA commented Nov 16, 2021

SparkQA commented Nov 16, 2021

SparkQA commented Nov 16, 2021

SparkQA commented Nov 16, 2021

SparkQA commented Nov 16, 2021

SparkQA commented Nov 16, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yaohua628 Dec 21, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan left a comment

Choose a reason for hiding this comment

SparkQA commented Dec 21, 2021

SparkQA commented Dec 21, 2021

SparkQA commented Dec 21, 2021

SparkQA commented Dec 21, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 21, 2021

SparkQA commented Dec 21, 2021

SparkQA commented Dec 21, 2021

SparkQA commented Dec 21, 2021

SparkQA commented Dec 21, 2021

SparkQA commented Dec 21, 2021

SparkQA commented Dec 21, 2021

SparkQA commented Dec 21, 2021

SparkQA commented Dec 21, 2021

cloud-fan commented Dec 21, 2021

SparkQA commented Dec 21, 2021

SparkQA commented Dec 21, 2021

c21 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yaohua628 commented Nov 12, 2021 •

edited

Yaohua628 Dec 21, 2021 •

edited