[SPARK-30428][SQL] File source V2: support partition pruning #27112

gengliangwang · 2020-01-07T02:05:54Z

What changes were proposed in this pull request?

File source V2: support partition pruning.
Note: subquery predicates are not pushed down for partition pruning even after this PR, due to the limitation for the current data source V2 API and framework. The rule PlanSubqueries requires the subquery expression to be in the children or class parameters in SparkPlan, while the condition is not satisfied for BatchScanExec.

Why are the changes needed?

It's important for reading performance.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New unit tests for all the V2 file sources

gengliangwang · 2020-01-07T02:07:05Z

I will add more test cases. Mark this one as WIP for now.

SparkQA · 2020-01-07T03:37:34Z

Test build #116200 has finished for PR 27112 at commit 51a42a0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2020-01-07T21:38:12Z

@cloud-fan

gengliangwang · 2020-01-07T21:47:50Z

external/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroScan.scala

-    }
+    options: CaseInsensitiveStringMap,
+    partitionFilters: Seq[Expression] = Seq.empty) extends FileScan {
+  override def isSplitable(path: Path): Boolean = true


I found the indent is wrong in AvroScan. Fix it as well.

SparkQA · 2020-01-07T23:07:34Z

Test build #116254 has finished for PR 27112 at commit 65200b6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-08T01:35:50Z

Test build #116260 has finished for PR 27112 at commit 13e9535.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-08T03:04:26Z

Test build #116264 has finished for PR 27112 at commit 7652c35.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-08T04:43:29Z

Test build #116271 has finished for PR 27112 at commit 58a4a07.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

guykhazma · 2020-01-08T15:29:36Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala

      if (partitionKeyFilters.nonEmpty) {
        val prunedFileIndex = catalogFileIndex.filterPartitions(partitionKeyFilters.toSeq)
        val prunedFsRelation =
-          fsRelation.copy(location = prunedFileIndex)(sparkSession)
+          fsRelation.copy(location = prunedFileIndex)(fsRelation.sparkSession)


I suggest to pass also the dataFilters.
This is useful for FileIndex implementations that use the dataFilters to do the file listing.
For example, we use this to provide data skipping for all file based datasources.
I suggest something like this guykhazma@ de3415b

@guykhazma Thanks for the suggestion.
However, the PartitioningAwareFileIndex doesn't use the data filters for listing files. Could you provide an example that the data filters will be useful here?
Also, the data filters are supposed to be pushed down in FileScanBuiler (e.g ORC/Parquet)

@gengliangwang this is useful for enabling data skipping on all file formats including formats which doesn't support pushdown (e.g CSV, JSON) by replacing the FileIndex implementation with a FileIndex which use also the dataFilters to filter the file listing.

This is the old v1 code path, let's not touch it in this PR.

and for v2 code path, the data filters are already pushed in the rule V2ScanRelationPushDown

cloud-fan · 2020-01-09T13:00:30Z

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

  override protected def sparkConf: SparkConf =
    super
      .sparkConf
      .set(SQLConf.USE_V1_SOURCE_LIST, "")
+
+  test("Avro source v2: support partition pruning") {


not related to this PR, but we should think of how to share test cases between the avro suite and FileBasedDataSourceSuite

cloud-fan · 2020-01-09T13:53:46Z

thanks, merging to master!

guykhazma · 2020-01-09T14:09:55Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala

@@ -71,7 +103,7 @@ abstract class FileScan(
  }

  protected def partitions: Seq[FilePartition] = {
-    val selectedPartitions = fileIndex.listFiles(Seq.empty, Seq.empty)
+    val selectedPartitions = fileIndex.listFiles(partitionFilters, Seq.empty)


@gengliangwang @cloud-fan continuing the discussion from above (the comment was on the wrong line).
The V2ScanRelationPushDown rule will pushdown the dataFilters only to datasources which support pushdown by implementing the SupportsPushDownFilters trait.
Datasources such as csv and json do not implement the SupportsPushDownFilters trait. In order to support data skipping uniformly for all file based data sources, we override the listFiles method in a FileIndex implementation, which consults external metadata and prunes the list of files.
The suggestion is to make the necessary changes to have the dataFilters passed to the listFiles as well.
Otherwise, one would have to create a new datasource implementation in order to support each file based datasource that doesn't have a built in pushdown mechanism.

This makes sense to me. @gengliangwang what do you think?

At least you can disable v2 file source to bring back this feature.

Yes, it makes sense if there is a fileIndex can use the dataFilters.
@guykhazma could you create a PR for this?

@gengliangwang @cloud-fan sure, thanks.
I have opened this PR

MaxGekk · 2020-01-10T13:07:27Z

After rebasing on the changes, my PR #26973 started failing at

spark/sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala

Line 764 in c0e9f9f

assert(filterCondition.isDefined)

@gengliangwang Did I miss something while merging this? Any help is welcome.

MaxGekk · 2020-01-11T20:10:53Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala

+        op
+      }
+
+    case op @ PhysicalOperation(projects, filters,


CSV datasource in #26973 doesn't fall to the case but parquet/orc does. And withPartitionFilters is not invoke for CSV. What's wrong with CSV when filters push down is enabled?

### What changes were proposed in this pull request? Follow up on [SPARK-30428](#27112) which added support for partition pruning in File source V2. This PR implements the necessary changes in order to pass the `dataFilters` to the `listFiles`. This enables having `FileIndex` implementations which use the `dataFilters` for further pruning the file listing (see the discussion [here](#27112 (comment))). ### Why are the changes needed? Datasources such as `csv` and `json` do not implement the `SupportsPushDownFilters` trait. In order to support data skipping uniformly for all file based data sources, one can override the `listFiles` method in a `FileIndex` implementation, which consults external metadata and prunes the list of files. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Modifying the unit tests for v2 file sources to verify the `dataFilters` are passed Closes #27157 from guykhazma/PushdataFiltersInFileListing. Authored-by: Guy Khazma <guykhag@gmail.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>

### What changes were proposed in this pull request? This bug was introduced by SPARK-30428 at Apache Spark 3.0.0. This PR fixes `FileScan.equals()`. ### Why are the changes needed? - Without this fix `FileScan.equals` doesn't take `fileIndex` and `readSchema` into account. - Partition filters and data filters added to `FileScan` (in #27112 and #27157) caused that canonicalized form of some `BatchScanExec` nodes don't match and this prevents some reuse possibilities. ### Does this PR introduce _any_ user-facing change? Yes, before this fix incorrect reuse of `FileScan` and so `BatchScanExec` could have happed causing correctness issues. ### How was this patch tested? Added new UTs. Closes #31848 from peter-toth/SPARK-34756-fix-filescan-equality-check. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This bug was introduced by SPARK-30428 at Apache Spark 3.0.0. This PR fixes `FileScan.equals()`. ### Why are the changes needed? - Without this fix `FileScan.equals` doesn't take `fileIndex` and `readSchema` into account. - Partition filters and data filters added to `FileScan` (in #27112 and #27157) caused that canonicalized form of some `BatchScanExec` nodes don't match and this prevents some reuse possibilities. ### Does this PR introduce _any_ user-facing change? Yes, before this fix incorrect reuse of `FileScan` and so `BatchScanExec` could have happed causing correctness issues. ### How was this patch tested? Added new UTs. Closes #31848 from peter-toth/SPARK-34756-fix-filescan-equality-check. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 93a5d34) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

This bug was introduced by SPARK-30428 at Apache Spark 3.0.0. This PR fixes `FileScan.equals()`. - Without this fix `FileScan.equals` doesn't take `fileIndex` and `readSchema` into account. - Partition filters and data filters added to `FileScan` (in #27112 and #27157) caused that canonicalized form of some `BatchScanExec` nodes don't match and this prevents some reuse possibilities. Yes, before this fix incorrect reuse of `FileScan` and so `BatchScanExec` could have happed causing correctness issues. Added new UTs. Closes #31848 from peter-toth/SPARK-34756-fix-filescan-equality-check. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 93a5d34) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

This bug was introduced by SPARK-30428 at Apache Spark 3.0.0. This PR fixes `FileScan.equals()`. - Without this fix `FileScan.equals` doesn't take `fileIndex` and `readSchema` into account. - Partition filters and data filters added to `FileScan` (in apache#27112 and apache#27157) caused that canonicalized form of some `BatchScanExec` nodes don't match and this prevents some reuse possibilities. Yes, before this fix incorrect reuse of `FileScan` and so `BatchScanExec` could have happed causing correctness issues. Added new UTs. Closes apache#31848 from peter-toth/SPARK-34756-fix-filescan-equality-check. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 93a5d34) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This bug was introduced by SPARK-30428 at Apache Spark 3.0.0. This PR fixes `FileScan.equals()`. ### Why are the changes needed? - Without this fix `FileScan.equals` doesn't take `fileIndex` and `readSchema` into account. - Partition filters and data filters added to `FileScan` (in #27112 and #27157) caused that canonicalized form of some `BatchScanExec` nodes don't match and this prevents some reuse possibilities. ### Does this PR introduce _any_ user-facing change? Yes, before this fix incorrect reuse of `FileScan` and so `BatchScanExec` could have happed causing correctness issues. ### How was this patch tested? Added new UTs. Closes #31952 from peter-toth/SPARK-34756-fix-filescan-equality-check-3.0. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This bug was introduced by SPARK-30428 at Apache Spark 3.0.0. This PR fixes `FileScan.equals()`. ### Why are the changes needed? - Without this fix `FileScan.equals` doesn't take `fileIndex` and `readSchema` into account. - Partition filters and data filters added to `FileScan` (in apache#27112 and apache#27157) caused that canonicalized form of some `BatchScanExec` nodes don't match and this prevents some reuse possibilities. ### Does this PR introduce _any_ user-facing change? Yes, before this fix incorrect reuse of `FileScan` and so `BatchScanExec` could have happed causing correctness issues. ### How was this patch tested? Added new UTs. Closes apache#31848 from peter-toth/SPARK-34756-fix-filescan-equality-check. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 93a5d34) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? Follow up on [SPARK-30428](apache#27112) which added support for partition pruning in File source V2. This PR implements the necessary changes in order to pass the `dataFilters` to the `listFiles`. This enables having `FileIndex` implementations which use the `dataFilters` for further pruning the file listing (see the discussion [here](apache#27112 (comment))). ### Why are the changes needed? Datasources such as `csv` and `json` do not implement the `SupportsPushDownFilters` trait. In order to support data skipping uniformly for all file based data sources, one can override the `listFiles` method in a `FileIndex` implementation, which consults external metadata and prunes the list of files. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Modifying the unit tests for v2 file sources to verify the `dataFilters` are passed Closes apache#27157 from guykhazma/PushdataFiltersInFileListing. Authored-by: Guy Khazma <guykhag@gmail.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com> # Conflicts: # external/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroScan.scala # external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/csv/CSVScan.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/json/JsonScan.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcScan.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetScan.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/text/TextScan.scala # sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala

gengliangwang added 3 commits January 6, 2020 13:40

support partition pruning in file source V2

11287f6

fix compiling error

acfe4f1

add test case

51a42a0

gengliangwang added 2 commits January 7, 2020 11:10

fix sameResult method

65200b6

update tests

13e9535

gengliangwang changed the title ~~[WIP][SPARK-30428][SQL] File source V2: support partition pruning~~ [SPARK-30428][SQL] File source V2: support partition pruning Jan 7, 2020

gengliangwang commented Jan 7, 2020

View reviewed changes

gengliangwang added 2 commits January 7, 2020 14:22

revise comments

31c8c14

check filters in test cases

7652c35

fix v1.2 compiling

58a4a07

guykhazma reviewed Jan 8, 2020

View reviewed changes

cloud-fan reviewed Jan 9, 2020

View reviewed changes

cloud-fan closed this in 94fc0e3 Jan 9, 2020

guykhazma reviewed Jan 9, 2020

View reviewed changes

guykhazma mentioned this pull request Jan 9, 2020

[SPARK-30475][SQL] File source V2: Push data filters for file listing #27157

Closed

MaxGekk reviewed Jan 11, 2020

View reviewed changes

dongjoon-hyun added the SQL label Feb 5, 2020

peter-toth mentioned this pull request Mar 12, 2021

[SPARK-33482][SQL] Fix FileScan canonicalization #31820

Closed

peter-toth mentioned this pull request Mar 18, 2021

[SPARK-33482][SPARK-34756][SQL] Fix FileScan equality check #31848

Closed

peter-toth mentioned this pull request Mar 24, 2021

[SPARK-33482][SPARK-34756][SQL][3.0] Fix FileScan equality check #31952

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-30428][SQL] File source V2: support partition pruning #27112

[SPARK-30428][SQL] File source V2: support partition pruning #27112

gengliangwang commented Jan 7, 2020 •

edited

Loading

gengliangwang commented Jan 7, 2020

SparkQA commented Jan 7, 2020

gengliangwang commented Jan 7, 2020

gengliangwang Jan 7, 2020

SparkQA commented Jan 7, 2020

SparkQA commented Jan 8, 2020

SparkQA commented Jan 8, 2020

SparkQA commented Jan 8, 2020

guykhazma Jan 8, 2020 •

edited

Loading

gengliangwang Jan 8, 2020

guykhazma Jan 9, 2020 •

edited

Loading

cloud-fan Jan 9, 2020

cloud-fan Jan 9, 2020

cloud-fan Jan 9, 2020

cloud-fan commented Jan 9, 2020

guykhazma Jan 9, 2020 •

edited

Loading

cloud-fan Jan 9, 2020

gengliangwang Jan 9, 2020

guykhazma Jan 9, 2020

MaxGekk commented Jan 10, 2020

MaxGekk Jan 11, 2020

[SPARK-30428][SQL] File source V2: support partition pruning #27112

[SPARK-30428][SQL] File source V2: support partition pruning #27112

Conversation

gengliangwang commented Jan 7, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

gengliangwang commented Jan 7, 2020

SparkQA commented Jan 7, 2020

gengliangwang commented Jan 7, 2020

Choose a reason for hiding this comment

SparkQA commented Jan 7, 2020

SparkQA commented Jan 8, 2020

SparkQA commented Jan 8, 2020

SparkQA commented Jan 8, 2020

guykhazma Jan 8, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guykhazma Jan 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Jan 9, 2020

guykhazma Jan 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MaxGekk commented Jan 10, 2020

Choose a reason for hiding this comment

gengliangwang commented Jan 7, 2020 •

edited

Loading

guykhazma Jan 8, 2020 •

edited

Loading

guykhazma Jan 9, 2020 •

edited

Loading

guykhazma Jan 9, 2020 •

edited

Loading