[SPARK-32792][SQL] Improve Parquet In filter pushdown #29642

wangyum · 2020-09-03T16:58:59Z

What changes were proposed in this pull request?

Support push down GreaterThanOrEqual minimum value and LessThanOrEqual maximum value for Parquet when sources.In's values exceeds spark.sql.optimizer.inSetRewriteMinMaxThreshold. For example:

SELECT * FROM t WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 15)

We will push down id >= 1 and id <= 15.

Impala also has this improvement: https://issues.apache.org/jira/browse/IMPALA-3654

Why are the changes needed?

Improve query performance.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test, manual test and benchmark test.

Before this PR:

================================================================================================
Pushdown benchmark for InSet -> InFilters
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 10, distribution: 10):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
---------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                                5995           6026          53          2.6         381.2       1.0X
Parquet Vectorized (Pushdown)                                      423            440          11         37.2          26.9      14.2X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 10, distribution: 50):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
---------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                                5767           5887         154          2.7         366.7       1.0X
Parquet Vectorized (Pushdown)                                      419            428           6         37.6          26.6      13.8X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 10, distribution: 90):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
---------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                                5764           5857          96          2.7         366.4       1.0X
Parquet Vectorized (Pushdown)                                      408            419           9         38.6          25.9      14.1X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 100, distribution: 10):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
----------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                                 5895           5949          41          2.7         374.8       1.0X
Parquet Vectorized (Pushdown)                                      5908           5986         114          2.7         375.6       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 100, distribution: 50):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
----------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                                 5893           5988         106          2.7         374.7       1.0X
Parquet Vectorized (Pushdown)                                      5875           5939          57          2.7         373.5       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 100, distribution: 90):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
----------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                                 5891           5954          42          2.7         374.5       1.0X
Parquet Vectorized (Pushdown)                                      5901           5976          99          2.7         375.2       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 2000, distribution: 10):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-----------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                                  6128           6158          40          2.6         389.6       1.0X
Parquet Vectorized (Pushdown)                                       6145           6190          37          2.6         390.7       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 2000, distribution: 50):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-----------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                                  6142           6217          64          2.6         390.5       1.0X
Parquet Vectorized (Pushdown)                                       6149           6235          90          2.6         391.0       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 2000, distribution: 90):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-----------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                                  6148           6218          64          2.6         390.9       1.0X
Parquet Vectorized (Pushdown)                                       6145           6177          30          2.6         390.7       1.0X

After this PR:

================================================================================================
Pushdown benchmark for InSet -> InFilters
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 10, distribution: 10):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
---------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                                5745           5768          28          2.7         365.2       1.0X
Parquet Vectorized (Pushdown)                                      401            412          12         39.2          25.5      14.3X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 10, distribution: 50):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
---------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                                5796           5861          61          2.7         368.5       1.0X
Parquet Vectorized (Pushdown)                                      417            482          37         37.7          26.5      13.9X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 10, distribution: 90):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
---------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                                5754           5777          20          2.7         365.8       1.0X
Parquet Vectorized (Pushdown)                                      408            418           9         38.6          25.9      14.1X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 100, distribution: 10):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
----------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                                 5878           5915          40          2.7         373.7       1.0X
Parquet Vectorized (Pushdown)                                       929            940          10         16.9          59.1       6.3X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 100, distribution: 50):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
----------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                                 5886           5917          29          2.7         374.2       1.0X
Parquet Vectorized (Pushdown)                                      3091           3114          20          5.1         196.5       1.9X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 100, distribution: 90):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
----------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                                 5913           5948          48          2.7         375.9       1.0X
Parquet Vectorized (Pushdown)                                      5330           5427          98          3.0         338.9       1.1X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 2000, distribution: 10):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-----------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                                  6147           6228          72          2.6         390.8       1.0X
Parquet Vectorized (Pushdown)                                       1023           1029           4         15.4          65.1       6.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 2000, distribution: 50):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-----------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                                  6164           6224          47          2.6         391.9       1.0X
Parquet Vectorized (Pushdown)                                       3332           3360          45          4.7         211.9       1.8X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 2000, distribution: 90):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-----------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                                  6154           6192          38          2.6         391.3       1.0X
Parquet Vectorized (Pushdown)                                       5588           5679          92          2.8         355.3       1.1X

SparkQA · 2020-09-03T21:50:38Z

Test build #128266 has finished for PR 29642 at commit 648e8e5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

gengliangwang · 2020-09-23T06:00:22Z

@wangyum Do you have any further comments? If not, shall we close this one?

SparkQA · 2020-10-25T14:42:49Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34844/

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

SparkQA · 2020-10-25T15:04:30Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34844/

SparkQA · 2020-10-25T15:04:38Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34845/

SparkQA · 2020-10-25T15:29:29Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34845/

SparkQA · 2020-10-25T18:11:22Z

Test build #130244 has finished for PR 29642 at commit c5ab656.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-25T18:31:21Z

Test build #130245 has finished for PR 29642 at commit 0169114.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2020-10-26T00:57:39Z

cc @cloud-fan @HyukjinKwon @gengliangwang

SparkQA · 2020-11-18T00:14:26Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35840/

SparkQA · 2020-11-18T00:38:11Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35840/

SparkQA · 2020-11-18T01:14:38Z

Test build #131236 has finished for PR 29642 at commit ebb13cc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2020-11-18T09:18:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

+            case Some(dataType) =>
+              val sortedValues = values.sorted(TypeUtils.getInterpretedOrdering(dataType))
+              createFilterHelper(
+                sources.And(sources.GreaterThanOrEqual(name, sortedValues.head),
+                  sources.LessThanOrEqual(name, sortedValues.last)),
+                canPartialPushDownConjuncts)


The logic is same to HiveShim.scala#L746-L750.

spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala

Lines 746 to 750 in 09bb9be

case InSet(child, values) if useAdvanced && values.size > inSetThreshold =>

val dataType = child.dataType

val sortedValues = values.toSeq.sorted(TypeUtils.getInterpretedOrdering(dataType))

convert(And(GreaterThanOrEqual(child, Literal(sortedValues.head, dataType)),

LessThanOrEqual(child, Literal(sortedValues.last, dataType))))

@cloud-fan @dongjoon-hyun @HyukjinKwon It can be improved by 6.6X in InSet -> InFilters (values count: 100, distribution: 10):

Parquet Vectorized (Pushdown) 9520 9560 27 1.7 605.3 1.0X Parquet Vectorized (Pushdown) 873 885 11 18.0 55.5 6.6X

ah, then can we turn it into a util method and use it in all the filter pushdown place?

ok, Added a new function to TypeUtils.

SparkQA · 2020-11-18T16:01:18Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35893/

SparkQA · 2020-11-18T16:25:17Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35893/

SparkQA · 2020-11-18T20:12:33Z

Test build #131289 has finished for PR 29642 at commit b8cb1f4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-11-19T13:00:54Z

shall we implement the logic in FileSourceStrategy? Then it's not parquet only.

wangyum · 2020-11-20T05:54:08Z

It seems only Parquet not well supported In predicate pushdown. @MaxGekk What do you think?

This is the benchmark of CSV:

val rowsNum = 100 * 1000
val numIters = 3
val colsNum = 100
val fields = Seq.tabulate(colsNum)(i => StructField(s"col$i", TimestampType))
val schema = StructType(StructField("key", IntegerType) +: fields)
def columns(): Seq[Column] = {
  val ts = Seq.tabulate(colsNum) { i =>
    lit(Instant.ofEpochSecond(i * 12345678)).as(s"col$i")
  }
  ($"id" % 1000).as("key") +: ts
}
withTempPath { path =>
  spark.range(rowsNum).select(columns(): _*)
    .write.option("header", true)
    .csv(path.getAbsolutePath)
  def readback = {
    spark.read
      .option("header", true)
      .schema(schema)
      .csv(path.getAbsolutePath)
  }

  def withFilter(filer: String, configEnabled: Boolean): Unit = {
    withSQLConf(SQLConf.CSV_FILTER_PUSHDOWN_ENABLED.key -> configEnabled.toString()) {
      readback.filter(filer).noop()
    }
  }

  Seq(5, 10, 50, 100, 500).foreach { count =>
    Seq(10, 50).foreach { distribution =>
      val title = s"InSet -> InFilters (values count: $count, distribution: $distribution)"
      val benchmark = new Benchmark(title, rowsNum, output = output)
      Seq(false, true).foreach { pushDownEnabled =>
        val name = s"Native CSV Vectorized ${if (pushDownEnabled) s"(Pushdown)" else ""}"
        benchmark.addCase(name, numIters) { _ =>
          val filter =
            Range(0, count).map(_ => scala.util.Random.nextInt(rowsNum * distribution / 100))
          val whereExpr = s"key in(${filter.mkString(",")})"
          withFilter(whereExpr, configEnabled = pushDownEnabled)
        }
      }
      benchmark.run()
    }
  }
}

Result:

================================================================================================
Benchmark to measure CSV read performance
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 5, distribution: 10):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
--------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized                                           13082          17077        1674          0.0      130815.6       1.0X
Native CSV Vectorized (Pushdown)                                 1172           1192          35          0.1       11719.5      11.2X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 5, distribution: 50):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
--------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized                                           11858          12028         237          0.0      118576.9       1.0X
Native CSV Vectorized (Pushdown)                                 1165           1172           6          0.1       11652.4      10.2X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 10, distribution: 10):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
---------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized                                            11883          12180         494          0.0      118834.3       1.0X
Native CSV Vectorized (Pushdown)                                  1142           1156          21          0.1       11418.6      10.4X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 10, distribution: 50):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
---------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized                                            11857          11878          19          0.0      118570.4       1.0X
Native CSV Vectorized (Pushdown)                                  1169           1174           7          0.1       11692.9      10.1X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 50, distribution: 10):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
---------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized                                            11923          11962          66          0.0      119228.0       1.0X
Native CSV Vectorized (Pushdown)                                  1196           1225          26          0.1       11960.7      10.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 50, distribution: 50):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
---------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized                                            11910          11917           7          0.0      119095.3       1.0X
Native CSV Vectorized (Pushdown)                                  1191           1194           5          0.1       11908.0      10.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 100, distribution: 10):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
----------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized                                             11948          12097         201          0.0      119484.5       1.0X
Native CSV Vectorized (Pushdown)                                   1250           1284          32          0.1       12501.4       9.6X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 100, distribution: 50):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
----------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized                                             11938          11978          39          0.0      119378.8       1.0X
Native CSV Vectorized (Pushdown)                                   1176           1188          11          0.1       11756.0      10.2X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 500, distribution: 10):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
----------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized                                             11954          12051         124          0.0      119542.9       1.0X
Native CSV Vectorized (Pushdown)                                   1762           1833         104          0.1       17620.6       6.8X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 500, distribution: 50):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
----------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized                                             11860          12166         484          0.0      118597.8       1.0X
Native CSV Vectorized (Pushdown)                                   1417           1434          15          0.1       14171.7       8.4X

HyukjinKwon · 2020-11-23T00:46:31Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

-          makeEq.lift(nameToParquetField(name).fieldType)
-            .map(_(nameToParquetField(name).fieldNames, v))
-        }.reduceLeftOption(FilterApi.or)
+      case sources.In(name, values) if pushDownInFilterThreshold > 0 &&


@wangyum, the impala reference sounds good. Can we make it general and push the range filter to other data sources as well?

If this is supposed to be beneficial in other sources as well, I think it makes more sense to push it to other sources as well anyway.

It seems only Parquet is not well supported In predicate pushdown.
Parquet vs ORC:

spark/sql/core/benchmarks/FilterPushdownBenchmark-results.txt

Lines 439 to 482 in f5118f8

OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws

Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz

InSet -> InFilters (values count: 50, distribution: 50): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative

------------------------------------------------------------------------------------------------------------------------

Parquet Vectorized 9281 9298 12 1.7 590.1 1.0X

Parquet Vectorized (Pushdown) 9546 9561 17 1.6 606.9 1.0X

Native ORC Vectorized 6877 6897 18 2.3 437.2 1.3X

Native ORC Vectorized (Pushdown) 661 668 15 23.8 42.0 14.0X

OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws

Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz

InSet -> InFilters (values count: 50, distribution: 90): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative

------------------------------------------------------------------------------------------------------------------------

Parquet Vectorized 9322 9335 22 1.7 592.7 1.0X

Parquet Vectorized (Pushdown) 9551 9573 18 1.6 607.2 1.0X

Native ORC Vectorized 6902 6915 13 2.3 438.8 1.4X

Native ORC Vectorized (Pushdown) 659 680 25 23.9 41.9 14.1X

OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws

Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz

InSet -> InFilters (values count: 100, distribution: 10): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative

------------------------------------------------------------------------------------------------------------------------

Parquet Vectorized 9278 9294 18 1.7 589.9 1.0X

Parquet Vectorized (Pushdown) 9520 9560 27 1.7 605.3 1.0X

Native ORC Vectorized 6855 6870 16 2.3 435.9 1.4X

Native ORC Vectorized (Pushdown) 795 808 16 19.8 50.5 11.7X

OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws

Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz

InSet -> InFilters (values count: 100, distribution: 50): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative

------------------------------------------------------------------------------------------------------------------------

Parquet Vectorized 9306 9311 4 1.7 591.6 1.0X

Parquet Vectorized (Pushdown) 9529 9551 16 1.7 605.8 1.0X

Native ORC Vectorized 6875 6882 7 2.3 437.1 1.4X

Native ORC Vectorized (Pushdown) 853 865 15 18.4 54.2 10.9X

OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws

Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz

InSet -> InFilters (values count: 100, distribution: 90): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative

------------------------------------------------------------------------------------------------------------------------

Parquet Vectorized 9256 9271 9 1.7 588.5 1.0X

Parquet Vectorized (Pushdown) 9500 9520 13 1.7 604.0 1.0X

Native ORC Vectorized 6843 6857 9 2.3 435.1 1.4X

Native ORC Vectorized (Pushdown) 858 870 14 18.3 54.6 10.8X

CSV:
#29642 (comment)

gengliangwang · 2020-11-23T03:07:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -704,8 +704,8 @@ object SQLConf {
  val PARQUET_FILTER_PUSHDOWN_INFILTERTHRESHOLD =
    buildConf("spark.sql.parquet.pushdown.inFilterThreshold")
      .doc("The maximum number of values to filter push-down optimization for IN predicate. " +
-        "Large threshold won't necessarily provide much better performance. " +
-        "The experiment argued that 300 is the limit threshold. " +
+        "Spark will push-down a value greater than or equal to its minimum value and " +


I think the default value 10 is small here. What is the default threshold in IMPLA?

Impala only optimize it to >= minimum value and <= maximum value: apache/impala@aa05c64

dongjoon-hyun · 2021-05-09T14:40:49Z

sql/core/benchmarks/FilterPushdownBenchmark-jdk11-results.txt

+Parquet Vectorized                                 10287          10449         144          1.5         654.0       1.0X
+Parquet Vectorized (Pushdown)                        467            494          20         33.7          29.7      22.0X
+Native ORC Vectorized                               6781           6848          58          2.3         431.1       1.5X
+Native ORC Vectorized (Pushdown)                     428            440          10         36.8          27.2      24.1X


ditto. 17 vs 17 -> 22 vs 24.

No. Github action runs on different machines, there is a performance difference between them.

dongjoon-hyun

According to the benchmark result, I'm a little confused if this PR is improvement or not. Could you add some explanation about the improvement part, @wangyum ? Maybe, is it affected by the master branch instead of this PR?

Also, cc @huaxingao since this is Parquet filter pushdown.

SparkQA · 2021-05-09T15:55:05Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42829/

SparkQA · 2021-05-09T15:55:06Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42829/

SparkQA · 2021-05-09T19:38:58Z

Test build #138307 has finished for PR 29642 at commit f0bfb06.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2021-05-10T00:24:01Z

@dongjoon-hyun This pr only improve the In predicate. I have added the improvement part to PR description.

wangyum · 2021-05-13T03:12:42Z

@dongjoon-hyun Do you have more comments?

dongjoon-hyun · 2021-05-13T20:33:23Z

No. Github action runs on different machines, there is a performance difference between them.

No, @wangyum . I'm meaning the ratio between ORC and Parquet on the same machine run. Previously, ORC and Parquet shows the similar performance but now Parquet looks like more slower than ORC after this PR by increasing the gap. For example, the following.

- Parquet Vectorized                                10512          10572          58          1.5         668.4       1.0X
- Parquet Vectorized (Pushdown)                       596            621          19         26.4          37.9      17.6X
- Native ORC Vectorized                              8555           8723          97          1.8         543.9       1.2X
- Native ORC Vectorized (Pushdown)                    592            609          11         26.6          37.7      17.8X
+ Parquet Vectorized                                 9788          10231         259          1.6         622.3       1.0X
+ Parquet Vectorized (Pushdown)                       493            536          29         31.9          31.3      19.9X
+ Native ORC Vectorized                              6487           6575         137          2.4         412.4       1.5X
+ Native ORC Vectorized (Pushdown)                    436            447          14         36.1          27.7      22.4X

Although the values are too small, this generate result shows a slowdown of Parquet compared with ORC. That was my question.

wangyum · 2021-05-14T03:48:33Z

@dongjoon-hyun I think this performance issue is not caused by this change. This PR only changes the In predicate. It is also slow without this change:

OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Linux 5.4.0-1047-azure
Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
Select 0 string row (value IS NULL):      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                10623          10994         272          1.5         675.4       1.0X
Parquet Vectorized (Pushdown)                       627            657          24         25.1          39.9      16.9X
Native ORC Vectorized                              7490           7653         203          2.1         476.2       1.4X
Native ORC Vectorized (Pushdown)                    553            606          34         28.4          35.2      19.2X

https://github.com/wangyum/spark/runs/2580852093

cloud-fan · 2021-05-14T04:44:49Z

Yea these benchmark results are not updated in time. Let's post the benchmark result before and after this PR in the PR description.

wangyum · 2021-05-14T11:10:29Z

@dongjoon-hyun @cloud-fan Please see the latest benchmark result: 27a2bf6

SparkQA · 2021-05-14T11:23:10Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43078/

SparkQA · 2021-05-14T11:23:11Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43078/

SparkQA · 2021-05-14T14:58:13Z

Test build #138557 has finished for PR 29642 at commit 27a2bf6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2021-05-15T23:58:29Z

Thank you for updating, @wangyum . At the last commit, yes, I agree that it looks like there is no regression by this PR.

One last question: could you spot what is the improvement in the the last commit by this PR? It's not clear to me in the last commit. Do we need to add some specific additional benchmark case for your contribution?

SparkQA · 2021-05-16T03:25:10Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43103/

SparkQA · 2021-05-16T07:14:53Z

Test build #138582 has finished for PR 29642 at commit 2545c1e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2021-05-16T09:57:44Z

@dongjoon-hyun I think current benchmark is enough. I have updated the benchmark to PR description.

dongjoon-hyun

+1, LGTM. Thank you so much, @wangyum and all!
Merged to master.

cc @aokolnychyi

HyukjinKwon · 2021-05-17T04:50:46Z

uhoh .. seems like there's a logical conflict with #31776:

[error] /home/runner/work/spark/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala:489:27: wrong number of arguments for pattern ParquetFilters.this.ParquetSchemaType(logicalTypeAnnotation: org.apache.parquet.schema.LogicalTypeAnnotation, primitiveTypeName: org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName, length: Int)
[error]     case ParquetSchemaType(DECIMAL, INT32, _, _) if pushDownDecimal =>
[error]                           ^
[error] /home/runner/work/spark/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala:496:27: wrong number of arguments for pattern ParquetFilters.this.ParquetSchemaType(logicalTypeAnnotation: org.apache.parquet.schema.LogicalTypeAnnotation, primitiveTypeName: org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName, length: Int)
[error]     case ParquetSchemaType(DECIMAL, INT64, _, _) if pushDownDecimal =>
[error]                           ^
[error] /home/runner/work/spark/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala:503:27: wrong number of arguments for pattern ParquetFilters.this.ParquetSchemaType(logicalTypeAnnotation: org.apache.parquet.schema.LogicalTypeAnnotation, primitiveTypeName: org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName, length: Int)
[error]     case ParquetSchemaType(DECIMAL, FIXED_LEN_BYTE_ARRAY, length, _) if pushDownDecimal =>
[error]                           ^
[warn] /home/runner/work/spark/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala:127:39: [deprecation @ org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.prepareWrite | origin=org.apache.parquet.hadoop.ParquetOutputFormat.ENABLE_JOB_SUMMARY | version=] value ENABLE_JOB_SUMMARY in class ParquetOutputFormat is deprecated
[warn]       && conf.get(ParquetOutputFormat.ENABLE_JOB_SUMMARY) == null) {
[warn]                                       ^
[warn] /home/runner/work/spark/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetWrite.scala:89:39: [deprecation @ org.apache.spark.sql.execution.datasources.v2.parquet.ParquetWrite.prepareWrite | origin=org.apache.parquet.hadoop.ParquetOutputFormat.ENABLE_JOB_SUMMARY | version=] value ENABLE_JOB_SUMMARY in class ParquetOutputFormat is deprecated
[warn]       && conf.get(ParquetOutputFormat.ENABLE_JOB_SUMMARY) == null) {

HyukjinKwon · 2021-05-17T04:51:38Z

@wangyum are you online? can you take a quick look and fix or revert?

dongjoon-hyun · 2021-05-17T05:02:37Z

Oops.

### What changes were proposed in this pull request? This fixes the compilation error due to the logical conflicts between #31776 and #29642 . ### Why are the changes needed? To recover compilation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Closes #32568 from wangyum/HOT-FIX. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

Improve in filter pushdown for ParquetFilters

648e8e5

probot-autolabeler bot added the SQL label Sep 3, 2020

HyukjinKwon reviewed Sep 4, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala Show resolved Hide resolved

wangyum added 3 commits October 25, 2020 18:47

Merge remote-tracking branch 'upstream/master' into SPARK-32792

e5e2fe2

pushdown

c5ab656

Update doc

0169114

wangyum commented Oct 25, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala Show resolved Hide resolved

wangyum added 2 commits November 17, 2020 23:31

Merge remote-tracking branch 'upstream/master' into SPARK-32792

b29eca1

Update benchmark result

ebb13cc

wangyum commented Nov 18, 2020

View reviewed changes

Add getMinMaxValue

b8cb1f4

HyukjinKwon reviewed Nov 23, 2020

View reviewed changes

gengliangwang reviewed Nov 23, 2020

View reviewed changes

dongjoon-hyun reviewed May 9, 2021

View reviewed changes

wangyum added 2 commits May 14, 2021 18:23

Update benchmark result before this change

00ff10f

Update benchmark result after this change

27a2bf6

fix

2545c1e

dongjoon-hyun approved these changes May 17, 2021

View reviewed changes

dongjoon-hyun closed this in d2d1f0b May 17, 2021

dongjoon-hyun mentioned this pull request May 17, 2021

[SPARK-32792][SQL][FOLLOWUP] Fix conflict with SPARK-34661 #32568

Closed

wangyum deleted the SPARK-32792 branch May 17, 2021 05:19

	case InSet(child, values) if useAdvanced && values.size > inSetThreshold =>
	val dataType = child.dataType
	val sortedValues = values.toSeq.sorted(TypeUtils.getInterpretedOrdering(dataType))
	convert(And(GreaterThanOrEqual(child, Literal(sortedValues.head, dataType)),
	LessThanOrEqual(child, Literal(sortedValues.last, dataType))))

	OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws
	Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
	InSet -> InFilters (values count: 50, distribution: 50): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
	------------------------------------------------------------------------------------------------------------------------
	Parquet Vectorized 9281 9298 12 1.7 590.1 1.0X
	Parquet Vectorized (Pushdown) 9546 9561 17 1.6 606.9 1.0X
	Native ORC Vectorized 6877 6897 18 2.3 437.2 1.3X
	Native ORC Vectorized (Pushdown) 661 668 15 23.8 42.0 14.0X

	OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws
	Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
	InSet -> InFilters (values count: 50, distribution: 90): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
	------------------------------------------------------------------------------------------------------------------------
	Parquet Vectorized 9322 9335 22 1.7 592.7 1.0X
	Parquet Vectorized (Pushdown) 9551 9573 18 1.6 607.2 1.0X
	Native ORC Vectorized 6902 6915 13 2.3 438.8 1.4X
	Native ORC Vectorized (Pushdown) 659 680 25 23.9 41.9 14.1X

	OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws
	Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
	InSet -> InFilters (values count: 100, distribution: 10): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
	------------------------------------------------------------------------------------------------------------------------
	Parquet Vectorized 9278 9294 18 1.7 589.9 1.0X
	Parquet Vectorized (Pushdown) 9520 9560 27 1.7 605.3 1.0X
	Native ORC Vectorized 6855 6870 16 2.3 435.9 1.4X
	Native ORC Vectorized (Pushdown) 795 808 16 19.8 50.5 11.7X

	OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws
	Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
	InSet -> InFilters (values count: 100, distribution: 50): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
	------------------------------------------------------------------------------------------------------------------------
	Parquet Vectorized 9306 9311 4 1.7 591.6 1.0X
	Parquet Vectorized (Pushdown) 9529 9551 16 1.7 605.8 1.0X
	Native ORC Vectorized 6875 6882 7 2.3 437.1 1.4X
	Native ORC Vectorized (Pushdown) 853 865 15 18.4 54.2 10.9X

	OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws
	Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
	InSet -> InFilters (values count: 100, distribution: 90): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
	------------------------------------------------------------------------------------------------------------------------
	Parquet Vectorized 9256 9271 9 1.7 588.5 1.0X
	Parquet Vectorized (Pushdown) 9500 9520 13 1.7 604.0 1.0X
	Native ORC Vectorized 6843 6857 9 2.3 435.1 1.4X
	Native ORC Vectorized (Pushdown) 858 870 14 18.3 54.6 10.8X

[SPARK-32792][SQL] Improve Parquet In filter pushdown #29642

[SPARK-32792][SQL] Improve Parquet In filter pushdown #29642

Conversation

wangyum commented Sep 3, 2020 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Sep 3, 2020

gengliangwang commented Sep 23, 2020

SparkQA commented Oct 25, 2020

SparkQA commented Oct 25, 2020

SparkQA commented Oct 25, 2020

SparkQA commented Oct 25, 2020

SparkQA commented Oct 25, 2020

SparkQA commented Oct 25, 2020

wangyum commented Oct 26, 2020

SparkQA commented Nov 18, 2020

SparkQA commented Nov 18, 2020

SparkQA commented Nov 18, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 18, 2020

SparkQA commented Nov 18, 2020

SparkQA commented Nov 18, 2020

cloud-fan commented Nov 19, 2020

wangyum commented Nov 20, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gengliangwang Nov 23, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment • edited

Choose a reason for hiding this comment

SparkQA commented May 9, 2021

SparkQA commented May 9, 2021

SparkQA commented May 9, 2021

wangyum commented May 10, 2021

wangyum commented May 13, 2021

dongjoon-hyun commented May 13, 2021 • edited

wangyum commented May 14, 2021

cloud-fan commented May 14, 2021

wangyum commented May 14, 2021

SparkQA commented May 14, 2021

SparkQA commented May 14, 2021

SparkQA commented May 14, 2021

dongjoon-hyun commented May 15, 2021 • edited

SparkQA commented May 16, 2021

SparkQA commented May 16, 2021

wangyum commented May 16, 2021

dongjoon-hyun left a comment

Choose a reason for hiding this comment

HyukjinKwon commented May 17, 2021

HyukjinKwon commented May 17, 2021

dongjoon-hyun commented May 17, 2021

wangyum commented Sep 3, 2020 •

edited

gengliangwang Nov 23, 2020 •

edited

dongjoon-hyun left a comment •

edited

dongjoon-hyun commented May 13, 2021 •

edited

dongjoon-hyun commented May 15, 2021 •

edited