[SPARK-24638][SQL] StringStartsWith support push down #21623

wangyum · 2018-06-23T13:45:14Z

What changes were proposed in this pull request?

StringStartsWith support push down. About 50% savings in compute time.

How was this patch tested?

unit tests, manual tests and performance test:

cat <<EOF > SPARK-24638.scala
def benchmark(func: () => Unit): Long = {
  val start = System.currentTimeMillis()
  for(i <- 0 until 100) { func() }
  val end = System.currentTimeMillis()
  end - start
}
val path = "/tmp/spark/parquet/string/"
spark.range(10000000).selectExpr("concat(id, 'str', id) as id").coalesce(1).write.mode("overwrite").option("parquet.block.size", 1048576).parquet(path)
val df = spark.read.parquet(path)

spark.sql("set spark.sql.parquet.filterPushdown.string.startsWith=true")
val pushdownEnable = benchmark(() => df.where("id like '999998%'").count())

spark.sql("set spark.sql.parquet.filterPushdown.string.startsWith=false")
val pushdownDisable = benchmark(() => df.where("id like '999998%'").count())

val improvements = pushdownDisable - pushdownEnable
println(s"improvements: $improvements")
EOF

bin/spark-shell -i SPARK-24638.scala

result:

Loading SPARK-24638.scala...
benchmark: (func: () => Unit)Long
path: String = /tmp/spark/parquet/string/
df: org.apache.spark.sql.DataFrame = [id: string]                               
res1: org.apache.spark.sql.DataFrame = [key: string, value: string]
pushdownEnable: Long = 11608
res2: org.apache.spark.sql.DataFrame = [key: string, value: string]
pushdownDisable: Long = 31981
improvements: Long = 20373

gatorsmile · 2018-06-23T15:25:46Z

cc @rdblue

SparkQA · 2018-06-23T17:25:45Z

Test build #92258 has finished for PR 21623 at commit 5b52ace.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

attilapiros

Just a question regarding sources.StringStartsWith("_1", null)): if you have a nullable string column and some of the values are null will this operator (parameterized with null) matches against them?

LGTM otherwise.

HyukjinKwon · 2018-06-26T01:19:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

@@ -270,6 +272,29 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean) {
      case sources.Not(pred) =>
        createFilter(schema, pred).map(FilterApi.not)

+      case sources.StringStartsWith(name, prefix) if canMakeFilterOn(name) =>


What do you think about adding a configuration to control this and set it true by default? It's basically dependent on an user defined predicate we manually wrote here.

wangyum · 2018-06-26T01:47:06Z

cc @gszadovszky @nandorKollar

wangyum · 2018-06-26T01:52:16Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

+      assertResult(None) {
+        parquetFilters.createFilter(
+          df.schema,
+          sources.StringStartsWith("_1", null))


Thanks @attilapiros , sources.StringStartsWith("_1", null) will not matches them, same as before.

rdblue · 2018-06-26T17:24:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

+                  comparator.compare(min.slice(0, math.min(size, min.length)), strToBinary) > 0
+              }
+
+              override def inverseCanDrop(statistics: Statistics[Binary]): Boolean = false


Why can't this evaluate the inverse of StartsWith? If the min and max values exclude the prefix, then this should be able to filter.

No.

Let me illustrate this with an example: let's assume min="BBB", max="DDD" canDrop() means if your prefix is before "BBB" (like "A") we can stop as there is no reason to search within this range. This is also true for prefixes after "DDD" (like "E").

Now if your operator is negated. What can you say when your prefix is "C" and the range is "BBB" and "DDD"? Can you drop it? No. And if the prefix is "A" or "E". Still not. You see you should check the range.

Sorry, I meant if the min and max both include the prefix, then we should be able to drop the range. The situation is where both min and max match, so all values must also match the filter. If we are looking for values that do not match the filter, then we can eliminate the row group.

The example is prefix=CCC and values are between min=CCCa and max=CCCZ: all values start with CCC, so the entire row group can be skipped.

There is one rare case when you can drop it with inverse when both min and max starts the perfix. @wangyum please correct me if I am wrong.

@rdblue oh, sorry I have not seen your reply. Yes, in that case we can and you are right it is worth to do.

SparkQA · 2018-06-26T18:13:29Z

Test build #92342 has finished for PR 21623 at commit 02f41cc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-27T05:46:59Z

Test build #92362 has finished for PR 21623 at commit 4f25a33.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2018-06-27T16:04:28Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

@@ -660,6 +661,56 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex
      assert(df.where("col > 0").count() === 2)
    }
  }
+
+  test("filter pushdown - StringStartsWith") {
+    withParquetDataFrame((1 to 4).map(i => Tuple1(i + "str" + i))) { implicit df =>


I think that all of these tests go through the keep method instead of the canDrop and inverseCanDrop. I think those methods need to be tested. You can do that by constructing a Parquet file with row groups that have predictable statistics, but that would be difficult. An easier way to do this is to define the predicate class elsewhere and create a unit test for it that passes in different statistics values.

Added testStringStartsWith to test that exactly go through the canDrop and inverseCanDrop.

rdblue · 2018-06-27T16:10:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

-private[parquet] class ParquetFilters(pushDownDate: Boolean) {
+private[parquet] class ParquetFilters() {
+
+  val sqlConf: SQLConf = SQLConf.get


This should pass in pushDownDate and pushDownStartWith like the previous version did with just the date setting.

The SQLConf is already available in ParquetFileFormat and it would be better to pass it in. The problem is that this class is instantiated in the function ((file: PartitionedFile) => { ... }) that gets serialized and sent to executors. That means we don't want SQLConf and its references in the function's closure. The way we got around this before was to put boolean config vals in the closure instead. I think you should go with that approach.

I'm not sure what SQLConf.get is for or what a correct use would be. @gatorsmile, can you comment on use of SQLConf.get?

You are right. I hit a bug here.

rdblue · 2018-06-27T16:12:45Z

Overall, I think this is close. The tests need to cover the row group stats case and we should update how configuration is passed to the filters. Thanks for working on this, @wangyum!

…RTSWITH_ENABLED

stanzhai · 2018-06-29T06:17:32Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -378,6 +378,14 @@ object SQLConf {
    .booleanConf
    .createWithDefault(true)

+  val PARQUET_FILTER_PUSHDOWN_STRING_STARTSWITH_ENABLED =
+    buildConf("spark.sql.parquet.filterPushdown.string.startsWith")


It would be better if we added .enabled postfix.

SparkQA · 2018-06-29T06:46:28Z

Test build #92449 has finished for PR 21623 at commit e959d1a.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-29T07:05:02Z

Test build #92454 has finished for PR 21623 at commit 536610e.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2018-06-29T07:07:07Z

Jenkins, retest this please.

cloud-fan · 2018-06-29T07:41:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -378,6 +378,14 @@ object SQLConf {
    .booleanConf
    .createWithDefault(true)

+  val PARQUET_FILTER_PUSHDOWN_STRING_STARTSWITH_ENABLED =
+    buildConf("spark.sql.parquet.filterPushdown.string.startsWith")
+    .doc("If true, enables Parquet filter push-down optimization for string starts with. " +


for string startsWith function

cloud-fan · 2018-06-29T07:45:35Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

+              }
+
+              override def keep(value: Binary): Boolean = {
+                UTF8String.fromBytes(value.getBytes).startsWith(UTF8String.fromString(v))


UTF8String.fromString(v) -> UTF8String.fromBytes(strToBinary.getBytes)?

cloud-fan · 2018-06-29T07:53:49Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

+
+          val df = spark.read.parquet(path).filter(filter)
+          df.foreachPartition((it: Iterator[Row]) => it.foreach(v => accu.add(0)))
+          df.collect


what does this collect do? foreachPartition is already an action

cloud-fan · 2018-06-29T07:57:03Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

+
+  test("filter pushdown - StringStartsWith") {
+    withParquetDataFrame((1 to 4).map(i => Tuple1(i + "str" + i))) { implicit df =>
+      // Test canDrop()


to confirm, do they test canDrop or keep?

Both methods have been executed but it can't be confirmed which method has taken effect.

SparkQA · 2018-06-29T10:50:55Z

Test build #92463 has finished for PR 21623 at commit 536610e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-29T13:58:34Z

Test build #92474 has finished for PR 21623 at commit 800fde7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2018-06-30T01:01:27Z

Benchmark result:

###########################[ Pushdown benchmark for StringStartsWith ]###########################
Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz

StringStartsWith filter: (value like '10%'): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Parquet Vectorized                          10104 / 11125          1.6         642.4       1.0X
Parquet Vectorized (Pushdown)                 3002 / 3608          5.2         190.8       3.4X
Native ORC Vectorized                        9589 / 10454          1.6         609.7       1.1X
Native ORC Vectorized (Pushdown)             9798 / 10509          1.6         622.9       1.0X

StringStartsWith filter: (value like '1000%'): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Parquet Vectorized                            8437 / 8563          1.9         536.4       1.0X
Parquet Vectorized (Pushdown)                  279 /  289         56.3          17.8      30.2X
Native ORC Vectorized                         7354 / 7568          2.1         467.5       1.1X
Native ORC Vectorized (Pushdown)              7730 / 7972          2.0         491.4       1.1X

StringStartsWith filter: (value like '786432%'): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Parquet Vectorized                            8290 / 8510          1.9         527.0       1.0X
Parquet Vectorized (Pushdown)                  260 /  272         60.5          16.5      31.9X
Native ORC Vectorized                         7361 / 7395          2.1         468.0       1.1X
Native ORC Vectorized (Pushdown)              7694 / 7811          2.0         489.2       1.1X

cloud-fan · 2018-06-30T05:59:28Z

thanks, merging to master!

@dongjoon-hyun is there something similar in ORC?

dongjoon-hyun · 2018-06-30T18:30:10Z

@cloud-fan . AFAIK, ORC doesn't support custom filter yet. I'll follow up that stuff in ORC.

StringStartsWith support push down

5b52ace

attilapiros approved these changes Jun 25, 2018

View reviewed changes

HyukjinKwon reviewed Jun 26, 2018

View reviewed changes

wangyum commented Jun 26, 2018

View reviewed changes

Add config spark.sql.parquet.filterPushdown.string.startsWith

02f41cc

rdblue reviewed Jun 26, 2018

View reviewed changes

Implement inverseCanDrop()

4f25a33

rdblue reviewed Jun 27, 2018

View reviewed changes

wangyum added 2 commits June 29, 2018 11:14

Tests that exactly go through the canDrop and inverseCanDrop.

e959d1a

PARQUET_FILTER_PUSHDOWN_ENABLED -> PARQUET_FILTER_PUSHDOWN_STRING_STA…

536610e

…RTSWITH_ENABLED

stanzhai suggested changes Jun 29, 2018

View reviewed changes

cloud-fan reviewed Jun 29, 2018

View reviewed changes

UTF8String.fromString(v) -> UTF8String.fromBytes(strToBinary.getBytes)

800fde7

asfgit closed this in 03545ce Jun 30, 2018

wangyum mentioned this pull request Sep 16, 2018

[SPARK-25438][SQL][TEST] Fix FilterPushdownBenchmark to use the same memory assumption #22427

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-24638][SQL] StringStartsWith support push down #21623

[SPARK-24638][SQL] StringStartsWith support push down #21623

wangyum commented Jun 23, 2018 •

edited

gatorsmile commented Jun 23, 2018

SparkQA commented Jun 23, 2018

attilapiros left a comment

HyukjinKwon Jun 26, 2018

wangyum commented Jun 26, 2018

wangyum Jun 26, 2018

rdblue Jun 26, 2018

attilapiros Jun 26, 2018

rdblue Jun 26, 2018

attilapiros Jun 26, 2018 •

edited

attilapiros Jun 26, 2018

SparkQA commented Jun 26, 2018

SparkQA commented Jun 27, 2018

rdblue Jun 27, 2018

wangyum Jun 29, 2018

rdblue Jun 27, 2018

wangyum Jun 29, 2018

rdblue commented Jun 27, 2018

stanzhai Jun 29, 2018

SparkQA commented Jun 29, 2018

SparkQA commented Jun 29, 2018

wangyum commented Jun 29, 2018

cloud-fan Jun 29, 2018

cloud-fan Jun 29, 2018

cloud-fan Jun 29, 2018

cloud-fan Jun 29, 2018

wangyum Jun 29, 2018

SparkQA commented Jun 29, 2018

SparkQA commented Jun 29, 2018

wangyum commented Jun 30, 2018

cloud-fan commented Jun 30, 2018

dongjoon-hyun commented Jun 30, 2018

[SPARK-24638][SQL] StringStartsWith support push down #21623

[SPARK-24638][SQL] StringStartsWith support push down #21623

Conversation

wangyum commented Jun 23, 2018 • edited

What changes were proposed in this pull request?

How was this patch tested?

gatorsmile commented Jun 23, 2018

SparkQA commented Jun 23, 2018

attilapiros left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wangyum commented Jun 26, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

attilapiros Jun 26, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 26, 2018

SparkQA commented Jun 27, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue commented Jun 27, 2018

Choose a reason for hiding this comment

SparkQA commented Jun 29, 2018

SparkQA commented Jun 29, 2018

wangyum commented Jun 29, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 29, 2018

SparkQA commented Jun 29, 2018

wangyum commented Jun 30, 2018

cloud-fan commented Jun 30, 2018

dongjoon-hyun commented Jun 30, 2018

wangyum commented Jun 23, 2018 •

edited

attilapiros Jun 26, 2018 •

edited