Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-24638][SQL] StringStartsWith support push down #21623

Closed
wants to merge 6 commits into from
Closed

[SPARK-24638][SQL] StringStartsWith support push down #21623

wants to merge 6 commits into from

Conversation

wangyum
Copy link
Member

@wangyum wangyum commented Jun 23, 2018

What changes were proposed in this pull request?

StringStartsWith support push down. About 50% savings in compute time.

How was this patch tested?

unit tests, manual tests and performance test:

cat <<EOF > SPARK-24638.scala
def benchmark(func: () => Unit): Long = {
  val start = System.currentTimeMillis()
  for(i <- 0 until 100) { func() }
  val end = System.currentTimeMillis()
  end - start
}
val path = "/tmp/spark/parquet/string/"
spark.range(10000000).selectExpr("concat(id, 'str', id) as id").coalesce(1).write.mode("overwrite").option("parquet.block.size", 1048576).parquet(path)
val df = spark.read.parquet(path)

spark.sql("set spark.sql.parquet.filterPushdown.string.startsWith=true")
val pushdownEnable = benchmark(() => df.where("id like '999998%'").count())

spark.sql("set spark.sql.parquet.filterPushdown.string.startsWith=false")
val pushdownDisable = benchmark(() => df.where("id like '999998%'").count())

val improvements = pushdownDisable - pushdownEnable
println(s"improvements: $improvements")
EOF

bin/spark-shell -i SPARK-24638.scala

result:

Loading SPARK-24638.scala...
benchmark: (func: () => Unit)Long
path: String = /tmp/spark/parquet/string/
df: org.apache.spark.sql.DataFrame = [id: string]                               
res1: org.apache.spark.sql.DataFrame = [key: string, value: string]
pushdownEnable: Long = 11608
res2: org.apache.spark.sql.DataFrame = [key: string, value: string]
pushdownDisable: Long = 31981
improvements: Long = 20373

@gatorsmile
Copy link
Member

cc @rdblue

@SparkQA
Copy link

SparkQA commented Jun 23, 2018

Test build #92258 has finished for PR 21623 at commit 5b52ace.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@attilapiros attilapiros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question regarding sources.StringStartsWith("_1", null)): if you have a nullable string column and some of the values are null will this operator (parameterized with null) matches against them?

LGTM otherwise.

@@ -270,6 +272,29 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean) {
case sources.Not(pred) =>
createFilter(schema, pred).map(FilterApi.not)

case sources.StringStartsWith(name, prefix) if canMakeFilterOn(name) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about adding a configuration to control this and set it true by default? It's basically dependent on an user defined predicate we manually wrote here.

@wangyum
Copy link
Member Author

wangyum commented Jun 26, 2018

cc @gszadovszky @nandorKollar

assertResult(None) {
parquetFilters.createFilter(
df.schema,
sources.StringStartsWith("_1", null))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @attilapiros , sources.StringStartsWith("_1", null) will not matches them, same as before.

comparator.compare(min.slice(0, math.min(size, min.length)), strToBinary) > 0
}

override def inverseCanDrop(statistics: Statistics[Binary]): Boolean = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't this evaluate the inverse of StartsWith? If the min and max values exclude the prefix, then this should be able to filter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No.

Let me illustrate this with an example: let's assume min="BBB", max="DDD" canDrop() means if your prefix is before "BBB" (like "A") we can stop as there is no reason to search within this range. This is also true for prefixes after "DDD" (like "E").

Now if your operator is negated. What can you say when your prefix is "C" and the range is "BBB" and "DDD"? Can you drop it? No. And if the prefix is "A" or "E". Still not. You see you should check the range.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant if the min and max both include the prefix, then we should be able to drop the range. The situation is where both min and max match, so all values must also match the filter. If we are looking for values that do not match the filter, then we can eliminate the row group.

The example is prefix=CCC and values are between min=CCCa and max=CCCZ: all values start with CCC, so the entire row group can be skipped.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one rare case when you can drop it with inverse when both min and max starts the perfix. @wangyum please correct me if I am wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue oh, sorry I have not seen your reply. Yes, in that case we can and you are right it is worth to do.

@SparkQA
Copy link

SparkQA commented Jun 26, 2018

Test build #92342 has finished for PR 21623 at commit 02f41cc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 27, 2018

Test build #92362 has finished for PR 21623 at commit 4f25a33.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -660,6 +661,56 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex
assert(df.where("col > 0").count() === 2)
}
}

test("filter pushdown - StringStartsWith") {
withParquetDataFrame((1 to 4).map(i => Tuple1(i + "str" + i))) { implicit df =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that all of these tests go through the keep method instead of the canDrop and inverseCanDrop. I think those methods need to be tested. You can do that by constructing a Parquet file with row groups that have predictable statistics, but that would be difficult. An easier way to do this is to define the predicate class elsewhere and create a unit test for it that passes in different statistics values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added testStringStartsWith to test that exactly go through the canDrop and inverseCanDrop.

private[parquet] class ParquetFilters(pushDownDate: Boolean) {
private[parquet] class ParquetFilters() {

val sqlConf: SQLConf = SQLConf.get
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should pass in pushDownDate and pushDownStartWith like the previous version did with just the date setting.

The SQLConf is already available in ParquetFileFormat and it would be better to pass it in. The problem is that this class is instantiated in the function ((file: PartitionedFile) => { ... }) that gets serialized and sent to executors. That means we don't want SQLConf and its references in the function's closure. The way we got around this before was to put boolean config vals in the closure instead. I think you should go with that approach.

I'm not sure what SQLConf.get is for or what a correct use would be. @gatorsmile, can you comment on use of SQLConf.get?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I hit a bug here.

@rdblue
Copy link
Contributor

rdblue commented Jun 27, 2018

Overall, I think this is close. The tests need to cover the row group stats case and we should update how configuration is passed to the filters. Thanks for working on this, @wangyum!

@@ -378,6 +378,14 @@ object SQLConf {
.booleanConf
.createWithDefault(true)

val PARQUET_FILTER_PUSHDOWN_STRING_STARTSWITH_ENABLED =
buildConf("spark.sql.parquet.filterPushdown.string.startsWith")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better if we added .enabled postfix.

@SparkQA
Copy link

SparkQA commented Jun 29, 2018

Test build #92449 has finished for PR 21623 at commit e959d1a.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 29, 2018

Test build #92454 has finished for PR 21623 at commit 536610e.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented Jun 29, 2018

Jenkins, retest this please.

@@ -378,6 +378,14 @@ object SQLConf {
.booleanConf
.createWithDefault(true)

val PARQUET_FILTER_PUSHDOWN_STRING_STARTSWITH_ENABLED =
buildConf("spark.sql.parquet.filterPushdown.string.startsWith")
.doc("If true, enables Parquet filter push-down optimization for string starts with. " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for string startsWith function

}

override def keep(value: Binary): Boolean = {
UTF8String.fromBytes(value.getBytes).startsWith(UTF8String.fromString(v))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UTF8String.fromString(v) -> UTF8String.fromBytes(strToBinary.getBytes)?


val df = spark.read.parquet(path).filter(filter)
df.foreachPartition((it: Iterator[Row]) => it.foreach(v => accu.add(0)))
df.collect
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this collect do? foreachPartition is already an action


test("filter pushdown - StringStartsWith") {
withParquetDataFrame((1 to 4).map(i => Tuple1(i + "str" + i))) { implicit df =>
// Test canDrop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to confirm, do they test canDrop or keep?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both methods have been executed but it can't be confirmed which method has taken effect.

@SparkQA
Copy link

SparkQA commented Jun 29, 2018

Test build #92463 has finished for PR 21623 at commit 536610e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 29, 2018

Test build #92474 has finished for PR 21623 at commit 800fde7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented Jun 30, 2018

Benchmark result:

###########################[ Pushdown benchmark for StringStartsWith ]###########################
Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz

StringStartsWith filter: (value like '10%'): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Parquet Vectorized                          10104 / 11125          1.6         642.4       1.0X
Parquet Vectorized (Pushdown)                 3002 / 3608          5.2         190.8       3.4X
Native ORC Vectorized                        9589 / 10454          1.6         609.7       1.1X
Native ORC Vectorized (Pushdown)             9798 / 10509          1.6         622.9       1.0X

StringStartsWith filter: (value like '1000%'): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Parquet Vectorized                            8437 / 8563          1.9         536.4       1.0X
Parquet Vectorized (Pushdown)                  279 /  289         56.3          17.8      30.2X
Native ORC Vectorized                         7354 / 7568          2.1         467.5       1.1X
Native ORC Vectorized (Pushdown)              7730 / 7972          2.0         491.4       1.1X

StringStartsWith filter: (value like '786432%'): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Parquet Vectorized                            8290 / 8510          1.9         527.0       1.0X
Parquet Vectorized (Pushdown)                  260 /  272         60.5          16.5      31.9X
Native ORC Vectorized                         7361 / 7395          2.1         468.0       1.1X
Native ORC Vectorized (Pushdown)              7694 / 7811          2.0         489.2       1.1X

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@dongjoon-hyun is there something similar in ORC?

@asfgit asfgit closed this in 03545ce Jun 30, 2018
@dongjoon-hyun
Copy link
Member

@cloud-fan . AFAIK, ORC doesn't support custom filter yet. I'll follow up that stuff in ORC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
9 participants