[SPARK-26205][SQL] Optimize InSet Expression for bytes, shorts, ints, dates #23171

aokolnychyi · 2018-11-28T19:06:04Z

What changes were proposed in this pull request?

This PR optimizes InSet expressions for byte, short, integer, date types. It is a follow-up on PR #21442 from @dbtsai.

In expressions are compiled into a sequence of if-else statements, which results in O(n) time complexity. InSet is an optimized version of In, which is supposed to improve the performance if all values are literals and the number of elements is big enough. However, InSet actually worsens the performance in many cases due to various reasons.

The main idea of this PR is to use Java switch statements to significantly improve the performance of InSet expressions for bytes, shorts, ints, dates. All switch statements are compiled into tableswitch and lookupswitch bytecode instructions. We will have O(1) time complexity if our case values are compact and tableswitch can be used. Otherwise, lookupswitch will give us O(log n).

Locally, I tried Spark OpenHashSet and primitive collections from fastutils in order to solve the boxing issue in InSet. Both options significantly decreased the memory consumption and fastutils improved the time compared to HashSet from Scala. However, the switch-based approach was still more than two times faster even on 500+ non-compact elements.

I also noticed that applying the switch-based approach on less than 10 elements gives a relatively minor improvement compared to the if-else approach. Therefore, I placed the switch-based logic into InSet and added a new config to track when it is applied. Even if we migrate to primitive collections at some point, the switch logic will be still faster unless the number of elements is really big. Another option is to have a separate InSwitch expression. However, this would mean we need to modify other places (e.g., DataSourceStrategy).

See here and here for more information.

This PR does not cover long values as Java switch statements cannot be used on them. However, we can have a follow-up PR with an approach similar to binary search.

How was this patch tested?

There are new tests that verify the logic of the proposed optimization.

The performance was evaluated using existing benchmarks. This PR was also tested on an EC2 instance (OpenJDK 64-Bit Server VM 1.8.0_191-b12 on Linux 4.14.77-70.59.amzn1.x86_64, Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz).

Notes

This link contains source code that decides between tableswitch and lookupswitch. The logic was re-used in the benchmarks. See the isLookupSwitch method.

dbtsai · 2018-11-28T19:23:36Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

+  }
+
+  private def isSwitchCompatible: Boolean = list.forall {
+    case Literal(_, dt) => dt == ByteType || dt == ShortType || dt == IntegerType


case Literal(_, dt) if dt == ByteType || dt == ShortType || dt == IntegerType => true

is easier to read?

Can be simplified to?

private def isSwitchCompatible: Boolean = { inSetConvertible && (value.dataType == ByteType || value.dataType == ShortType || value.dataType == IntegerType) }

aokolnychyi · 2018-11-28T19:25:53Z

@gatorsmile @cloud-fan @dongjoon-hyun @viirya It would be great to have your feedback.

dbtsai · 2018-11-28T19:28:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

+    val (nullLiterals, nonNullLiterals) = list.partition {
+      case Literal(null, _) => true
+      case _ => false
+    }


If there is null in the list, it will be only one. As a result, we may not need to use nullLiterals.

val containNullInList = ... val nonNullLiterals = ...

maybe we can follow InSet, define a hasNull ahead, and filter out null values from the list before processing.

dbtsai · 2018-11-28T19:48:54Z

The approach looks great, and can significantly improve the performance. For Long, I agree that we should also implement binary search approach for O(logn) look up.

Wondering which one will be faster, binary search using arrays or rewrite the if-else in binary search form.

SparkQA · 2018-11-28T22:40:14Z

Test build #99393 has finished for PR 23171 at commit 1477f10.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-11-28T22:45:25Z

Also cc @ueshin

cloud-fan · 2018-11-29T02:55:57Z

I'm wondering if this is still useful after we fix the boxing issue in InSet. We can write a binary hash set for primitive types, like LongToUnsafeRowMap, which should have better performance.

viirya · 2018-11-29T09:10:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

+    val listGen = nonNullLiterals.map(_.genCode(ctx))
+    val valueGen = value.genCode(ctx)
+
+    val caseBranches = listGen.map(literal =>


style:

listGen.map { literal => ... }

viirya · 2018-11-29T09:20:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

+  }
+
+  private def isSwitchCompatible: Boolean = list.forall {
+    case Literal(_, dt) => dt == ByteType || dt == ShortType || dt == IntegerType


Can be simplified to?

private def isSwitchCompatible: Boolean = { inSetConvertible && (value.dataType == ByteType || value.dataType == ShortType || value.dataType == IntegerType) }

aokolnychyi · 2018-11-29T11:31:02Z

@cloud-fan, yeah, let’s see if this PR is useful.

The original idea wasn’t to avoid fixing autoboxing in InSet. In was tested on 250 numbers to prove O(1) time complexity on compact values and outline problems with InSet. After this change, In will be faster than InSet but this is not the goal. Overall, the intent was to have a tiny and straightforward change that would optimize In expressions even if the number of elements is less than “spark.sql.optimizer.inSetConversionThreshold” and Spark does not use InSet.

Once we solve autoboxing issues in InSet, we would need to benchmark against this approach in order to compare to the most efficient implementation of In.

mgaido91

my (maybe stupid?) question is: one we do such a change, does it still make sense to convert In to InSet? Most likely now In is even more efficient. Shall we change the optimizer in order to reflect this? Maybe we can do this in a followup.

mgaido91 · 2018-11-29T16:37:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

+         |${CodeGenerator.JAVA_BOOLEAN} ${ev.value} = false;
+         |if (!${valueGen.isNull}) {
+         |  switch (${valueGen.value}) {
+         |    ${caseBranches.mkString("")}


we should consider that if the number of items is very big, this can cause a compile exception due to the method size limit. So we should use the proper splitting methods for the code

@aokolnychyi Could you please address @mgaido91 's comment? The current code will throw an exception for a huge sequence of In.

Could you please add test cases that could cause more than 64KB Java bytecode size in one switch statement?

dbtsai · 2018-11-29T19:13:05Z

@cloud-fan as @aokolnychyi said, switch will still be faster than optimized Set without autoboxing when the number of elements are small. As a result, this PR is still very useful.

@mgaido91 InSet can be better when we implement properly without autoboxing for large numbers of elements controlled by spark.sql.optimizer.inSetConversionThreshold. Also, generating In with huge lists can cause a compile exception due to the method size limit as you pointed out. As a result, we should convert it into InSet for large set.

mgaido91 · 2018-11-29T20:00:18Z

@dbtsai I see, it would be great, though, to check which is this threshold. My understanding is that the current solution has better performance even for several hundreds of items. If this number is some thousands and since this depends on the datatype (so it is hard to control by the users with a single config), it is arguable which is the best solution: I don't think it is very common to have thousands of elements, while for lower numbers (more common) we would use the less efficient solution.

aokolnychyi · 2018-11-29T20:04:53Z

@dbtsai @mgaido91 I think we can come back to this question once SPARK-26203 is resolved. That JIRA will give us enough information about each data type.

aokolnychyi · 2018-11-29T20:13:15Z

To sum up, I would set the goal of this PR is to make In expressions as efficient as possible for bytes/shorts/ints. Then we can do benchmarks for In vs InSet in SPARK-26203 and try to come up with a solution for InSet in SPARK-26204. By the time we solve SPARK-26204, we will have a clear undestanding of pros and cons in In and InSet and would be able to determine the right thresholds.

This approach sets a pretty high bar even for huge value lists, so it would be a nice basis to benchmark our solution for InSet.

mgaido91 · 2018-11-29T20:53:17Z

yes @aokolnychyi , I agree that the work can be done later (not in the scope of this PR). We can maybe just open a new JIRA about it so we won't forget.

rxin · 2018-12-03T19:32:26Z

I'm not a big fan of making the physical implementation of an expression very different depending on the situation. It complicates the code base and makes things more difficult to reason about. Why can't we just make InSet efficient and convert these cases to that?

cloud-fan · 2018-12-04T04:19:36Z

@rxin I proposed the same thing before, but one problem is that, we only convert In to InSet when the length of list reaches the threshold. If the switch way is faster than hash set when the list is small, it seems still worth to optimize In using switch.

rxin · 2018-12-04T04:24:06Z

That probably means we should just optimize InSet to have the switch version though? Rather than do it in In?

…

On Mon, Dec 03, 2018 at 8:20 PM, Wenchen Fan < ***@***.*** > wrote: @ rxin ( https://github.com/rxin ) I proposed the same thing before, but one problem is that, we only convert In to InSet when the length of list reaches the threshold. If the switch way is faster than hash set when the list is small, it seems still worth to optimize In using switch. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub ( #23171 (comment) ) , or mute the thread ( https://github.com/notifications/unsubscribe-auth/AATvPEkrUFJuT4FI167cCI9b0nfv16V4ks5u1fgNgaJpZM4Y4P4J ).

cloud-fan · 2018-12-04T04:37:38Z

I think InSet is not an optimized version of In, but just a way to separate the implementation for different conditions (the length of the list). Maybe we should do the same thing here, create a InSwitch and convert In to it when meeting some conditions. One problem is, In and InSwitch is same in the interpreted version, maybe we should create a base class for them.

rxin · 2018-12-04T04:42:32Z

I thought InSwitch logically is the same as InSet, in which all the child expressions are literals?

…

On Mon, Dec 03, 2018 at 8:38 PM, Wenchen Fan < ***@***.*** > wrote: I think InSet is not an optimized version of In , but just a way to separate the implementation for different conditions (the length of the list). Maybe we should do the same thing here, create a InSwitch and convert In to it when meeting some conditions. One problem is, In and InSwitch is same in the interpreted version, maybe we should create a base class for them. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub ( #23171 (comment) ) , or mute the thread ( https://github.com/notifications/unsubscribe-auth/AATvPDTQic0Ii5UD40m_Uj5kMVy4pNExks5u1fxPgaJpZM4Y4P4J ).

cloud-fan · 2018-12-04T05:37:27Z

How about, we create an OptimizedIn, and convert In to OptimizedIn if the list is all literals? OptimizedIn will pick switch or hash set based on the length of the list.

dbtsai · 2018-12-04T06:51:22Z

@rxin switch in Java is still significantly faster than hash set even without boxing / unboxing problems when the number of elements are small. We were thinking about to have two implementations in InSet, and pick up switch if the number of elements are small, or otherwise pick up hash set one. But this is the same complexity as having two implements in In as this PR.

@cloud-fan do you suggest to create an OptimizeIn which has switch and hash set implementations based on the length of the elements and remove InSet? Basically, what we were thinking above.

rxin · 2018-12-04T06:56:58Z

Basically logically there are only two expressions: In which handles arbitrary expressions, and InSet which handles expressions with literals. Both could work: (1) we provide two separate expressions for InSet, one using switch, and one using hashset, or (2) we just provide one InSet and internally in InSet have two implementations ... The downside with creating different expressions for the same logical expression is that potentially the downstream optimization rules would need to match more.

…

On Mon, Dec 03, 2018 at 10:52 PM, DB Tsai < ***@***.*** > wrote: @ rxin ( https://github.com/rxin ) switch in Java is still significantly faster than hash set even without boxing / unboxing problems when the number of elements are small. We were thinking about to have two implementations in InSet , and pick up switch if the number of elements are small, or otherwise pick up hash set one. But this is the same complexity as having two implements in In as this PR. @ cloud-fan ( https://github.com/cloud-fan ) do you suggest to create an OptimizeIn which has switch and hash set implementations based on the length of the elements and remove InSet ? Basically, what we were thinking above. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub ( #23171 (comment) ) , or mute the thread ( https://github.com/notifications/unsubscribe-auth/AATvPKtGyx5jWxgtO1y5WsiXYDAQqRQ4ks5u1hvJgaJpZM4Y4P4J ).

aokolnychyi · 2018-12-04T12:03:20Z

As @rxin said, if we introduce a separate expression for the switch-based approach, then we will need to modify other places. For example, DataSourceStrategy$translateFilter. So, integrating into In or InSet seems safer.

I think we can move the switch-based logic to InSet and make InSet responsible for picking the most optimal execution path. We might need to modify the condition when we convert In to InSet as this will most likely depend on the underlying data type. I saw noticeable improvements starting from 5 elements when you compare the if-else approach to the switch-based one. Right now, the conversion happens for more than 10 elements.

aokolnychyi · 2018-12-11T19:20:07Z

@dbtsai @cloud-fan @mgaido91 @rxin @dongjoon-hyun @viirya @gatorsmile

PR #23291 contains benchmarks for different data types.

@rxin was your latest suggestion to convert In to InSet if all elements are literals independently of data types and the number of elements?

mgaido91 · 2018-12-12T09:48:03Z

thanks @aokolnychyi , may you please post here the result of that benchmark after applying this patch?

Just a quick question: can't we support timestamp too in the switch approach?

aokolnychyi · 2018-12-12T17:49:50Z

@mgaido91 It won't be possible to apply the switch-based approach on timestamps as they are represented as longs. We can try dates as are represented as ints.

Below is the result of that benchmark with this patch:

================================================================================================
In Expression Benchmark
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.14
Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
5 bytes:                                 Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
In expression                                   45 /   56        223.3           4.5       1.0X
InSet expression                                58 /   61        173.4           5.8       0.8X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.14
Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
10 bytes:                                Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
In expression                                   37 /   40        268.8           3.7       1.0X
InSet expression                                63 /   67        158.2           6.3       0.6X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.14
Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
25 bytes:                                Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
In expression                                   51 /   55        194.4           5.1       1.0X
InSet expression                                87 /   92        114.3           8.7       0.6X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.14
Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
50 bytes:                                Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
In expression                                   74 /   79        135.4           7.4       1.0X
InSet expression                               128 /  138         78.1          12.8       0.6X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.14
Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
100 bytes:                               Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
In expression                                  123 /  128         81.1          12.3       1.0X
InSet expression                               197 /  212         50.7          19.7       0.6X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.14
Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
5 shorts:                                Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
In expression                                   29 /   31        342.2           2.9       1.0X
InSet expression                               110 /  114         90.8          11.0       0.3X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.14
Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
10 shorts:                               Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
In expression                                   25 /   27        407.3           2.5       1.0X
InSet expression                               122 /  127         82.1          12.2       0.2X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.14
Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
25 shorts:                               Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
In expression                                   27 /   29        367.2           2.7       1.0X
InSet expression                               124 /  127         80.9          12.4       0.2X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.14
Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
50 shorts:                               Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
In expression                                   26 /   27        386.5           2.6       1.0X
InSet expression                               158 /  162         63.2          15.8       0.2X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.14
Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
100 shorts:                              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
In expression                                   28 /   30        353.5           2.8       1.0X
InSet expression                               136 /  141         73.4          13.6       0.2X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.14
Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
200 shorts:                              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
In expression                                   33 /   36        307.0           3.3       1.0X
InSet expression                               136 /  141         73.5          13.6       0.2X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.14
Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
5 ints:                                  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
In expression                                   26 /   28        389.4           2.6       1.0X
InSet expression                               108 /  115         92.7          10.8       0.2X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.14
Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
10 ints:                                 Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
In expression                                   23 /   25        431.0           2.3       1.0X
InSet expression                               119 /  124         84.1          11.9       0.2X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.14
Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
25 ints:                                 Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
In expression                                   24 /   25        420.9           2.4       1.0X
InSet expression                               123 /  127         81.5          12.3       0.2X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.14
Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
50 ints:                                 Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
In expression                                   25 /   26        406.6           2.5       1.0X
InSet expression                               153 /  157         65.4          15.3       0.2X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.14
Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
100 ints:                                Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
In expression                                   26 /   27        391.4           2.6       1.0X
InSet expression                               128 /  133         78.3          12.8       0.2X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.14
Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
200 ints:                                Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
In expression                                   28 /   30        354.9           2.8       1.0X
InSet expression                               126 /  131         79.7          12.6       0.2X

mgaido91 · 2018-12-13T10:22:34Z

thanks @aokolnychyi. I just have a couple of comments on this:
1 - As @rxin mentioned, now we have InSet for handling Literals and In for handling arbitrary expressions. Since this method works only with literals, I'd rather see it as an alternative execution for InSet rather than for In. Then we might want to convert always (without a threshold) a In containing literals in InSet and let InSet pick the best implementation (either switch or the real InSet). What do you think about this?
2 - I think we may also support longs. We just need to split a long in 2 integers, so with 2 nested switches it would be doable I think. I see this will add complexity and we need to write a dedicated implementation for it, but we may consider this as a followup work. Do you agree on this?

kiszk · 2018-12-15T05:37:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

+       """.stripMargin)
+  }
+
+  private def isSwitchCompatible: Boolean = list.forall {


Could you please take care of the following limitation of Java switch statement, too?

npairs pairs of signed 32-bit values

https://docs.oracle.com/javase/specs/jvms/se8/html/jvms-6.html#jvms-6.5.lookupswitch
https://docs.oracle.com/javase/specs/jvms/se8/html/jvms-6.html#jvms-6.5.tableswitch

Could you, please, elaborate a bit on this? I am not sure I got. Shouldn't we be fine if we limit this approach to bytes/shorts/ints?

Sorry for missing some words. My comment is that isSwitchCompatible can be true only if list.size is less than or eqal to INT.MAX. Otherwise, Janino will cause a failure.

kiszk · 2019-02-27T19:10:04Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/PredicateSuite.scala

+    withSQLConf(SQLConf.OPTIMIZER_INSET_SWITCH_THRESHOLD.key -> "20") {
+      checkAllTypes()
+    }
+  }


Could you please add a test case that spark.sql.optimizer.inSetSwitchThreshold has maximum value and this optimization calls genCodeWithSwitch()?

Do you mean testing that if the set size is 100 and spark.sql.optimizer.inSetSwitchThreshold is 100, then genCodeWithSwitch is still applied?

My question addressed what you are talking here. The current implementation can accept large int value (e.g. Integer.MAX) for spark.sql.optimizer.inSetSwitchThreshold. Thus, I am afraid switch code requires more than 64KB java byte code.
If the option would have the appropriate upper limit, it is fine.

dongjoon-hyun · 2019-02-27T23:39:06Z

I'm +1 for this approach. Thank you for updating, @aokolnychyi .

dongjoon-hyun · 2019-02-27T23:43:18Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      .internal()
+      .doc("Configures the max set size in InSet for which Spark will generate code with " +
+        "switch statements. This is applicable only to bytes, shorts, ints, dates.")
+      .intConf


To prevent user configuration errors, can we have a meaningful min/max check?

.checkValue(v => v > 0 && v < ???, ...)

@kiszk @mgaido91 we had a discussion about generating codes bigger than 64KB.

I am wondering if we still want to split the switch-based logic into multiple methods if we have this check suggested by @dongjoon-hyun. I've implemented the split logic locally. However, the code looks more complicated and we will need some extensions to splitExpressionsWithCurrentInputs.

I am not sure why you'd need any extension. We have other parts of the code with swtich which are split. I think in general it is safer to have it.

@mgaido91 could you point me to an example?

ah, you're right sorry, I was remembering wrongly. There were switch based expressions for for splitting them we migrated them to a do while approach. Since the whole point of this PR is to introduce the switch construct, then I agree with you that the best way is to add a constraint here in order to have the number small enough not to cause issues with code generation.

What about the default and max values then? The switch logic was faster than HashSet on 500 elements for every data type and on every machine I tested. In some cases, HashSet started to outperform on 550+. Also, I had to generate a set of 6000+ element to hit the limit of 64KB. My proposal is to have 400 as default and 600 as max. Then we should be safe.

yes, sounds fine to me. Please add a comment in the codegen part in order to explain why we are not splitting the code. Thanks.

Yeah, I'll add a comment.

dongjoon-hyun · 2019-02-27T23:45:01Z

sql/core/benchmarks/InExpressionBenchmark-results.txt

-OpenJDK 64-Bit Server VM 1.8.0_191-b12 on Linux 3.10.0-862.3.2.el7.x86_64
-Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.14.3
+Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
 200 dates:             


Ur, this PR is irrelevant to this ratio change, isn't it?

No, it has no effect on this. I assume we see such a difference because of machines. My original evaluation had a similar ratio as we see now.

Also, I re-tested this PR on a t2.xlarge EC2 instance.

OpenJDK 64-Bit Server VM 1.8.0_191-b12 on Linux 4.14.77-70.59.amzn1.x86_64 Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz 200 structs: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ In expression 2614 / 2895 0.4 2614.5 1.0X InSet expression 427 / 433 2.3 427.3 6.1X

SparkQA · 2019-02-28T22:26:27Z

Test build #102871 has finished for PR 23171 at commit bab82f2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-03-03T20:49:05Z

Retest this please.

dongjoon-hyun · 2019-03-03T21:31:25Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+        "switch statements. This is applicable only to bytes, shorts, ints, dates.")
+      .intConf
+      .checkValue(threshold => threshold >= 0 && threshold <= 600, "The max set size " +
+        "for using switch statements in InSet must be positive and less than or equal to 600")


Ur, the description is not matched to the condition check; must be positive -> threahold > 0?

Yeah, I've started with threshold > 0 but then changed it to threshold >= 0 and forgot to update the description. I kept 0 as a possible value to ensure we can disable this optimization if needed. Do you think it makes sense or shall we require threshold > 0?

Disabling is also a good idea if you give the description clearly.

dongjoon-hyun · 2019-03-03T21:33:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

+    val valueSQL = child.sql
+    val listSQL = hset.toSeq.map(Literal(_).sql).mkString(", ")
+    s"($valueSQL IN ($listSQL))"
+  }


This function is not changed. To reduce the code diff more clearly, could you move override def sql and private def canBeComputedUsingSwitch after genCodeWithSwitch ?

dongjoon-hyun · 2019-03-03T21:34:59Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/PredicateSuite.scala

@@ -241,6 +242,52 @@ class PredicateSuite extends SparkFunSuite with ExpressionEvalHelper {
    }
  }

+  test("SPARK-26205: Optimize InSet for bytes, shorts, ints, dates using switch statements") {


Let's remove SPARK-26205: prefix since this is an improvement. We use JIRA ID only for bug fixes.

dongjoon-hyun · 2019-03-03T21:37:22Z

sql/core/benchmarks/InExpressionBenchmark-results.txt

@@ -2,550 +2,739 @@
 In Expression Benchmark
 ================================================================================================

-OpenJDK 64-Bit Server VM 1.8.0_191-b12 on Linux 3.10.0-862.3.2.el7.x86_64


Recently, #23914 added Stdev to the benchmark result. We need to rerun this.

@aokolnychyi . After you update the PR code, I'll rerun the benchmark on EC2 and make a PR to you.

dongjoon-hyun · 2019-03-03T21:53:22Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/PredicateSuite.scala

+        dateValues)
+    }
+
+    withSQLConf(SQLConf.OPTIMIZER_INSET_SWITCH_THRESHOLD.key -> "0") {


After https://github.com/apache/spark/pull/23171/files#r261888276, we need to increase this from 0 to 1.

dongjoon-hyun · 2019-03-03T21:53:55Z

sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala

@@ -413,6 +415,43 @@ class ColumnExpressionSuite extends QueryTest with SharedSQLContext {
      }
  }

+  test("SPARK-26205: Optimize InSet for bytes, shorts, ints, dates") {


ditto. Let's remove SPARK-26205: .

sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala

dongjoon-hyun · 2019-03-03T22:17:11Z

sql/core/src/test/scala/org/apache/spark/sql/internal/SQLConfSuite.scala

+    }
+
+    spark.sessionState.conf.clear()
+  }


+1 for the intention, but I think we can skip this testing. :)
Could you revert the change on this file please?

dongjoon-hyun · 2019-03-03T22:45:17Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

+        ${CodeGenerator.JAVA_BOOLEAN} ${ev.value} = false;
+        if (!${valueGen.isNull}) {
+          switch (${valueGen.value}) {
+            ${caseBranches.mkString("")}


Shall we add new lines?

- ${caseBranches.mkString("")} + ${caseBranches.mkString("\n")}

Otherwise, the readability is not good since it goes like the following (AS-IS).

/* 037 */ case 2: /* 038 */ filter_value_0 = true; /* 039 */ break;case 1: ...

dongjoon-hyun · 2019-03-04T00:50:07Z

I made a benchmark result PR to you, @aokolnychyi .
It's a result from your branch after I rebased it to the master branch. It's based on the same EC2 environment. Please review and merge it.

SparkQA · 2019-03-04T01:06:33Z

Test build #102954 has finished for PR 23171 at commit bab82f2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Update the benchmark result on the same EC2 instance.

aokolnychyi · 2019-03-04T19:29:21Z

@dongjoon-hyun thanks for running the benchmarks! It's great to verify the performance benefit on one more machine.

dongjoon-hyun

+1, LGTM.

This PR has a clear benefit in terms of the performance. And, the generated code is also safe and clean.

dbtsai · 2019-03-04T21:49:43Z

LGTM too!

SparkQA · 2019-03-04T23:27:18Z

Test build #103006 has finished for PR 23171 at commit fcef14a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-03-04T23:39:37Z

Thank you, @aokolnychyi , @dbtsai , @gatorsmile , @cloud-fan , @rxin , @kiszk, @viirya , @mgaido91 .

Merged to master.

dbtsai reviewed Nov 28, 2018

View reviewed changes

viirya reviewed Nov 29, 2018

View reviewed changes

mgaido91 reviewed Nov 29, 2018

View reviewed changes

aokolnychyi mentioned this pull request Dec 11, 2018

[SPARK-26203][SQL][TEST] Benchmark performance of In and InSet expressions #23291

Closed

kiszk reviewed Dec 15, 2018

View reviewed changes

kiszk reviewed Feb 27, 2019

View reviewed changes

dongjoon-hyun reviewed Feb 27, 2019

View reviewed changes

aokolnychyi changed the title ~~[SPARK-26205][SQL] Optimize In for bytes, shorts, ints~~ [SPARK-26205][SQL] Optimize InSet Expression for bytes, shorts, ints, dates Feb 28, 2019

More tests, new default value, validation

bab82f2

dongjoon-hyun reviewed Mar 3, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala Show resolved Hide resolved

dongjoon-hyun reviewed Mar 3, 2019

View reviewed changes

Update the benchmark result on the same EC2 instance.

7b4d6a2

aokolnychyi added 2 commits March 4, 2019 19:07

Merge pull request #2 from dongjoon-hyun/PR-23171-2

66d00a3

Update the benchmark result on the same EC2 instance.

Address review comments

fcef14a

dongjoon-hyun approved these changes Mar 4, 2019

View reviewed changes

dbtsai self-assigned this Mar 4, 2019

dbtsai self-requested a review March 4, 2019 21:50

dongjoon-hyun closed this in 0c23a39 Mar 4, 2019

[SPARK-26205][SQL] Optimize InSet Expression for bytes, shorts, ints, dates #23171

[SPARK-26205][SQL] Optimize InSet Expression for bytes, shorts, ints, dates #23171

Conversation

aokolnychyi commented Nov 28, 2018 • edited

What changes were proposed in this pull request?

How was this patch tested?

Notes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi commented Nov 28, 2018 • edited

Choose a reason for hiding this comment

cloud-fan Nov 29, 2018 • edited

Choose a reason for hiding this comment

dbtsai commented Nov 28, 2018

SparkQA commented Nov 28, 2018

gatorsmile commented Nov 28, 2018

cloud-fan commented Nov 29, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi commented Nov 29, 2018 • edited

mgaido91 left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dbtsai commented Nov 29, 2018

mgaido91 commented Nov 29, 2018

aokolnychyi commented Nov 29, 2018

aokolnychyi commented Nov 29, 2018 • edited

mgaido91 commented Nov 29, 2018

rxin commented Dec 3, 2018 • edited

cloud-fan commented Dec 4, 2018

rxin commented Dec 4, 2018 via email

cloud-fan commented Dec 4, 2018

rxin commented Dec 4, 2018 via email

cloud-fan commented Dec 4, 2018

dbtsai commented Dec 4, 2018

rxin commented Dec 4, 2018 via email

aokolnychyi commented Dec 4, 2018

aokolnychyi commented Dec 11, 2018

mgaido91 commented Dec 12, 2018

aokolnychyi commented Dec 12, 2018

mgaido91 commented Dec 13, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kiszk Dec 19, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Feb 27, 2019

Choose a reason for hiding this comment

aokolnychyi Feb 28, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi Feb 28, 2019 • edited

Choose a reason for hiding this comment

SparkQA commented Feb 28, 2019

dongjoon-hyun commented Mar 3, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun Mar 3, 2019 • edited

Choose a reason for hiding this comment

dongjoon-hyun commented Mar 4, 2019

SparkQA commented Mar 4, 2019

aokolnychyi commented Mar 4, 2019

dongjoon-hyun left a comment

aokolnychyi commented Nov 28, 2018 •

edited

aokolnychyi commented Nov 28, 2018 •

edited

cloud-fan Nov 29, 2018 •

edited

aokolnychyi commented Nov 29, 2018 •

edited

mgaido91 left a comment •

edited

aokolnychyi commented Nov 29, 2018 •

edited

rxin commented Dec 3, 2018 •

edited

kiszk Dec 19, 2018 •

edited

aokolnychyi Feb 28, 2019 •

edited

aokolnychyi Feb 28, 2019 •

edited

dongjoon-hyun Mar 3, 2019 •

edited