New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-26205][SQL] Optimize InSet Expression for bytes, shorts, ints, dates #23171
Conversation
} | ||
|
||
private def isSwitchCompatible: Boolean = list.forall { | ||
case Literal(_, dt) => dt == ByteType || dt == ShortType || dt == IntegerType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case Literal(_, dt) if dt == ByteType || dt == ShortType || dt == IntegerType => true
is easier to read?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be simplified to?
private def isSwitchCompatible: Boolean = {
inSetConvertible && (value.dataType == ByteType || value.dataType == ShortType || value.dataType == IntegerType)
}
@gatorsmile @cloud-fan @dongjoon-hyun @viirya It would be great to have your feedback. |
val (nullLiterals, nonNullLiterals) = list.partition { | ||
case Literal(null, _) => true | ||
case _ => false | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is null in the list, it will be only one. As a result, we may not need to use nullLiterals
.
val containNullInList = ...
val nonNullLiterals = ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can follow InSet
, define a hasNull
ahead, and filter out null values from the list before processing.
The approach looks great, and can significantly improve the performance. For Long, I agree that we should also implement binary search approach for Wondering which one will be faster, binary search using arrays or rewrite the |
Test build #99393 has finished for PR 23171 at commit
|
Also cc @ueshin |
I'm wondering if this is still useful after we fix the boxing issue in |
val listGen = nonNullLiterals.map(_.genCode(ctx)) | ||
val valueGen = value.genCode(ctx) | ||
|
||
val caseBranches = listGen.map(literal => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style:
listGen.map { literal =>
...
}
} | ||
|
||
private def isSwitchCompatible: Boolean = list.forall { | ||
case Literal(_, dt) => dt == ByteType || dt == ShortType || dt == IntegerType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be simplified to?
private def isSwitchCompatible: Boolean = {
inSetConvertible && (value.dataType == ByteType || value.dataType == ShortType || value.dataType == IntegerType)
}
@cloud-fan, yeah, let’s see if this PR is useful. The original idea wasn’t to avoid fixing autoboxing in Once we solve autoboxing issues in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my (maybe stupid?) question is: one we do such a change, does it still make sense to convert In
to InSet
? Most likely now In
is even more efficient. Shall we change the optimizer in order to reflect this? Maybe we can do this in a followup.
|${CodeGenerator.JAVA_BOOLEAN} ${ev.value} = false; | ||
|if (!${valueGen.isNull}) { | ||
| switch (${valueGen.value}) { | ||
| ${caseBranches.mkString("")} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should consider that if the number of items is very big, this can cause a compile exception due to the method size limit. So we should use the proper splitting methods for the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aokolnychyi Could you please address @mgaido91 's comment? The current code will throw an exception for a huge sequence of In
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add test cases that could cause more than 64KB Java bytecode size in one switch statement?
@cloud-fan as @aokolnychyi said, @mgaido91 |
@dbtsai I see, it would be great, though, to check which is this threshold. My understanding is that the current solution has better performance even for several hundreds of items. If this number is some thousands and since this depends on the datatype (so it is hard to control by the users with a single config), it is arguable which is the best solution: I don't think it is very common to have thousands of elements, while for lower numbers (more common) we would use the less efficient solution. |
@dbtsai @mgaido91 I think we can come back to this question once SPARK-26203 is resolved. That JIRA will give us enough information about each data type. |
To sum up, I would set the goal of this PR is to make This approach sets a pretty high bar even for huge value lists, so it would be a nice basis to benchmark our solution for |
yes @aokolnychyi , I agree that the work can be done later (not in the scope of this PR). We can maybe just open a new JIRA about it so we won't forget. |
I'm not a big fan of making the physical implementation of an expression very different depending on the situation. It complicates the code base and makes things more difficult to reason about. Why can't we just make InSet efficient and convert these cases to that? |
@rxin I proposed the same thing before, but one problem is that, we only convert |
That probably means we should just optimize InSet to have the switch version though? Rather than do it in In?
…On Mon, Dec 03, 2018 at 8:20 PM, Wenchen Fan < ***@***.*** > wrote:
@ rxin ( https://github.com/rxin ) I proposed the same thing before, but
one problem is that, we only convert In to InSet when the length of list
reaches the threshold. If the switch way is faster than hash set when the
list is small, it seems still worth to optimize In using switch.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub (
#23171 (comment) ) , or mute
the thread (
https://github.com/notifications/unsubscribe-auth/AATvPEkrUFJuT4FI167cCI9b0nfv16V4ks5u1fgNgaJpZM4Y4P4J
).
|
I think |
I thought InSwitch logically is the same as InSet, in which all the child expressions are literals?
…On Mon, Dec 03, 2018 at 8:38 PM, Wenchen Fan < ***@***.*** > wrote:
I think InSet is not an optimized version of In , but just a way to
separate the implementation for different conditions (the length of the
list). Maybe we should do the same thing here, create a InSwitch and
convert In to it when meeting some conditions. One problem is, In and InSwitch
is same in the interpreted version, maybe we should create a base class
for them.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub (
#23171 (comment) ) , or mute
the thread (
https://github.com/notifications/unsubscribe-auth/AATvPDTQic0Ii5UD40m_Uj5kMVy4pNExks5u1fxPgaJpZM4Y4P4J
).
|
How about, we create an |
@rxin @cloud-fan do you suggest to create an |
Basically logically there are only two expressions: In which handles arbitrary expressions, and InSet which handles expressions with literals. Both could work: (1) we provide two separate expressions for InSet, one using switch, and one using hashset, or (2) we just provide one InSet and internally in InSet have two implementations ...
The downside with creating different expressions for the same logical expression is that potentially the downstream optimization rules would need to match more.
…On Mon, Dec 03, 2018 at 10:52 PM, DB Tsai < ***@***.*** > wrote:
@ rxin ( https://github.com/rxin ) switch in Java is still significantly
faster than hash set even without boxing / unboxing problems when the
number of elements are small. We were thinking about to have two
implementations in InSet , and pick up switch if the number of elements are
small, or otherwise pick up hash set one. But this is the same complexity
as having two implements in In as this PR.
@ cloud-fan ( https://github.com/cloud-fan ) do you suggest to create an OptimizeIn
which has switch and hash set implementations based on the length of the
elements and remove InSet ? Basically, what we were thinking above.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub (
#23171 (comment) ) , or mute
the thread (
https://github.com/notifications/unsubscribe-auth/AATvPKtGyx5jWxgtO1y5WsiXYDAQqRQ4ks5u1hvJgaJpZM4Y4P4J
).
|
As @rxin said, if we introduce a separate expression for the switch-based approach, then we will need to modify other places. For example, I think we can move the switch-based logic to |
@dbtsai @cloud-fan @mgaido91 @rxin @dongjoon-hyun @viirya @gatorsmile PR #23291 contains benchmarks for different data types. @rxin was your latest suggestion to convert |
thanks @aokolnychyi , may you please post here the result of that benchmark after applying this patch? Just a quick question: can't we support timestamp too in the switch approach? |
@mgaido91 It won't be possible to apply the switch-based approach on timestamps as they are represented as longs. We can try dates as are represented as ints. Below is the result of that benchmark with this patch:
|
thanks @aokolnychyi. I just have a couple of comments on this: |
""".stripMargin) | ||
} | ||
|
||
private def isSwitchCompatible: Boolean = list.forall { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please take care of the following limitation of Java switch statement, too?
npairs pairs of signed 32-bit values
https://docs.oracle.com/javase/specs/jvms/se8/html/jvms-6.html#jvms-6.5.lookupswitch
https://docs.oracle.com/javase/specs/jvms/se8/html/jvms-6.html#jvms-6.5.tableswitch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you, please, elaborate a bit on this? I am not sure I got. Shouldn't we be fine if we limit this approach to bytes/shorts/ints?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for missing some words. My comment is that isSwitchCompatible
can be true only if list.size
is less than or eqal to INT.MAX. Otherwise, Janino will cause a failure.
withSQLConf(SQLConf.OPTIMIZER_INSET_SWITCH_THRESHOLD.key -> "20") { | ||
checkAllTypes() | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a test case that spark.sql.optimizer.inSetSwitchThreshold
has maximum value and this optimization calls genCodeWithSwitch()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean testing that if the set size is 100 and spark.sql.optimizer.inSetSwitchThreshold
is 100, then genCodeWithSwitch
is still applied?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My question addressed what you are talking here. The current implementation can accept large int value (e.g. Integer.MAX) for spark.sql.optimizer.inSetSwitchThreshold
. Thus, I am afraid switch code requires more than 64KB java byte code.
If the option would have the appropriate upper limit, it is fine.
I'm +1 for this approach. Thank you for updating, @aokolnychyi . |
.internal() | ||
.doc("Configures the max set size in InSet for which Spark will generate code with " + | ||
"switch statements. This is applicable only to bytes, shorts, ints, dates.") | ||
.intConf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To prevent user configuration errors, can we have a meaningful min/max check?
.checkValue(v => v > 0 && v < ???, ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kiszk @mgaido91 we had a discussion about generating codes bigger than 64KB.
I am wondering if we still want to split the switch-based logic into multiple methods if we have this check suggested by @dongjoon-hyun. I've implemented the split logic locally. However, the code looks more complicated and we will need some extensions to splitExpressionsWithCurrentInputs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why you'd need any extension. We have other parts of the code with swtich which are split. I think in general it is safer to have it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgaido91 could you point me to an example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, you're right sorry, I was remembering wrongly. There were switch based expressions for for splitting them we migrated them to a do while approach. Since the whole point of this PR is to introduce the switch construct, then I agree with you that the best way is to add a constraint here in order to have the number small enough not to cause issues with code generation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the default and max values then? The switch logic was faster than HashSet
on 500 elements for every data type and on every machine I tested. In some cases, HashSet
started to outperform on 550+. Also, I had to generate a set of 6000+ element to hit the limit of 64KB. My proposal is to have 400 as default and 600 as max. Then we should be safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, sounds fine to me. Please add a comment in the codegen part in order to explain why we are not splitting the code. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'll add a comment.
OpenJDK 64-Bit Server VM 1.8.0_191-b12 on Linux 3.10.0-862.3.2.el7.x86_64 | ||
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz | ||
Java HotSpot(TM) 64-Bit Server VM 1.8.0_192-b12 on Mac OS X 10.14.3 | ||
Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz | ||
200 dates: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ur, this PR is irrelevant to this ratio change, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it has no effect on this. I assume we see such a difference because of machines. My original evaluation had a similar ratio as we see now.
Also, I re-tested this PR on a t2.xlarge EC2 instance.
OpenJDK 64-Bit Server VM 1.8.0_191-b12 on Linux 4.14.77-70.59.amzn1.x86_64
Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
200 structs: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
In expression 2614 / 2895 0.4 2614.5 1.0X
InSet expression 427 / 433 2.3 427.3 6.1X
Test build #102871 has finished for PR 23171 at commit
|
Retest this please. |
"switch statements. This is applicable only to bytes, shorts, ints, dates.") | ||
.intConf | ||
.checkValue(threshold => threshold >= 0 && threshold <= 600, "The max set size " + | ||
"for using switch statements in InSet must be positive and less than or equal to 600") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ur, the description is not matched to the condition check; must be positive
-> threahold > 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've started with threshold > 0
but then changed it to threshold >= 0
and forgot to update the description. I kept 0 as a possible value to ensure we can disable this optimization if needed. Do you think it makes sense or shall we require threshold > 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disabling is also a good idea if you give the description clearly.
val valueSQL = child.sql | ||
val listSQL = hset.toSeq.map(Literal(_).sql).mkString(", ") | ||
s"($valueSQL IN ($listSQL))" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is not changed. To reduce the code diff more clearly, could you move override def sql
and private def canBeComputedUsingSwitch
after genCodeWithSwitch
?
@@ -241,6 +242,52 @@ class PredicateSuite extends SparkFunSuite with ExpressionEvalHelper { | |||
} | |||
} | |||
|
|||
test("SPARK-26205: Optimize InSet for bytes, shorts, ints, dates using switch statements") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove SPARK-26205:
prefix since this is an improvement. We use JIRA ID
only for bug fixes.
@@ -2,550 +2,739 @@ | |||
In Expression Benchmark | |||
================================================================================================ | |||
|
|||
OpenJDK 64-Bit Server VM 1.8.0_191-b12 on Linux 3.10.0-862.3.2.el7.x86_64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recently, #23914 added Stdev
to the benchmark result. We need to rerun this.
@aokolnychyi . After you update the PR code, I'll rerun the benchmark on EC2 and make a PR to you.
dateValues) | ||
} | ||
|
||
withSQLConf(SQLConf.OPTIMIZER_INSET_SWITCH_THRESHOLD.key -> "0") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After https://github.com/apache/spark/pull/23171/files#r261888276, we need to increase this from 0
to 1
.
@@ -413,6 +415,43 @@ class ColumnExpressionSuite extends QueryTest with SharedSQLContext { | |||
} | |||
} | |||
|
|||
test("SPARK-26205: Optimize InSet for bytes, shorts, ints, dates") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto. Let's remove SPARK-26205:
.
sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala
Show resolved
Hide resolved
} | ||
|
||
spark.sessionState.conf.clear() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for the intention, but I think we can skip this testing. :)
Could you revert the change on this file please?
${CodeGenerator.JAVA_BOOLEAN} ${ev.value} = false; | ||
if (!${valueGen.isNull}) { | ||
switch (${valueGen.value}) { | ||
${caseBranches.mkString("")} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add new lines?
- ${caseBranches.mkString("")}
+ ${caseBranches.mkString("\n")}
Otherwise, the readability is not good since it goes like the following (AS-IS).
/* 037 */ case 2:
/* 038 */ filter_value_0 = true;
/* 039 */ break;case 1:
...
I made a benchmark result PR to you, @aokolnychyi . |
Test build #102954 has finished for PR 23171 at commit
|
Update the benchmark result on the same EC2 instance.
@dongjoon-hyun thanks for running the benchmarks! It's great to verify the performance benefit on one more machine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
This PR has a clear benefit in terms of the performance. And, the generated code is also safe and clean.
LGTM too! |
Test build #103006 has finished for PR 23171 at commit
|
Thank you, @aokolnychyi , @dbtsai , @gatorsmile , @cloud-fan , @rxin , @kiszk, @viirya , @mgaido91 . Merged to master. |
What changes were proposed in this pull request?
This PR optimizes
InSet
expressions for byte, short, integer, date types. It is a follow-up on PR #21442 from @dbtsai.In
expressions are compiled into a sequence of if-else statements, which results in O(n) time complexity.InSet
is an optimized version ofIn
, which is supposed to improve the performance if all values are literals and the number of elements is big enough. However,InSet
actually worsens the performance in many cases due to various reasons.The main idea of this PR is to use Java
switch
statements to significantly improve the performance ofInSet
expressions for bytes, shorts, ints, dates. Allswitch
statements are compiled intotableswitch
andlookupswitch
bytecode instructions. We will have O(1) time complexity if our case values are compact andtableswitch
can be used. Otherwise,lookupswitch
will give us O(log n).Locally, I tried Spark
OpenHashSet
and primitive collections fromfastutils
in order to solve the boxing issue inInSet
. Both options significantly decreased the memory consumption andfastutils
improved the time compared toHashSet
from Scala. However, the switch-based approach was still more than two times faster even on 500+ non-compact elements.I also noticed that applying the switch-based approach on less than 10 elements gives a relatively minor improvement compared to the if-else approach. Therefore, I placed the switch-based logic into
InSet
and added a new config to track when it is applied. Even if we migrate to primitive collections at some point, the switch logic will be still faster unless the number of elements is really big. Another option is to have a separateInSwitch
expression. However, this would mean we need to modify other places (e.g.,DataSourceStrategy
).See here and here for more information.
This PR does not cover long values as Java
switch
statements cannot be used on them. However, we can have a follow-up PR with an approach similar to binary search.How was this patch tested?
There are new tests that verify the logic of the proposed optimization.
The performance was evaluated using existing benchmarks. This PR was also tested on an EC2 instance (OpenJDK 64-Bit Server VM 1.8.0_191-b12 on Linux 4.14.77-70.59.amzn1.x86_64, Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz).
Notes
tableswitch
andlookupswitch
. The logic was re-used in the benchmarks. See theisLookupSwitch
method.