-
Notifications
You must be signed in to change notification settings - Fork 980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIRA DRILL-5879: Like operator performance improvements #1001
Conversation
@sachouche Do you have a breakdown of how much gain we got with 1 vs 2. Since the changes for 2 are not straightforward and easy to maintain, I am thinking performance gain vs. maintainability of the code. |
@sachouche, thanks for the first PR to Drill! Thanks for the detailed explanation! Before reviewing the code, a comment on the design:
The problem with this design is that there is no guarantee that the first value is representative of the other columns. Maybe my list looks like this:
The first value is ASCII. The second is not. So, we must treat each value as independent of the others. Or, maybe the design means that we do this per holder per row. In this case, do we know that the holder is reused for accesses to the same column in the same row, but is reinitialized for each new row? We'd have to look at the generated code (see the "Plain Java" feature) to see how this works. On the other hand, we can exploit the nature of UTF-8. The encoding is such that no valid UTF-8 character is a prefix of any other valid character. Thus, if a character is 0xXX 0xYY 0xZZ, then there can never be a valid character which is 0xXX 0xYY. As a result, starts-with, ends-width, equals and contains can be done without either converting to UTF-16 or even caring if the data is ASCII or not. What does this mean? It means that, for the simple operations:
Unlike other multi-byte encodings, UTF-8 was designed to make this possible. If we go this route, we would not need the ASCII mode flag. Note: all of this applies only to the "basic four" operations: if we do a real regex, then we must decode the Varchar into a Java UTF-16 string. |
|
Paul,
Note that the test-suite has similar test cases that you are proposing; they all passed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks much for the PR.
The two key comments are:
- The conversion to characters is unnecessary: work directly with bytes, even for UTF-8.
- Some of the pattern-length optimizations should not be checked per-row, but instead should be part of the pattern parser introduced in an earlier commit, since they test conditions that must be true for the pattern as a whole independent of the per-row values.
@@ -73,6 +73,9 @@ public void eval() { | |||
out.start = in.start; | |||
if (charCount <= length.value || length.value == 0 ) { | |||
out.end = in.end; | |||
if (charCount == (out.end-out.start)) { | |||
out.asciiMode = 1; // we can conclude this string is ASCII |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The values -1, 0, and 1 are magic numbers here. Because we want speed, we don't want to use an enum. But, can we define symbols for our values: IS_ASCII
, NOT_ASCII
, ASCII_UNKNOWN
? (You decide on the names...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do!
@@ -73,6 +73,9 @@ public void eval() { | |||
out.start = in.start; | |||
if (charCount <= length.value || length.value == 0 ) { | |||
out.end = in.end; | |||
if (charCount == (out.end-out.start)) { | |||
out.asciiMode = 1; // we can conclude this string is ASCII |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drill likes spaces around operators: out.end - out.start
} | ||
} | ||
|
||
private final int match_1() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See note about UTF-8. If we don't care about the match position (that is, we don't need strpos()
, and all we care is whether it matches or not, then we can do the work on the undecoded UTF-8 bytes, saving a large amount of complexity.
|
||
<#if minor.class == "VarChar"> | ||
// -1: unknown, 0: not ascii, 1: is ascii | ||
public int asciiMode = -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drill is complex, nothing is as it seems. Drill has a very elaborate mechanism to rewrite byte codes to do scalar replacement. That is, holders are added to the Java source during code generation, but then are removed during byte code transforms. (Yes, that is silly given that Java 8 does a fine job on its own; but it is how Drill works today...)
Will adding this variable change that behavior? Will the scalar replacement logic know to not replace this object? If so, will that hurt performance, or with the JVM go ahead and figure it out on its own? Is all the code that sets the variable inlined via code generation so that scalar replacement is possible?
Yes, indeed, nothing in Drill is as simple as it seems...
continue; | ||
} else { | ||
char ch2_1 = sequenceWrapper.charAt(idx+1); | ||
char ch2_2 = patternString.charAt(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want speed. Instead of getting the second character multiple times, is it better to get it once up front? I suppose that depends on the hit rate. Average hit rate may be 2 % (1/ ~64). So if our input is smaller than 64 characters, we'll have, on average one hit so we pay the second character cost once. At 128 or above, we'll pay the cost two or more times. But, maybe the JVM can optimize away the second and subsequent accesses?
Actually, let's take a step back. The pattern is fixed. We parsed the pattern to decide to use this particular class. Should we instead create a 1-char, 2-char and n-char matcher class so we get the second character (for the 2-char case) only once, and we eliminate the extra per-value if-check?
|
||
private final int match_N() { | ||
|
||
if (patternLength == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't this be optimized away in the pattern parser stage? We should not be calling this function if we had a zero-length pattern.
public final int match() { | ||
// The idea is to write loops with simple condition checks to allow the Java Hotspot vectorize | ||
// the generate code. | ||
if (patternLength == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pattern length is fixed per expression; need not be checked per row. See more below.
Paul, again thanks for the detailed review; I was able to address most of the feedback except for one:
Byte Code Manipulation
|
@@ -73,6 +73,9 @@ public void eval() { | |||
out.start = in.start; | |||
if (charCount <= length.value || length.value == 0 ) { | |||
out.end = in.end; | |||
if (charCount == (out.end - out.start)) { | |||
out.asciiMode = org.apache.drill.exec.expr.holders.VarCharHolder.CHAR_MODE_IS_ASCII; // we can conclude this string is ASCII |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please add comments here ? I am not able to understand this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- As previously stated (when responding to Paul'd comment), the expression framework is able to use the same VarCharHolder input variable when it is shared amongst multiple expressions
- If the original column was of type var-binary, then the expression framework will include a cast to var-char
- The cast logic will also compute the string length
- Using this information to deduce whether the string is pure ASCII or not
- UTF-8 encoding uses 1 byte for ASCII and 2, 3, or 4 for other character sets
- If the encoded length and character length are equal, then this means this is an ASCII string
|
||
public CharSequenceWrapper() { | ||
} | ||
|
||
public CharSequenceWrapper(int start, int end, DrillBuf buffer) { | ||
setBuffer(start, end, buffer); | ||
setBuffer(start, end, buffer, -1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does -1 mean ? Shouldn't it be one of CHAR_MODE_IS_ASCII, CHAR_MODE_UNKNOWN or CHAR_MODE_NOT_ASCII ?
// The idea is to write loops with simple condition checks to allow the Java Hotspot achieve | ||
// better optimizations (especially vectorization) | ||
if (patternLength == 1) { | ||
matcherFcn = new Matcher1(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if it is a good idea to write special case code for each pattern length. It is not easy to maintain. Can you please give more details how this is improving performance ? Are we getting better performance for patternLength 1 or 2 or 3 or N or all ? If so, how much and why ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Padma, I have two reasons to follow the added complexity
- The new code is encapsulated within the Contains matching logic; doesn't increase code complexity
o I created a test with the original match logic, pattern and input were Strings though passed as CharSequence
o Ran the test with the new and old method (1 billion iterations) on MacOS
o pattern length
o The old match method performed in 43sec where as the new one performed in 15sec
o The reason for the speedup is the custom matcher functions have less instructions (load and comparison)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does matcherN perform compared to matcher1, matcher2, matcher3 for pattern lengths 1, 2 and 3 ? If matcherN performs well for patternLengths 1, 2 and 3, we can just have one matcher instead of multiple for different pattern lengths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran two types of tests to evaluate the generic vs custom method:
Test1 - A match does exist
o Custom method is 3x faster because the code will go to the "else" part
o A nested loop is always slower than unrolled code (the loop is also correlated with the outer one); please refer to this article (https://en.wikipedia.org/wiki/Loop_unrolling) on the benefits of loop unrolling
o Older match function performed in 59sec
Test2- A match doesn't exist
o Custom method and generic one perform in 15sec; this is because both perform a comparison and proceed to the next iteration
o Older match function performed in 45sec
@ppadma @paul-rogers I see that @sachouche addressed the comments in the JIRA - is this one ready to merge? |
@priteshm @sachouche This PR needs to be updated on top of changes made for DRILL-5899. |
@sachouche can you update this PR? |
Created another pull request #1072to merge my changes with the one done with Padma's. |
Query: select from <table> where colA like '%a%' or colA like '%xyz%';
Improvement Opportunities
Avoid isAscii computation (full access of the input string) since we're dealing with the same column twice
Optimize the "contains" for-loop
Implementation Details
1)
Added a new integer variable "asciiMode" to the VarCharHolder class
The default value is -1 which indicates this info is not known
Otherwise this value will be set to either 1 or 0 based on the string being in ASCII mode or Unicode
The execution plan already shares the same VarCharHolder instance for all evaluations of the same column value
The asciiMode will be correctly set during the first LIKE evaluation and will be reused across other LIKE evaluations
2)
The "Contains" LIKE operation is quite expensive as the code needs to access the input string to perform character based comparisons
Created 4 versions of the same for-loop to a) make the loop simpler to optimize (Vectorization) and b) minimize comparisons
Benchmarks
Lineitem table 100GB
Query: select l_returnflag, count from dfs.
<source>
where l_comment not like '%a%' or l_comment like '%the%' group by l_returnflagBefore changes: 33sec
After changes : 27sec