JIRA DRILL-5879: Like operator performance improvements #1001

sachouche · 2017-10-18T19:08:12Z

Recently, custom code has been added to handle common search patterns (Like operator)
Contains, Starts With, and Ends With
Custom code is way faster than the generic RegEx based implementation
This pull request is another attempt to improve the Contains pattern since it is more CPU intensive

Query: select from <table> where colA like '%a%' or colA like '%xyz%';
Improvement Opportunities
Avoid isAscii computation (full access of the input string) since we're dealing with the same column twice
Optimize the "contains" for-loop
Implementation Details
1)
Added a new integer variable "asciiMode" to the VarCharHolder class
The default value is -1 which indicates this info is not known
Otherwise this value will be set to either 1 or 0 based on the string being in ASCII mode or Unicode
The execution plan already shares the same VarCharHolder instance for all evaluations of the same column value
The asciiMode will be correctly set during the first LIKE evaluation and will be reused across other LIKE evaluations
2)
The "Contains" LIKE operation is quite expensive as the code needs to access the input string to perform character based comparisons
Created 4 versions of the same for-loop to a) make the loop simpler to optimize (Vectorization) and b) minimize comparisons
Benchmarks
Lineitem table 100GB
Query: select l_returnflag, count from dfs.<source> where l_comment not like '%a%' or l_comment like '%the%' group by l_returnflag
Before changes: 33sec
After changes : 27sec

ppadma · 2017-10-19T00:00:30Z

@sachouche Do you have a breakdown of how much gain we got with 1 vs 2. Since the changes for 2 are not straightforward and easy to maintain, I am thinking performance gain vs. maintainability of the code.

paul-rogers · 2017-10-19T00:08:09Z

@sachouche, thanks for the first PR to Drill! Thanks for the detailed explanation!

Before reviewing the code, a comment on the design:

Added a new integer variable "asciiMode" ... this value will be set ... during the first LIKE evaluation and will be reused across other LIKE evaluations

The problem with this design is that there is no guarantee that the first value is representative of the other columns. Maybe my list looks like this:

Hello
你好

The first value is ASCII. The second is not. So, we must treat each value as independent of the others.

Or, maybe the design means that we do this per holder per row. In this case, do we know that the holder is reused for accesses to the same column in the same row, but is reinitialized for each new row? We'd have to look at the generated code (see the "Plain Java" feature) to see how this works.

On the other hand, we can exploit the nature of UTF-8. The encoding is such that no valid UTF-8 character is a prefix of any other valid character. Thus, if a character is 0xXX 0xYY 0xZZ, then there can never be a valid character which is 0xXX 0xYY. As a result, starts-with, ends-width, equals and contains can be done without either converting to UTF-16 or even caring if the data is ASCII or not.

What does this mean? It means that, for the simple operations:

Convert the Java UTF-16 string to UTF-8.
Do the classic byte comparison methods for starts with, ends with or contains. No special processing is needed for multi-byte

Unlike other multi-byte encodings, UTF-8 was designed to make this possible.

If we go this route, we would not need the ASCII mode flag.

Note: all of this applies only to the "basic four" operations: if we do a real regex, then we must decode the Varchar into a Java UTF-16 string.

sachouche · 2017-10-19T00:11:49Z

It was 50% for 1) and 50% for 2)
Notice this breakdown depends on
o The number of Contains pattern for the same value (impacts 1))
o The pattern length (impacts both 1) and 2))

sachouche · 2017-10-19T00:20:33Z

Paul,

I think you misunderstood the proposal
Let me use an example
select .. c1 like '%pattern1%' OR c1 like '%pattern2%'..
Assume c1 has 3 values [v1, v2, v3]
The generated code will create a new VarcharHolder instance for each value iteration
For v1: VarCharHolder vc1 is created, ascii-mode computed for pattern1, ascii-mode computation reused for pattern2, pattern3, etc since we're evaluating the same value
For v2: VarCharHolder vc2 is created, ascii-mode computed for pattern1, ascii-mode computation reused for pattern2, pattern3, etc since we're evaluating the same value
DITO for v3

Note that the test-suite has similar test cases that you are proposing; they all passed.

paul-rogers

Thanks much for the PR.
The two key comments are:

The conversion to characters is unnecessary: work directly with bytes, even for UTF-8.
Some of the pattern-length optimizations should not be checked per-row, but instead should be part of the pattern parser introduced in an earlier commit, since they test conditions that must be true for the pattern as a whole independent of the per-row values.

paul-rogers · 2017-10-19T00:11:44Z

exec/java-exec/src/main/codegen/templates/CastFunctionsSrcVarLenTargetVarLen.java

@@ -73,6 +73,9 @@ public void eval() {
    out.start =  in.start;
    if (charCount <= length.value || length.value == 0 ) {
      out.end = in.end;
+      if (charCount == (out.end-out.start)) {
+        out.asciiMode = 1; // we can conclude this string is ASCII


The values -1, 0, and 1 are magic numbers here. Because we want speed, we don't want to use an enum. But, can we define symbols for our values: IS_ASCII, NOT_ASCII, ASCII_UNKNOWN? (You decide on the names...)

paul-rogers · 2017-10-19T00:14:28Z

exec/java-exec/src/main/codegen/templates/CastFunctionsSrcVarLenTargetVarLen.java

@@ -73,6 +73,9 @@ public void eval() {
    out.start =  in.start;
    if (charCount <= length.value || length.value == 0 ) {
      out.end = in.end;
+      if (charCount == (out.end-out.start)) {
+        out.asciiMode = 1; // we can conclude this string is ASCII


Drill likes spaces around operators: out.end - out.start

paul-rogers · 2017-10-19T00:29:26Z

exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java

+    }
+  }
+
+  private final int match_1() {


See note about UTF-8. If we don't care about the match position (that is, we don't need strpos(), and all we care is whether it matches or not, then we can do the work on the undecoded UTF-8 bytes, saving a large amount of complexity.

paul-rogers · 2017-10-19T00:32:55Z

exec/vector/src/main/codegen/templates/ValueHolders.java

+
+    <#if minor.class == "VarChar">
+    // -1: unknown, 0: not ascii, 1: is ascii
+    public int asciiMode = -1;


Drill is complex, nothing is as it seems. Drill has a very elaborate mechanism to rewrite byte codes to do scalar replacement. That is, holders are added to the Java source during code generation, but then are removed during byte code transforms. (Yes, that is silly given that Java 8 does a fine job on its own; but it is how Drill works today...)

Will adding this variable change that behavior? Will the scalar replacement logic know to not replace this object? If so, will that hurt performance, or with the JVM go ahead and figure it out on its own? Is all the code that sets the variable inlined via code generation so that scalar replacement is possible?

Yes, indeed, nothing in Drill is as simple as it seems...

paul-rogers · 2017-10-19T00:39:55Z

exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java

+        continue;
+      } else {
+        char ch2_1 = sequenceWrapper.charAt(idx+1);
+        char ch2_2 = patternString.charAt(1);


We want speed. Instead of getting the second character multiple times, is it better to get it once up front? I suppose that depends on the hit rate. Average hit rate may be 2 % (1/ ~64). So if our input is smaller than 64 characters, we'll have, on average one hit so we pay the second character cost once. At 128 or above, we'll pay the cost two or more times. But, maybe the JVM can optimize away the second and subsequent accesses?

Actually, let's take a step back. The pattern is fixed. We parsed the pattern to decide to use this particular class. Should we instead create a 1-char, 2-char and n-char matcher class so we get the second character (for the 2-char case) only once, and we eliminate the extra per-value if-check?

paul-rogers · 2017-10-19T00:40:47Z

exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java

+
+  private final int match_N() {
+
+    if (patternLength == 0) {


Can't this be optimized away in the pattern parser stage? We should not be calling this function if we had a zero-length pattern.

paul-rogers · 2017-10-19T00:45:24Z

exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java

+  public final int match() {
+    // The idea is to write loops with simple condition checks to allow the Java Hotspot vectorize
+    // the generate code.
+    if (patternLength == 1) {


Pattern length is fixed per expression; need not be checked per row. See more below.

sachouche · 2017-10-23T18:24:38Z

Paul, again thanks for the detailed review; I was able to address most of the feedback except for one:
URF-8 Encoded Based Processing

I agree that expressions that can operate directly on the encoded UTF-8 string should ideally perform checks on bytes and not characters
Having said that, such a change is more involved and should be done properly
o The SqlPatternContainsMatcher currently gets a CharSequence as input
o We should enhance the expression framework so that matchers can a) express their capabilities and b) receive the expected data type (Character or Byte sequences)
o The compiler might also need an enhancement; currently, the optimizer adds a varchar cast to Parquet columns of binary data types which are involved with the LIKE operator (the cast logic will iterate through the whole byte array)
o Note also there is an impact on the test-suite since StringBuffer are being used to directly test the matcher functionality

Byte Code Manipulation

I looked at the Byte Code implementation to ensure the newly added scalar variable doesn't break the VarCharHolder optimization (stack allocation instead of heap)
The code uses reflection and as such is able to handle new member variables

ppadma · 2017-10-23T20:43:11Z

exec/java-exec/src/main/codegen/templates/CastFunctionsSrcVarLenTargetVarLen.java

@@ -73,6 +73,9 @@ public void eval() {
    out.start =  in.start;
    if (charCount <= length.value || length.value == 0 ) {
      out.end = in.end;
+      if (charCount == (out.end - out.start)) {
+        out.asciiMode = org.apache.drill.exec.expr.holders.VarCharHolder.CHAR_MODE_IS_ASCII; // we can conclude this string is ASCII


can you please add comments here ? I am not able to understand this change.

As previously stated (when responding to Paul'd comment), the expression framework is able to use the same VarCharHolder input variable when it is shared amongst multiple expressions

If the original column was of type var-binary, then the expression framework will include a cast to var-char

The cast logic will also compute the string length

Using this information to deduce whether the string is pure ASCII or not

UTF-8 encoding uses 1 byte for ASCII and 2, 3, or 4 for other character sets

If the encoded length and character length are equal, then this means this is an ASCII string

ppadma · 2017-10-23T20:52:18Z

exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/CharSequenceWrapper.java


  public CharSequenceWrapper() {
  }

  public CharSequenceWrapper(int start, int end, DrillBuf buffer) {
-    setBuffer(start, end, buffer);
+    setBuffer(start, end, buffer, -1);


what does -1 mean ? Shouldn't it be one of CHAR_MODE_IS_ASCII, CHAR_MODE_UNKNOWN or CHAR_MODE_NOT_ASCII ?

ppadma · 2017-10-23T20:59:28Z

exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java

+    // The idea is to write loops with simple condition checks to allow the Java Hotspot achieve
+    // better optimizations (especially vectorization)
+    if (patternLength == 1) {
+      matcherFcn = new Matcher1();


I am not sure if it is a good idea to write special case code for each pattern length. It is not easy to maintain. Can you please give more details how this is improving performance ? Are we getting better performance for patternLength 1 or 2 or 3 or N or all ? If so, how much and why ?

Padma, I have two reasons to follow the added complexity

The new code is encapsulated within the Contains matching logic; doesn't increase code complexity

o I created a test with the original match logic, pattern and input were Strings though passed as CharSequence
o Ran the test with the new and old method (1 billion iterations) on MacOS
o pattern length
o The old match method performed in 43sec where as the new one performed in 15sec
o The reason for the speedup is the custom matcher functions have less instructions (load and comparison)

how does matcherN perform compared to matcher1, matcher2, matcher3 for pattern lengths 1, 2 and 3 ? If matcherN performs well for patternLengths 1, 2 and 3, we can just have one matcher instead of multiple for different pattern lengths.

I ran two types of tests to evaluate the generic vs custom method:
Test1 - A match does exist
o Custom method is 3x faster because the code will go to the "else" part
o A nested loop is always slower than unrolled code (the loop is also correlated with the outer one); please refer to this article (https://en.wikipedia.org/wiki/Loop_unrolling) on the benefits of loop unrolling
o Older match function performed in 59sec

Test2- A match doesn't exist
o Custom method and generic one perform in 15sec; this is because both perform a comparison and proceed to the next iteration
o Older match function performed in 45sec

…cherry-pick

priteshm · 2017-11-17T05:06:53Z

@ppadma @paul-rogers I see that @sachouche addressed the comments in the JIRA - is this one ready to merge?

ppadma · 2017-11-17T05:20:24Z

@priteshm @sachouche This PR needs to be updated on top of changes made for DRILL-5899.

priteshm · 2018-01-09T23:44:32Z

@sachouche can you update this PR?

sachouche · 2018-01-10T17:06:44Z

Created another pull request #1072to merge my changes with the one done with Padma's.

JIRA 5879: Like operator performance improvements

c2b05b2

paul-rogers reviewed Oct 19, 2017

View reviewed changes

DRILL-5879: Like operator performance improvements

4d851cb

ppadma reviewed Oct 23, 2017

View reviewed changes

sachouche added 2 commits October 24, 2017 13:24

Merge branch 'master' of https://github.com/apache/drill into yodlee-…

5b0302f

…cherry-pick

DRILL-5879: Like operator performance improvements

2f527bd

sachouche closed this Jan 10, 2018

sachouche deleted the yodlee-cherry-pick branch January 10, 2018 17:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIRA DRILL-5879: Like operator performance improvements #1001

JIRA DRILL-5879: Like operator performance improvements #1001

sachouche commented Oct 18, 2017

ppadma commented Oct 19, 2017

paul-rogers commented Oct 19, 2017 •

edited

Loading

sachouche commented Oct 19, 2017

sachouche commented Oct 19, 2017 •

edited

Loading

paul-rogers left a comment

paul-rogers Oct 19, 2017

sachouche Oct 19, 2017

paul-rogers Oct 19, 2017

paul-rogers Oct 19, 2017

paul-rogers Oct 19, 2017

paul-rogers Oct 19, 2017

paul-rogers Oct 19, 2017

paul-rogers Oct 19, 2017

sachouche commented Oct 23, 2017 •

edited

Loading

ppadma Oct 23, 2017

sachouche Oct 24, 2017

ppadma Oct 23, 2017

ppadma Oct 23, 2017

sachouche Oct 24, 2017

ppadma Oct 25, 2017

sachouche Oct 25, 2017

priteshm commented Nov 17, 2017

ppadma commented Nov 17, 2017

priteshm commented Jan 9, 2018

sachouche commented Jan 10, 2018

JIRA DRILL-5879: Like operator performance improvements #1001

JIRA DRILL-5879: Like operator performance improvements #1001

Conversation

sachouche commented Oct 18, 2017

ppadma commented Oct 19, 2017

paul-rogers commented Oct 19, 2017 • edited Loading

sachouche commented Oct 19, 2017

sachouche commented Oct 19, 2017 • edited Loading

paul-rogers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sachouche commented Oct 23, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

priteshm commented Nov 17, 2017

ppadma commented Nov 17, 2017

priteshm commented Jan 9, 2018

sachouche commented Jan 10, 2018

paul-rogers commented Oct 19, 2017 •

edited

Loading

sachouche commented Oct 19, 2017 •

edited

Loading

sachouche commented Oct 23, 2017 •

edited

Loading