Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIRA DRILL-5879: Like operator performance improvements #1001

Closed
wants to merge 4 commits into from

Conversation

sachouche
Copy link
Contributor

  • Recently, custom code has been added to handle common search patterns (Like operator)
  • Contains, Starts With, and Ends With
  • Custom code is way faster than the generic RegEx based implementation
  • This pull request is another attempt to improve the Contains pattern since it is more CPU intensive

Query: select from <table> where colA like '%a%' or colA like '%xyz%';
Improvement Opportunities
Avoid isAscii computation (full access of the input string) since we're dealing with the same column twice
Optimize the "contains" for-loop
Implementation Details
1)
Added a new integer variable "asciiMode" to the VarCharHolder class
The default value is -1 which indicates this info is not known
Otherwise this value will be set to either 1 or 0 based on the string being in ASCII mode or Unicode
The execution plan already shares the same VarCharHolder instance for all evaluations of the same column value
The asciiMode will be correctly set during the first LIKE evaluation and will be reused across other LIKE evaluations
2)
The "Contains" LIKE operation is quite expensive as the code needs to access the input string to perform character based comparisons
Created 4 versions of the same for-loop to a) make the loop simpler to optimize (Vectorization) and b) minimize comparisons
Benchmarks
Lineitem table 100GB
Query: select l_returnflag, count from dfs.<source> where l_comment not like '%a%' or l_comment like '%the%' group by l_returnflag
Before changes: 33sec
After changes : 27sec

@ppadma
Copy link
Contributor

ppadma commented Oct 19, 2017

@sachouche Do you have a breakdown of how much gain we got with 1 vs 2. Since the changes for 2 are not straightforward and easy to maintain, I am thinking performance gain vs. maintainability of the code.

@paul-rogers
Copy link
Contributor

paul-rogers commented Oct 19, 2017

@sachouche, thanks for the first PR to Drill! Thanks for the detailed explanation!

Before reviewing the code, a comment on the design:

Added a new integer variable "asciiMode" ... this value will be set ... during the first LIKE evaluation and will be reused across other LIKE evaluations

The problem with this design is that there is no guarantee that the first value is representative of the other columns. Maybe my list looks like this:

Hello
你好

The first value is ASCII. The second is not. So, we must treat each value as independent of the others.

Or, maybe the design means that we do this per holder per row. In this case, do we know that the holder is reused for accesses to the same column in the same row, but is reinitialized for each new row? We'd have to look at the generated code (see the "Plain Java" feature) to see how this works.

On the other hand, we can exploit the nature of UTF-8. The encoding is such that no valid UTF-8 character is a prefix of any other valid character. Thus, if a character is 0xXX 0xYY 0xZZ, then there can never be a valid character which is 0xXX 0xYY. As a result, starts-with, ends-width, equals and contains can be done without either converting to UTF-16 or even caring if the data is ASCII or not.

What does this mean? It means that, for the simple operations:

  1. Convert the Java UTF-16 string to UTF-8.
  2. Do the classic byte comparison methods for starts with, ends with or contains. No special processing is needed for multi-byte

Unlike other multi-byte encodings, UTF-8 was designed to make this possible.

If we go this route, we would not need the ASCII mode flag.

Note: all of this applies only to the "basic four" operations: if we do a real regex, then we must decode the Varchar into a Java UTF-16 string.

@sachouche
Copy link
Contributor Author

  • It was 50% for 1) and 50% for 2)
  • Notice this breakdown depends on
    o The number of Contains pattern for the same value (impacts 1))
    o The pattern length (impacts both 1) and 2))

@sachouche
Copy link
Contributor Author

sachouche commented Oct 19, 2017

Paul,

  • I think you misunderstood the proposal
  • Let me use an example
  • select .. c1 like '%pattern1%' OR c1 like '%pattern2%'..
  • Assume c1 has 3 values [v1, v2, v3]
  • The generated code will create a new VarcharHolder instance for each value iteration
  • For v1: VarCharHolder vc1 is created, ascii-mode computed for pattern1, ascii-mode computation reused for pattern2, pattern3, etc since we're evaluating the same value
  • For v2: VarCharHolder vc2 is created, ascii-mode computed for pattern1, ascii-mode computation reused for pattern2, pattern3, etc since we're evaluating the same value
  • DITO for v3

Note that the test-suite has similar test cases that you are proposing; they all passed.

Copy link
Contributor

@paul-rogers paul-rogers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks much for the PR.
The two key comments are:

  • The conversion to characters is unnecessary: work directly with bytes, even for UTF-8.
  • Some of the pattern-length optimizations should not be checked per-row, but instead should be part of the pattern parser introduced in an earlier commit, since they test conditions that must be true for the pattern as a whole independent of the per-row values.

@@ -73,6 +73,9 @@ public void eval() {
out.start = in.start;
if (charCount <= length.value || length.value == 0 ) {
out.end = in.end;
if (charCount == (out.end-out.start)) {
out.asciiMode = 1; // we can conclude this string is ASCII
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The values -1, 0, and 1 are magic numbers here. Because we want speed, we don't want to use an enum. But, can we define symbols for our values: IS_ASCII, NOT_ASCII, ASCII_UNKNOWN? (You decide on the names...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do!

@@ -73,6 +73,9 @@ public void eval() {
out.start = in.start;
if (charCount <= length.value || length.value == 0 ) {
out.end = in.end;
if (charCount == (out.end-out.start)) {
out.asciiMode = 1; // we can conclude this string is ASCII
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drill likes spaces around operators: out.end - out.start

}
}

private final int match_1() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See note about UTF-8. If we don't care about the match position (that is, we don't need strpos(), and all we care is whether it matches or not, then we can do the work on the undecoded UTF-8 bytes, saving a large amount of complexity.


<#if minor.class == "VarChar">
// -1: unknown, 0: not ascii, 1: is ascii
public int asciiMode = -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drill is complex, nothing is as it seems. Drill has a very elaborate mechanism to rewrite byte codes to do scalar replacement. That is, holders are added to the Java source during code generation, but then are removed during byte code transforms. (Yes, that is silly given that Java 8 does a fine job on its own; but it is how Drill works today...)

Will adding this variable change that behavior? Will the scalar replacement logic know to not replace this object? If so, will that hurt performance, or with the JVM go ahead and figure it out on its own? Is all the code that sets the variable inlined via code generation so that scalar replacement is possible?

Yes, indeed, nothing in Drill is as simple as it seems...

continue;
} else {
char ch2_1 = sequenceWrapper.charAt(idx+1);
char ch2_2 = patternString.charAt(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want speed. Instead of getting the second character multiple times, is it better to get it once up front? I suppose that depends on the hit rate. Average hit rate may be 2 % (1/ ~64). So if our input is smaller than 64 characters, we'll have, on average one hit so we pay the second character cost once. At 128 or above, we'll pay the cost two or more times. But, maybe the JVM can optimize away the second and subsequent accesses?

Actually, let's take a step back. The pattern is fixed. We parsed the pattern to decide to use this particular class. Should we instead create a 1-char, 2-char and n-char matcher class so we get the second character (for the 2-char case) only once, and we eliminate the extra per-value if-check?


private final int match_N() {

if (patternLength == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this be optimized away in the pattern parser stage? We should not be calling this function if we had a zero-length pattern.

public final int match() {
// The idea is to write loops with simple condition checks to allow the Java Hotspot vectorize
// the generate code.
if (patternLength == 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pattern length is fixed per expression; need not be checked per row. See more below.

@sachouche
Copy link
Contributor Author

sachouche commented Oct 23, 2017

Paul, again thanks for the detailed review; I was able to address most of the feedback except for one:
URF-8 Encoded Based Processing

  • I agree that expressions that can operate directly on the encoded UTF-8 string should ideally perform checks on bytes and not characters
  • Having said that, such a change is more involved and should be done properly
    o The SqlPatternContainsMatcher currently gets a CharSequence as input
    o We should enhance the expression framework so that matchers can a) express their capabilities and b) receive the expected data type (Character or Byte sequences)
    o The compiler might also need an enhancement; currently, the optimizer adds a varchar cast to Parquet columns of binary data types which are involved with the LIKE operator (the cast logic will iterate through the whole byte array)
    o Note also there is an impact on the test-suite since StringBuffer are being used to directly test the matcher functionality

Byte Code Manipulation

  • I looked at the Byte Code implementation to ensure the newly added scalar variable doesn't break the VarCharHolder optimization (stack allocation instead of heap)
  • The code uses reflection and as such is able to handle new member variables

@@ -73,6 +73,9 @@ public void eval() {
out.start = in.start;
if (charCount <= length.value || length.value == 0 ) {
out.end = in.end;
if (charCount == (out.end - out.start)) {
out.asciiMode = org.apache.drill.exec.expr.holders.VarCharHolder.CHAR_MODE_IS_ASCII; // we can conclude this string is ASCII
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please add comments here ? I am not able to understand this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • As previously stated (when responding to Paul'd comment), the expression framework is able to use the same VarCharHolder input variable when it is shared amongst multiple expressions
  • If the original column was of type var-binary, then the expression framework will include a cast to var-char
  • The cast logic will also compute the string length
  • Using this information to deduce whether the string is pure ASCII or not
  • UTF-8 encoding uses 1 byte for ASCII and 2, 3, or 4 for other character sets
  • If the encoded length and character length are equal, then this means this is an ASCII string


public CharSequenceWrapper() {
}

public CharSequenceWrapper(int start, int end, DrillBuf buffer) {
setBuffer(start, end, buffer);
setBuffer(start, end, buffer, -1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does -1 mean ? Shouldn't it be one of CHAR_MODE_IS_ASCII, CHAR_MODE_UNKNOWN or CHAR_MODE_NOT_ASCII ?

// The idea is to write loops with simple condition checks to allow the Java Hotspot achieve
// better optimizations (especially vectorization)
if (patternLength == 1) {
matcherFcn = new Matcher1();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if it is a good idea to write special case code for each pattern length. It is not easy to maintain. Can you please give more details how this is improving performance ? Are we getting better performance for patternLength 1 or 2 or 3 or N or all ? If so, how much and why ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Padma, I have two reasons to follow the added complexity

  1. The new code is encapsulated within the Contains matching logic; doesn't increase code complexity

o I created a test with the original match logic, pattern and input were Strings though passed as CharSequence
o Ran the test with the new and old method (1 billion iterations) on MacOS
o pattern length
o The old match method performed in 43sec where as the new one performed in 15sec
o The reason for the speedup is the custom matcher functions have less instructions (load and comparison)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does matcherN perform compared to matcher1, matcher2, matcher3 for pattern lengths 1, 2 and 3 ? If matcherN performs well for patternLengths 1, 2 and 3, we can just have one matcher instead of multiple for different pattern lengths.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran two types of tests to evaluate the generic vs custom method:
Test1 - A match does exist
o Custom method is 3x faster because the code will go to the "else" part
o A nested loop is always slower than unrolled code (the loop is also correlated with the outer one); please refer to this article (https://en.wikipedia.org/wiki/Loop_unrolling) on the benefits of loop unrolling
o Older match function performed in 59sec

Test2- A match doesn't exist
o Custom method and generic one perform in 15sec; this is because both perform a comparison and proceed to the next iteration
o Older match function performed in 45sec

@priteshm
Copy link

@ppadma @paul-rogers I see that @sachouche addressed the comments in the JIRA - is this one ready to merge?

@ppadma
Copy link
Contributor

ppadma commented Nov 17, 2017

@priteshm @sachouche This PR needs to be updated on top of changes made for DRILL-5899.

@priteshm
Copy link

priteshm commented Jan 9, 2018

@sachouche can you update this PR?

@sachouche
Copy link
Contributor Author

Created another pull request #1072to merge my changes with the one done with Padma's.

@sachouche sachouche closed this Jan 10, 2018
@sachouche sachouche deleted the yodlee-cherry-pick branch January 10, 2018 17:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants