Skip to content

Commit

Permalink
[SPARK-34164][SQL] Improve write side varchar check to visit only las…
Browse files Browse the repository at this point in the history
…t few tailing spaces

### What changes were proposed in this pull request?

For varchar(N), we currently trim all spaces first to check whether the remained length exceeds, it not necessary to visit them all but at most to those after N.

### Why are the changes needed?

improve varchar performance for write side
### Does this PR introduce _any_ user-facing change?

no
### How was this patch tested?

benchmark and existing ut

Closes #31253 from yaooqinn/SPARK-34164.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
  • Loading branch information
yaooqinn authored and cloud-fan committed Jan 21, 2021
1 parent faa4f0c commit d640631
Show file tree
Hide file tree
Showing 4 changed files with 177 additions and 60 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -683,6 +683,16 @@ public UTF8String trimRight() {
return copyUTF8String(0, e);
}

/**
* Trims at most `numSpaces` space characters (ASCII 32) from the end of this string.
*/
public UTF8String trimTrailingSpaces(int numSpaces) {
int endIdx = numBytes - 1;
int trimTo = numBytes - numSpaces;
while (endIdx >= trimTo && getByte(endIdx) == 0x20) endIdx--;
return copyUTF8String(0, endIdx);
}

/**
* Trims instances of the given trim string from the end of this string.
*
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -47,19 +47,23 @@ public static UTF8String charTypeReadSideCheck(UTF8String inputStr, int limit) {
}

public static UTF8String varcharTypeWriteSideCheck(UTF8String inputStr, int limit) {
if (inputStr != null && inputStr.numChars() <= limit) {
return inputStr;
} else if (inputStr != null) {
// Trailing spaces do not count in the length check. We need to retain the trailing spaces
// (truncate to length N), as there is no read-time padding for varchar type.
// TODO: create a special TrimRight function that can trim to a certain length.
UTF8String trimmed = inputStr.trimRight();
if (trimmed.numChars() > limit) {
throw new RuntimeException("Exceeds varchar type length limitation: " + limit);
}
return inputStr.substring(0, limit);
} else {
if (inputStr == null) {
return null;
} else {
int numChars = inputStr.numChars();
if (numChars <= limit) {
return inputStr;
} else {
// Trailing spaces do not count in the length check. We need to retain the trailing spaces
// (truncate to length N), as there is no read-time padding for varchar type.
int maxAllowedNumTailSpaces = numChars - limit;
UTF8String trimmed = inputStr.trimTrailingSpaces(maxAllowedNumTailSpaces);
if (trimmed.numChars() > limit) {
throw new RuntimeException("Exceeds varchar type length limitation: " + limit);
} else {
return trimmed;
}
}
}
}

Expand Down
Loading

0 comments on commit d640631

Please sign in to comment.