Skip to content

Commit

Permalink
[SPARK-34130][SQL] Impove preformace for char varchar padding and len…
Browse files Browse the repository at this point in the history
…gth check with StaticInvoke

### What changes were proposed in this pull request?

This could reduce the `generate.java` size to prevent codegen fallback which causes performance regression.

here is a case from tpcds that could be fixed by this improvement
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133964/testReport/org.apache.spark.sql.execution/LogicalPlanTagInSparkPlanSuite/q41/

The original case generate 20K bytes, we are trying to reduce it to less than 8k
### Why are the changes needed?

performance improvement as in the PR benchmark test, the performance  w/ codegen is 2~3x better than w/o codegen.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

yes, it's a code reflect so the existing ut should be enough

cross-check with #31012 where the tpcds shall all pass

benchmark compared with master

```logtalk
================================================================================================
Char Varchar Read Side Perf
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Read with length 20, hasSpaces: false:    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 20                         1571           1667          83         63.6          15.7       1.0X
read char with length 20                           1710           1764          58         58.5          17.1       0.9X
read varchar with length 20                        1774           1792          16         56.4          17.7       0.9X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Read with length 40, hasSpaces: false:    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 40                         1824           1927          91         54.8          18.2       1.0X
read char with length 40                           1788           1928         137         55.9          17.9       1.0X
read varchar with length 40                        1676           1700          40         59.7          16.8       1.1X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Read with length 60, hasSpaces: false:    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 60                         1727           1762          30         57.9          17.3       1.0X
read char with length 60                           1628           1674          43         61.4          16.3       1.1X
read varchar with length 60                        1651           1665          13         60.6          16.5       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Read with length 80, hasSpaces: true:     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 80                         1748           1778          28         57.2          17.5       1.0X
read char with length 80                           1673           1678           9         59.8          16.7       1.0X
read varchar with length 80                        1667           1684          27         60.0          16.7       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Read with length 100, hasSpaces: true:    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 100                        1709           1743          48         58.5          17.1       1.0X
read char with length 100                          1610           1664          67         62.1          16.1       1.1X
read varchar with length 100                       1614           1673          53         61.9          16.1       1.1X

================================================================================================
Char Varchar Write Side Perf
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Write with length 20, hasSpaces: false:   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
write string with length 20                        2277           2327          67          4.4         227.7       1.0X
write char with length 20                          2421           2443          19          4.1         242.1       0.9X
write varchar with length 20                       2393           2419          27          4.2         239.3       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Write with length 40, hasSpaces: false:   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
write string with length 40                        2249           2290          38          4.4         224.9       1.0X
write char with length 40                          2386           2444          57          4.2         238.6       0.9X
write varchar with length 40                       2397           2405          12          4.2         239.7       0.9X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Write with length 60, hasSpaces: false:   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
write string with length 60                        2326           2367          41          4.3         232.6       1.0X
write char with length 60                          2478           2501          37          4.0         247.8       0.9X
write varchar with length 60                       2475           2503          24          4.0         247.5       0.9X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Write with length 80, hasSpaces: true:    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
write string with length 80                        9367           9773         354          1.1         936.7       1.0X
write char with length 80                         10454          10621         238          1.0        1045.4       0.9X
write varchar with length 80                      18943          19503         571          0.5        1894.3       0.5X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Write with length 100, hasSpaces: true:   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
write string with length 100                      11055          11104          59          0.9        1105.5       1.0X
write char with length 100                        12204          12275          63          0.8        1220.4       0.9X
write varchar with length 100                     21737          22275         574          0.5        2173.7       0.5X

```

Closes #31199 from yaooqinn/SPARK-34130.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
  • Loading branch information
yaooqinn authored and cloud-fan committed Jan 19, 2021
1 parent 0130a38 commit 6fa2fb9
Show file tree
Hide file tree
Showing 4 changed files with 318 additions and 39 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.spark.sql.catalyst.util;

import org.apache.spark.unsafe.types.UTF8String;

public class CharVarcharCodegenUtils {
private static final UTF8String SPACE = UTF8String.fromString(" ");

/**
* Trailing spaces do not count in the length check. We don't need to retain the trailing
* spaces, as we will pad char type columns/fields at read time.
*/
public static UTF8String charTypeWriteSideCheck(UTF8String inputStr, int limit) {
if (inputStr == null) {
return null;
} else {
UTF8String trimmed = inputStr.trimRight();
if (trimmed.numChars() > limit) {
throw new RuntimeException("Exceeds char type length limitation: " + limit);
}
return trimmed;
}
}

public static UTF8String charTypeReadSideCheck(UTF8String inputStr, int limit) {
if (inputStr == null) return null;
if (inputStr.numChars() > limit) {
throw new RuntimeException("Exceeds char type length limitation: " + limit);
}
return inputStr.rpad(limit, SPACE);
}

public static UTF8String varcharTypeWriteSideCheck(UTF8String inputStr, int limit) {
if (inputStr != null && inputStr.numChars() <= limit) {
return inputStr;
} else if (inputStr != null) {
// Trailing spaces do not count in the length check. We need to retain the trailing spaces
// (truncate to length N), as there is no read-time padding for varchar type.
// TODO: create a special TrimRight function that can trim to a certain length.
UTF8String trimmed = inputStr.trimRight();
if (trimmed.numChars() > limit) {
throw new RuntimeException("Exceeds varchar type length limitation: " + limit);
}
return inputStr.substring(0, limit);
} else {
return null;
}
}

public static UTF8String varcharTypeReadSideCheck(UTF8String inputStr, int limit) {
if (inputStr != null && inputStr.numChars() > limit) {
throw new RuntimeException("Exceeds varchar type length limitation: " + limit);
}
return inputStr;
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,10 @@ import scala.collection.mutable
import org.apache.spark.internal.Logging
import org.apache.spark.sql.AnalysisException
import org.apache.spark.sql.catalyst.expressions._
import org.apache.spark.sql.catalyst.expressions.objects.StaticInvoke
import org.apache.spark.sql.catalyst.parser.CatalystSqlParser
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types._
import org.apache.spark.unsafe.types.UTF8String

object CharVarcharUtils extends Logging {

Expand Down Expand Up @@ -161,9 +161,20 @@ object CharVarcharUtils extends Logging {

private def paddingWithLengthCheck(expr: Expression, dt: DataType): Expression = dt match {
case CharType(length) =>
StringRPad(stringLengthCheck(expr, dt, needTrim = false), Literal(length))

case VarcharType(_) => stringLengthCheck(expr, dt, needTrim = false)
StaticInvoke(
classOf[CharVarcharCodegenUtils],
StringType,
"charTypeReadSideCheck",
expr :: Literal(length) :: Nil,
propagateNull = false)

case VarcharType(length) =>
StaticInvoke(
classOf[CharVarcharCodegenUtils],
StringType,
"varcharTypeReadSideCheck",
expr :: Literal(length) :: Nil,
propagateNull = false)

case StructType(fields) =>
val struct = CreateNamedStruct(fields.zipWithIndex.flatMap { case (f, i) =>
Expand Down Expand Up @@ -200,69 +211,54 @@ object CharVarcharUtils extends Logging {
*/
def stringLengthCheck(expr: Expression, targetAttr: Attribute): Expression = {
getRawType(targetAttr.metadata).map { rawType =>
stringLengthCheck(expr, rawType, needTrim = true)
stringLengthCheck(expr, rawType)
}.getOrElse(expr)
}

private def raiseError(typeName: String, length: Int): Expression = {
val errMsg = UTF8String.fromString(s"Exceeds $typeName type length limitation: $length")
RaiseError(Literal(errMsg, StringType), StringType)
}

private def stringLengthCheck(expr: Expression, dt: DataType, needTrim: Boolean): Expression = {
private def stringLengthCheck(expr: Expression, dt: DataType): Expression = {
dt match {
case CharType(length) =>
val trimmed = if (needTrim) StringTrimRight(expr) else expr
// Trailing spaces do not count in the length check. We don't need to retain the trailing
// spaces, as we will pad char type columns/fields at read time.
If(
GreaterThan(Length(trimmed), Literal(length)),
raiseError("char", length),
trimmed)
StaticInvoke(
classOf[CharVarcharCodegenUtils],
StringType,
"charTypeWriteSideCheck",
expr :: Literal(length) :: Nil,
propagateNull = false)

case VarcharType(length) =>
if (needTrim) {
val trimmed = StringTrimRight(expr)
// Trailing spaces do not count in the length check. We need to retain the trailing spaces
// (truncate to length N), as there is no read-time padding for varchar type.
// TODO: create a special TrimRight function that can trim to a certain length.
If(
LessThanOrEqual(Length(expr), Literal(length)),
expr,
If(
GreaterThan(Length(trimmed), Literal(length)),
raiseError("varchar", length),
StringRPad(trimmed, Literal(length))))
} else {
If(GreaterThan(Length(expr), Literal(length)), raiseError("varchar", length), expr)
}
StaticInvoke(
classOf[CharVarcharCodegenUtils],
StringType,
"varcharTypeWriteSideCheck",
expr :: Literal(length) :: Nil,
propagateNull = false)

case StructType(fields) =>
val struct = CreateNamedStruct(fields.zipWithIndex.flatMap { case (f, i) =>
Seq(Literal(f.name),
stringLengthCheck(GetStructField(expr, i, Some(f.name)), f.dataType, needTrim))
stringLengthCheck(GetStructField(expr, i, Some(f.name)), f.dataType))
})
if (expr.nullable) {
If(IsNull(expr), Literal(null, struct.dataType), struct)
} else {
struct
}

case ArrayType(et, containsNull) => stringLengthCheckInArray(expr, et, containsNull, needTrim)
case ArrayType(et, containsNull) => stringLengthCheckInArray(expr, et, containsNull)

case MapType(kt, vt, valueContainsNull) =>
val newKeys = stringLengthCheckInArray(MapKeys(expr), kt, containsNull = false, needTrim)
val newValues = stringLengthCheckInArray(MapValues(expr), vt, valueContainsNull, needTrim)
val newKeys = stringLengthCheckInArray(MapKeys(expr), kt, containsNull = false)
val newValues = stringLengthCheckInArray(MapValues(expr), vt, valueContainsNull)
MapFromArrays(newKeys, newValues)

case _ => expr
}
}

private def stringLengthCheckInArray(
arr: Expression, et: DataType, containsNull: Boolean, needTrim: Boolean): Expression = {
arr: Expression, et: DataType, containsNull: Boolean): Expression = {
val param = NamedLambdaVariable("x", replaceCharVarcharWithString(et), containsNull)
val func = LambdaFunction(stringLengthCheck(param, et, needTrim), Seq(param))
val func = LambdaFunction(stringLengthCheck(param, et), Seq(param))
ArrayTransform(arr, func)
}

Expand Down
90 changes: 90 additions & 0 deletions sql/core/benchmarks/CharVarcharBenchmark-results.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
================================================================================================
Char Varchar Read Side Perf
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Read with length 20, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 20 1504 1508 4 66.5 15.0 1.0X
read char with length 20 1680 1684 3 59.5 16.8 0.9X
read varchar with length 20 1659 1682 26 60.3 16.6 0.9X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Read with length 40, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 40 1662 1678 15 60.2 16.6 1.0X
read char with length 40 1721 1731 9 58.1 17.2 1.0X
read varchar with length 40 1694 1706 12 59.0 16.9 1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Read with length 60, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 60 1623 1643 23 61.6 16.2 1.0X
read char with length 60 1644 1685 66 60.8 16.4 1.0X
read varchar with length 60 1660 1680 18 60.2 16.6 1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Read with length 80, hasSpaces: true: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 80 1629 1678 57 61.4 16.3 1.0X
read char with length 80 1630 1667 65 61.3 16.3 1.0X
read varchar with length 80 1664 1684 34 60.1 16.6 1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Read with length 100, hasSpaces: true: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 100 1594 1612 17 62.7 15.9 1.0X
read char with length 100 1631 1642 11 61.3 16.3 1.0X
read varchar with length 100 1635 1644 13 61.1 16.4 1.0X


================================================================================================
Char Varchar Write Side Perf
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Write with length 20, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
write string with length 20 2760 2784 21 3.6 276.0 1.0X
write char with length 20 2898 2917 22 3.5 289.8 1.0X
write varchar with length 20 2876 2892 14 3.5 287.6 1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Write with length 40, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
write string with length 40 2726 2734 9 3.7 272.6 1.0X
write char with length 40 2885 2898 16 3.5 288.5 0.9X
write varchar with length 40 2844 2860 15 3.5 284.4 1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Write with length 60, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
write string with length 60 2724 2739 21 3.7 272.4 1.0X
write char with length 60 2868 2912 44 3.5 286.8 0.9X
write varchar with length 60 2870 2896 23 3.5 287.0 0.9X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Write with length 80, hasSpaces: true: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
write string with length 80 9094 9154 71 1.1 909.4 1.0X
write char with length 80 9471 9489 19 1.1 947.1 1.0X
write varchar with length 80 15099 15130 28 0.7 1509.9 0.6X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Write with length 100, hasSpaces: true: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
write string with length 100 10152 10253 94 1.0 1015.2 1.0X
write char with length 100 10831 10834 3 0.9 1083.1 0.9X
write varchar with length 100 19486 19560 73 0.5 1948.6 0.5X


Loading

0 comments on commit 6fa2fb9

Please sign in to comment.