-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-28083][SQL] Support LIKE ... ESCAPE syntax #25001
Conversation
Test build #106996 has finished for PR 25001 at commit
|
Test build #106999 has finished for PR 25001 at commit
|
Test build #107001 has finished for PR 25001 at commit
|
Test build #107012 has finished for PR 25001 at commit
|
Test build #107013 has finished for PR 25001 at commit
|
Also, you need to update |
@maropu Thanks for you reminder. I have added keyword. |
docs/sql-keywords.md
Outdated
@@ -103,6 +103,7 @@ Below is a list of all the keywords in Spark SQL. | |||
<tr><td>DROP</td><td>non-reserved</td><td>non-reserved</td><td>reserved</td></tr> | |||
<tr><td>ELSE</td><td>reserved</td><td>non-reserved</td><td>reserved</td></tr> | |||
<tr><td>END</td><td>reserved</td><td>non-reserved</td><td>reserved</td></tr> | |||
<tr><td>ESCAPE</td><td>non-reserved</td><td>non-reserved</td><td>non-reserved</td></tr> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ESCAPE is reserved in the standard: https://developer.mimer.com/wp-content/uploads/standard-sql-reserved-words-summary.pdf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your reminder, I will change it.
@@ -65,6 +65,59 @@ abstract class StringRegexExpression extends BinaryExpression | |||
override def sql: String = s"${left.sql} ${prettyName.toUpperCase(Locale.ROOT)} ${right.sql}" | |||
} | |||
|
|||
abstract class StringRegexV2Expression extends TernaryExpression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need this abstract class? I think we could make the fix more simple by just tweaking StringUtils.escapeLikeRegex
? Anyway, the simple, the better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the StringRegexExpression
extends the BinaryExpression
that is only two input parameters all allowed. So I make the StringRegexV2Expression
extends the TernaryExpression
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about just making StringRegexExpression
ternary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because RLIKE
extends StringRegexExpression
and only need two input parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried this?
case class Like(
inputExpr: Expression,
patternExpr: Expression,
escapeExpr: Option[String] = None) extends StringRegexExpression
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like this? master...maropu:SPARK-28083
/* 066 */ String filter_rightStr_0 = scan_value_1.toString();
/* 067 */ java.util.regex.Pattern filter_pattern_0 = java.util.regex.Pattern.compile(org.apache.spark.sql.catalyst.util.StringUtils.escapeLikeRegex(filter_rightStr_0, "\"));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let me have a try!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu I have made this try. It was not pass the tests of RegexpExpressionsSuite
.
The failure info:
- LIKE Pattern *** FAILED ***
Code generation of null LIKE input[0, string, true] ESCAPE \ failed:
java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 50, Column 0: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 50, Column 0: Line break in literal not allowed
java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 50, Column 0: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 50, Column 0: Line break in literal not allowed
The code generated:
/* 033 */ public java.lang.Object apply(java.lang.Object _i) {
/* 034 */ InternalRow i = (InternalRow) _i;
/* 035 */
/* 036 */
/* 037 */ boolean isNull_0 = true;
/* 038 */ boolean value_0 = false;
/* 039 */
/* 040 */ if (!true) {
/* 041 */ boolean isNull_2 = i.isNullAt(0);
/* 042 */ UTF8String value_2 = isNull_2 ?
/* 043 */ null : (i.getUTF8String(0));
/* 044 */ if (!isNull_2) {
/* 045 */
/* 046 */ isNull_0 = false; // resultCode could change nullability.
/* 047 */
/* 048 */ String rightStr_0 = value_2.toString();
/* 049 */ java.util.regex.Pattern pattern_0 = java.util.regex.Pattern.compile(org.apache.spark.sql.catalyst.util.StringUtils.escapeLikeRegex(rightStr_0, "\"));
/* 050 */ value_0 = pattern_0.matcher(((UTF8String)null).toString()).matches();
/* 051 */
/* 052 */
/* 053 */ }
/* 054 */
/* 055 */ }
/* 056 */ isNull_3 = isNull_0;
/* 057 */ value_3 = value_0;
/* 058 */
/* 059 */ // copy all the results into MutableRow
/* 060 */
/* 061 */ if (!isNull_3) {
/* 062 */ mutableRow.setBoolean(0, value_3);
/* 063 */ } else {
/* 064 */ mutableRow.setNullAt(0);
/* 065 */ }
/* 066 */
/* 067 */ return mutableRow;
/* 068 */ }
/* 069 */
/* 070 */
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
val pattern = ctx.freshName("pattern")
val rightStr = ctx.freshName("rightStr")
val escapeChar = escapeCharOpt.getOrElse("\\\\")
nullSafeCodeGen(ctx, ev, (eval1, eval2) => {
s"""
String $rightStr = $eval2.toString();
$patternClass $pattern = $patternClass.compile($escapeFunc($rightStr, "$escapeChar"));
${ev.value} = $pattern.matcher($eval1.toString()).matches();
"""
})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed
val escapeChar = escapeCharOpt.getOrElse("\\\\")
to
val escapeChar = escapeCharOpt.getOrElse("\\\\\\\\")
The latter is OK.
Test build #107105 has finished for PR 25001 at commit
|
@beliefer Please re-generate golden files.
|
Please use the following command to rebuild the golden answer files, @beliefer .
|
Test build #107172 has finished for PR 25001 at commit
|
@wangyum @dongjoon-hyun Thanks for all your help and review. |
Test build #107226 has finished for PR 25001 at commit
|
@@ -484,7 +484,7 @@ object LikeSimplification extends Rule[LogicalPlan] { | |||
private val equalTo = "([^_%]*)".r | |||
|
|||
def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions { | |||
case Like(input, Literal(pattern, StringType)) => | |||
case Like(input, Literal(pattern, StringType), opt) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
opt => escapeChar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
@beliefer Thanks for changing the parameter data type. The code looks simpler now :) |
Test build #114892 has finished for PR 25001 at commit
|
Test build #114894 has finished for PR 25001 at commit
|
retest this please |
Test build #114898 has finished for PR 25001 at commit
|
|
||
override def matches(regex: Pattern, str: String): Boolean = regex.matcher(str).matches() | ||
|
||
override def toString: String = s"$left LIKE $right" | ||
override def toString: String = s"$left LIKE $right ESCAPE '$escapeChar'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can skip printing ESCAPE '$escapeChar'
if escapeChar =\
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except one comment
Since Spark 2.0, string literals are unescaped in our SQL parser. For example, in order | ||
to match "\abc", the pattern should be "\\abc". | ||
|
||
When SQL config 'spark.sql.parser.escapedStringLiterals' is enabled, it fallbacks | ||
to Spark 1.6 behavior regarding string literal parsing. For example, if the config is | ||
enabled, the pattern to match "\abc" should be "\abc". | ||
* escape - an string added since Spark 3.0. The default escape character is the '\'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit string
or character
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
Test build #114925 has finished for PR 25001 at commit
|
Thanks, merging to master |
@maropu @cloud-fan @gengliangwang @gatorsmile @HyukjinKwon @dongjoon-hyun @Ngone51 |
## What changes were proposed in this pull request? The syntax 'LIKE predicate: ESCAPE clause' is a ANSI SQL. For example: ``` select 'abcSpark_13sd' LIKE '%Spark\\_%'; //true select 'abcSpark_13sd' LIKE '%Spark/_%'; //false select 'abcSpark_13sd' LIKE '%Spark"_%'; //false select 'abcSpark_13sd' LIKE '%Spark/_%' ESCAPE '/'; //true select 'abcSpark_13sd' LIKE '%Spark"_%' ESCAPE '"'; //true select 'abcSpark%13sd' LIKE '%Spark\\%%'; //true select 'abcSpark%13sd' LIKE '%Spark/%%'; //false select 'abcSpark%13sd' LIKE '%Spark"%%'; //false select 'abcSpark%13sd' LIKE '%Spark/%%' ESCAPE '/'; //true select 'abcSpark%13sd' LIKE '%Spark"%%' ESCAPE '"'; //true select 'abcSpark\\13sd' LIKE '%Spark\\\\_%'; //true select 'abcSpark/13sd' LIKE '%Spark//_%'; //false select 'abcSpark"13sd' LIKE '%Spark""_%'; //false select 'abcSpark/13sd' LIKE '%Spark//_%' ESCAPE '/'; //true select 'abcSpark"13sd' LIKE '%Spark""_%' ESCAPE '"'; //true ``` But Spark SQL only supports 'LIKE predicate'. Note: If the input string or pattern string is null, then the result is null too. There are some mainstream database support the syntax. **PostgreSQL:** https://www.postgresql.org/docs/11/functions-matching.html **Vertica:** https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/LanguageElements/Predicates/LIKE-predicate.htm?zoom_highlight=like%20escape **MySQL:** https://dev.mysql.com/doc/refman/5.6/en/string-comparison-functions.html **Oracle:** https://docs.oracle.com/en/database/oracle/oracle-database/19/jjdbc/JDBC-reference-information.html#GUID-5D371A5B-D7F6-42EB-8C0D-D317F3C53708 https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/Pattern-matching-Conditions.html#GUID-0779657B-06A8-441F-90C5-044B47862A0A ## How was this patch tested? Exists UT and new UT. This PR merged to my production environment and runs above sql: ``` spark-sql> select 'abcSpark_13sd' LIKE '%Spark\\_%'; true Time taken: 0.119 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark_13sd' LIKE '%Spark/_%'; false Time taken: 0.103 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark_13sd' LIKE '%Spark"_%'; false Time taken: 0.096 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark_13sd' LIKE '%Spark/_%' ESCAPE '/'; true Time taken: 0.096 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark_13sd' LIKE '%Spark"_%' ESCAPE '"'; true Time taken: 0.092 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark%13sd' LIKE '%Spark\\%%'; true Time taken: 0.109 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark%13sd' LIKE '%Spark/%%'; false Time taken: 0.1 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark%13sd' LIKE '%Spark"%%'; false Time taken: 0.081 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark%13sd' LIKE '%Spark/%%' ESCAPE '/'; true Time taken: 0.095 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark%13sd' LIKE '%Spark"%%' ESCAPE '"'; true Time taken: 0.113 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark\\13sd' LIKE '%Spark\\\\_%'; true Time taken: 0.078 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark/13sd' LIKE '%Spark//_%'; false Time taken: 0.067 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark"13sd' LIKE '%Spark""_%'; false Time taken: 0.084 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark/13sd' LIKE '%Spark//_%' ESCAPE '/'; true Time taken: 0.091 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark"13sd' LIKE '%Spark""_%' ESCAPE '"'; true Time taken: 0.091 seconds, Fetched 1 row(s) ``` I create a table and its schema is: ``` spark-sql> desc formatted gja_test; key string NULL value string NULL other string NULL # Detailed Table Information Database test Table gja_test Owner test Created Time Wed Apr 10 11:06:15 CST 2019 Last Access Thu Jan 01 08:00:00 CST 1970 Created By Spark 2.4.1-SNAPSHOT Type MANAGED Provider hive Table Properties [transient_lastDdlTime=1563443838] Statistics 26 bytes Location hdfs://namenode.xxx:9000/home/test/hive/warehouse/test.db/gja_test Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat org.apache.hadoop.mapred.TextInputFormat OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Storage Properties [field.delim= , serialization.format= ] Partition Provider Catalog Time taken: 0.642 seconds, Fetched 21 row(s) ``` Table `gja_test` exists three rows of data. ``` spark-sql> select * from gja_test; a A ao b B bo "__ """__ " Time taken: 0.665 seconds, Fetched 3 row(s) ``` At finally, I test this function: ``` spark-sql> select * from gja_test where key like value escape '"'; "__ """__ " Time taken: 0.687 seconds, Fetched 1 row(s) ``` Closes apache#25001 from beliefer/ansi-sql-like. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
Thanks, all. |
### What changes were proposed in this pull request? Since [25001](#25001), spark support like escape syntax. But '%' and '_' is the reserve char in `Like` expression. We can not use them as escape char. ### Why are the changes needed? Avoid some unexpect problem when using like escape syntax. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Add UT. Closes #26860 from ulysses-you/SPARK-30230. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
…capeChar Since [25001](#25001), spark support like escape syntax. We should also sync the escape used by `LikeSimplification`. Avoid optimize failed. No. Add UT. Closes #26880 from ulysses-you/SPARK-30254. Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request? This PR is a follow-up to #25001 ### Why are the changes needed? No ### Does this PR introduce any user-facing change? No ### How was this patch tested? Pass the Jenkins with the newly update test files. Closes #26949 from beliefer/uncomment-like-escape-tests. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
The syntax 'LIKE predicate: ESCAPE clause' is a ANSI SQL.
For example:
But Spark SQL only supports 'LIKE predicate'.
Note: If the input string or pattern string is null, then the result is null too.
There are some mainstream database support the syntax.
PostgreSQL:
https://www.postgresql.org/docs/11/functions-matching.html
Vertica:
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/LanguageElements/Predicates/LIKE-predicate.htm?zoom_highlight=like%20escape
MySQL:
https://dev.mysql.com/doc/refman/5.6/en/string-comparison-functions.html
Oracle:
https://docs.oracle.com/en/database/oracle/oracle-database/19/jjdbc/JDBC-reference-information.html#GUID-5D371A5B-D7F6-42EB-8C0D-D317F3C53708
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/Pattern-matching-Conditions.html#GUID-0779657B-06A8-441F-90C5-044B47862A0A
Teradata
https://docs.teradata.com/reader/756LNiPSFdY~4JcCCcR5Cw/ZP3CE_cR~e7V50zVkzzeVQ
Snowflake
https://docs.snowflake.net/manuals/sql-reference/functions/like.html
How was this patch tested?
Exists UT and new UT.
This PR merged to my production environment and runs above sql:
I create a table and its schema is:
Table
gja_test
exists three rows of data.At finally, I test this function: