Skip to content

[CALCITE-6278] Add REGEXP, REGEXP_LIKE function (enabled in Spark library)#3712

Merged
macroguo-ghy merged 1 commit intoapache:mainfrom
YiwenWu:spark_func_regexp
Mar 12, 2024
Merged

[CALCITE-6278] Add REGEXP, REGEXP_LIKE function (enabled in Spark library)#3712
macroguo-ghy merged 1 commit intoapache:mainfrom
YiwenWu:spark_func_regexp

Conversation

@YiwenWu
Copy link
Contributor

@YiwenWu YiwenWu commented Mar 3, 2024

https://issues.apache.org/jira/browse/CALCITE-6278

  1. Add Function REGEXPREGEXP_LIKE enabled in the Spark library.
    Since this function has the same implementation as the Spark RLIKE function.
    The implementation can be reused.

Source Code
image

[undo] Discuss results in Jira:

  1. Since Spark 2.0, string literals (including regex patterns) are unescaped in SQL parser, also fix this bug in calcite.
  1. Add spark query tests in BabelQuidemTest to ensure the execution correctness.

@YiwenWu YiwenWu force-pushed the spark_func_regexp branch from 20f4d1e to b4fa622 Compare March 3, 2024 02:48
@YiwenWu YiwenWu changed the title [CALCITE-6278] Add REGEXP function (enabled in Spark library) [CALCITE-6278] Add REGEXP、REGEXP_LIKE function (enabled in Spark library) Mar 3, 2024
}

@Test void testRegexpFunc() {
final SqlOperatorFixture f = fixture().setFor(SqlLibraryOperators.REGEXP);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixture().setFor(SqlLibraryOperators.REGEXP).withLibrary(SqlLibrary.Spark) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

withLibrary is optional.
f0.forEachLibrary(list(functionAlias.libraries), consumer) will automatically load the corresponding Library type.

@@ -0,0 +1,243 @@
# spark.iq - Babel test for Spark dialect of SQL
Copy link
Member

@caicancai caicancai Mar 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question.Are you sure this test only takes effect for semantics that conform to spark sql?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test will load Built-in functions and spark functions.

Copy link
Contributor

@JiajunBernoulli JiajunBernoulli Mar 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not against using this test.

But I want to know the necessity of using QuidemTest for Spark.
Because your code may become a future contribution specification.

Let's discuss two questions.

  1. Why need spark.iq?
  2. What features need to be tested and what features don't need to be tested?

Copy link
Member

@caicancai caicancai Mar 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 At present, calcite has adapted many spark functions. Do we need to add all spark functions? What are the benefits after adding them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Compared to SqlOperatorTest, I think the function tests in Babel are more accurate and complete. Therefore,
    referencesbig-query.iq and redshift.iq, spark.iq were added to verify the correctness of function execution.
  2. I think that unit tests can be added to this file when adding spark functions, but it is unnecessary.

/** SQL {@code RLIKE} function. */
public boolean rlike(String s, String pattern) {
return cache.getUnchecked(new Key(0, pattern)).matcher(s).find();
s = StringEscapeUtils.unescapeJava(s);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add a little comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, add comment

@YiwenWu YiwenWu force-pushed the spark_func_regexp branch 2 times, most recently from d885c38 to d4bce2a Compare March 3, 2024 03:18
# REGEXP(str, regexp)
# Returns true if str matches regexp, or false otherwise.
#
# Returns STRING
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no it doesn't

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, fixed to return Boolean

| b | PARSE_TIMESTAMP(format, string[, timeZone]) | Uses format specified by *format* to convert *string* representation of timestamp to a TIMESTAMP WITH LOCAL TIME ZONE value in *timeZone*
| h s | PARSE_URL(urlString, partToExtract [, keyToExtract] ) | Returns the specified *partToExtract* from the *urlString*. Valid values for *partToExtract* include HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, and USERINFO. *keyToExtract* specifies which query to extract
| b s | POW(numeric1, numeric2) | Returns *numeric1* raised to the power *numeric2*
| s | REGEXP(string, regexp) | Returns true if *string* matches *regexp*, or false otherwise, string literals (including regex patterns) are unescaped
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't you just say 'alias for RLIKE'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, Equivalent to string1 RLIKE string2

f.checkNull(fn + "(cast(null as varchar), 'abc')");
f.checkNull(fn + "(cast(null as varchar), cast(null as varchar))");
};
f0.forEachLibrary(list(functionAlias.libraries), consumer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if RLIKE, REGEXP_LIKE and REGEXP are identical, you should devise a way to test all three with a single test

Copy link
Contributor Author

@YiwenWu YiwenWu Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although RLIKE and REGEXP have the same implementation, their function calling methods differ, so they are divided into different tests.

  • RLIKE: 'str' RLIKE 'regex'
  • REGEXP|REGEXP_LIKE : fn('str', 'regex').

In addition, we can abstract a method to support different Function calling modes, but I may think it is not necessary at present.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think it's better to combine them into a single test, I'll modify the test case.

Copy link
Contributor

@julianhyde julianhyde Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is necessary. And it's easy to combine all three. BinaryOperator<String>. For RLIKE, pass (a, b) -> a + " RLIKE " b; for REGEXP, pass (a, b) -> "REGEXP(" + a + ", " + b + ")".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, combined test case of RLIKEREGEXPREGEXP_LIKE

@YiwenWu YiwenWu force-pushed the spark_func_regexp branch 2 times, most recently from 1545b81 to 6161df5 Compare March 6, 2024 04:12
@YiwenWu YiwenWu changed the title [CALCITE-6278] Add REGEXP、REGEXP_LIKE function (enabled in Spark library) [CALCITE-6278] Add REGEXP, REGEXP_LIKE function (enabled in Spark library) Mar 7, 2024
/** SQL {@code RLIKE} function. */
public boolean rlike(String s, String pattern) {
return cache.getUnchecked(new Key(0, pattern)).matcher(s).find();
// Since Spark 2.0, string literals (including regex patterns) are unescaped in SQL parser
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rlike also enabled in HIVE library, does the rlike function have the same semantics in hive as in spark sql?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, judging by the engine execution behavior, Hive also performs the unescape, and the result is consistent with Spark.
image

Copy link
Contributor

@macroguo-ghy macroguo-ghy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@YiwenWu YiwenWu force-pushed the spark_func_regexp branch from ef42d21 to dc4119d Compare March 9, 2024 03:57
@jduo
Copy link
Member

jduo commented Mar 9, 2024

Would it be possible to merge these changes? I'm looking to implement CALCITE-6309 on top of this.

| b m p s | RIGHT(string, length) | Returns the rightmost *length* characters from the *string*
| h s | string1 RLIKE string2 | Whether *string1* matches regex pattern *string2* (similar to `LIKE`, but uses Java regex)
| h s | string1 NOT RLIKE string2 | Whether *string1* does not match regex pattern *string2* (similar to `NOT LIKE`, but uses Java regex)
| h s | string1 RLIKE string2 | Whether *string1* matches regex pattern *string2* (similar to `LIKE`, but uses Java regex), string literals (including regex patterns) are unescaped
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we maintain alphabetical order?

re before than ri

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, order adjusted.

@YiwenWu
Copy link
Contributor Author

YiwenWu commented Mar 10, 2024

Would it be possible to merge these changes? I'm looking to implement CALCITE-6309 on top of this.

@jduo I think this pr will be able to be merged recently, the details of the relevant issues have been discussed clearly in Jira.

@YiwenWu YiwenWu force-pushed the spark_func_regexp branch from 6a91261 to 78acd2e Compare March 10, 2024 06:08
@sonarqubecloud
Copy link

@macroguo-ghy
Copy link
Contributor

If no other comment, I will merge it later.

@jduo
Copy link
Member

jduo commented Mar 11, 2024

If no other comment, I will merge it later.

Thanks, that'd be great. Once that's in I'll follow-up with the CALCITE-6309 work.

@macroguo-ghy macroguo-ghy merged commit d8804b4 into apache:main Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants