Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-24884][SQL] Support regexp function regexp_extract_all #27507

Closed
wants to merge 17 commits into from

Conversation

beliefer
Copy link
Contributor

@beliefer beliefer commented Feb 9, 2020

What changes were proposed in this pull request?

regexp_extract_all is a very useful function expanded the capabilities of regexp_extract.
There are some description of this function.

SELECT regexp_extract('1a 2b 14m', '\d+', 0); -- 1
SELECT regexp_extract_all('1a 2b 14m', '\d+', 0); -- [1, 2, 14]
SELECT regexp_extract('1a 2b 14m', '(\d+)([a-z]+)', 2); -- 'a'
SELECT regexp_extract_all('1a 2b 14m', '(\d+)([a-z]+)', 2); -- ['a', 'b', 'm']

There are some mainstream database support the syntax.
Presto:
https://prestodb.io/docs/current/functions/regexp.html

Pig:
https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html

BigQuery
https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#regexp_extract_all

Note: This PR pick up the work of #21985

Why are the changes needed?

regexp_extract_all is a very useful function and make work easier.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New UT

@SparkQA
Copy link

SparkQA commented Feb 9, 2020

Test build #118091 has finished for PR 27507 at commit 06bb690.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • abstract class RegExpExtractBase extends TernaryExpression with ImplicitCastInputTypes
  • case class RegExpExtractAll(subject: Expression, regexp: Expression, idx: Expression)

@SparkQA
Copy link

SparkQA commented Feb 9, 2020

Test build #118096 has finished for PR 27507 at commit 7a1c6d3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer beliefer changed the title [SPARK-24884][SQL] Support string function regexp_extract_all [SPARK-24884][SQL] Support regexp function regexp_extract_all Feb 10, 2020
""
}

(classNamePattern, matcher, matchResult, termLastRegex, termPattern, setEvNotNull)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a little hard to read at the caller side.

can we implement doGenCode in the base class, which calls an abstract method. Sub-classes need to implement the abstract method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Good idea.

* NOTE: this expression is not THREAD-SAFE, as it has some internal mutable status.
*/
@ExpressionDescription(
usage = "_FUNC_(str, regexp[, idx]) - Extracts all group that matches `regexp`.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we explain the semantic of idx?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

@SparkQA
Copy link

SparkQA commented Feb 12, 2020

Test build #118274 has finished for PR 27507 at commit 1ed159f.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

s"${ev.isNull} = false;"
} else {
""
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I don't think there is much common code to share. Maybe we can have a
protected def setNotNullCode(ev: ExprCode) = ... but that's all.

How about we just let each sub-class implement doGenCode individually?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

fallback to the Spark 1.6 behavior regarding string literal parsing. For example,
if the config is enabled, the `regexp` that can match "\abc" is "^\abc$".
* idx - a int expression. The regex maybe contains multiple groups. `idx` represents the
index of regex group.
Copy link
Contributor

@cloud-fan cloud-fan Feb 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idx indicates which regex group to extract.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

/**
* Extract a specific(idx) group identified by a Java regex.
*
* NOTE: this expression is not THREAD-SAFE, as it has some internal mutable status.
*/
@ExpressionDescription(
usage = "_FUNC_(str, regexp[, idx]) - Extracts a group that matches `regexp`.",
arguments = """
Arguments:
* str - a string expression
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a string expression of the input string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

arguments = """
Arguments:
* str - a string expression
* regexp - a string expression. The regex string should be a Java regular expression.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a string expression of the regex string.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is Java regular expression?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. There just references the comment of RLIKE.

There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to
fallback to the Spark 1.6 behavior regarding string literal parsing. For example,
if the config is enabled, the `regexp` that can match "\abc" is "^\abc$".
* idx - a int expression. The regex maybe contains multiple groups. `idx` represents the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an int expression of the regex group index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@@ -508,3 +568,96 @@ case class RegExpExtract(subject: Expression, regexp: Expression, idx: Expressio
})
}
}

/**
* Extract all specific(idx) group identified by a Java regex.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

group -> groups

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@@ -2383,6 +2383,17 @@ object functions {
RegExpExtract(e.expr, lit(exp).expr, lit(groupIdx).expr)
}

/**
* Extract all specific group matched by a Java regex, from the specified string column.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

groups

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@@ -2383,6 +2383,17 @@ object functions {
RegExpExtract(e.expr, lit(exp).expr, lit(groupIdx).expr)
}

/**
* Extract all specific group matched by a Java regex, from the specified string column.
* If the regex did not match, or the specified group did not match, an empty array is returned.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior seems to be

  1. If the regex does not match, return an empty array
  2. if the specified group does not match, put an empty string to the result array.

Can we document the behavior in SQL expression? And can you verify this is the standard behavior in other databases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. should throw a IllegalArgumentException.
    [SPARK-30763][SQL] Fix java.lang.IndexOutOfBoundsException No group 1 for regexp_extract #27508
    the behavior of Hive is :
    FAILED: SemanticException [Error 10014]: Line 1:7 Wrong arguments ‘2’: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute method public java.lang.String org.apache.hadoop.hive.ql.udf.UDFRegExpExtract.evaluate(java.lang.String,java.lang.String,java.lang.Integer) on object org.apache.hadoop.hive.ql.udf.UDFRegExpExtract@2cf5e0f0 of class org.apache.hadoop.hive.ql.udf.UDFRegExpExtract with arguments {x=a3&x=18abc&x=2&y=3&x=4:java.lang.String, x=([0-9]+)[a-z]:java.lang.String, 2:java.lang.Integer} of size 3

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's document the behavior clearly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@SparkQA
Copy link

SparkQA commented Feb 12, 2020

Test build #118297 has finished for PR 27507 at commit ea29d66.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to
fallback to the Spark 1.6 behavior regarding string literal parsing. For example,
if the config is enabled, the `regexp` that can match "\abc" is "^\abc$".
* idx - an int expression of the regex group index. The regex maybe contains multiple
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe contains -> may contain

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@kiszk
Copy link
Member

kiszk commented Feb 12, 2020

I have a high-level question. Do we have huge advantage to generate Java code?

One advantage is to store the result of Pattern.compile() into each global variable for caching while the non-generated code shares one variable for cache.
On the other hand, the size of the result is not small. Which trade-off do we select? Space or performance?

override def nullSafeEval(s: Any, p: Any, r: Any): Any = {
val m = getLastMatcher(s, p)
val matchResults = new ArrayBuffer[UTF8String]()
val mr: MatchResult = m.toMatchResult
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where do we use this mr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove it.

@SparkQA
Copy link

SparkQA commented Feb 12, 2020

Test build #118304 has finished for PR 27507 at commit e50f010.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 12, 2020

Test build #118306 has finished for PR 27507 at commit 517251a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

beliefer commented Feb 13, 2020

I have a high-level question. Do we have huge advantage to generate Java code?

One advantage is to store the result of Pattern.compile() into each global variable for caching while the non-generated code shares one variable for cache.
On the other hand, the size of the result is not small. Which trade-off do we select? Space or performance?

LIKE and RLIKE cache the result of Pattern.compile().
RegExpReplace and RegExpExtract use another way

.
If the pattern string is a constant, the two approaches to the same goal.
If the pattern string is a variable, the performance issue seems cannot to avoid.

// invalid group index
val row8 = create_row("100-200,300-400,500-600", "(\\d+)-(\\d+)", 3)
val row9 = create_row("100-200,300-400,500-600", "(\\d+).*", 2)
val row10 = create_row("100-200,300-400,500-600", "\\d+", 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about negative group index?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throw new IllegalArgumentException("The specified group index cannot be less than zero")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we test it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

// This is a hack way to enable the codegen, thus the codegen is enable by default,
// it will still use the interpretProjection if projection followed by a LocalRelation,
// hence we add a filter operator.
// See the optimizer rule `ConvertToLocalRelation`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is out-dated. We already disabled ConvertToLocalRelation in testing. We can remove these code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

SELECT regexp_extract_all('1a 2b 14m', '(\\d+)([a-z]+)', 0);
SELECT regexp_extract_all('1a 2b 14m', '(\\d+)([a-z]+)', 1);
SELECT regexp_extract_all('1a 2b 14m', '(\\d+)([a-z]+)', 2);
SELECT regexp_extract_all('1a 2b 14m', '(\\d+)([a-z]+)', 3);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we test optional group here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@SparkQA
Copy link

SparkQA commented Jul 31, 2020

Test build #126823 has finished for PR 27507 at commit c096de4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 31, 2020

Test build #126853 has finished for PR 27507 at commit af88ad1.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 31, 2020

Test build #126858 has finished for PR 27507 at commit 3a98bbe.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -322,6 +322,48 @@ class RegexpExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper {
RegExpExtract(Literal("\"quote"), Literal("\"quote"), Literal(1)) :: Nil)
}

test("RegexExtractAll") {
val row1 = create_row("100-200,300-400,500-600", "(\\d+)-(\\d+)", 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we test group 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@SparkQA
Copy link

SparkQA commented Jul 31, 2020

Test build #126891 has finished for PR 27507 at commit 82a384e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

beliefer commented Aug 1, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Aug 1, 2020

Test build #126919 has finished for PR 27507 at commit 2125bff.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 1, 2020

Test build #126916 has finished for PR 27507 at commit 82a384e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

beliefer commented Aug 2, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Aug 2, 2020

Test build #126924 has finished for PR 27507 at commit 2125bff.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member

kiszk commented Aug 2, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Aug 2, 2020

Test build #126943 has finished for PR 27507 at commit 2125bff.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 42f9ee4 Aug 3, 2020
@beliefer
Copy link
Contributor Author

beliefer commented Aug 3, 2020

@cloud-fan @kiszk Thanks for your review.

maropu pushed a commit that referenced this pull request Sep 4, 2020
…aSuite to sql-expression-schema.md

### What changes were proposed in this pull request?
`sql-expression-schema.md` automatically generated by `ExpressionsSchemaSuite`, but only expressions entries are checked in `ExpressionsSchemaSuite`. So if we manually modify the contents of the file,  `ExpressionsSchemaSuite` does not necessarily guarantee the correctness of the it some times. For example, [Spark-24884](#27507) added `regexp_extract_all`  expression support, and manually modify the `sql-expression-schema.md` but not change the content of `Number of queries` cause file content inconsistency.

Some additional checks have been added to `ExpressionsSchemaSuite` to improve the correctness guarantee of `sql-expression-schema.md` as follow:

- `Number of queries` should equals size of `expressions entries` in `sql-expression-schema.md`

- `Number of expressions that missing example` should equals size of `Expressions missing examples` in `sql-expression-schema.md`

- `MissExamples` from case should same as  `expectedMissingExamples` from `sql-expression-schema.md`

### Why are the changes needed?
Ensure the correctness of `sql-expression-schema.md` content.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Enhanced ExpressionsSchemaSuite

Closes #29608 from LuciferYang/sql-expression-schema.

Authored-by: yangjie <yangjie@MacintoshdeMacBook-Pro.local>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
dongjoon-hyun pushed a commit that referenced this pull request Jan 26, 2021
…t_all

### What changes were proposed in this pull request?
#27507 implements `regexp_extract_all` and added the scala function version of it.
According https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L41-L59, it seems good for remove the scala function version. Although I think is regexp_extract_all is very useful, if we just reference the description.

### Why are the changes needed?
`regexp_extract_all` is less common.

### Does this PR introduce _any_ user-facing change?
'No'. `regexp_extract_all` was added in Spark 3.1.0 which isn't released yet.

### How was this patch tested?
Jenkins test.

Closes #31346 from beliefer/SPARK-24884-followup.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
dongjoon-hyun pushed a commit that referenced this pull request Jan 26, 2021
…t_all

### What changes were proposed in this pull request?
#27507 implements `regexp_extract_all` and added the scala function version of it.
According https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L41-L59, it seems good for remove the scala function version. Although I think is regexp_extract_all is very useful, if we just reference the description.

### Why are the changes needed?
`regexp_extract_all` is less common.

### Does this PR introduce _any_ user-facing change?
'No'. `regexp_extract_all` was added in Spark 3.1.0 which isn't released yet.

### How was this patch tested?
Jenkins test.

Closes #31346 from beliefer/SPARK-24884-followup.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 99b6af2)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
skestle pushed a commit to skestle/spark that referenced this pull request Feb 3, 2021
…t_all

### What changes were proposed in this pull request?
apache#27507 implements `regexp_extract_all` and added the scala function version of it.
According https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L41-L59, it seems good for remove the scala function version. Although I think is regexp_extract_all is very useful, if we just reference the description.

### Why are the changes needed?
`regexp_extract_all` is less common.

### Does this PR introduce _any_ user-facing change?
'No'. `regexp_extract_all` was added in Spark 3.1.0 which isn't released yet.

### How was this patch tested?
Jenkins test.

Closes apache#31346 from beliefer/SPARK-24884-followup.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants