[SPARK-33045][SQL] Support build-in function like_all and fix StackOverflowError issue. #29999

beliefer · 2020-10-10T10:25:47Z

What changes were proposed in this pull request?

Spark already support LIKE ALL syntax, but it will throw StackOverflowError if there are many elements(more than 14378 elements). We should implement built-in function for LIKE ALL to fix this issue.

Why the stack overflow can happen in the current approach ?
The current approach uses reduceLeft to connect each Like(e, p), this will lead the the call depth of the thread is too large, causing StackOverflowError problems.

Why the fix in this PR can avoid the error?
This PR support built-in function for LIKE ALL and avoid this issue.

Why are the changes needed?

1.Fix the StackOverflowError issue.
2.Support built-in function like_all.

Does this PR introduce any user-facing change?

'No'.

How was this patch tested?

Jenkins test.

SparkQA · 2020-10-10T11:12:36Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34227/

SparkQA · 2020-10-10T11:36:28Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34227/

maropu · 2020-10-10T11:57:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

+              throw new ParseException("Expected something between '(' and ')'.", ctx)
+            }
+            ctx.NOT match {
+              case null => LikeAll(e, ctx.expression.asScala.map(expression))


Does this change disable the datasource pushdown for LIKE (e.g., StartsWith, EndsWith)? If so, we possibly get performance regression when reading datasources, I think.

I have improved the implement. It will be converted to LikeAll when it judges that it will cause StackOverflowError, still use the current approach, otherwise.

maropu · 2020-10-10T12:01:08Z

In the PR description, could you describe why the stack overflow can happen in the current approach and why the fix in this PR can avoid the error?

maropu · 2020-10-10T12:04:56Z

One more question; does this PR approach has the same performance with the current one in case of the small number of elements in LIKE ALL?

SparkQA · 2020-10-10T13:00:16Z

Test build #129623 has finished for PR 29999 at commit 4163382.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class LikeAllBase extends Expression with ImplicitCastInputTypes with NullIntolerant
case class LikeAll(value: Expression, list: Seq[Expression]) extends LikeAllBase
case class NotLikeAll(value: Expression, list: Seq[Expression]) extends LikeAllBase

SparkQA · 2020-11-17T03:52:56Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35791/

SparkQA · 2020-11-17T03:59:56Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35793/

SparkQA · 2020-11-17T04:13:21Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35794/

SparkQA · 2020-11-17T04:24:24Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35793/

SparkQA · 2020-11-17T04:36:40Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35794/

SparkQA · 2020-11-17T07:35:57Z

Test build #131189 has finished for PR 29999 at commit 97c1c73.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-17T08:05:02Z

Test build #131191 has finished for PR 29999 at commit f0e3de1.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

beliefer · 2020-11-17T08:22:40Z

retest this please

SparkQA · 2020-11-17T09:52:19Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35814/

SparkQA · 2020-11-17T10:23:20Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35814/

SparkQA · 2020-11-17T12:59:55Z

Test build #131211 has finished for PR 29999 at commit f0e3de1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-11-19T05:55:59Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala

+    if (exprValue == null) {
+      null
+    } else {
+      val allMatched = if (isNotLikeAll) {


to improve readability:

val matchFunc: Pattern => Booolean = if (isNotLikeAll) { p => !p.matcher(exprValue.toString).matches() } else { p => p.matcher(exprValue.toString).matches() } if (cache.forall(matchFunc)) { if (hasNull) null else true } else { false }

cloud-fan · 2020-11-19T05:57:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala

+    val valueArg = ctx.freshName("valueArg")
+    val patternCache = ctx.addReferenceObj("patternCache", cache.asJava)
+
+    val matchCode = if (isNotLikeAll) {


this is checkNotMatchCode

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala

cloud-fan · 2020-11-19T06:19:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala

+    ev.copy(code =
+      code"""
+            |${eval.code}
+            |boolean $allMatched = true;


the code flow can be

boolean ${ev.isNull} = false; boolean ${ev.value} = true; if (${eval.isNull}) { ${ev.isNull} = true; } else { $javaDataType $valueArg = ${eval.value}; for ... { if (notMatched) { $ev.value = false; break; } } if (${ev.value} && hasNull) ${ev.isNull} = true; }

I learned more! Thanks!

SparkQA · 2020-11-19T07:42:24Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35930/

SparkQA · 2020-11-19T08:05:03Z

Test build #131326 has finished for PR 29999 at commit 001eb38.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-19T08:08:47Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35930/

beliefer · 2020-11-19T08:09:35Z

retest this please

SparkQA · 2020-11-19T09:28:50Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35936/

SparkQA · 2020-11-19T09:52:29Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35936/

SparkQA · 2020-11-19T14:07:12Z

Test build #131331 has finished for PR 29999 at commit 001eb38.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-11-19T16:56:19Z

thanks, merging to master!

mridulm · 2020-11-19T19:18:31Z

@cloud-fan This is causing failures in scala-2.13 build
See this for example.

+CC @dongjoon-hyun, @srowen

I believe @sunchao's PR is attempting to address it here

beliefer · 2020-11-20T02:03:54Z

@cloud-fan @wangyum @maropu Thanks for all your work!

juliuszsompolski · 2020-12-17T15:41:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      .intConf
+      .checkValue(threshold => threshold >= 0, "The maximum size of pattern sequence " +
+        "in like all must be non-negative")
+      .createWithDefault(200)


A tree of 200 And-reduced expressions is already a huge expr tree.
I think this could be useful and helpful with a default threshold of 5 or so already.

We have removed this config: beliefer@9273d42#diff-13c5b65678b327277c68d17910ae93629801af00117a0e3da007afd95b6c6764L219

We will always use the new expression for LIKE ALL if values are all literal.

beliefer and others added 22 commits June 19, 2020 10:36

Reuse completeNextStageWithFetchFailure

4a6f903

Merge remote-tracking branch 'upstream/master'

96456e2

Merge remote-tracking branch 'upstream/master'

4314005

Merge remote-tracking branch 'upstream/master'

d6af4a7

Merge remote-tracking branch 'upstream/master'

f69094f

Merge remote-tracking branch 'upstream/master'

b86a42d

Merge branch 'master' of github.com:beliefer/spark

2ac5159

Merge remote-tracking branch 'upstream/master'

9021d6c

Merge branch 'master' of github.com:beliefer/spark

74a2ef4

Merge remote-tracking branch 'upstream/master'

9828158

Merge remote-tracking branch 'upstream/master'

9cd1aaf

Merge remote-tracking branch 'upstream/master'

abfcbb9

Merge remote-tracking branch 'upstream/master'

07c6c81

Merge remote-tracking branch 'upstream/master'

580130b

Merge branch 'master' of github.com:beliefer/spark

3712808

Merge remote-tracking branch 'upstream/master'

6107413

Merge remote-tracking branch 'upstream/master'

4b799b4

Merge remote-tracking branch 'upstream/master'

ee0ecbf

Merge remote-tracking branch 'upstream/master'

596bc61

Merge remote-tracking branch 'upstream/master'

0164e2f

Merge remote-tracking branch 'upstream/master'

90b79fc

Support build-in LIKE_ALL function

4163382

maropu reviewed Oct 10, 2020

View reviewed changes

beliefer and others added 2 commits October 12, 2020 11:05

Fix schema issue.

1909298

Merge branch 'master' into SPARK-33045-like_all

054fc1b

cloud-fan reviewed Nov 19, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala Show resolved Hide resolved

cloud-fan reviewed Nov 19, 2020

View reviewed changes

Optimize code.

001eb38

cloud-fan approved these changes Nov 19, 2020

View reviewed changes

cloud-fan closed this in 3695e99 Nov 19, 2020

juliuszsompolski reviewed Dec 17, 2020

View reviewed changes

cloud-fan mentioned this pull request Jan 6, 2021

[SPARK-33938][SQL] Optimize Like Any/All by LikeSimplification #30975

Closed

[SPARK-33045][SQL] Support build-in function like_all and fix StackOverflowError issue. #29999

[SPARK-33045][SQL] Support build-in function like_all and fix StackOverflowError issue. #29999

Conversation

beliefer commented Oct 10, 2020 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Oct 10, 2020

SparkQA commented Oct 10, 2020

maropu Oct 10, 2020

Choose a reason for hiding this comment

beliefer Oct 12, 2020

Choose a reason for hiding this comment

maropu commented Oct 10, 2020

maropu commented Oct 10, 2020

SparkQA commented Oct 10, 2020

SparkQA commented Nov 17, 2020

SparkQA commented Nov 17, 2020

SparkQA commented Nov 17, 2020

SparkQA commented Nov 17, 2020

SparkQA commented Nov 17, 2020

SparkQA commented Nov 17, 2020

SparkQA commented Nov 17, 2020

beliefer commented Nov 17, 2020

SparkQA commented Nov 17, 2020

SparkQA commented Nov 17, 2020

SparkQA commented Nov 17, 2020

cloud-fan Nov 19, 2020

Choose a reason for hiding this comment

beliefer Nov 19, 2020

Choose a reason for hiding this comment

cloud-fan Nov 19, 2020

Choose a reason for hiding this comment

beliefer Nov 19, 2020

Choose a reason for hiding this comment

cloud-fan Nov 19, 2020 • edited

Choose a reason for hiding this comment

beliefer Nov 19, 2020 • edited

Choose a reason for hiding this comment

SparkQA commented Nov 19, 2020

SparkQA commented Nov 19, 2020

SparkQA commented Nov 19, 2020

beliefer commented Nov 19, 2020

SparkQA commented Nov 19, 2020

SparkQA commented Nov 19, 2020

SparkQA commented Nov 19, 2020

cloud-fan commented Nov 19, 2020

mridulm commented Nov 19, 2020

beliefer commented Nov 20, 2020

juliuszsompolski Dec 17, 2020

Choose a reason for hiding this comment

cloud-fan Dec 17, 2020 • edited

Choose a reason for hiding this comment

beliefer commented Oct 10, 2020 •

edited

cloud-fan Nov 19, 2020 •

edited

beliefer Nov 19, 2020 •

edited

cloud-fan Dec 17, 2020 •

edited