Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-33045][SQL] Support build-in function like_all and fix StackOverflowError issue. #29999

Closed
wants to merge 60 commits into from

Conversation

beliefer
Copy link
Contributor

@beliefer beliefer commented Oct 10, 2020

What changes were proposed in this pull request?

Spark already support LIKE ALL syntax, but it will throw StackOverflowError if there are many elements(more than 14378 elements). We should implement built-in function for LIKE ALL to fix this issue.

Why the stack overflow can happen in the current approach ?
The current approach uses reduceLeft to connect each Like(e, p), this will lead the the call depth of the thread is too large, causing StackOverflowError problems.

Why the fix in this PR can avoid the error?
This PR support built-in function for LIKE ALL and avoid this issue.

Why are the changes needed?

1.Fix the StackOverflowError issue.
2.Support built-in function like_all.

Does this PR introduce any user-facing change?

'No'.

How was this patch tested?

Jenkins test.

@SparkQA
Copy link

SparkQA commented Oct 10, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34227/

@SparkQA
Copy link

SparkQA commented Oct 10, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34227/

throw new ParseException("Expected something between '(' and ')'.", ctx)
}
ctx.NOT match {
case null => LikeAll(e, ctx.expression.asScala.map(expression))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change disable the datasource pushdown for LIKE (e.g., StartsWith, EndsWith)? If so, we possibly get performance regression when reading datasources, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have improved the implement. It will be converted to LikeAll when it judges that it will cause StackOverflowError, still use the current approach, otherwise.

@maropu
Copy link
Member

maropu commented Oct 10, 2020

In the PR description, could you describe why the stack overflow can happen in the current approach and why the fix in this PR can avoid the error?

@maropu
Copy link
Member

maropu commented Oct 10, 2020

One more question; does this PR approach has the same performance with the current one in case of the small number of elements in LIKE ALL?

@SparkQA
Copy link

SparkQA commented Oct 10, 2020

Test build #129623 has finished for PR 29999 at commit 4163382.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • abstract class LikeAllBase extends Expression with ImplicitCastInputTypes with NullIntolerant
  • case class LikeAll(value: Expression, list: Seq[Expression]) extends LikeAllBase
  • case class NotLikeAll(value: Expression, list: Seq[Expression]) extends LikeAllBase

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35791/

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35793/

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35794/

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35793/

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35794/

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Test build #131189 has finished for PR 29999 at commit 97c1c73.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Test build #131191 has finished for PR 29999 at commit f0e3de1.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35814/

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35814/

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Test build #131211 has finished for PR 29999 at commit f0e3de1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if (exprValue == null) {
null
} else {
val allMatched = if (isNotLikeAll) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to improve readability:

val matchFunc: Pattern => Booolean = if (isNotLikeAll) {
  p => !p.matcher(exprValue.toString).matches()
} else {
  p => p.matcher(exprValue.toString).matches()
}
if (cache.forall(matchFunc)) {
  if (hasNull) null else true
} else {
  false
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

val valueArg = ctx.freshName("valueArg")
val patternCache = ctx.addReferenceObj("patternCache", cache.asJava)

val matchCode = if (isNotLikeAll) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is checkNotMatchCode

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

ev.copy(code =
code"""
|${eval.code}
|boolean $allMatched = true;
Copy link
Contributor

@cloud-fan cloud-fan Nov 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code flow can be

boolean ${ev.isNull} = false;
boolean ${ev.value} = true;
if (${eval.isNull}) {
  ${ev.isNull} = true;
} else {
  $javaDataType $valueArg = ${eval.value};
  for ... {
    if (notMatched) {
      $ev.value = false;
      break;
    }
  }
  if (${ev.value} && hasNull) ${ev.isNull} = true;
}

Copy link
Contributor Author

@beliefer beliefer Nov 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I learned more! Thanks!

@SparkQA
Copy link

SparkQA commented Nov 19, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35930/

@SparkQA
Copy link

SparkQA commented Nov 19, 2020

Test build #131326 has finished for PR 29999 at commit 001eb38.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 19, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35930/

@beliefer
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Nov 19, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35936/

@SparkQA
Copy link

SparkQA commented Nov 19, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35936/

@SparkQA
Copy link

SparkQA commented Nov 19, 2020

Test build #131331 has finished for PR 29999 at commit 001eb38.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 3695e99 Nov 19, 2020
@mridulm
Copy link
Contributor

mridulm commented Nov 19, 2020

@cloud-fan This is causing failures in scala-2.13 build
See this for example.

+CC @dongjoon-hyun, @srowen

I believe @sunchao's PR is attempting to address it here

@beliefer
Copy link
Contributor Author

@cloud-fan @wangyum @maropu Thanks for all your work!

.intConf
.checkValue(threshold => threshold >= 0, "The maximum size of pattern sequence " +
"in like all must be non-negative")
.createWithDefault(200)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A tree of 200 And-reduced expressions is already a huge expr tree.
I think this could be useful and helpful with a default threshold of 5 or so already.

Copy link
Contributor

@cloud-fan cloud-fan Dec 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have removed this config: beliefer@9273d42#diff-13c5b65678b327277c68d17910ae93629801af00117a0e3da007afd95b6c6764L219

We will always use the new expression for LIKE ALL if values are all literal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
8 participants