-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-33045][SQL] Support build-in function like_all and fix StackOverflowError issue. #29999
Conversation
Kubernetes integration test starting |
Kubernetes integration test status success |
throw new ParseException("Expected something between '(' and ')'.", ctx) | ||
} | ||
ctx.NOT match { | ||
case null => LikeAll(e, ctx.expression.asScala.map(expression)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this change disable the datasource pushdown for LIKE (e.g., StartsWith, EndsWith)? If so, we possibly get performance regression when reading datasources, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have improved the implement. It will be converted to LikeAll
when it judges that it will cause StackOverflowError
, still use the current approach, otherwise.
In the PR description, could you describe why the stack overflow can happen in the current approach and why the fix in this PR can avoid the error? |
One more question; does this PR approach has the same performance with the current one in case of the small number of elements in |
Test build #129623 has finished for PR 29999 at commit
|
Kubernetes integration test status success |
Kubernetes integration test starting |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Kubernetes integration test status success |
Test build #131189 has finished for PR 29999 at commit
|
Test build #131191 has finished for PR 29999 at commit
|
retest this please |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #131211 has finished for PR 29999 at commit
|
if (exprValue == null) { | ||
null | ||
} else { | ||
val allMatched = if (isNotLikeAll) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to improve readability:
val matchFunc: Pattern => Booolean = if (isNotLikeAll) {
p => !p.matcher(exprValue.toString).matches()
} else {
p => p.matcher(exprValue.toString).matches()
}
if (cache.forall(matchFunc)) {
if (hasNull) null else true
} else {
false
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
val valueArg = ctx.freshName("valueArg") | ||
val patternCache = ctx.addReferenceObj("patternCache", cache.asJava) | ||
|
||
val matchCode = if (isNotLikeAll) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is checkNotMatchCode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
Show resolved
Hide resolved
ev.copy(code = | ||
code""" | ||
|${eval.code} | ||
|boolean $allMatched = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the code flow can be
boolean ${ev.isNull} = false;
boolean ${ev.value} = true;
if (${eval.isNull}) {
${ev.isNull} = true;
} else {
$javaDataType $valueArg = ${eval.value};
for ... {
if (notMatched) {
$ev.value = false;
break;
}
}
if (${ev.value} && hasNull) ${ev.isNull} = true;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I learned more! Thanks!
Kubernetes integration test starting |
Test build #131326 has finished for PR 29999 at commit
|
Kubernetes integration test status failure |
retest this please |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #131331 has finished for PR 29999 at commit
|
thanks, merging to master! |
@cloud-fan This is causing failures in scala-2.13 build +CC @dongjoon-hyun, @srowen |
@cloud-fan @wangyum @maropu Thanks for all your work! |
.intConf | ||
.checkValue(threshold => threshold >= 0, "The maximum size of pattern sequence " + | ||
"in like all must be non-negative") | ||
.createWithDefault(200) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A tree of 200 And-reduced expressions is already a huge expr tree.
I think this could be useful and helpful with a default threshold of 5 or so already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have removed this config: beliefer@9273d42#diff-13c5b65678b327277c68d17910ae93629801af00117a0e3da007afd95b6c6764L219
We will always use the new expression for LIKE ALL if values are all literal.
What changes were proposed in this pull request?
Spark already support
LIKE ALL
syntax, but it will throwStackOverflowError
if there are many elements(more than 14378 elements). We should implement built-in function for LIKE ALL to fix this issue.Why the stack overflow can happen in the current approach ?
The current approach uses reduceLeft to connect each
Like(e, p)
, this will lead the the call depth of the thread is too large, causingStackOverflowError
problems.Why the fix in this PR can avoid the error?
This PR support built-in function for
LIKE ALL
and avoid this issue.Why are the changes needed?
1.Fix the
StackOverflowError
issue.2.Support built-in function
like_all
.Does this PR introduce any user-facing change?
'No'.
How was this patch tested?
Jenkins test.