Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22494][SQL] Fix 64KB limit exception with Coalesce and AtleastNNonNulls #19720

Closed
wants to merge 10 commits into from

Conversation

mgaido91
Copy link
Contributor

What changes were proposed in this pull request?

Both Coalesce and AtLeastNNonNulls can cause the 64KB limit exception when used with a lot of arguments and/or complex expressions.
This PR splits their expressions in order to avoid the issue.

How was this patch tested?

Added UTs

@SparkQA
Copy link

SparkQA commented Nov 10, 2017

Test build #83707 has finished for PR 19720 at commit f0edd7e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 11, 2017

Test build #83719 has finished for PR 19720 at commit 1722d12.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

ev.copy(code = s"""
${ev.isNull} = true;
${ev.value} = ${ctx.defaultValue(dataType)};
${ctx.splitExpressions(ctx.INPUT_ROW, evals)}""")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can only split the expressions when they are created from a row object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I don't see which is the problem here. I see that here the row object is null, but the goal is to set ev.isNull and ev.value, which is done. May you explain me if there is something I am missing? Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

splitExpressions puts expression codes in individual functions. Only if those expressions's input is from a row object, we can do this. If their input is from currentVars, the splitting doesn't work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't cause problem in fact. In that case, it works as before, i.e., expressions.mkString("\n").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the reason why I put ev.isNull and ev.value as attributes of the generated class, in this way they can be used as before in the individual functions. If you want, I can try and use the other overloaded method splitExpressions without passing the input row to it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. nvm. looks good.

}.mkString("\n")
}

val code = ctx.splitExpressions(ctx.INPUT_ROW, evals)
Copy link
Member

@viirya viirya Nov 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

${ctx.javaType(dataType)} ${ev.value} = ${firstEval.value};""" +
rest.map { e =>
ctx.addMutableState("boolean", ev.isNull, s"")
ctx.addMutableState(ctx.javaType(dataType), ev.value, s"")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s"" -> ""

@@ -357,7 +358,8 @@ case class AtLeastNNonNulls(n: Int, children: Seq[Expression]) extends Predicate

override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
val nonnull = ctx.freshName("nonnull")
val code = children.map { e =>
ctx.addMutableState("int", nonnull, s"")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s"" -> "".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! Anyway, I am refactoring this, since I figured out a way to avoid the declaration of a global attribute. I can't do the same for coalesce unfortunately, because there I'd need to return two values from the methods.

@viirya
Copy link
Member

viirya commented Nov 12, 2017

LGTM

@SparkQA
Copy link

SparkQA commented Nov 12, 2017

Test build #83742 has finished for PR 19720 at commit 911e172.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 12, 2017

Test build #83743 has finished for PR 19720 at commit e8320d6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member

kiszk commented Nov 12, 2017

LGTM

@SparkQA
Copy link

SparkQA commented Nov 13, 2017

Test build #83754 has finished for PR 19720 at commit 924aab9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 13, 2017

Test build #83781 has finished for PR 19720 at commit 423245e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

I reviewed the PR according to what came out on the conversations on the other PRs by @kiszk .

@viirya @kiszk May I kindly ask you to review it again now? Thank you very much.

@SparkQA
Copy link

SparkQA commented Nov 13, 2017

Test build #83797 has finished for PR 19720 at commit a548ddb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

"""
},
foldFunctions = { funcCalls =>
funcCalls.map { funcCall => s"$nonnull = $funcCall;" }.mkString("\n")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

funcCalls.map(funcCall => s"$nonnull = $funcCall;").mkString("\n")

val code = if (ctx.INPUT_ROW == null || ctx.currentVars != null) {
evals.mkString("\n")
} else {
ctx.splitExpressions(evals, "atLeastNNonNull",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: atLeastNNonNulls

@@ -357,7 +358,7 @@ case class AtLeastNNonNulls(n: Int, children: Seq[Expression]) extends Predicate

override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
val nonnull = ctx.freshName("nonnull")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it really matter not to have nonnull as a global variable? The code is simpler if it is a global one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it does, because having it as a global variable may put more pressure on the constant pool (see SPARK-18016). Thus, whenever feasible, I do think that we should keep it local.

@SparkQA
Copy link

SparkQA commented Nov 14, 2017

Test build #83843 has finished for PR 19720 at commit 3a5c683.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Nov 14, 2017

LGTM

boolean ${ev.isNull} = ${firstEval.isNull};
${ctx.javaType(dataType)} ${ev.value} = ${firstEval.value};""" +
rest.map { e =>
ctx.addMutableState("boolean", ev.isNull, "")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we ensure ev.isNull always has a variable name? In other words, ev.isNull never has true or false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the previous code there is the same assumption, thus yes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is guaranteed by Expression.genCode.

asfgit pushed a commit that referenced this pull request Nov 16, 2017
…NNonNulls

## What changes were proposed in this pull request?

Both `Coalesce` and `AtLeastNNonNulls` can cause the 64KB limit exception when used with a lot of arguments and/or complex expressions.
This PR splits their expressions in order to avoid the issue.

## How was this patch tested?

Added UTs

Author: Marco Gaido <marcogaido91@gmail.com>
Author: Marco Gaido <mgaido@hortonworks.com>

Closes #19720 from mgaido91/SPARK-22494.

(cherry picked from commit 4e7f07e)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan
Copy link
Contributor

thanks, merging to master/2.2!

@asfgit asfgit closed this in 4e7f07e Nov 16, 2017
@mgaido91
Copy link
Contributor Author

@cloud-fan please do not backport this to 2.2. In 2.2 we don't have SPARK-18016 and this is adding new variables in the case of coalesce. Thus it can generate an higher pressure on the constant pool and this may even cause a regression IMHO.

@cloud-fan
Copy link
Contributor

hmm, isn't running slower better than can't run?

@mgaido91
Copy link
Contributor Author

It's not about running slower. This PR solves the problem which makes the user facing an exception if there are a lot of arguments in coalesce (or AtLestNNonNulls), but what I am doing in the coalesce function here is that I am adding to variables for each coalesce function. If there is a query with a lot of coalesce function (instead of a coalesce with a lot of parameters), this might result in having much more variables than before. This can cause the problem and the exception described in SPARK-18016. Thus a query that was previously running can fail.

The same thing is true for all the other PRs similar to this one submitted by @kiszk. Then, we should keep all these changes only on master, where part of SPARK-18016 is landing and hopefully soon it will be completely solved.

@cloud-fan
Copy link
Contributor

If there is a query with a lot of coalesce function, wouldn't it hit the 64kb issue?

@mgaido91
Copy link
Contributor Author

No, a query with a coalesce with many/complex parameters will hit this problem. A query with a lot of small coalesce will not have the problem.
For AtLeastNNonNulls the fix would be safe to be backported, because no class variables are defined, but for coalesce it is safer to fix it only with SPARK-18016. In particular, the ongoing PR will solve the issue.
The same is true also for all the other similar PRs.
Maybe what we can do to backport this to branch-2.2 is to do the splitting and define class level variables only after a threshold of parameter is met, otherwise we go on with the previous code generation (without splitting). In this way we don't have any regression.
Or maybe we can backport to 2.2 only those fix which are not introducing class level variables, like for AtLeastNNonNulls.
Actually I think that the most important of all of these fixes is AtLeastNNonNulls indeed, because it is used to drop rows containing all nulls and this fails with dataset with a lot of columns before this PR. All the other functions are less likely to have a huge amount of parameters, despite this may happen and we should support it.

@cloud-fan
Copy link
Contributor

I don't have a strong preference, but there were many 64kb compile error fixes for 2.2 or prior(e.g. CreateStruct, CreateArray, Invoke, CreateExternalRow, erc.). They all add more global variables, and it's werid to me that we stop doing this because the master branch has SPARK-18016.

@kiszk what do you think?

MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
…NNonNulls

## What changes were proposed in this pull request?

Both `Coalesce` and `AtLeastNNonNulls` can cause the 64KB limit exception when used with a lot of arguments and/or complex expressions.
This PR splits their expressions in order to avoid the issue.

## How was this patch tested?

Added UTs

Author: Marco Gaido <marcogaido91@gmail.com>
Author: Marco Gaido <mgaido@hortonworks.com>

Closes apache#19720 from mgaido91/SPARK-22494.

(cherry picked from commit 4e7f07e)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants