Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-37019][SQL] Add codegen support to array higher-order functions #34558

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

Kimahriman
Copy link
Contributor

@Kimahriman Kimahriman commented Nov 11, 2021

What changes were proposed in this pull request?

This PR adds codegen support to array based higher order functions except ArraySort. This is my first time playing around with codegen, so definitely looking for any feedback.

A few notes:

  • Disabled subexpression elimination for lambda functions (this already was the case because it was CodegenFallback). I plan to explore supprting subexpression elimination inside lambda functions later on, as it will require special handling.
  • I set the AtomicReference for all lambda values as well in case a child expression reverts to interpreted evaluation for any reason (CodegenFallback or otherwise)

Why are the changes needed?

To improve performance of array higher-order function operations, letting the children be codegen'd and participate in WholeStageCodegen

Does this PR introduce any user-facing change?

No, only performance improvements.

How was this patch tested?

Existing unit tests, let me know if there's other codegen-specific unit tests I should add.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@Kimahriman
Copy link
Contributor Author

@viirya @HyukjinKwon @cloud-fan any thoughts or know who might have thoughts?

@Tagar
Copy link

Tagar commented Mar 15, 2022

@Kimahriman just out of curiosity, how much did the performance improve?

@Kimahriman
Copy link
Contributor Author

It's hard to say because when I tested this out on my production jobs (actually still actively using it), I had several other changes too. I'm not sure if there are any benchmarks involving HOFs? Though it's highly dependent on what the lambda function is, and honestly that's one of the main benefits, the lambda functions themselves can be codegen'd instead of eval'd.

I also have a larger goal to support subexpression elimination inside lambda functions, because that's where I've found our biggest problem is. #34727 is also part of that goal.

@Kimahriman Kimahriman force-pushed the array-hof-codegen branch 2 times, most recently from 9cec788 to 8b898b0 Compare July 12, 2022 11:44
@Kimahriman Kimahriman force-pushed the array-hof-codegen branch 2 times, most recently from a565a82 to 92d9a9f Compare November 30, 2022 23:36
Copy link
Contributor

@jaceklaskowski jaceklaskowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be a lot of repetition. Wish it could be avoided somehow but can't help though (beside nit-picking).

}

val result = f(lambdaVars)
namedLambdas.foreach(v => currentLambdaVars.remove(v.variableName))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/v/lamba? Don't want to ask for namedLambda as used earlier, but v does not fit really.

I'd also consider namedLambdas.map(_.variableName).foreach(currentLambdaVars.remove)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I never know all the exact ways that using _ does and doesn't work when you use a function call, struggle with the cleanest way to write things like this. I like the second one I think

}

def getLambdaVar(name: String): ExprCode = {
currentLambdaVars.getOrElse(name, {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Are these curly brackets required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw this format in other parts of the code, but I also have seen it without, so can remove


// We need to include the Expr ID in the Codegen variable name since several tests bypass
// `UnresolvedNamedLambdaVariable.freshVarName`
lazy val variableName = s"${name}_${exprId.id}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not seem consistent with simpleString. Is this intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm must have thought it was used in the actual generated code or something but doesn't look like it is. In fact I don't think variableName is really needed at all, can just use exprId.id as the map key, and name for the code comment

@@ -350,6 +445,49 @@ case class ArrayTransform(
result
}

override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
ctx.withLambdaVars(Seq(elementVar) ++ indexVar, { lambdaExprs =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Be consistent with {s; here, { is before an input argument to a function literal while a few lines below it's after =>.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I did that because of the instances where I used the case expression to pull out the single element list, will update these for consistency

@Kimahriman
Copy link
Contributor Author

There seems to be a lot of repetition. Wish it could be avoided somehow but can't help though (beside nit-picking).

Thanks for the review! I tried to get as much common code in the parent classes as I could, can take another pass to see if anything jumps out for deduping

@Kimahriman Kimahriman force-pushed the array-hof-codegen branch 2 times, most recently from dcfb17c to 05f05cf Compare April 29, 2023 13:30
private def childrenToRecurse(expr: Expression): Seq[Expression] = expr match {
case _: CodegenFallback => Nil
case c: ConditionalExpression => c.alwaysEvaluatedInputs.map(skipForShortcut)
case h: HigherOrderFunction => h.arguments
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to do the same for commonChildrenToRecurse?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. That only cares if the current expression is a ConditionalExpression. The default is Nil if it's not that so I don't think it needs any special handling for HOFs

@@ -130,6 +134,23 @@ case class LambdaFunction(

override def eval(input: InternalRow): Any = function.eval(input)

override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
val functionCode = function.genCode(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems we can just return function.genCode(ctx)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code here is only to optimize the non-nullable case, which should be handled by the function body codegen already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to wrap my ahead around this again. I think they might be doing different things? Like for ArrayTransform, the nullSafeCodeGen is only optimizing the non-nullable case of the result of ArrayTransform, whereas this is optimizing the non-nullable case of the result of the lambda function being called for each element in the ArrayTransform

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But maybe this already handles that:

        val resultNull = if (function.nullable) Some(functionCode.isNull.toString) else None
        val resultAssignment = CodeGenerator.setArrayElement(arrayData, dataType.elementType,
          i, copy, isNull = resultNull)

I'll need to think about this a little more

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay nevermind I see what you mean now, it's just assigning the same values to new variables for no reason. Changed to just return function.genCode(ctx)

elementVar: NamedLambdaVariable, index: String): String = {
val elementType = elementVar.dataType
val elementAtomic = ctx.addReferenceObj(elementVar.name, elementVar.value)
val extractElement = CodeGenerator.getValue(arrayName, elementType, index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With codegen, I don't think we need a global atomic reference to represent the lambda variable. We can just generate local variables in the java code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The atomic references are for cases where you might have a CodegenFallback expression inside your lambda function. I think the only way an expression.eval(input) will work is if this atomic reference is set to the lambda variable value as well. I could potentially try to detect if the function has a CodegenFallback expression, but I wasn't sure if there were other reasons some expression would fall back to interpreted inside the lambda function that I couldn't detect.

Copy link
Contributor

@jaceklaskowski jaceklaskowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More nitting 😉

*/
var currentLambdaVars: mutable.Map[Long, ExprCode] = mutable.HashMap.empty

def withLambdaVars(namedLambdas: Seq[NamedLambdaVariable],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put namedLambdas... on a separate line?


def withLambdaVars(namedLambdas: Seq[NamedLambdaVariable],
f: Seq[ExprCode] => ExprCode): ExprCode = {
val lambdaVars = namedLambdas.map { namedLambda =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Replace namedLambda to lambda?

}

def getLambdaVar(id: Long): ExprCode = {
currentLambdaVars.getOrElse(id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move id to its own new line? 🙏

protected def nullSafeCodeGen(
ctx: CodegenContext,
ev: ExprCode,
f: String => String): ExprCode = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be very happy if you use this parameter formatting style in the other places in your PR. Makes reading so much easier, esp. with functions with 5+ params. Can you make such change? 🙏

})
})
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a copy and paste of exists, doesn't it? Can we have a parent class for some sharing? Unless I'm mistaken, the generated code block is the exact copy except $forall (vs $exists).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to look at this a little bit. It's tricky because there's a few places they are different (default true vs false, !'d for one and not the other, the followThreeValuedLogic flag for exists)

@Kimahriman Kimahriman closed this Jun 23, 2023
@Kimahriman Kimahriman reopened this Jun 23, 2023
@Kimahriman
Copy link
Contributor Author

More nitting 😉

Yeah I struggle with the right formatting for multiline things in scala, tried to update all the suggestions, thanks for the tips!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants