Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29013][SQL] Structurally equivalent subexpression elimination #25717

Closed
wants to merge 7 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Sep 6, 2019

What changes were proposed in this pull request?

We do semantically equivalent subexpression elimination in SparkSQL. However, for some expressions that are not semantically equivalent, but structurally equivalent, current subexpression elimination generates too many similar functions. These functions share same computation structure but only differ in input slots of current processing row.

For example, expression a is input[1] + input[2], expression b is input[3] + input[4]. They are not semantically equivalent in SparkSQL, but they have the same computation on different input data.

For such expressions, we can generate just one function, and pass in input slots during runtime.

It can reduce the length of generated code text, and save compilation time.

Why are the changes needed?

For complex query, current sub-expression elimination could generate too many similar functions. It leads long generated code text and increases compilation time.

For example, run the following query:

val df = spark.range(2).selectExpr((0 to 5000).map(i => s"id as field_$i"): _*)
df.createOrReplaceTempView("spark64kb")
val data = spark.sql("select * from spark64kb limit 10")
data.describe()

The longest compilation time observed in this query is 25816.203394ms. After this patch, the same compilation is reduced to 9143.778397ms.

Does this PR introduce any user-facing change?

This doesn't introduce user-facing change.
This feature is controlled by a SQL config spark.sql.structuralSubexpressionElimination.enabled.

How was this patch tested?

Added tests.

@viirya
Copy link
Member Author

viirya commented Sep 6, 2019

I observed this problem when running the test in #25642 .

Now it is WIP. I would like to see if whether all tests work and listen to the opinions from others.

@SparkQA

This comment has been minimized.

@viirya viirya force-pushed the structural-subexpr branch 2 times, most recently from 4be455c to c5d443a Compare September 6, 2019 21:12
@SparkQA

This comment has been minimized.

@viirya
Copy link
Member Author

viirya commented Sep 7, 2019

@SparkQA

This comment has been minimized.

@maropu
Copy link
Member

maropu commented Sep 7, 2019

Ur, it looks a cool idea. I'll check the code tonight.

@SparkQA

This comment has been minimized.

@mgaido91
Copy link
Contributor

mgaido91 commented Sep 7, 2019

thanks @viirya. I actually was thinking the same, so I like your proposal. Just one question: why are you limiting this to BoundReferences? Cannot we generalize this?

@viirya
Copy link
Member Author

viirya commented Sep 8, 2019

Thanks @maropu @mgaido91

For sub-expression elimination, I observed that many similar functions generated are only different in input slots. So this works for BoundReference. For BoundReference, we can parameterize the ordinals out of generated functions to achieve this feature.

I think it is probably very hard to generalize this.

* structurally equivalent expressions. Non-recursive.
*/
def addStructExpr(ctx: CodegenContext, expr: Expression): Unit = {
if (expr.deterministic) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot always share a function for non-deterministic cases? e.g.,

int subExpr1 = input[0] + random();
int subExpr2 = input[1] + random();
=>
int subExpr1 = subExpr(input[0]);
int subExpr2 = subExpr(input[1]);

int subExpr(int v) { return v + random(); }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-deterministic expressions can't do sub-expression elimination.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I see.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, this idea is limited to common subexprs? For example, the idea can cover a case like;

select sum(a + b), sum(b + c), sum(c + d), sum(d + e) from values (1, 1, 1, 1, 1) t(a, b, c, d, e)

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is probably suitable. Only if we want these functions sum(a+b)...etc. to be called in split functions. Their inputs can be parameterized.

@SparkQA

This comment has been minimized.

* but different slots of input tuple, we replace `BoundReference` with this parameterized
* version. The slot position is parameterized and is given at runtime.
*/
case class ParameterizedBoundReference(parameter: String, dataType: DataType, nullable: Boolean)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is only used for codegen, how about moving this to org.apache.spark.sql.catalyst.expressions.codegen?

Copy link
Member

@maropu maropu Sep 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: parameter -> variableNameForOrdinal or paramNameForOrdinal?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, ok, I will.

private def parameterizedBoundReferences(ctx: CodegenContext, expr: Expression): Expression = {
expr.transformUp {
case b: BoundReference =>
val param = ctx.freshName("boundInput")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: boundInput -> ordinal?


if (!skip && !addExpr(expr)) {
childrenToRecurse.foreach(addExprTree)
if (!skip) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot do it like !skip && !addStructExpr(expr, exprMap) in the same way with addExprTree?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this also recursively adds children if the parent expr was added. I tested with adding or not in prototyping this. Adding children saves more code text.

case class StructuralExpr(e: Expression) {
def normalized(expr: Expression): Expression = {
expr.transformUp {
case b: ParameterizedBoundReference =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid unnecessary plan copys, we can check this equality based on BoundReference (by just copying it like b.copy(ordinal = 0 or -1?))? IIUC its ok to replace BoundReference with ParameterizedBoundReference just when generating code in https://github.com/apache/spark/pull/25717/files#diff-8bcc5aea39c73d4bf38aef6f6951d42cR1117?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't b.copy also copying the input expr?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. But, if we write it like this (c1c5052), we don't need to pass CodegenContext into EquivalentExpressions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. looks good.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just modified like this way, but it failed HashAggregationQuerySuite in jenkins and local.

Copy link
Member Author

@viirya viirya Sep 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I figured why. Made another commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current one looks super good to me. Thanks!


override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
assert(ctx.currentVars == null && ctx.INPUT_ROW != null,
"ParameterizedBoundReference can not be used in whole-stage codegen yet.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any barrier to support the whole-stage codegen case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can be used in whole-stage codegen. As this is applied to sub-expression elimination which is non whole-stage only, I'd also like to reduce the change of code in a single PR.

def addStructExpr(ctx: CodegenContext, expr: Expression): Unit = {
if (expr.deterministic) {
val refs = expr.collect {
case b: BoundReference => b
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: case b: BoundReference => Literal(0)?

// We calculate function parameter length by the number of ints plus `INPUT_ROW` plus
// a int type result array index.
val parameterLength = CodeGenerator.calculateParamLength(refs.map(_ => Literal(0))) + 2
if (CodeGenerator.isValidParamLength(parameterLength)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the length goes over the limit, the current logic gives up eliminating common exprs? If so, can we fall back into the non-structural mode?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA
Copy link

SparkQA commented Sep 8, 2019

Test build #110315 has finished for PR 25717 at commit 13f5ca6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ParameterizedBoundReference(parameter: String, dataType: DataType, nullable: Boolean)
  • case class StructuralExpr(e: Expression)

@viirya
Copy link
Member Author

viirya commented Sep 9, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Sep 9, 2019

Test build #110320 has finished for PR 25717 at commit 13f5ca6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ParameterizedBoundReference(parameter: String, dataType: DataType, nullable: Boolean)
  • case class StructuralExpr(e: Expression)

@viirya
Copy link
Member Author

viirya commented Sep 9, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Sep 9, 2019

Test build #110352 has finished for PR 25717 at commit f52cbde.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 10, 2019

Test build #110379 has finished for PR 25717 at commit cc0ee12.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ParameterizedBoundReference(ordinalParam: String, dataType: DataType, nullable: Boolean)

@viirya viirya changed the title [WIP][SPARK-29013][SQL] Structurally equivalent subexpression elimination [SPARK-29013][SQL] Structurally equivalent subexpression elimination Sep 10, 2019
@SparkQA
Copy link

SparkQA commented Sep 10, 2019

Test build #110397 has finished for PR 25717 at commit f447042.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member

kiszk commented Sep 10, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Sep 10, 2019

Test build #110407 has finished for PR 25717 at commit f447042.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 14, 2019

Test build #110594 has finished for PR 25717 at commit 69b6ba4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 14, 2019

Test build #110598 has finished for PR 25717 at commit f447042.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Sep 14, 2019

@cloud-fan @rednaxelafx @mgaido91 @kiszk Anyone can check this?

@maropu
Copy link
Member

maropu commented Sep 14, 2019

btw, this pr doesn't include end-to-end tests and do you think the queries in the existing tests are enough for the end-to-end tests of this pr? I'm not sure about how much the existing tests include structurally equivalent exprs though...

Also, no negative performance impacts on the existing queries, e.g., TPCDS?

@viirya
Copy link
Member Author

viirya commented Sep 15, 2019

btw, this pr doesn't include end-to-end tests and do you think the queries in the existing tests are enough for the end-to-end tests of this pr? I'm not sure about how much the existing tests include structurally equivalent exprs though...

Regarding end-to-end tests, I am not sure what test we need. If there are end-to-end tests of current sub-expresssion elimination, I think I can write tests based on that.

Also, no negative performance impacts on the existing queries, e.g., TPCDS?

I have not run a TPCDS benchmark with this. Just TPCDSQueryBenchmark is enough?

@SparkQA
Copy link

SparkQA commented Sep 15, 2019

Test build #110602 has finished for PR 25717 at commit 71e0239.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member

kiszk commented Sep 15, 2019

I will take a look on Monday or Tuesday.

/**
* Adds each expression to this data structure, grouping them with existing equivalent
* expressions. Non-recursive.
* Returns true if there was already a matching expression.
*/
def addExpr(expr: Expression): Boolean = {
def addExpr(expr: Expression, exprMap: EquivalenceMap = this.equivalenceMap): Boolean = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Do we need = this.equivalenceMap? It seems that all of the callers pass two arguments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addExpr is also used at PhysicalAggregation:

expr.collect {
// addExpr() always returns false for non-deterministic expressions and do not add them.
case agg: AggregateExpression
if !equivalentAggregateExpressions.addExpr(agg) => agg
case udf: PythonUDF
if PythonUDF.isGroupedAggPandasUDF(udf) &&
!equivalentAggregateExpressions.addExpr(udf) => udf
}

if (!skip && !addExpr(expr)) {
childrenToRecurse.foreach(addExprTree)
if (!skip && addStructExpr(expr)) {
childrenToRecurse(expr).foreach(addStructuralExprTree)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we want to add (_)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addExprTree doesn't add (_) too. Just followed it. If no special reason, I will leave it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am neutral on this.
I was curious why this line added (_), but here does not add.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At that line, if not add a (_) to call addExprTree, will see compilation error:

[error]  found   : (org.apache.spark.sql.catalyst.expressions.Expression, equivalentExpressions.EquivalenceMap) => Unit                                                               
[error]     (which expands to)  (org.apache.spark.sql.catalyst.expressions.Expression, scala.collection.mutable.HashMap[equivalentExpressions.Expr,scala.collection.mutable.ArrayBuff$r[org.apache.spark.sql.catalyst.expressions.Expression]]) => Unit                                                                                                                     
[error]  required: org.apache.spark.sql.catalyst.expressions.Expression => ?               
[error]     expressions.foreach(equivalentExpressions.addExprTree)     
[error]                                               ^                                                                                                                               

Because addExprTree actually needs to arguments, it doesn't match foreach's argument type.

@SparkQA
Copy link

SparkQA commented Sep 17, 2019

Test build #110665 has finished for PR 25717 at commit 4700b89.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

@github-actions github-actions bot added the Stale label Dec 27, 2019
@viirya viirya closed this Dec 27, 2019
@viirya viirya deleted the structural-subexpr branch December 27, 2023 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants