Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-37392][SQL] Fix the performance bug when inferring constraints for Generate #34823

Closed
wants to merge 2 commits into from

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

This is a performance regression since Spark 3.1, caused by https://issues.apache.org/jira/browse/SPARK-32295

If you run the query in the JIRA ticket

Seq(
  (1, "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x")
).toDF()
  .checkpoint() // or save and reload to truncate lineage
  .createOrReplaceTempView("sub")

session.sql("""
  SELECT
    *
  FROM
  (
    SELECT
      EXPLODE( ARRAY( * ) ) result
    FROM
    (
      SELECT
        _1 a, _2 b, _3 c, _4 d, _5 e, _6 f, _7 g, _8 h, _9 i, _10 j, _11 k, _12 l, _13 m, _14 n, _15 o, _16 p, _17 q, _18 r, _19 s, _20 t, _21 u
      FROM
        sub
    )
  )
  WHERE
    result != ''
  """).show() 

You will hit OOM. The reason is that:

  1. We infer additional predicates with Generate. In this case, it's size(array(cast(_1#21 as string), _2#22, _3#23, ...) > 0
  2. Because of the cast, the ConstantFolding rule can't optimize this size(array(...)).
  3. We end up with a plan containing this part
   +- Project [_1#21 AS a#106, _2#22 AS b#107, _3#23 AS c#108, _4#24 AS d#109, _5#25 AS e#110, _6#26 AS f#111, _7#27 AS g#112, _8#28 AS h#113, _9#29 AS i#114, _10#30 AS j#115, _11#31 AS k#116, _12#32 AS l#117, _13#33 AS m#118, _14#34 AS n#119, _15#35 AS o#120, _16#36 AS p#121, _17#37 AS q#122, _18#38 AS r#123, _19#39 AS s#124, _20#40 AS t#125, _21#41 AS u#126]
      +- Filter (size(array(cast(_1#21 as string), _2#22, _3#23, _4#24, _5#25, _6#26, _7#27, _8#28, _9#29, _10#30, _11#31, _12#32, _13#33, _14#34, _15#35, _16#36, _17#37, _18#38, _19#39, _20#40, _21#41), true) > 0)
         +- LogicalRDD [_1#21, _2#22, _3#23, _4#24, _5#25, _6#26, _7#27, _8#28, _9#29, _10#30, _11#31, _12#32, _13#33, _14#34, _15#35, _16#36, _17#37, _18#38, _19#39, _20#40, _21#41] 

When calculating the constraints of the Project, we generate around 2^20 expressions, due to this code

var allConstraints = child.constraints
projectList.foreach {
  case a @ Alias(l: Literal, _) =>
    allConstraints += EqualNullSafe(a.toAttribute, l)
  case a @ Alias(e, _) =>
    // For every alias in `projectList`, replace the reference in constraints by its attribute.
    allConstraints ++= allConstraints.map(_ transform {
      case expr: Expression if expr.semanticEquals(e) =>
        a.toAttribute
    })
    allConstraints += EqualNullSafe(e, a.toAttribute)
  case _ => // Don't change.
} 

There are 3 issues here:

  1. We may infer complicated predicates from Generate
  2. ConstanFolding rule is too conservative. At least Cast has no side effect with ANSI-off.
  3. When calculating constraints, we should have a upper bound to avoid generating too many expressions.

This fixes the first 2 issues, and leaves the third one for the future.

Why are the changes needed?

fix a performance issue

Does this PR introduce any user-facing change?

no

How was this patch tested?

new tests, and run the query in JIRA ticket locally.

// - The input expression may fail to be evaluated under ANSI mode. If we reorder the
// predicates and evaluate the input expression first, we may fail the query unexpectedly.
// To be safe, here we only generate extra predicates if the input is an attribute.
if (input.isInstanceOf[Attribute]) {
Copy link
Contributor Author

@cloud-fan cloud-fan Dec 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's almost useless to generate predicate with CreateArray/CreateMap. Size(CreateArray(...)) > 0 is always true unless you create an empty array.

@@ -47,6 +47,7 @@ object ConstantFolding extends Rule[LogicalPlan] {
private def hasNoSideEffect(e: Expression): Boolean = e match {
case _: Attribute => true
case _: Literal => true
case c: Cast if !conf.ansiEnabled => hasNoSideEffect(c.child)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either this change or the change in InferFiltersFromGenerate can fix the perf issue. But I keep both fixes to be super safe.

@cloud-fan
Copy link
Contributor Author

cc @JoshRosen @maropu @viirya

@SparkQA
Copy link

SparkQA commented Dec 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50435/

@SparkQA
Copy link

SparkQA commented Dec 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50436/

@SparkQA
Copy link

SparkQA commented Dec 6, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50435/

@SparkQA
Copy link

SparkQA commented Dec 6, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50436/

@SparkQA
Copy link

SparkQA commented Dec 6, 2021

Test build #145961 has finished for PR 34823 at commit 14f1241.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix looks okay to me.

@wangyum
Copy link
Member

wangyum commented Dec 7, 2021

Calculating the constraints of the Project will also cause performance problems for other rules, such as: #26257

Copy link
Contributor

@sigmod sigmod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch!

@cloud-fan
Copy link
Contributor Author

Calculating the constraints of the Project will also cause performance problems for other rules

Yea, and it's the issue 3 I mentioned in the PR description. This is hard to fix, and I don't want to include it in this PR that needs to be backported.


// setup rules to test inferFilters with ConstantFolding to make sure
// the Filter rule added in inferFilters is removed again when doing
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this now, because we don't infer the filters at the first place, instead of removing the inferred filters later.

@SparkQA
Copy link

SparkQA commented Dec 7, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50464/

@SparkQA
Copy link

SparkQA commented Dec 7, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50464/

@SparkQA
Copy link

SparkQA commented Dec 7, 2021

Test build #145988 has finished for PR 34823 at commit b79eaf3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan cloud-fan closed this in 1fac7a9 Dec 8, 2021
cloud-fan added a commit that referenced this pull request Dec 8, 2021
… for Generate

### What changes were proposed in this pull request?

This is a performance regression since Spark 3.1, caused by https://issues.apache.org/jira/browse/SPARK-32295

If you run the query in the JIRA ticket
```
Seq(
  (1, "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x")
).toDF()
  .checkpoint() // or save and reload to truncate lineage
  .createOrReplaceTempView("sub")

session.sql("""
  SELECT
    *
  FROM
  (
    SELECT
      EXPLODE( ARRAY( * ) ) result
    FROM
    (
      SELECT
        _1 a, _2 b, _3 c, _4 d, _5 e, _6 f, _7 g, _8 h, _9 i, _10 j, _11 k, _12 l, _13 m, _14 n, _15 o, _16 p, _17 q, _18 r, _19 s, _20 t, _21 u
      FROM
        sub
    )
  )
  WHERE
    result != ''
  """).show()
```
You will hit OOM. The reason is that:
1. We infer additional predicates with `Generate`. In this case, it's `size(array(cast(_1#21 as string), _2#22, _3#23, ...) > 0`
2. Because of the cast, the `ConstantFolding` rule can't optimize this `size(array(...))`.
3. We end up with a plan containing this part
```
   +- Project [_1#21 AS a#106, _2#22 AS b#107, _3#23 AS c#108, _4#24 AS d#109, _5#25 AS e#110, _6#26 AS f#111, _7#27 AS g#112, _8#28 AS h#113, _9#29 AS i#114, _10#30 AS j#115, _11#31 AS k#116, _12#32 AS l#117, _13#33 AS m#118, _14#34 AS n#119, _15#35 AS o#120, _16#36 AS p#121, _17#37 AS q#122, _18#38 AS r#123, _19#39 AS s#124, _20#40 AS t#125, _21#41 AS u#126]
      +- Filter (size(array(cast(_1#21 as string), _2#22, _3#23, _4#24, _5#25, _6#26, _7#27, _8#28, _9#29, _10#30, _11#31, _12#32, _13#33, _14#34, _15#35, _16#36, _17#37, _18#38, _19#39, _20#40, _21#41), true) > 0)
         +- LogicalRDD [_1#21, _2#22, _3#23, _4#24, _5#25, _6#26, _7#27, _8#28, _9#29, _10#30, _11#31, _12#32, _13#33, _14#34, _15#35, _16#36, _17#37, _18#38, _19#39, _20#40, _21#41]
```
When calculating the constraints of the `Project`, we generate around 2^20 expressions, due to this code
```
var allConstraints = child.constraints
projectList.foreach {
  case a  Alias(l: Literal, _) =>
    allConstraints += EqualNullSafe(a.toAttribute, l)
  case a  Alias(e, _) =>
    // For every alias in `projectList`, replace the reference in constraints by its attribute.
    allConstraints ++= allConstraints.map(_ transform {
      case expr: Expression if expr.semanticEquals(e) =>
        a.toAttribute
    })
    allConstraints += EqualNullSafe(e, a.toAttribute)
  case _ => // Don't change.
}
```

There are 3 issues here:
1. We may infer complicated predicates from `Generate`
2. `ConstanFolding` rule is too conservative. At least `Cast` has no side effect with ANSI-off.
3. When calculating constraints, we should have a upper bound to avoid generating too many expressions.

This fixes the first 2 issues, and leaves the third one for the future.

### Why are the changes needed?

fix a performance issue

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new tests, and run the query in JIRA ticket locally.

Closes #34823 from cloud-fan/perf.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 1fac7a9)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan added a commit that referenced this pull request Dec 8, 2021
… for Generate

This is a performance regression since Spark 3.1, caused by https://issues.apache.org/jira/browse/SPARK-32295

If you run the query in the JIRA ticket
```
Seq(
  (1, "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x")
).toDF()
  .checkpoint() // or save and reload to truncate lineage
  .createOrReplaceTempView("sub")

session.sql("""
  SELECT
    *
  FROM
  (
    SELECT
      EXPLODE( ARRAY( * ) ) result
    FROM
    (
      SELECT
        _1 a, _2 b, _3 c, _4 d, _5 e, _6 f, _7 g, _8 h, _9 i, _10 j, _11 k, _12 l, _13 m, _14 n, _15 o, _16 p, _17 q, _18 r, _19 s, _20 t, _21 u
      FROM
        sub
    )
  )
  WHERE
    result != ''
  """).show()
```
You will hit OOM. The reason is that:
1. We infer additional predicates with `Generate`. In this case, it's `size(array(cast(_1#21 as string), _2#22, _3#23, ...) > 0`
2. Because of the cast, the `ConstantFolding` rule can't optimize this `size(array(...))`.
3. We end up with a plan containing this part
```
   +- Project [_1#21 AS a#106, _2#22 AS b#107, _3#23 AS c#108, _4#24 AS d#109, _5#25 AS e#110, _6#26 AS f#111, _7#27 AS g#112, _8#28 AS h#113, _9#29 AS i#114, _10#30 AS j#115, _11#31 AS k#116, _12#32 AS l#117, _13#33 AS m#118, _14#34 AS n#119, _15#35 AS o#120, _16#36 AS p#121, _17#37 AS q#122, _18#38 AS r#123, _19#39 AS s#124, _20#40 AS t#125, _21#41 AS u#126]
      +- Filter (size(array(cast(_1#21 as string), _2#22, _3#23, _4#24, _5#25, _6#26, _7#27, _8#28, _9#29, _10#30, _11#31, _12#32, _13#33, _14#34, _15#35, _16#36, _17#37, _18#38, _19#39, _20#40, _21#41), true) > 0)
         +- LogicalRDD [_1#21, _2#22, _3#23, _4#24, _5#25, _6#26, _7#27, _8#28, _9#29, _10#30, _11#31, _12#32, _13#33, _14#34, _15#35, _16#36, _17#37, _18#38, _19#39, _20#40, _21#41]
```
When calculating the constraints of the `Project`, we generate around 2^20 expressions, due to this code
```
var allConstraints = child.constraints
projectList.foreach {
  case a  Alias(l: Literal, _) =>
    allConstraints += EqualNullSafe(a.toAttribute, l)
  case a  Alias(e, _) =>
    // For every alias in `projectList`, replace the reference in constraints by its attribute.
    allConstraints ++= allConstraints.map(_ transform {
      case expr: Expression if expr.semanticEquals(e) =>
        a.toAttribute
    })
    allConstraints += EqualNullSafe(e, a.toAttribute)
  case _ => // Don't change.
}
```

There are 3 issues here:
1. We may infer complicated predicates from `Generate`
2. `ConstanFolding` rule is too conservative. At least `Cast` has no side effect with ANSI-off.
3. When calculating constraints, we should have a upper bound to avoid generating too many expressions.

This fixes the first 2 issues, and leaves the third one for the future.

fix a performance issue

no

new tests, and run the query in JIRA ticket locally.

Closes #34823 from cloud-fan/perf.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 1fac7a9)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan
Copy link
Contributor Author

thanks for the view, merging to master/3.2/3.1!

@martinf-moodys
Copy link

thanks for the view, merging to master/3.2/3.1!

Thank you for the fix!

fishcus pushed a commit to fishcus/spark that referenced this pull request Jan 12, 2022
… for Generate

This is a performance regression since Spark 3.1, caused by https://issues.apache.org/jira/browse/SPARK-32295

If you run the query in the JIRA ticket
```
Seq(
  (1, "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x")
).toDF()
  .checkpoint() // or save and reload to truncate lineage
  .createOrReplaceTempView("sub")

session.sql("""
  SELECT
    *
  FROM
  (
    SELECT
      EXPLODE( ARRAY( * ) ) result
    FROM
    (
      SELECT
        _1 a, _2 b, _3 c, _4 d, _5 e, _6 f, _7 g, _8 h, _9 i, _10 j, _11 k, _12 l, _13 m, _14 n, _15 o, _16 p, _17 q, _18 r, _19 s, _20 t, _21 u
      FROM
        sub
    )
  )
  WHERE
    result != ''
  """).show()
```
You will hit OOM. The reason is that:
1. We infer additional predicates with `Generate`. In this case, it's `size(array(cast(_1#21 as string), _2#22, _3#23, ...) > 0`
2. Because of the cast, the `ConstantFolding` rule can't optimize this `size(array(...))`.
3. We end up with a plan containing this part
```
   +- Project [_1#21 AS a#106, _2#22 AS b#107, _3#23 AS c#108, _4#24 AS d#109, _5#25 AS e#110, _6#26 AS f#111, _7#27 AS g#112, _8#28 AS h#113, _9#29 AS i#114, _10#30 AS j#115, _11#31 AS k#116, _12#32 AS l#117, _13#33 AS m#118, _14#34 AS n#119, _15#35 AS o#120, _16#36 AS p#121, _17#37 AS q#122, _18#38 AS r#123, _19#39 AS s#124, _20#40 AS t#125, _21#41 AS u#126]
      +- Filter (size(array(cast(_1#21 as string), _2#22, _3#23, _4#24, _5#25, _6#26, _7#27, _8#28, _9#29, _10#30, _11#31, _12#32, _13#33, _14#34, _15#35, _16#36, _17#37, _18#38, _19#39, _20#40, _21#41), true) > 0)
         +- LogicalRDD [_1#21, _2#22, _3#23, _4#24, _5#25, _6#26, _7#27, _8#28, _9#29, _10#30, _11#31, _12#32, _13#33, _14#34, _15#35, _16#36, _17#37, _18#38, _19#39, _20#40, _21#41]
```
When calculating the constraints of the `Project`, we generate around 2^20 expressions, due to this code
```
var allConstraints = child.constraints
projectList.foreach {
  case a  Alias(l: Literal, _) =>
    allConstraints += EqualNullSafe(a.toAttribute, l)
  case a  Alias(e, _) =>
    // For every alias in `projectList`, replace the reference in constraints by its attribute.
    allConstraints ++= allConstraints.map(_ transform {
      case expr: Expression if expr.semanticEquals(e) =>
        a.toAttribute
    })
    allConstraints += EqualNullSafe(e, a.toAttribute)
  case _ => // Don't change.
}
```

There are 3 issues here:
1. We may infer complicated predicates from `Generate`
2. `ConstanFolding` rule is too conservative. At least `Cast` has no side effect with ANSI-off.
3. When calculating constraints, we should have a upper bound to avoid generating too many expressions.

This fixes the first 2 issues, and leaves the third one for the future.

fix a performance issue

no

new tests, and run the query in JIRA ticket locally.

Closes apache#34823 from cloud-fan/perf.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 1fac7a9)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
catalinii pushed a commit to lyft/spark that referenced this pull request Feb 22, 2022
… for Generate

### What changes were proposed in this pull request?

This is a performance regression since Spark 3.1, caused by https://issues.apache.org/jira/browse/SPARK-32295

If you run the query in the JIRA ticket
```
Seq(
  (1, "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x")
).toDF()
  .checkpoint() // or save and reload to truncate lineage
  .createOrReplaceTempView("sub")

session.sql("""
  SELECT
    *
  FROM
  (
    SELECT
      EXPLODE( ARRAY( * ) ) result
    FROM
    (
      SELECT
        _1 a, _2 b, _3 c, _4 d, _5 e, _6 f, _7 g, _8 h, _9 i, _10 j, _11 k, _12 l, _13 m, _14 n, _15 o, _16 p, _17 q, _18 r, _19 s, _20 t, _21 u
      FROM
        sub
    )
  )
  WHERE
    result != ''
  """).show()
```
You will hit OOM. The reason is that:
1. We infer additional predicates with `Generate`. In this case, it's `size(array(cast(_1#21 as string), _2#22, _3#23, ...) > 0`
2. Because of the cast, the `ConstantFolding` rule can't optimize this `size(array(...))`.
3. We end up with a plan containing this part
```
   +- Project [_1#21 AS a#106, _2#22 AS b#107, _3#23 AS c#108, _4#24 AS d#109, _5#25 AS e#110, _6#26 AS f#111, _7#27 AS g#112, _8#28 AS h#113, _9#29 AS i#114, _10#30 AS j#115, _11#31 AS k#116, _12#32 AS l#117, _13#33 AS m#118, _14#34 AS n#119, _15#35 AS o#120, _16#36 AS p#121, _17#37 AS q#122, _18#38 AS r#123, _19#39 AS s#124, _20#40 AS t#125, _21#41 AS u#126]
      +- Filter (size(array(cast(_1#21 as string), _2#22, _3#23, _4#24, _5#25, _6#26, _7#27, _8#28, _9#29, _10#30, _11#31, _12#32, _13#33, _14#34, _15#35, _16#36, _17#37, _18#38, _19#39, _20#40, _21#41), true) > 0)
         +- LogicalRDD [_1#21, _2#22, _3#23, _4#24, _5#25, _6#26, _7#27, _8#28, _9#29, _10#30, _11#31, _12#32, _13#33, _14#34, _15#35, _16#36, _17#37, _18#38, _19#39, _20#40, _21#41]
```
When calculating the constraints of the `Project`, we generate around 2^20 expressions, due to this code
```
var allConstraints = child.constraints
projectList.foreach {
  case a  Alias(l: Literal, _) =>
    allConstraints += EqualNullSafe(a.toAttribute, l)
  case a  Alias(e, _) =>
    // For every alias in `projectList`, replace the reference in constraints by its attribute.
    allConstraints ++= allConstraints.map(_ transform {
      case expr: Expression if expr.semanticEquals(e) =>
        a.toAttribute
    })
    allConstraints += EqualNullSafe(e, a.toAttribute)
  case _ => // Don't change.
}
```

There are 3 issues here:
1. We may infer complicated predicates from `Generate`
2. `ConstanFolding` rule is too conservative. At least `Cast` has no side effect with ANSI-off.
3. When calculating constraints, we should have a upper bound to avoid generating too many expressions.

This fixes the first 2 issues, and leaves the third one for the future.

### Why are the changes needed?

fix a performance issue

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new tests, and run the query in JIRA ticket locally.

Closes apache#34823 from cloud-fan/perf.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 1fac7a9)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
catalinii pushed a commit to lyft/spark that referenced this pull request Mar 4, 2022
… for Generate

### What changes were proposed in this pull request?

This is a performance regression since Spark 3.1, caused by https://issues.apache.org/jira/browse/SPARK-32295

If you run the query in the JIRA ticket
```
Seq(
  (1, "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x")
).toDF()
  .checkpoint() // or save and reload to truncate lineage
  .createOrReplaceTempView("sub")

session.sql("""
  SELECT
    *
  FROM
  (
    SELECT
      EXPLODE( ARRAY( * ) ) result
    FROM
    (
      SELECT
        _1 a, _2 b, _3 c, _4 d, _5 e, _6 f, _7 g, _8 h, _9 i, _10 j, _11 k, _12 l, _13 m, _14 n, _15 o, _16 p, _17 q, _18 r, _19 s, _20 t, _21 u
      FROM
        sub
    )
  )
  WHERE
    result != ''
  """).show()
```
You will hit OOM. The reason is that:
1. We infer additional predicates with `Generate`. In this case, it's `size(array(cast(_1#21 as string), _2#22, _3#23, ...) > 0`
2. Because of the cast, the `ConstantFolding` rule can't optimize this `size(array(...))`.
3. We end up with a plan containing this part
```
   +- Project [_1#21 AS a#106, _2#22 AS b#107, _3#23 AS c#108, _4#24 AS d#109, _5#25 AS e#110, _6#26 AS f#111, _7#27 AS g#112, _8#28 AS h#113, _9#29 AS i#114, _10#30 AS j#115, _11#31 AS k#116, _12#32 AS l#117, _13#33 AS m#118, _14#34 AS n#119, _15#35 AS o#120, _16#36 AS p#121, _17#37 AS q#122, _18#38 AS r#123, _19#39 AS s#124, _20#40 AS t#125, _21#41 AS u#126]
      +- Filter (size(array(cast(_1#21 as string), _2#22, _3#23, _4#24, _5#25, _6#26, _7#27, _8#28, _9#29, _10#30, _11#31, _12#32, _13#33, _14#34, _15#35, _16#36, _17#37, _18#38, _19#39, _20#40, _21#41), true) > 0)
         +- LogicalRDD [_1#21, _2#22, _3#23, _4#24, _5#25, _6#26, _7#27, _8#28, _9#29, _10#30, _11#31, _12#32, _13#33, _14#34, _15#35, _16#36, _17#37, _18#38, _19#39, _20#40, _21#41]
```
When calculating the constraints of the `Project`, we generate around 2^20 expressions, due to this code
```
var allConstraints = child.constraints
projectList.foreach {
  case a  Alias(l: Literal, _) =>
    allConstraints += EqualNullSafe(a.toAttribute, l)
  case a  Alias(e, _) =>
    // For every alias in `projectList`, replace the reference in constraints by its attribute.
    allConstraints ++= allConstraints.map(_ transform {
      case expr: Expression if expr.semanticEquals(e) =>
        a.toAttribute
    })
    allConstraints += EqualNullSafe(e, a.toAttribute)
  case _ => // Don't change.
}
```

There are 3 issues here:
1. We may infer complicated predicates from `Generate`
2. `ConstanFolding` rule is too conservative. At least `Cast` has no side effect with ANSI-off.
3. When calculating constraints, we should have a upper bound to avoid generating too many expressions.

This fixes the first 2 issues, and leaves the third one for the future.

### Why are the changes needed?

fix a performance issue

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new tests, and run the query in JIRA ticket locally.

Closes apache#34823 from cloud-fan/perf.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 1fac7a9)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
kazuyukitanimura pushed a commit to kazuyukitanimura/spark that referenced this pull request Aug 10, 2022
… for Generate

### What changes were proposed in this pull request?

This is a performance regression since Spark 3.1, caused by https://issues.apache.org/jira/browse/SPARK-32295

If you run the query in the JIRA ticket
```
Seq(
  (1, "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x")
).toDF()
  .checkpoint() // or save and reload to truncate lineage
  .createOrReplaceTempView("sub")

session.sql("""
  SELECT
    *
  FROM
  (
    SELECT
      EXPLODE( ARRAY( * ) ) result
    FROM
    (
      SELECT
        _1 a, _2 b, _3 c, _4 d, _5 e, _6 f, _7 g, _8 h, _9 i, _10 j, _11 k, _12 l, _13 m, _14 n, _15 o, _16 p, _17 q, _18 r, _19 s, _20 t, _21 u
      FROM
        sub
    )
  )
  WHERE
    result != ''
  """).show()
```
You will hit OOM. The reason is that:
1. We infer additional predicates with `Generate`. In this case, it's `size(array(cast(_1#21 as string), _2#22, _3#23, ...) > 0`
2. Because of the cast, the `ConstantFolding` rule can't optimize this `size(array(...))`.
3. We end up with a plan containing this part
```
   +- Project [_1#21 AS a#106, _2#22 AS b#107, _3#23 AS c#108, _4#24 AS d#109, _5#25 AS e#110, _6#26 AS f#111, _7#27 AS g#112, _8#28 AS h#113, _9#29 AS i#114, _10#30 AS j#115, _11#31 AS k#116, _12#32 AS l#117, _13#33 AS m#118, _14#34 AS n#119, _15#35 AS o#120, _16#36 AS p#121, _17#37 AS q#122, _18#38 AS r#123, _19#39 AS s#124, _20#40 AS t#125, _21#41 AS u#126]
      +- Filter (size(array(cast(_1#21 as string), _2#22, _3#23, _4#24, _5#25, _6#26, _7#27, _8#28, _9#29, _10#30, _11#31, _12#32, _13#33, _14#34, _15#35, _16#36, _17#37, _18#38, _19#39, _20#40, _21#41), true) > 0)
         +- LogicalRDD [_1#21, _2#22, _3#23, _4#24, _5#25, _6#26, _7#27, _8#28, _9#29, _10#30, _11#31, _12#32, _13#33, _14#34, _15#35, _16#36, _17#37, _18#38, _19#39, _20#40, _21#41]
```
When calculating the constraints of the `Project`, we generate around 2^20 expressions, due to this code
```
var allConstraints = child.constraints
projectList.foreach {
  case a  Alias(l: Literal, _) =>
    allConstraints += EqualNullSafe(a.toAttribute, l)
  case a  Alias(e, _) =>
    // For every alias in `projectList`, replace the reference in constraints by its attribute.
    allConstraints ++= allConstraints.map(_ transform {
      case expr: Expression if expr.semanticEquals(e) =>
        a.toAttribute
    })
    allConstraints += EqualNullSafe(e, a.toAttribute)
  case _ => // Don't change.
}
```

There are 3 issues here:
1. We may infer complicated predicates from `Generate`
2. `ConstanFolding` rule is too conservative. At least `Cast` has no side effect with ANSI-off.
3. When calculating constraints, we should have a upper bound to avoid generating too many expressions.

This fixes the first 2 issues, and leaves the third one for the future.

### Why are the changes needed?

fix a performance issue

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new tests, and run the query in JIRA ticket locally.

Closes apache#34823 from cloud-fan/perf.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 1fac7a9)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants