[SPARK-25276] OutOfMemoryError: GC overhead limit exceeded when using alias #22277

ajithme · 2018-08-30T03:44:57Z

Test sql :

Output :

spark-2.3.1-bin-hadoop2.7/bin # ./spark-sql -f test.txt
Time taken: 3.405 seconds
Time taken: 0.373 seconds
Time taken: 0.202 seconds
Time taken: 0.024 seconds
18/09/06 11:29:49 WARN HiveMetaStore: Location: file:/user/hive/warehouse/table11 specified for non-external table:table11
Time taken: 0.541 seconds
18/09/06 11:29:49 WARN HiveMetaStore: Location: file:/user/hive/warehouse/table22 specified for non-external table:table22
Time taken: 0.115 seconds
18/09/06 11:29:50 WARN HiveMetaStore: Location: file:/user/hive/warehouse/table33 specified for non-external table:table33
Time taken: 6.075 seconds
18/09/06 11:31:38 ERROR SparkSQLDriver: Failed in [
create table table44 as
select a.*
from
(
select
(concat(
case when a1 is null then '' else cast(a1 as string) end,'|~|',
case when a2 is null then '' else cast(a2 as string) end,'|~|',
case when a3 is null then '' else cast(a3 as string) end,'|~|',
case when a4 is null then '' else cast(a4 as string) end,'|~|',
case when a5 is null then '' else cast(a5 as string) end,'|~|',
case when a6 is null then '' else cast(a6 as string) end,'|~|',
case when a7 is null then '' else cast(a7 as string) end,'|~|',
case when a8 is null then '' else cast(a8 as string) end,'|~|',
case when a9 is null then '' else cast(a9 as string) end,'|~|',
case when a10 is null then '' else cast(a10 as string) end,'|~|',
case when a11 is null then '' else cast(a11 as string) end,'|~|',
case when a12 is null then '' else cast(a12 as string) end,'|~|',
case when a13 is null then '' else cast(a13 as string) end,'|~|',
case when a14 is null then '' else cast(a14 as string) end,'|~|',
case when a15 is null then '' else cast(a15 as string) end,'|~|',
case when a16 is null then '' else cast(a16 as string) end,'|~|',
case when a17 is null then '' else cast(a17 as string) end,'|~|',
case when a18 is null then '' else cast(a18 as string) end,'|~|',
case when a19 is null then '' else cast(a19 as string) end
)) as KEY_ID ,
case when a1 is null then '' else cast(a1 as string) end as a1,
case when a2 is null then '' else cast(a2 as string) end as a2,
case when a3 is null then '' else cast(a3 as string) end as a3,
case when a4 is null then '' else cast(a4 as string) end as a4,
case when a5 is null then '' else cast(a5 as string) end as a5,
case when a6 is null then '' else cast(a6 as string) end as a6,
case when a7 is null then '' else cast(a7 as string) end as a7,
case when a8 is null then '' else cast(a8 as string) end as a8,
case when a9 is null then '' else cast(a9 as string) end as a9,
case when a10 is null then '' else cast(a10 as string) end as a10,
case when a11 is null then '' else cast(a11 as string) end as a11,
case when a12 is null then '' else cast(a12 as string) end as a12,
case when a13 is null then '' else cast(a13 as string) end as a13,
case when a14 is null then '' else cast(a14 as string) end as a14,
case when a15 is null then '' else cast(a15 as string) end as a15,
case when a16 is null then '' else cast(a16 as string) end as a16,
case when a17 is null then '' else cast(a17 as string) end as a17,
case when a18 is null then '' else cast(a18 as string) end as a18,
case when a19 is null then '' else cast(a19 as string) end as a19
from table22
) A
left join table11 B ON A.KEY_ID = B.KEY_ID
where b.KEY_ID is null]
java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.lang.Class.copyConstructors(Class.java:3130)
        at java.lang.Class.getConstructors(Class.java:1651)
        at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:387)
        at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:385)
        at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
        at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:385)
        at org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:244)
        at org.apache.spark.sql.catalyst.expressions.Expression.canonicalized$lzycompute(Expression.scala:190)
        at org.apache.spark.sql.catalyst.expressions.Expression.canonicalized(Expression.scala:188)
        at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$1.apply(Expression.scala:189)
        at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$1.apply(Expression.scala:189)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
        at scala.collection.immutable.List.map(List.scala:285)
        at org.apache.spark.sql.catalyst.expressions.Expression.canonicalized$lzycompute(Expression.scala:189)
        at org.apache.spark.sql.catalyst.expressions.Expression.canonicalized(Expression.scala:188)
        at org.apache.spark.sql.catalyst.expressions.ExpressionSet.add(ExpressionSet.scala:63)
        at org.apache.spark.sql.catalyst.expressions.ExpressionSet$$anonfun$$plus$plus$1.apply(ExpressionSet.scala:79)
        at org.apache.spark.sql.catalyst.expressions.ExpressionSet$$anonfun$$plus$plus$1.apply(ExpressionSet.scala:79)
        at scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:316)
        at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
        at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
        at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
        at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
        at org.apache.spark.sql.catalyst.expressions.ExpressionSet.$plus$plus(ExpressionSet.scala:79)
        at org.apache.spark.sql.catalyst.expressions.ExpressionSet.$plus$plus(ExpressionSet.scala:55)
        at org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$getAliasedConstraints$1.apply(LogicalPlan.scala:254)
        at org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$getAliasedConstraints$1.apply(LogicalPlan.scala:249)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at org.apache.spark.sql.catalyst.plans.logical.UnaryNode.getAliasedConstraints(LogicalPlan.scala:249)

Attaching a test to reproduce the issue. The issue seems to be with the redundant constrains, Below is a test which explains it.

test("redundant constrains") {
val tr = LocalRelation('a.int, 'b.string, 'c.int)
val aliasedRelation = tr.where('a.attr > 10).select('a.as('x), 'b, 'b.as('y), 'a.as('z))

verifyConstraints(aliasedRelation.analyze.constraints,
  ExpressionSet(Seq(resolveColumn(aliasedRelation.analyze, "x") > 10,
    IsNotNull(resolveColumn(aliasedRelation.analyze, "x")),
    resolveColumn(aliasedRelation.analyze, "b") <=> resolveColumn(aliasedRelation.analyze, "y"),
    resolveColumn(aliasedRelation.analyze, "z") <=>
      resolveColumn(aliasedRelation.analyze, "x"))))

}

== FAIL: Constraints do not match ===
Found: isnotnull(z#5),(z#5 > 10),(x#3 > 10),(z#5 <=> x#3),(b#1 <=> y#4),isnotnull(x#3)
Expected: (x#3 > 10),isnotnull(x#3),(b#1 <=> y#4),(z#5 <=> x#3)
== Result ==
Missing: N/A
Found but not expected: isnotnull(z#5),(z#5 > 10)
Here i think as z has a EqualNullSafe comparison with x, so having isnotnull(z#5),(z#5 > 10) is redundant. If a query has lot of aliases, this may cause overhead leading to java.lang.OutOfMemoryError: GC overhead limit exceeded.

So i suggest https://github.com/apache/spark/blob/v2.3.2-rc5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L254 instead of addAll++= we must just assign =

ajithme · 2018-08-31T06:43:09Z

@gatorsmile and @jiangxb1987 any inputs.?

jiangxb1987 · 2018-09-04T13:33:45Z

Thank you for interest in this issue, however, I don't think the changes proposed in this PR is valid, consider you have another predicate like a > z, it is surely desired to infer a new constraint z > z. Please correct me if I'm wrong about this.

ajithme · 2018-09-05T07:37:15Z

@jiangxb1987 Thanks for the feedback. Couple of points

If introduce a predicate which refers to alias( as u mentioned a > z), it will throw error

spark-sql> create table table1 (a int);
Time taken: 0.152 seconds

spark-sql> select a, a as c from table1 where a > 10 and a > c;
Error in query: cannot resolve '`c`' given input columns: [table1.a]; line 1 pos 50;
'Project ['a, 'a AS c#6]
+- 'Filter ((a#7 > 10) && (a#7 > 'c))
   +- SubqueryAlias table1
      +- HiveTableRelation `default`.`table1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#7]

So i think its invalid scenario for a > z.? please correct me if i am wrong

if we add a predicate like a > a instead of a > z ( self referring) the PR still produces valid constrain list

  (x#5 > x#5),(b#1 <=> y#6),(x#5 > 10),(z#7 <=> x#5),isnotnull(x#5)

jiangxb1987 · 2018-09-05T13:50:53Z

You can have select * from (select a, a as c from table1 where a > 10) t where a > c

ajithme · 2018-09-06T07:19:19Z

I see. But the code modified in this PR is when alias is part of projection. The query mention by you seems not to hit the current alias logic @ org.apache.spark.sql.catalyst.plans.logical.UnaryNode#getAliasedConstraints as for outer query c is not alias but rather AttributeReferences like a.

Please correct me if i am wrong, do you mean we should cover the scenario where alias is referenced in filter as part of this PR.?

ajithme · 2018-09-06T09:52:06Z

Attaching a sql file to reproduce the issue and see the effect of PR :
test.txt

Without patch:

spark-2.3.1-bin-hadoop2.7/bin # ./spark-sql -f test.txt
Time taken: 3.405 seconds
Time taken: 0.373 seconds
Time taken: 0.202 seconds
Time taken: 0.024 seconds
18/09/06 11:29:49 WARN HiveMetaStore: Location: file:/user/hive/warehouse/table11 specified for non-external table:table11
Time taken: 0.541 seconds
18/09/06 11:29:49 WARN HiveMetaStore: Location: file:/user/hive/warehouse/table22 specified for non-external table:table22
Time taken: 0.115 seconds
18/09/06 11:29:50 WARN HiveMetaStore: Location: file:/user/hive/warehouse/table33 specified for non-external table:table33
Time taken: 6.075 seconds
18/09/06 11:31:38 ERROR SparkSQLDriver: Failed in [
create table table44 as
select a.*
from
(
select
(concat(
case when a1 is null then '' else cast(a1 as string) end,'|~|',
case when a2 is null then '' else cast(a2 as string) end,'|~|',
case when a3 is null then '' else cast(a3 as string) end,'|~|',
case when a4 is null then '' else cast(a4 as string) end,'|~|',
case when a5 is null then '' else cast(a5 as string) end,'|~|',
case when a6 is null then '' else cast(a6 as string) end,'|~|',
case when a7 is null then '' else cast(a7 as string) end,'|~|',
case when a8 is null then '' else cast(a8 as string) end,'|~|',
case when a9 is null then '' else cast(a9 as string) end,'|~|',
case when a10 is null then '' else cast(a10 as string) end,'|~|',
case when a11 is null then '' else cast(a11 as string) end,'|~|',
case when a12 is null then '' else cast(a12 as string) end,'|~|',
case when a13 is null then '' else cast(a13 as string) end,'|~|',
case when a14 is null then '' else cast(a14 as string) end,'|~|',
case when a15 is null then '' else cast(a15 as string) end,'|~|',
case when a16 is null then '' else cast(a16 as string) end,'|~|',
case when a17 is null then '' else cast(a17 as string) end,'|~|',
case when a18 is null then '' else cast(a18 as string) end,'|~|',
case when a19 is null then '' else cast(a19 as string) end
)) as KEY_ID ,
case when a1 is null then '' else cast(a1 as string) end as a1,
case when a2 is null then '' else cast(a2 as string) end as a2,
case when a3 is null then '' else cast(a3 as string) end as a3,
case when a4 is null then '' else cast(a4 as string) end as a4,
case when a5 is null then '' else cast(a5 as string) end as a5,
case when a6 is null then '' else cast(a6 as string) end as a6,
case when a7 is null then '' else cast(a7 as string) end as a7,
case when a8 is null then '' else cast(a8 as string) end as a8,
case when a9 is null then '' else cast(a9 as string) end as a9,
case when a10 is null then '' else cast(a10 as string) end as a10,
case when a11 is null then '' else cast(a11 as string) end as a11,
case when a12 is null then '' else cast(a12 as string) end as a12,
case when a13 is null then '' else cast(a13 as string) end as a13,
case when a14 is null then '' else cast(a14 as string) end as a14,
case when a15 is null then '' else cast(a15 as string) end as a15,
case when a16 is null then '' else cast(a16 as string) end as a16,
case when a17 is null then '' else cast(a17 as string) end as a17,
case when a18 is null then '' else cast(a18 as string) end as a18,
case when a19 is null then '' else cast(a19 as string) end as a19
from table22
) A
left join table11 B ON A.KEY_ID = B.KEY_ID
where b.KEY_ID is null]
java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.lang.Class.copyConstructors(Class.java:3130)
        at java.lang.Class.getConstructors(Class.java:1651)
        at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:387)
        at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:385)
        at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
        at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:385)
        at org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:244)
        at org.apache.spark.sql.catalyst.expressions.Expression.canonicalized$lzycompute(Expression.scala:190)
        at org.apache.spark.sql.catalyst.expressions.Expression.canonicalized(Expression.scala:188)
        at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$1.apply(Expression.scala:189)
        at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$1.apply(Expression.scala:189)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
        at scala.collection.immutable.List.map(List.scala:285)
        at org.apache.spark.sql.catalyst.expressions.Expression.canonicalized$lzycompute(Expression.scala:189)
        at org.apache.spark.sql.catalyst.expressions.Expression.canonicalized(Expression.scala:188)
        at org.apache.spark.sql.catalyst.expressions.ExpressionSet.add(ExpressionSet.scala:63)
        at org.apache.spark.sql.catalyst.expressions.ExpressionSet$$anonfun$$plus$plus$1.apply(ExpressionSet.scala:79)
        at org.apache.spark.sql.catalyst.expressions.ExpressionSet$$anonfun$$plus$plus$1.apply(ExpressionSet.scala:79)
        at scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:316)
        at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
        at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
        at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
        at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
        at org.apache.spark.sql.catalyst.expressions.ExpressionSet.$plus$plus(ExpressionSet.scala:79)
        at org.apache.spark.sql.catalyst.expressions.ExpressionSet.$plus$plus(ExpressionSet.scala:55)
        at org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$getAliasedConstraints$1.apply(LogicalPlan.scala:254)
        at org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$getAliasedConstraints$1.apply(LogicalPlan.scala:249)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at org.apache.spark.sql.catalyst.plans.logical.UnaryNode.getAliasedConstraints(LogicalPlan.scala:249)

After applying patch:

spark-2.3.1-bin-hadoop2.7/bin # ./spark-sql -f test.txt
Time taken: 3.469 seconds
Time taken: 0.294 seconds
Time taken: 0.223 seconds
Time taken: 0.023 seconds
18/09/06 11:33:08 WARN HiveMetaStore: Location: file:/user/hive/warehouse/table11 specified for non-external table:table11
Time taken: 0.546 seconds
18/09/06 11:33:08 WARN HiveMetaStore: Location: file:/user/hive/warehouse/table22 specified for non-external table:table22
Time taken: 0.11 seconds
18/09/06 11:33:10 WARN HiveMetaStore: Location: file:/user/hive/warehouse/table33 specified for non-external table:table33
Time taken: 6.258 seconds
18/09/06 11:33:15 WARN HiveMetaStore: Location: file:/user/hive/warehouse/table44 specified for non-external table:table44
Time taken: 2.603 seconds

As you can see here, when we have many aliases in projection, computing it will cause significant overhead with current code which throws GC overhead limit exceeded after 3 minutes for table44.

AmplabJenkins · 2019-09-16T18:19:27Z

Can one of the admins verify this patch?

github-actions · 2020-01-07T00:07:44Z

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

[SPARK-25276] Redundant constrains when using alias

5be1df1

Added test

fcee767

ajithme changed the title ~~[SPARK-25276] Redundant constrains when using alias~~ [SPARK-25276] OutOfMemoryError: GC overhead limit exceeded when using alias Sep 10, 2018

dongjoon-hyun added the SPARK CORE label Jun 14, 2019

github-actions bot added the Stale label Jan 7, 2020

srowen closed this Jan 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-25276] OutOfMemoryError: GC overhead limit exceeded when using alias #22277

[SPARK-25276] OutOfMemoryError: GC overhead limit exceeded when using alias #22277

ajithme commented Aug 30, 2018 •

edited

ajithme commented Aug 31, 2018

jiangxb1987 commented Sep 4, 2018

ajithme commented Sep 5, 2018 •

edited

jiangxb1987 commented Sep 5, 2018

ajithme commented Sep 6, 2018 •

edited

ajithme commented Sep 6, 2018 •

edited

AmplabJenkins commented Sep 16, 2019

github-actions bot commented Jan 7, 2020

[SPARK-25276] OutOfMemoryError: GC overhead limit exceeded when using alias #22277

[SPARK-25276] OutOfMemoryError: GC overhead limit exceeded when using alias #22277

Conversation

ajithme commented Aug 30, 2018 • edited

Test sql :

Output :

ajithme commented Aug 31, 2018

jiangxb1987 commented Sep 4, 2018

ajithme commented Sep 5, 2018 • edited

jiangxb1987 commented Sep 5, 2018

ajithme commented Sep 6, 2018 • edited

ajithme commented Sep 6, 2018 • edited

Without patch:

After applying patch:

AmplabJenkins commented Sep 16, 2019

github-actions bot commented Jan 7, 2020

ajithme commented Aug 30, 2018 •

edited

ajithme commented Sep 5, 2018 •

edited

ajithme commented Sep 6, 2018 •

edited

ajithme commented Sep 6, 2018 •

edited