Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25276] OutOfMemoryError: GC overhead limit exceeded when using alias #22277

Closed
wants to merge 2 commits into from

Conversation

ajithme
Copy link
Contributor

@ajithme ajithme commented Aug 30, 2018

Test sql :

test.txt

Output :

spark-2.3.1-bin-hadoop2.7/bin # ./spark-sql -f test.txt
Time taken: 3.405 seconds
Time taken: 0.373 seconds
Time taken: 0.202 seconds
Time taken: 0.024 seconds
18/09/06 11:29:49 WARN HiveMetaStore: Location: file:/user/hive/warehouse/table11 specified for non-external table:table11
Time taken: 0.541 seconds
18/09/06 11:29:49 WARN HiveMetaStore: Location: file:/user/hive/warehouse/table22 specified for non-external table:table22
Time taken: 0.115 seconds
18/09/06 11:29:50 WARN HiveMetaStore: Location: file:/user/hive/warehouse/table33 specified for non-external table:table33
Time taken: 6.075 seconds
18/09/06 11:31:38 ERROR SparkSQLDriver: Failed in [
create table table44 as
select a.*
from
(
select
(concat(
case when a1 is null then '' else cast(a1 as string) end,'|~|',
case when a2 is null then '' else cast(a2 as string) end,'|~|',
case when a3 is null then '' else cast(a3 as string) end,'|~|',
case when a4 is null then '' else cast(a4 as string) end,'|~|',
case when a5 is null then '' else cast(a5 as string) end,'|~|',
case when a6 is null then '' else cast(a6 as string) end,'|~|',
case when a7 is null then '' else cast(a7 as string) end,'|~|',
case when a8 is null then '' else cast(a8 as string) end,'|~|',
case when a9 is null then '' else cast(a9 as string) end,'|~|',
case when a10 is null then '' else cast(a10 as string) end,'|~|',
case when a11 is null then '' else cast(a11 as string) end,'|~|',
case when a12 is null then '' else cast(a12 as string) end,'|~|',
case when a13 is null then '' else cast(a13 as string) end,'|~|',
case when a14 is null then '' else cast(a14 as string) end,'|~|',
case when a15 is null then '' else cast(a15 as string) end,'|~|',
case when a16 is null then '' else cast(a16 as string) end,'|~|',
case when a17 is null then '' else cast(a17 as string) end,'|~|',
case when a18 is null then '' else cast(a18 as string) end,'|~|',
case when a19 is null then '' else cast(a19 as string) end
)) as KEY_ID ,
case when a1 is null then '' else cast(a1 as string) end as a1,
case when a2 is null then '' else cast(a2 as string) end as a2,
case when a3 is null then '' else cast(a3 as string) end as a3,
case when a4 is null then '' else cast(a4 as string) end as a4,
case when a5 is null then '' else cast(a5 as string) end as a5,
case when a6 is null then '' else cast(a6 as string) end as a6,
case when a7 is null then '' else cast(a7 as string) end as a7,
case when a8 is null then '' else cast(a8 as string) end as a8,
case when a9 is null then '' else cast(a9 as string) end as a9,
case when a10 is null then '' else cast(a10 as string) end as a10,
case when a11 is null then '' else cast(a11 as string) end as a11,
case when a12 is null then '' else cast(a12 as string) end as a12,
case when a13 is null then '' else cast(a13 as string) end as a13,
case when a14 is null then '' else cast(a14 as string) end as a14,
case when a15 is null then '' else cast(a15 as string) end as a15,
case when a16 is null then '' else cast(a16 as string) end as a16,
case when a17 is null then '' else cast(a17 as string) end as a17,
case when a18 is null then '' else cast(a18 as string) end as a18,
case when a19 is null then '' else cast(a19 as string) end as a19
from table22
) A
left join table11 B ON A.KEY_ID = B.KEY_ID
where b.KEY_ID is null]
java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.lang.Class.copyConstructors(Class.java:3130)
        at java.lang.Class.getConstructors(Class.java:1651)
        at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:387)
        at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:385)
        at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
        at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:385)
        at org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:244)
        at org.apache.spark.sql.catalyst.expressions.Expression.canonicalized$lzycompute(Expression.scala:190)
        at org.apache.spark.sql.catalyst.expressions.Expression.canonicalized(Expression.scala:188)
        at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$1.apply(Expression.scala:189)
        at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$1.apply(Expression.scala:189)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
        at scala.collection.immutable.List.map(List.scala:285)
        at org.apache.spark.sql.catalyst.expressions.Expression.canonicalized$lzycompute(Expression.scala:189)
        at org.apache.spark.sql.catalyst.expressions.Expression.canonicalized(Expression.scala:188)
        at org.apache.spark.sql.catalyst.expressions.ExpressionSet.add(ExpressionSet.scala:63)
        at org.apache.spark.sql.catalyst.expressions.ExpressionSet$$anonfun$$plus$plus$1.apply(ExpressionSet.scala:79)
        at org.apache.spark.sql.catalyst.expressions.ExpressionSet$$anonfun$$plus$plus$1.apply(ExpressionSet.scala:79)
        at scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:316)
        at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
        at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
        at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
        at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
        at org.apache.spark.sql.catalyst.expressions.ExpressionSet.$plus$plus(ExpressionSet.scala:79)
        at org.apache.spark.sql.catalyst.expressions.ExpressionSet.$plus$plus(ExpressionSet.scala:55)
        at org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$getAliasedConstraints$1.apply(LogicalPlan.scala:254)
        at org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$getAliasedConstraints$1.apply(LogicalPlan.scala:249)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at org.apache.spark.sql.catalyst.plans.logical.UnaryNode.getAliasedConstraints(LogicalPlan.scala:249)

Attaching a test to reproduce the issue. The issue seems to be with the redundant constrains, Below is a test which explains it.

test("redundant constrains") {
val tr = LocalRelation('a.int, 'b.string, 'c.int)
val aliasedRelation = tr.where('a.attr > 10).select('a.as('x), 'b, 'b.as('y), 'a.as('z))

verifyConstraints(aliasedRelation.analyze.constraints,
  ExpressionSet(Seq(resolveColumn(aliasedRelation.analyze, "x") > 10,
    IsNotNull(resolveColumn(aliasedRelation.analyze, "x")),
    resolveColumn(aliasedRelation.analyze, "b") <=> resolveColumn(aliasedRelation.analyze, "y"),
    resolveColumn(aliasedRelation.analyze, "z") <=>
      resolveColumn(aliasedRelation.analyze, "x"))))

}

== FAIL: Constraints do not match ===
Found: isnotnull(z#5),(z#5 > 10),(x#3 > 10),(z#5 <=> x#3),(b#1 <=> y#4),isnotnull(x#3)
Expected: (x#3 > 10),isnotnull(x#3),(b#1 <=> y#4),(z#5 <=> x#3)
== Result ==
Missing: N/A
Found but not expected: isnotnull(z#5),(z#5 > 10)
Here i think as z has a EqualNullSafe comparison with x, so having isnotnull(z#5),(z#5 > 10) is redundant. If a query has lot of aliases, this may cause overhead leading to java.lang.OutOfMemoryError: GC overhead limit exceeded.

So i suggest https://github.com/apache/spark/blob/v2.3.2-rc5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L254 instead of addAll++= we must just assign =

@ajithme
Copy link
Contributor Author

ajithme commented Aug 31, 2018

@gatorsmile and @jiangxb1987 any inputs.?

@jiangxb1987
Copy link
Contributor

Thank you for interest in this issue, however, I don't think the changes proposed in this PR is valid, consider you have another predicate like a > z, it is surely desired to infer a new constraint z > z. Please correct me if I'm wrong about this.

@ajithme
Copy link
Contributor Author

ajithme commented Sep 5, 2018

@jiangxb1987 Thanks for the feedback. Couple of points

  1. If introduce a predicate which refers to alias( as u mentioned a > z), it will throw error
spark-sql> create table table1 (a int);
Time taken: 0.152 seconds

spark-sql> select a, a as c from table1 where a > 10 and a > c;
Error in query: cannot resolve '`c`' given input columns: [table1.a]; line 1 pos 50;
'Project ['a, 'a AS c#6]
+- 'Filter ((a#7 > 10) && (a#7 > 'c))
   +- SubqueryAlias table1
      +- HiveTableRelation `default`.`table1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#7]

So i think its invalid scenario for a > z.? please correct me if i am wrong

  1. if we add a predicate like a > a instead of a > z ( self referring) the PR still produces valid constrain list
  (x#5 > x#5),(b#1 <=> y#6),(x#5 > 10),(z#7 <=> x#5),isnotnull(x#5)

@jiangxb1987
Copy link
Contributor

You can have select * from (select a, a as c from table1 where a > 10) t where a > c

@ajithme
Copy link
Contributor Author

ajithme commented Sep 6, 2018

I see. But the code modified in this PR is when alias is part of projection. The query mention by you seems not to hit the current alias logic @ org.apache.spark.sql.catalyst.plans.logical.UnaryNode#getAliasedConstraints as for outer query c is not alias but rather AttributeReferences like a.

Please correct me if i am wrong, do you mean we should cover the scenario where alias is referenced in filter as part of this PR.?

@ajithme
Copy link
Contributor Author

ajithme commented Sep 6, 2018

Attaching a sql file to reproduce the issue and see the effect of PR :
test.txt

Without patch:

spark-2.3.1-bin-hadoop2.7/bin # ./spark-sql -f test.txt
Time taken: 3.405 seconds
Time taken: 0.373 seconds
Time taken: 0.202 seconds
Time taken: 0.024 seconds
18/09/06 11:29:49 WARN HiveMetaStore: Location: file:/user/hive/warehouse/table11 specified for non-external table:table11
Time taken: 0.541 seconds
18/09/06 11:29:49 WARN HiveMetaStore: Location: file:/user/hive/warehouse/table22 specified for non-external table:table22
Time taken: 0.115 seconds
18/09/06 11:29:50 WARN HiveMetaStore: Location: file:/user/hive/warehouse/table33 specified for non-external table:table33
Time taken: 6.075 seconds
18/09/06 11:31:38 ERROR SparkSQLDriver: Failed in [
create table table44 as
select a.*
from
(
select
(concat(
case when a1 is null then '' else cast(a1 as string) end,'|~|',
case when a2 is null then '' else cast(a2 as string) end,'|~|',
case when a3 is null then '' else cast(a3 as string) end,'|~|',
case when a4 is null then '' else cast(a4 as string) end,'|~|',
case when a5 is null then '' else cast(a5 as string) end,'|~|',
case when a6 is null then '' else cast(a6 as string) end,'|~|',
case when a7 is null then '' else cast(a7 as string) end,'|~|',
case when a8 is null then '' else cast(a8 as string) end,'|~|',
case when a9 is null then '' else cast(a9 as string) end,'|~|',
case when a10 is null then '' else cast(a10 as string) end,'|~|',
case when a11 is null then '' else cast(a11 as string) end,'|~|',
case when a12 is null then '' else cast(a12 as string) end,'|~|',
case when a13 is null then '' else cast(a13 as string) end,'|~|',
case when a14 is null then '' else cast(a14 as string) end,'|~|',
case when a15 is null then '' else cast(a15 as string) end,'|~|',
case when a16 is null then '' else cast(a16 as string) end,'|~|',
case when a17 is null then '' else cast(a17 as string) end,'|~|',
case when a18 is null then '' else cast(a18 as string) end,'|~|',
case when a19 is null then '' else cast(a19 as string) end
)) as KEY_ID ,
case when a1 is null then '' else cast(a1 as string) end as a1,
case when a2 is null then '' else cast(a2 as string) end as a2,
case when a3 is null then '' else cast(a3 as string) end as a3,
case when a4 is null then '' else cast(a4 as string) end as a4,
case when a5 is null then '' else cast(a5 as string) end as a5,
case when a6 is null then '' else cast(a6 as string) end as a6,
case when a7 is null then '' else cast(a7 as string) end as a7,
case when a8 is null then '' else cast(a8 as string) end as a8,
case when a9 is null then '' else cast(a9 as string) end as a9,
case when a10 is null then '' else cast(a10 as string) end as a10,
case when a11 is null then '' else cast(a11 as string) end as a11,
case when a12 is null then '' else cast(a12 as string) end as a12,
case when a13 is null then '' else cast(a13 as string) end as a13,
case when a14 is null then '' else cast(a14 as string) end as a14,
case when a15 is null then '' else cast(a15 as string) end as a15,
case when a16 is null then '' else cast(a16 as string) end as a16,
case when a17 is null then '' else cast(a17 as string) end as a17,
case when a18 is null then '' else cast(a18 as string) end as a18,
case when a19 is null then '' else cast(a19 as string) end as a19
from table22
) A
left join table11 B ON A.KEY_ID = B.KEY_ID
where b.KEY_ID is null]
java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.lang.Class.copyConstructors(Class.java:3130)
        at java.lang.Class.getConstructors(Class.java:1651)
        at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:387)
        at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:385)
        at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
        at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:385)
        at org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:244)
        at org.apache.spark.sql.catalyst.expressions.Expression.canonicalized$lzycompute(Expression.scala:190)
        at org.apache.spark.sql.catalyst.expressions.Expression.canonicalized(Expression.scala:188)
        at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$1.apply(Expression.scala:189)
        at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$1.apply(Expression.scala:189)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
        at scala.collection.immutable.List.map(List.scala:285)
        at org.apache.spark.sql.catalyst.expressions.Expression.canonicalized$lzycompute(Expression.scala:189)
        at org.apache.spark.sql.catalyst.expressions.Expression.canonicalized(Expression.scala:188)
        at org.apache.spark.sql.catalyst.expressions.ExpressionSet.add(ExpressionSet.scala:63)
        at org.apache.spark.sql.catalyst.expressions.ExpressionSet$$anonfun$$plus$plus$1.apply(ExpressionSet.scala:79)
        at org.apache.spark.sql.catalyst.expressions.ExpressionSet$$anonfun$$plus$plus$1.apply(ExpressionSet.scala:79)
        at scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:316)
        at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
        at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
        at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
        at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
        at org.apache.spark.sql.catalyst.expressions.ExpressionSet.$plus$plus(ExpressionSet.scala:79)
        at org.apache.spark.sql.catalyst.expressions.ExpressionSet.$plus$plus(ExpressionSet.scala:55)
        at org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$getAliasedConstraints$1.apply(LogicalPlan.scala:254)
        at org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$getAliasedConstraints$1.apply(LogicalPlan.scala:249)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at org.apache.spark.sql.catalyst.plans.logical.UnaryNode.getAliasedConstraints(LogicalPlan.scala:249)

After applying patch:

spark-2.3.1-bin-hadoop2.7/bin # ./spark-sql -f test.txt
Time taken: 3.469 seconds
Time taken: 0.294 seconds
Time taken: 0.223 seconds
Time taken: 0.023 seconds
18/09/06 11:33:08 WARN HiveMetaStore: Location: file:/user/hive/warehouse/table11 specified for non-external table:table11
Time taken: 0.546 seconds
18/09/06 11:33:08 WARN HiveMetaStore: Location: file:/user/hive/warehouse/table22 specified for non-external table:table22
Time taken: 0.11 seconds
18/09/06 11:33:10 WARN HiveMetaStore: Location: file:/user/hive/warehouse/table33 specified for non-external table:table33
Time taken: 6.258 seconds
18/09/06 11:33:15 WARN HiveMetaStore: Location: file:/user/hive/warehouse/table44 specified for non-external table:table44
Time taken: 2.603 seconds

As you can see here, when we have many aliases in projection, computing it will cause significant overhead with current code which throws GC overhead limit exceeded after 3 minutes for table44.

@ajithme ajithme changed the title [SPARK-25276] Redundant constrains when using alias [SPARK-25276] OutOfMemoryError: GC overhead limit exceeded when using alias Sep 10, 2018
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@github-actions
Copy link

github-actions bot commented Jan 7, 2020

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

@github-actions github-actions bot added the Stale label Jan 7, 2020
@srowen srowen closed this Jan 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants