-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-25276] OutOfMemoryError: GC overhead limit exceeded when using alias #22277
Conversation
@gatorsmile and @jiangxb1987 any inputs.? |
Thank you for interest in this issue, however, I don't think the changes proposed in this PR is valid, consider you have another predicate like |
@jiangxb1987 Thanks for the feedback. Couple of points
So i think its invalid scenario for a > z.? please correct me if i am wrong
|
You can have |
I see. But the code modified in this PR is when alias is part of projection. The query mention by you seems not to hit the current alias logic @ org.apache.spark.sql.catalyst.plans.logical.UnaryNode#getAliasedConstraints as for outer query c is not alias but rather AttributeReferences like a. Please correct me if i am wrong, do you mean we should cover the scenario where alias is referenced in filter as part of this PR.? |
Attaching a sql file to reproduce the issue and see the effect of PR : Without patch:
After applying patch:
As you can see here, when we have many aliases in projection, computing it will cause significant overhead with current code which throws GC overhead limit exceeded after 3 minutes for table44. |
Can one of the admins verify this patch? |
We're closing this PR because it hasn't been updated in a while. If you'd like to revive this PR, please reopen it! |
Test sql :
test.txt
Output :
Attaching a test to reproduce the issue. The issue seems to be with the redundant constrains, Below is a test which explains it.
test("redundant constrains") {
val tr = LocalRelation('a.int, 'b.string, 'c.int)
val aliasedRelation = tr.where('a.attr > 10).select('a.as('x), 'b, 'b.as('y), 'a.as('z))
}
== FAIL: Constraints do not match ===
Found: isnotnull(z#5),(z#5 > 10),(x#3 > 10),(z#5 <=> x#3),(b#1 <=> y#4),isnotnull(x#3)
Expected: (x#3 > 10),isnotnull(x#3),(b#1 <=> y#4),(z#5 <=> x#3)
== Result ==
Missing: N/A
Found but not expected: isnotnull(z#5),(z#5 > 10)
Here i think as z has a EqualNullSafe comparison with x, so having isnotnull(z#5),(z#5 > 10) is redundant. If a query has lot of aliases, this may cause overhead leading to java.lang.OutOfMemoryError: GC overhead limit exceeded.
So i suggest https://github.com/apache/spark/blob/v2.3.2-rc5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L254 instead of addAll++= we must just assign =