[SPARK-10474] [SQL] Aggregation fails to allocate memory for pointer array (round 2) #8888

andrewor14 · 2015-09-23T21:24:29Z

This patch reverts most of the changes in a previous fix #8827.

The real cause of the issue is that in TungstenAggregate's prepare method we only reserve 1 page, but later when we switch to sort-based aggregation we try to acquire 1 page AND a pointer array. The longer-term fix should be to reserve also the pointer array, but for now _we will simply not track the pointer array_. (Note that elsewhere we already don't track the pointer array, e.g. here)

Note: This patch reuses the unit test added in #8827 so it doesn't show up in the diff.

…pointer array" This reverts commit 7ff8d68.

davies · 2015-09-23T21:39:19Z

LGTM

SparkQA · 2015-09-23T23:41:41Z

Test build #42924 has finished for PR 8888 at commit c910d0b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-09-23T23:55:39Z

Test build #42925 has finished for PR 8888 at commit ed36351.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

Looks like we still tracked the pointer array memory when we grow it. Don't do that.

SparkQA · 2015-09-24T00:00:25Z

Test build #1803 has started for PR 8888 at commit ed36351.

SparkQA · 2015-09-24T00:01:15Z

Test build #1804 has started for PR 8888 at commit ed36351.

SparkQA · 2015-09-24T00:02:23Z

Test build #1799 has finished for PR 8888 at commit ed36351.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-09-24T00:02:28Z

Test build #1806 has started for PR 8888 at commit ed36351.

SparkQA · 2015-09-24T00:03:01Z

Test build #1805 has started for PR 8888 at commit ed36351.

SparkQA · 2015-09-24T00:03:47Z

Test build #1800 has finished for PR 8888 at commit ed36351.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-09-24T00:23:12Z

Test build #1802 has finished for PR 8888 at commit ed36351.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-09-24T00:26:09Z

Test build #1801 has finished for PR 8888 at commit ed36351.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2015-09-24T01:36:04Z

By the way @JoshRosen this seems to run core tests like SecurityManagerSuite even though it touches only SQL stuff. Just FYI.

chenghao-intel · 2015-09-24T01:39:39Z

https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L138 Seems reserve the data page only if the existingInMemorySorter is unset, however, SPARK-474 is not this case right? As it follows the sort merge join operator.

JoshRosen · 2015-09-24T01:45:07Z

@andrewor14, it's running core tests because you changed core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java

andrewor14 · 2015-09-24T01:50:28Z

@chenghao-intel SPARK-10474 is caused by an aggregate falling back to sort-based aggregation. In this case we don't acquire the page in the constructor, but we do acquire it when we insert into the sorter later.

andrewor14 · 2015-09-24T01:55:09Z

it's running core tests because you changed core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java

Ah yes never mind. I didn't realize it was in core.

SparkQA · 2015-09-24T01:55:23Z

Test build #1807 has started for PR 8888 at commit 2dc34c3.

chenghao-intel · 2015-09-24T02:19:03Z

@andrewor14 that's actually what I mean, if we didn't reserve the memory when creating the down streaming operator, we probably never get the chance to acquire the page when inserting records to sorter, as the upstreaming operator (like SMJ) probably eat out all of the memory, which even lead to the hash aggregation switching to sort-based aggregation when first record comes. Do you think that's possible?

andrewor14 · 2015-09-24T02:32:42Z

@chenghao-intel but we do reserve the page in advance. See TungstenAggregate:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregate.scala

Line 77 in 084e4e1

def preparePartition(): TungstenAggregationIterator = {

SparkQA · 2015-09-24T02:33:03Z

Test build #42942 has finished for PR 8888 at commit 2dc34c3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2015-09-24T02:34:01Z

OK it passed tests. I'm merging this into master 1.5.

SparkQA · 2015-09-24T02:34:48Z

Test build #1808 has finished for PR 8888 at commit 2dc34c3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…array (round 2) This patch reverts most of the changes in a previous fix #8827. The real cause of the issue is that in `TungstenAggregate`'s prepare method we only reserve 1 page, but later when we switch to sort-based aggregation we try to acquire 1 page AND a pointer array. The longer-term fix should be to reserve also the pointer array, but for now ***we will simply not track the pointer array***. (Note that elsewhere we already don't track the pointer array, e.g. [here](https://github.com/apache/spark/blob/a18208047f06a4244703c17023bb20cbe1f59d73/sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeKVExternalSorter.java#L88)) Note: This patch reuses the unit test added in #8827 so it doesn't show up in the diff. Author: Andrew Or <andrew@databricks.com> Closes #8888 from andrewor14/dont-track-pointer-array. (cherry picked from commit 83f6f54) Signed-off-by: Andrew Or <andrew@databricks.com>

…array (round 2) This patch reverts most of the changes in a previous fix #8827. The real cause of the issue is that in `TungstenAggregate`'s prepare method we only reserve 1 page, but later when we switch to sort-based aggregation we try to acquire 1 page AND a pointer array. The longer-term fix should be to reserve also the pointer array, but for now ***we will simply not track the pointer array***. (Note that elsewhere we already don't track the pointer array, e.g. [here](https://github.com/apache/spark/blob/a18208047f06a4244703c17023bb20cbe1f59d73/sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeKVExternalSorter.java#L88)) Note: This patch reuses the unit test added in #8827 so it doesn't show up in the diff. Author: Andrew Or <andrew@databricks.com> Closes #8888 from andrewor14/dont-track-pointer-array.

SparkQA · 2015-09-24T04:14:19Z

Test build #1811 has finished for PR 8888 at commit 2dc34c3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-09-24T04:37:14Z

Test build #1810 has finished for PR 8888 at commit 2dc34c3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-09-24T04:38:39Z

Test build #1809 has finished for PR 8888 at commit 2dc34c3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

chenghao-intel · 2015-09-24T04:51:53Z

Thank you @andrewor14 for the explanation. I believe you're talking about to reserve the data page in advanced via the UnsafeFixedWidthAggregationMap.map (BytesToBytesMap), which will reserve the data page in its constructor.
Everything looks reasonable to me now, the only concern is the performance, since only a single data page available for the sort-based aggregation. Anyway, it's not related to this fix.

andrewor14 · 2015-09-24T16:59:15Z

Yes, it's not related. What you mention here is a bigger problem. The current solution only ensures that we don't starve any operators. In the future we can improve this mechanism by introducing some force spilling mechanism, but that's too big of a change to backport to 1.5.

…array (round 2) This patch reverts most of the changes in a previous fix apache#8827. The real cause of the issue is that in `TungstenAggregate`'s prepare method we only reserve 1 page, but later when we switch to sort-based aggregation we try to acquire 1 page AND a pointer array. The longer-term fix should be to reserve also the pointer array, but for now ***we will simply not track the pointer array***. (Note that elsewhere we already don't track the pointer array, e.g. [here](https://github.com/apache/spark/blob/a18208047f06a4244703c17023bb20cbe1f59d73/sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeKVExternalSorter.java#L88)) Note: This patch reuses the unit test added in apache#8827 so it doesn't show up in the diff. Author: Andrew Or <andrew@databricks.com> Closes apache#8888 from andrewor14/dont-track-pointer-array. (cherry picked from commit 83f6f54) Signed-off-by: Andrew Or <andrew@databricks.com> (cherry picked from commit 1f47e68)

Andrew Or added 4 commits September 23, 2015 14:10

Revert "[SPARK-10474] [SQL] Aggregation fails to allocate memory for …

a00c737

…pointer array" This reverts commit 7ff8d68.

Add back test

7890baf

Do not track pointer array...

00f3739

Do not release pointer array memory since we don't track it

fa16b07

andrewor14 force-pushed the dont-track-pointer-array branch from 091298e to dfc73e8 Compare September 23, 2015 21:26

Clarify comment

a96b94e

andrewor14 force-pushed the dont-track-pointer-array branch from dfc73e8 to a96b94e Compare September 23, 2015 21:26

Use correct JIRA number in comments

c910d0b

andrewor14 force-pushed the dont-track-pointer-array branch from b2708ca to c910d0b Compare September 23, 2015 21:28

Correct a comment (minor)

ed36351

Fix tests

2dc34c3

Looks like we still tracked the pointer array memory when we grow it. Don't do that.

andrewor14 closed this Sep 24, 2015

andrewor14 deleted the dont-track-pointer-array branch September 24, 2015 02:35

[SPARK-10474] [SQL] Aggregation fails to allocate memory for pointer array (round 2) #8888

[SPARK-10474] [SQL] Aggregation fails to allocate memory for pointer array (round 2) #8888

Uh oh!

Conversation

andrewor14 commented Sep 23, 2015

Uh oh!

davies commented Sep 23, 2015

Uh oh!

SparkQA commented Sep 23, 2015

Uh oh!

SparkQA commented Sep 23, 2015

Uh oh!

SparkQA commented Sep 24, 2015

Uh oh!

SparkQA commented Sep 24, 2015

Uh oh!

SparkQA commented Sep 24, 2015

Uh oh!

SparkQA commented Sep 24, 2015

Uh oh!

SparkQA commented Sep 24, 2015

Uh oh!

SparkQA commented Sep 24, 2015

Uh oh!

SparkQA commented Sep 24, 2015

Uh oh!

SparkQA commented Sep 24, 2015

Uh oh!

andrewor14 commented Sep 24, 2015

Uh oh!

chenghao-intel commented Sep 24, 2015

Uh oh!

JoshRosen commented Sep 24, 2015

Uh oh!

andrewor14 commented Sep 24, 2015

Uh oh!

andrewor14 commented Sep 24, 2015

Uh oh!

SparkQA commented Sep 24, 2015

Uh oh!

chenghao-intel commented Sep 24, 2015

Uh oh!

andrewor14 commented Sep 24, 2015

Uh oh!

SparkQA commented Sep 24, 2015

Uh oh!

andrewor14 commented Sep 24, 2015

Uh oh!

SparkQA commented Sep 24, 2015

Uh oh!

SparkQA commented Sep 24, 2015

Uh oh!

SparkQA commented Sep 24, 2015

Uh oh!

SparkQA commented Sep 24, 2015

Uh oh!

chenghao-intel commented Sep 24, 2015

Uh oh!

andrewor14 commented Sep 24, 2015

Uh oh!

Uh oh!