[SPARK-26114][CORE] ExternalSorter's readingIterator field leak #23083

szhem · 2018-11-19T11:19:13Z

What changes were proposed in this pull request?

This pull request fixes SPARK-26114 issue that occurs when trying to reduce the number of partitions by means of coalesce without shuffling after shuffle-based transformations.

The leak occurs because of not cleaning up ExternalSorter's readingIterator field as it's done for its map and buffer fields.
Additionally there are changes to the CompletionIterator to prevent capturing its sub-iterator and holding it even after the completion iterator completes. It is necessary because in some cases, e.g. in case of standard scala's flatMap iterator (which is used is CoalescedRDD's compute method) the next value of the main iterator is assigned to flatMap's cur field only after it is available.
For DAGs where ShuffledRDD is a parent of CoalescedRDD it means that the data should be fetched from the map-side of the shuffle, but the process of fetching this data consumes quite a lot of memory in addition to the memory already consumed by the iterator held by flatMap's cur field (until it is reassigned).

For the following data

import org.apache.hadoop.io._ 
import org.apache.hadoop.io.compress._ 
import org.apache.commons.lang._ 
import org.apache.spark._ 

// generate 100M records of sample data 
sc.makeRDD(1 to 1000, 1000) 
  .flatMap(item => (1 to 100000) 
    .map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) -> new Text(RandomStringUtils.randomAlphanumeric(1024)))) 
  .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec]))

and the following job

import org.apache.hadoop.io._
import org.apache.spark._
import org.apache.spark.storage._

val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text])
rdd 
  .map(item => item._1.toString -> item._2.toString) 
  .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) 
  .coalesce(10,false) 
  .count

... executed like the following

spark-shell \ 
  --num-executors=5 \ 
  --executor-cores=2 \ 
  --master=yarn \
  --deploy-mode=client \ 
  --conf spark.executor.memoryOverhead=512 \
  --conf spark.executor.memory=1g \ 
  --conf spark.dynamicAllocation.enabled=false \
  --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true'

... executors are always failing with OutOfMemoryErrors.

The main issue is multiple leaks of ExternalSorter references.
For example, in case of 2 tasks per executor it is expected to be 2 simultaneous instances of ExternalSorter per executor but heap dump generated on OutOfMemoryError shows that there are more ones.

P.S. This PR does not cover cases with CoGroupedRDDs which use ExternalAppendOnlyMap internally, which itself can lead to OutOfMemoryErrors in many places.

How was this patch tested?

Existing unit tests
New unit tests
Job executions on the live environment

Here is the screenshot before applying this patch

Here is the screenshot after applying this patch

And in case of reducing the number of executors even more the job is still stable

…nd not trackable on freeing up the memory

…dont need to be fired at task completion

…till the end of the task

SparkQA · 2018-11-19T18:25:52Z

Test build #4433 has finished for PR 23083 at commit 12075ec.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

szhem · 2018-11-20T08:09:10Z

Hi @davies, @advancedxy, @rxin,
You seem to be the last ones who touched the corresponding parts of the files in this PR.
Could you be so kind to take a look at it?

advancedxy

Thanks for you investigating and this is a nice catch.

However I am not sure about adding new interface to TaskContext.

cc @cloud-fan, @jiangxb1987 & @srowen for more comments

advancedxy · 2018-11-21T08:51:14Z

core/src/main/scala/org/apache/spark/TaskContext.scala

+  /**
+   * Removes a (Java friendly) listener that is no longer needed to be executed on task completion.
+   */
+  def remoteTaskCompletionListener(listener: TaskCompletionListener): TaskContext


you mean removeTaskCompletionListener? didn't you?

Yep, seems that v was replaced with t on my keyboard)
Thanks a lot!

advancedxy · 2018-11-21T08:57:21Z

core/src/main/scala/org/apache/spark/TaskContextImpl.scala

@@ -99,6 +99,13 @@ private[spark] class TaskContextImpl(
    this
  }

+  override def remoteTaskCompletionListener(listener: TaskCompletionListener)
+      : this.type = synchronized {
+    onCompleteCallbacks -= listener


I'm not use whether we should add removeTaskCompletionListener or not.

If we are going to add this method. Then this's an O(n) operation. Maybe we need to replace onCompletedCallbacks to a LinkedHashSet?

Should we do the same thing (i.e. changing ArrayBuffer to LinkedHashSet) for onFailureCallbacks too?

/** List of callback functions to execute when the task completes. */ @transient private val onCompleteCallbacks = new ArrayBuffer[TaskCompletionListener] /** List of callback functions to execute when the task fails. */ @transient private val onFailureCallbacks = new ArrayBuffer[TaskFailureListener]

If we are going to add the new interface, I think so.

Replaced ArrayBuffer with LinkedHashSet. Thank you!

Another interesting question is why these collections are traversed in the reverse order in invokeLisneters like the following

private def invokeListeners[T]( listeners: Seq[T], name: String, error: Option[Throwable])( callback: T => Unit): Unit = { val errorMsgs = new ArrayBuffer[String](2) // Process callbacks in the reverse order of registration listeners.reverse.foreach { listener => ... } }

I believe @hvanhovell could help to understand. @hvanhovell Could you please remind why task completion and error listeners are traversed in the reverse order (you seem to the the one who added the corresponding line)?

advancedxy · 2018-11-21T08:58:22Z

core/src/main/scala/org/apache/spark/shuffle/BlockStoreShuffleReader.scala

+            // note that holding sorter references till the end of the task also holds
+            // references to PartitionedAppendOnlyMap and PartitionedPairBuffer too and these
+            // ones may consume a significant part of the available memory
+            context.remoteTaskCompletionListener(taskListener)


Nice catch.

Liked I said in the above, do we have another way to remove reference to sorter?

Great question! Honestly speaking I don't have pretty good solution right now.
TaskCompletionListener stops sorter in case of task failures, cancels, etc., i.e. in case of abnormal termination. In "happy path" case task completion listener is not needed.

advancedxy · 2018-11-21T09:00:50Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

@@ -72,7 +73,8 @@ final class ShuffleBlockFetcherIterator(
    maxBlocksInFlightPerAddress: Int,
    maxReqSizeShuffleToMem: Long,
    detectCorrupt: Boolean)
-  extends Iterator[(BlockId, InputStream)] with DownloadFileManager with Logging {
+  extends Iterator[(BlockId, InputStream)] with DownloadFileManager with TaskCompletionListener


I don't it's a good idea for ShuffleBlockFetchIterator to be a subclass of TaskCompletionListener.
What's wrong with original solution?

The main reason is that TaskCompletionListener is added in one place (in initialize method) and needs to be removed in another one (in cleanup method).

Will introduce a field for TaskCompletionListener instead. Thank you!

Another field sounds reasonable.

Introduced the corresponding field.

/** * Task completion callback to be called in both success as well as failure cases to cleanup. * It may not be called at all in case the `cleanup` method has already been called before * task completion. */ private[this] val cleanupTaskCompletionListener = (_: TaskContext) => cleanup()

advancedxy · 2018-11-21T09:08:26Z

And another thing:

P.S. This PR does not cover cases with CoGroupedRDDs which use ExternalAppendOnlyMap internally, which itself can lead to OutOfMemoryErrors in many places.

So do you mean CoGroupRDDs with multiple input sources will have similar problems? If so, can you create another Jira?

szhem · 2018-11-21T11:17:03Z

So do you mean CoGroupRDDs with multiple input sources will have similar problems?

Yep, but a little bit different ones

If so, can you create another Jira?

Will do it shortly.

…g TaskCompletionListener interface - using private fields instead, using LinkedHashSets as a collection of task completition listeners to faster lookup-s during listener removals

cloud-fan · 2018-11-23T12:32:09Z

Looking at the code, we are trying to fix 2 memory leaks: the task completion listener in ShuffleBlockFetcherIterator, and the CompletionIterator. If that's case, can you say that in the PR description?

For the task completion listener, I think it's an overkill to introduce a new API, do you know where exactly we leak the memory? and can we null it out when the ShuffleBlockFetcherIterator reaches to its end?

advancedxy · 2018-11-23T12:47:32Z

For the task completion listener, I think it's an overkill to introduce a new API, do you know where exactly we leak the memory? and can we null it out when the ShuffleBlockFetcherIterator reaches to its end?

If I understand correctly, the memory is leaked because external sorter is referenced in TaskCompletionListener and it's only gced when the task is completed. However for coalesce or similar APIs, multiple BlockStoreShuffleReaders are created as there are multiple input sources, the internal sorter is not released until all shuffle readers are consumed and task is finished.

It's an overkill to introduce a new API. However, I think we can limited it into private[Spark] scope.
Like @szhem, I don't figure out another way to null out the sorter reference yet.

advancedxy · 2018-11-23T14:24:02Z

core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala

      map = null // So that the memory can be garbage-collected
      buffer = null // So that the memory can be garbage-collected
+      readingIterator = null // So that the memory can be garbage-collected


Hi @szhem , I discussed with wenchen offline. I think this is the key point. After nulling out readingIterator, ExternalSorter should released all the memories it occupied.

Yes, ExternalSorter is leaked in TaskCompletionListener, but it would already be stopped in CompletionIterator in happy path. The stopped sorter wouldn't occupy too much memory. The readingIterator
is occupying memory because it may reference map/buffer.partitionedDestructiveSortedIterator, which itself references map/buffer. So only nulling out map or buffer is not enough.

Can you try with this modification only and see whether OOM still occurs.

@advancedxy I've tried to remove all the modifications except for this one and got OutOfMemoryErrors once again. Here are the details:

Now there are 4 ExternalSorter remained
2 of them are not closed ones ...

and 2 of them are closed ones ...

as expected

There are 2 SpillableIterators (which consume a significant part of memory) of already closed ExternalSorters remained

These SpillableIterators are referenced by CompletionIterators ...

... which in their order seem to be referenced by the cur field ...

... of the standard Iterator's flatMap that is used in the compute method of CoalescedRDD

Standard Iterator's flatMap does not clean up its cur field before obtaining the next value for it which in its order will consume quite a lot of memory too

.. and in case of Spark that means that the previous iterator consuming the memory will live there while fetching the next value for it

So I've returned the changes made to the CompletionIterator to reassign the reference of its sub-iterator to the empty iterator ...

... and that has helped (updated the PR correspondingly).

P.S. Cleaning up the standard flatMap iterator's cur field before calling nextCur will help too (here is the corresponding issue but I don't know whether it will be accepted or not)

def flatMap[B](f: A => GenTraversableOnce[B]): Iterator[B] = new AbstractIterator[B] { private var cur: Iterator[B] = empty private def nextCur() { cur = f(self.next()).toIterator } def hasNext: Boolean = { // Equivalent to cur.hasNext || self.hasNext && { nextCur(); hasNext } // but slightly shorter bytecode (better JVM inlining!) while (!cur.hasNext) { cur = empty if (!self.hasNext) return false nextCur() } true } def next(): B = (if (hasNext) cur else empty).next() }

Nice. Case well explained.

But I think you need to add corresponding test cases for CompletionIterator and ExternalSorter.

I've added test case for CompletionIterator.

Regarding ExternalSorter - taking into account that only the private api has been changed and there are no similar test cases which verify that private map and buffer fields are set to null after sorter stops, don't you think that already existing tests will cover the situation with readingIterator too?

szhem · 2018-11-25T00:57:00Z

Hi @cloud-fan

Looking at the code, we are trying to fix 2 memory leaks: the task completion listener in ShuffleBlockFetcherIterator, and the CompletionIterator. If that's case, can you say that in the PR description?

I've updated the description and the title of this PR correspondingly.

…ory leak

…Iterator memory leak

advancedxy

Add corresponding unit test please.

…completion

cloud-fan · 2018-11-26T03:05:37Z

ok to test

cloud-fan · 2018-11-26T03:06:23Z

LGTM, thanks for your great work!

SparkQA · 2018-11-26T06:37:55Z

Test build #99254 has finished for PR 23083 at commit 1723819.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-11-26T07:50:02Z

retest this please

SparkQA · 2018-11-26T08:05:01Z

Test build #99267 has finished for PR 23083 at commit 1723819.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-11-27T07:07:34Z

retest this please

SparkQA · 2018-11-27T08:05:01Z

Test build #99309 has finished for PR 23083 at commit 1723819.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-11-27T08:22:58Z

retest this please

SparkQA · 2018-11-27T11:52:07Z

Test build #99313 has finished for PR 23083 at commit 1723819.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-11-27T12:10:16Z

retest this please

SparkQA · 2018-11-27T15:36:54Z

Test build #99326 has finished for PR 23083 at commit 1723819.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-11-28T05:24:29Z

retest this please

SparkQA · 2018-11-28T08:05:01Z

Test build #99351 has finished for PR 23083 at commit 1723819.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-11-28T08:06:35Z

retest this please

SparkQA · 2018-11-28T12:14:58Z

Test build #99360 has finished for PR 23083 at commit 1723819.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-11-28T12:22:42Z

thanks, merging to master/2.4!

## What changes were proposed in this pull request? This pull request fixes [SPARK-26114](https://issues.apache.org/jira/browse/SPARK-26114) issue that occurs when trying to reduce the number of partitions by means of coalesce without shuffling after shuffle-based transformations. The leak occurs because of not cleaning up `ExternalSorter`'s `readingIterator` field as it's done for its `map` and `buffer` fields. Additionally there are changes to the `CompletionIterator` to prevent capturing its `sub`-iterator and holding it even after the completion iterator completes. It is necessary because in some cases, e.g. in case of standard scala's `flatMap` iterator (which is used is `CoalescedRDD`'s `compute` method) the next value of the main iterator is assigned to `flatMap`'s `cur` field only after it is available. For DAGs where ShuffledRDD is a parent of CoalescedRDD it means that the data should be fetched from the map-side of the shuffle, but the process of fetching this data consumes quite a lot of memory in addition to the memory already consumed by the iterator held by `flatMap`'s `cur` field (until it is reassigned). For the following data ```scala import org.apache.hadoop.io._ import org.apache.hadoop.io.compress._ import org.apache.commons.lang._ import org.apache.spark._ // generate 100M records of sample data sc.makeRDD(1 to 1000, 1000) .flatMap(item => (1 to 100000) .map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) -> new Text(RandomStringUtils.randomAlphanumeric(1024)))) .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) ``` and the following job ```scala import org.apache.hadoop.io._ import org.apache.spark._ import org.apache.spark.storage._ val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text]) rdd .map(item => item._1.toString -> item._2.toString) .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) .coalesce(10,false) .count ``` ... executed like the following ```bash spark-shell \ --num-executors=5 \ --executor-cores=2 \ --master=yarn \ --deploy-mode=client \ --conf spark.executor.memoryOverhead=512 \ --conf spark.executor.memory=1g \ --conf spark.dynamicAllocation.enabled=false \ --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true' ``` ... executors are always failing with OutOfMemoryErrors. The main issue is multiple leaks of ExternalSorter references. For example, in case of 2 tasks per executor it is expected to be 2 simultaneous instances of ExternalSorter per executor but heap dump generated on OutOfMemoryError shows that there are more ones. ![run1-noparams-dominator-tree-externalsorter](https://user-images.githubusercontent.com/1523889/48703665-782ce580-ec05-11e8-95a9-d6c94e8285ab.png) P.S. This PR does not cover cases with CoGroupedRDDs which use ExternalAppendOnlyMap internally, which itself can lead to OutOfMemoryErrors in many places. ## How was this patch tested? - Existing unit tests - New unit tests - Job executions on the live environment Here is the screenshot before applying this patch ![run3-noparams-failure-ui-5x2-repartition-and-sort](https://user-images.githubusercontent.com/1523889/48700395-f769eb80-ebfc-11e8-831b-e94c757d416c.png) Here is the screenshot after applying this patch ![run3-noparams-success-ui-5x2-repartition-and-sort](https://user-images.githubusercontent.com/1523889/48700610-7a8b4180-ebfd-11e8-9761-baaf38a58e66.png) And in case of reducing the number of executors even more the job is still stable ![run3-noparams-success-ui-2x2-repartition-and-sort](https://user-images.githubusercontent.com/1523889/48700619-82e37c80-ebfd-11e8-98ed-a38e1f1f1fd9.png) Closes #23083 from szhem/SPARK-26114-externalsorter-leak. Authored-by: Sergey Zhemzhitsky <szhemzhitski@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 438f8fd) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

## What changes were proposed in this pull request? This pull request fixes [SPARK-26114](https://issues.apache.org/jira/browse/SPARK-26114) issue that occurs when trying to reduce the number of partitions by means of coalesce without shuffling after shuffle-based transformations. The leak occurs because of not cleaning up `ExternalSorter`'s `readingIterator` field as it's done for its `map` and `buffer` fields. Additionally there are changes to the `CompletionIterator` to prevent capturing its `sub`-iterator and holding it even after the completion iterator completes. It is necessary because in some cases, e.g. in case of standard scala's `flatMap` iterator (which is used is `CoalescedRDD`'s `compute` method) the next value of the main iterator is assigned to `flatMap`'s `cur` field only after it is available. For DAGs where ShuffledRDD is a parent of CoalescedRDD it means that the data should be fetched from the map-side of the shuffle, but the process of fetching this data consumes quite a lot of memory in addition to the memory already consumed by the iterator held by `flatMap`'s `cur` field (until it is reassigned). For the following data ```scala import org.apache.hadoop.io._ import org.apache.hadoop.io.compress._ import org.apache.commons.lang._ import org.apache.spark._ // generate 100M records of sample data sc.makeRDD(1 to 1000, 1000) .flatMap(item => (1 to 100000) .map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) -> new Text(RandomStringUtils.randomAlphanumeric(1024)))) .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) ``` and the following job ```scala import org.apache.hadoop.io._ import org.apache.spark._ import org.apache.spark.storage._ val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text]) rdd .map(item => item._1.toString -> item._2.toString) .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) .coalesce(10,false) .count ``` ... executed like the following ```bash spark-shell \ --num-executors=5 \ --executor-cores=2 \ --master=yarn \ --deploy-mode=client \ --conf spark.executor.memoryOverhead=512 \ --conf spark.executor.memory=1g \ --conf spark.dynamicAllocation.enabled=false \ --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true' ``` ... executors are always failing with OutOfMemoryErrors. The main issue is multiple leaks of ExternalSorter references. For example, in case of 2 tasks per executor it is expected to be 2 simultaneous instances of ExternalSorter per executor but heap dump generated on OutOfMemoryError shows that there are more ones. ![run1-noparams-dominator-tree-externalsorter](https://user-images.githubusercontent.com/1523889/48703665-782ce580-ec05-11e8-95a9-d6c94e8285ab.png) P.S. This PR does not cover cases with CoGroupedRDDs which use ExternalAppendOnlyMap internally, which itself can lead to OutOfMemoryErrors in many places. ## How was this patch tested? - Existing unit tests - New unit tests - Job executions on the live environment Here is the screenshot before applying this patch ![run3-noparams-failure-ui-5x2-repartition-and-sort](https://user-images.githubusercontent.com/1523889/48700395-f769eb80-ebfc-11e8-831b-e94c757d416c.png) Here is the screenshot after applying this patch ![run3-noparams-success-ui-5x2-repartition-and-sort](https://user-images.githubusercontent.com/1523889/48700610-7a8b4180-ebfd-11e8-9761-baaf38a58e66.png) And in case of reducing the number of executors even more the job is still stable ![run3-noparams-success-ui-2x2-repartition-and-sort](https://user-images.githubusercontent.com/1523889/48700619-82e37c80-ebfd-11e8-98ed-a38e1f1f1fd9.png) Closes apache#23083 from szhem/SPARK-26114-externalsorter-leak. Authored-by: Sergey Zhemzhitsky <szhemzhitski@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

## What changes were proposed in this pull request? This pull request fixes [SPARK-26114](https://issues.apache.org/jira/browse/SPARK-26114) issue that occurs when trying to reduce the number of partitions by means of coalesce without shuffling after shuffle-based transformations. The leak occurs because of not cleaning up `ExternalSorter`'s `readingIterator` field as it's done for its `map` and `buffer` fields. Additionally there are changes to the `CompletionIterator` to prevent capturing its `sub`-iterator and holding it even after the completion iterator completes. It is necessary because in some cases, e.g. in case of standard scala's `flatMap` iterator (which is used is `CoalescedRDD`'s `compute` method) the next value of the main iterator is assigned to `flatMap`'s `cur` field only after it is available. For DAGs where ShuffledRDD is a parent of CoalescedRDD it means that the data should be fetched from the map-side of the shuffle, but the process of fetching this data consumes quite a lot of memory in addition to the memory already consumed by the iterator held by `flatMap`'s `cur` field (until it is reassigned). For the following data ```scala import org.apache.hadoop.io._ import org.apache.hadoop.io.compress._ import org.apache.commons.lang._ import org.apache.spark._ // generate 100M records of sample data sc.makeRDD(1 to 1000, 1000) .flatMap(item => (1 to 100000) .map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) -> new Text(RandomStringUtils.randomAlphanumeric(1024)))) .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) ``` and the following job ```scala import org.apache.hadoop.io._ import org.apache.spark._ import org.apache.spark.storage._ val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text]) rdd .map(item => item._1.toString -> item._2.toString) .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) .coalesce(10,false) .count ``` ... executed like the following ```bash spark-shell \ --num-executors=5 \ --executor-cores=2 \ --master=yarn \ --deploy-mode=client \ --conf spark.executor.memoryOverhead=512 \ --conf spark.executor.memory=1g \ --conf spark.dynamicAllocation.enabled=false \ --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true' ``` ... executors are always failing with OutOfMemoryErrors. The main issue is multiple leaks of ExternalSorter references. For example, in case of 2 tasks per executor it is expected to be 2 simultaneous instances of ExternalSorter per executor but heap dump generated on OutOfMemoryError shows that there are more ones. ![run1-noparams-dominator-tree-externalsorter](https://user-images.githubusercontent.com/1523889/48703665-782ce580-ec05-11e8-95a9-d6c94e8285ab.png) P.S. This PR does not cover cases with CoGroupedRDDs which use ExternalAppendOnlyMap internally, which itself can lead to OutOfMemoryErrors in many places. ## How was this patch tested? - Existing unit tests - New unit tests - Job executions on the live environment Here is the screenshot before applying this patch ![run3-noparams-failure-ui-5x2-repartition-and-sort](https://user-images.githubusercontent.com/1523889/48700395-f769eb80-ebfc-11e8-831b-e94c757d416c.png) Here is the screenshot after applying this patch ![run3-noparams-success-ui-5x2-repartition-and-sort](https://user-images.githubusercontent.com/1523889/48700610-7a8b4180-ebfd-11e8-9761-baaf38a58e66.png) And in case of reducing the number of executors even more the job is still stable ![run3-noparams-success-ui-2x2-repartition-and-sort](https://user-images.githubusercontent.com/1523889/48700619-82e37c80-ebfd-11e8-98ed-a38e1f1f1fd9.png) Closes apache#23083 from szhem/SPARK-26114-externalsorter-leak. Authored-by: Sergey Zhemzhitsky <szhemzhitski@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 438f8fd) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

szhem · 2019-08-11T07:36:24Z

core/src/main/scala/org/apache/spark/util/CompletionIterator.scala

    if (!r && !completed) {
      completed = true
+      // reassign to release resources of highly resource consuming iterators early
+      iter = Iterator.empty.asInstanceOf[I]


@ChenjunZou, you can find the details in this message

Thanks, szhem :)
your UT explains all.
at first I misunderstand sub as CompletionIterator(val sub)
Hided, well done!

szhem added 7 commits November 6, 2018 01:31

Allow memory consumers and spillables to be optionally unregistered a…

9ca5ddb

…nd not trackable on freeing up the memory

allow to remove already registered task completion listeners if they …

bf57de2

…dont need to be fired at task completion

clean up readingIterator on stop

e3531ac

prevent capturing and holding sub-iterators after completion

baa9656

cleaning up resourses as soon as they no longer needed, dont waiting …

3cc5452

…till the end of the task

adding some unit tests

d36323e

improved docs

12075ec

adopting tests for scala 2.12

19aa6c9

advancedxy reviewed Nov 21, 2018

View reviewed changes

fixing typos, preventing ShuffleBlockFetcherIterator from implementin…

83f33ee

…g TaskCompletionListener interface - using private fields instead, using LinkedHashSets as a collection of task completition listeners to faster lookup-s during listener removals

advancedxy reviewed Nov 23, 2018

View reviewed changes

szhem changed the title ~~[SPARK-26114][CORE] ExternalSorter Leak~~ [SPARK-26114][CORE] ExternalSorter's readingIterator field leak Nov 25, 2018

szhem added 2 commits November 25, 2018 04:01

rolling back changes unrelated to ExternalSorters readingIterator mem…

8feb694

…ory leak

removing redundant code as it is unrelated to ExternalSorters reading…

dd34d49

…Iterator memory leak

szhem mentioned this pull request Nov 25, 2018

flatMap iterator does not clean up its cur field before obtaining its next value scala/bug#11272

Closed

advancedxy suggested changes Nov 25, 2018

View reviewed changes

szhem added 4 commits November 25, 2018 20:46

test case to assert that there is no refernece to sub-iterator after …

6c74c0b

…completion

removing redundant type specifier

6ae5a34

prevent unnecessary sleeps in CompletionIterator test case

c5f2b36

replacing unneeded vars with vals

1723819

asfgit closed this in 438f8fd Nov 28, 2018

szhem mentioned this pull request Jan 28, 2019

[SPARK-26525][SHUFFLE]Fast release ShuffleBlockFetcherIterator on completion of the iteration #23438

Closed

ChenjunZou reviewed Aug 11, 2019

View reviewed changes

[SPARK-26114][CORE] ExternalSorter's readingIterator field leak #23083

[SPARK-26114][CORE] ExternalSorter's readingIterator field leak #23083

Conversation

szhem commented Nov 19, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Nov 19, 2018

szhem commented Nov 20, 2018

advancedxy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szhem Nov 21, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

advancedxy commented Nov 21, 2018 • edited Loading

szhem commented Nov 21, 2018

cloud-fan commented Nov 23, 2018 • edited Loading

advancedxy commented Nov 23, 2018

Choose a reason for hiding this comment

szhem Nov 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szhem Nov 25, 2018 • edited Loading

Choose a reason for hiding this comment

szhem commented Nov 25, 2018

advancedxy left a comment

Choose a reason for hiding this comment

cloud-fan commented Nov 26, 2018

cloud-fan commented Nov 26, 2018

SparkQA commented Nov 26, 2018

cloud-fan commented Nov 26, 2018

SparkQA commented Nov 26, 2018

cloud-fan commented Nov 27, 2018

SparkQA commented Nov 27, 2018

cloud-fan commented Nov 27, 2018

SparkQA commented Nov 27, 2018

cloud-fan commented Nov 27, 2018

SparkQA commented Nov 27, 2018

cloud-fan commented Nov 28, 2018

SparkQA commented Nov 28, 2018

cloud-fan commented Nov 28, 2018

SparkQA commented Nov 28, 2018

cloud-fan commented Nov 28, 2018

This comment was marked as resolved.

Choose a reason for hiding this comment

ChenjunZou Aug 12, 2019 • edited Loading

Choose a reason for hiding this comment

szhem commented Nov 19, 2018 •

edited

Loading

szhem Nov 21, 2018 •

edited

Loading

advancedxy commented Nov 21, 2018 •

edited

Loading

cloud-fan commented Nov 23, 2018 •

edited

Loading

szhem Nov 25, 2018 •

edited

Loading

szhem Nov 25, 2018 •

edited

Loading

ChenjunZou Aug 12, 2019 •

edited

Loading