[SPARK-26713][CORE] Interrupt pipe IO threads in PipedRDD when task is finished #23638

advancedxy · 2019-01-24T12:51:57Z

What changes were proposed in this pull request?

Manually release stdin writer and stderr reader thread when task is finished. This commit also marks
ShuffleBlockFetchIterator as fully consumed if isZombie is set.

How was this patch tested?

Added new test

thread when task is finished. This commit also marks ShuffleBlockFetchIterator as fully consumed if isZombie is set

advancedxy · 2019-01-24T12:54:12Z

cc @xuanyuanking @cloud-fan @gatorsmile
and also @srowen as you reviewed related PipedRDD prs.

core/src/main/scala/org/apache/spark/rdd/PipedRDD.scala

advancedxy · 2019-01-24T16:04:28Z

@srowen addressed your comments. More comments?

srowen · 2019-01-25T00:56:56Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

+      currentResult = result.asInstanceOf[SuccessFetchResult]
+      (currentResult.blockId, new BufferReleasingInputStream(input, this))
+    } else { // the iterator has already be closed
+      throw new NoSuchElementException


Tiny nit: new NoSuchElementException() You could also ...

if (result == null) { throw .. }

but doesn't really matter, maybe just slightly cleaner to follow.

ok.

L387 in this file is also throw new NoSuchElementException. Shall I modify that too?

core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala

core/src/main/scala/org/apache/spark/rdd/PipedRDD.scala

core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala

srowen · 2019-01-25T01:03:06Z

core/src/main/scala/org/apache/spark/rdd/PipedRDD.scala

+    //   val abnormalRDD = pipeRDD.mapPartitions(_ => Iterator.empty)
+    // the iterator generated by PipedRDD is never involved. If the parent RDD's iterator is time
+    // consuming to generate(ShuffledRDD's shuffle operation for example), the outlived stdin writer
+    // thread will consume significant memory and cpu time. Also, there's race condition for


Does the second part of the comment, beginning at "Also,", belong below in the change to ShuffleBlockFetcherIterator?

This is a tricky one. After the fix in this pr to ShuffleBlockFetcherIterator, the race condition should be really rare as only one potential next call may hang. However I am unable to find a good place to put the above comment in ShuffleBlockFetcherIterator. So it's inside this task completion listener.

Do you have any suggestion?

Why not put the comment next to the change in ShuffleBlockFetcherIterator?

core/src/main/scala/org/apache/spark/rdd/PipedRDD.scala

viirya · 2019-01-25T05:04:10Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

@@ -372,7 +372,7 @@ final class ShuffleBlockFetcherIterator(
    logDebug("Got local blocks in " + Utils.getUsedTimeMs(startTime))
  }

-  override def hasNext: Boolean = numBlocksProcessed < numBlocksToFetch
+  override def hasNext: Boolean = !isZombie && (numBlocksProcessed < numBlocksToFetch)


Is ShuffleBlockFetcherIterator's change related to PipedRDD?

Kind of. The root cause of OOM I described in Spark-26713 is ShuffledRDD + PipedRDD.

I found PipedRDD's stderr writer hangs at ShuffleBlockFetcherIterator.next() and leaks memory. I believe this change of ShuffleBlockFetcherIterator's could reduce the possibility of race condition and It seems right to mark iterator as fully consumed if is already cleaned up.

After the change at PipedRDD by this PR, won't the stdin writer thread (you wrote stderr writer but I think it is typo) be interrupted? If so, it stops consuming ShuffleBlockFetcherIterator anymore. Isn't it enough to solve that?

(you wrote stderr writer but I think it is typo)

Sorry for the typo.

If so, it stops consuming ShuffleBlockFetcherIterator anymore. Isn't it enough to solve that?

Yes, for the ShuffledRDD + PipedRRDD case , the cleanup logic in PipedRDD is enough to solve the potential leak.
However I am thinking that ShuffledRDD could be transformed with any operations, there may be other cases that ShuffledBlockFetcherIterator is cleaned up but still being consumed. So, making the ShuffledBlockFetcherIterator defensive.

Ok, it sounds good for me. You can make the related comment general and move it to ShuffledBlockFetcherIterator.

SparkQA · 2019-01-25T05:38:30Z

Test build #4529 has finished for PR 23638 at commit ef7643e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala

advancedxy · 2019-01-25T15:39:21Z

Gently ping @srowen, do you have any concerns or suggestions?

srowen

@advancedxy I wouldn't ping so often. We have tens of PRs to review each day in spare time.

srowen · 2019-01-25T15:43:51Z

core/src/main/scala/org/apache/spark/rdd/PipedRDD.scala

+    //   val abnormalRDD = pipeRDD.mapPartitions(_ => Iterator.empty)
+    // the iterator generated by PipedRDD is never involved. If the parent RDD's iterator is time
+    // consuming to generate(ShuffledRDD's shuffle operation for example), the outlived stdin writer
+    // thread will consume significant memory and cpu time. Also, there's race condition for


Why not put the comment next to the change in ShuffleBlockFetcherIterator?

srowen · 2019-01-25T15:44:12Z

core/src/main/scala/org/apache/spark/rdd/PipedRDD.scala

+    // consuming to generate(ShuffledRDD's shuffle operation for example), the outlived stdin writer
+    // thread will consume significant memory and cpu time. Also, there's race condition for
+    // ShuffledRDD + PipedRDD if the subprocess command is failed. The task will be marked as failed
+    // , ShuffleBlockFetcherIterator will be cleaned up at task completion, which may hang


Nit: this comma seems misplaced

srowen

Just comments on comments now

srowen · 2019-01-26T15:20:07Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

+   * When the iterator is inactive, [[hasNext]] and [[next()]] calls will honor that as there are
+   * cases the iterator is still being consumed. For example, there was a race condition for
+   * ShuffledRDD + PipedRDD if the subprocess command is failed. The task will be marked as failed,
+   * then the iterator will be cleaned up at task completion, the [[next()]] call(called in the


nit: space in call(called
I also don't know that [[ will do anything here as scaladoc won't be generated for this anyway
"there was a race condition" -> don't know if it's useful, just explain what the code is doing now, rather than the previous issue

I also don't know that [[ will do anything here as scaladoc won't be generated for this anyway

I noticed [[ is used here and there in this file or other source code files. I think it's clear to others(who are reading the code) that this points to the actual method/variable. Also, there's bonus that we just jump to code in IntelliJ IDEA as it's treated as code block.

Will fix other issues.

srowen · 2019-01-26T15:21:39Z

core/src/main/scala/org/apache/spark/rdd/PipedRDD.scala

+    stdinWriterThread.start()
+
+    // interrupts stdin writer and stderr reader threads when the corresponding task is finished as
+    // a safe belt. Otherwise, these threads could outlive the task's lifetime. For example:


safe belt -> remove this? not sure what it's getting at.

srowen · 2019-01-26T15:22:19Z

core/src/main/scala/org/apache/spark/rdd/PipedRDD.scala

+    // a safe belt. Otherwise, these threads could outlive the task's lifetime. For example:
+    //   val pipeRDD = sc.range(1, 100).pipe(Seq("cat"))
+    //   val abnormalRDD = pipeRDD.mapPartitions(_ => Iterator.empty)
+    // the iterator generated by PipedRDD is never involved. If the parent RDD's iterator is time


is time consuming -> takes a long time?
outlived stdin writer -> stdin writer?
cpu -> CPU

Thanks. Will fix.

HyukjinKwon · 2019-01-27T09:53:27Z

retest this please

SparkQA · 2019-01-27T13:46:18Z

Test build #101726 has finished for PR 23638 at commit be18ba7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-27T18:33:35Z

Test build #4533 has finished for PR 23638 at commit be18ba7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-01-28T16:54:32Z

Merged to master

cloud-fan · 2019-01-28T17:15:04Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

@@ -395,7 +402,7 @@ final class ShuffleBlockFetcherIterator(
    // then fetch it one more time if it's corrupt, throw FailureFetchResult if the second fetch
    // is also corrupt, so the previous stage could be retried.
    // For local shuffle block, throw FailureFetchResult for the first IOException.
-    while (result == null) {
+    while (!isZombie && result == null) {


is it possible that hasNext returns true and next throws NoSuchElementException? isZombie may get changed by other threads?

Yeah that can happen. Right now I think it's 'worse' in that the iterator might be cleaned up and yet next() will keep querying the iterator that's being drained by cleanup().

To really tighten it up I think more or all of next() and cleanup() would have to be synchronized (?) and I'm not sure what the implications are of that.

We could follow this up with small things like making hasNext() synchronized at least, as isZombie is marked GuardedBy("this"). That still doesn't prevent this from happening but is a little tighter.

@advancedxy what do you think? I think the argument is merely that this fixes the potential issue in 99% of cases, not 100%.

is it possible that hasNext returns true and next throws NoSuchElementException? isZombie may get changed by other threads?

@cloud-fan Yeah, it can happen. But I agree with @srowen. The isZombie flag indicates the whole task is finished, there's no point for the consumer of the iterator to be still active. This changes the semantics of Iterator at rare chances, but I think it is acceptable.

We could follow this up with small things like making hasNext() synchronized at least, as isZombie is marked GuardedBy("this"). That still doesn't prevent this from happening but is a little tighter.

Maybe. But I would leave it as it's if It's up to me. Like you said, this doesn't prevent the semantics changing but a little tighter.

cloud-fan · 2019-01-29T03:16:24Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

+   * subprocess command is failed. The task will be marked as failed, then the iterator will be
+   * cleaned up at task completion, the [[next]] call (called in the stdin writer thread of
+   * PipedRDD if not exited yet) may hang at [[results.take]]. The defensive check in [[hasNext]]
+   * and [[next]] reduces the possibility of such race conditions.


When a task finishes, do we really need to guarantee all the iterators stop producing data? I agree it's better, but I'm afraid it's too much effort to guarantee it. Not only shuffle reader, we also need to fix the sort iterator, aggregate iterator and so on.

And why can't PipelinedRDD stop consuming input? I think it's better to fix the solo consumer side, instead of fixing different kinds of producer sides.

And why can't PipelinedRDD stop consuming input? I think it's better to fix the solo consumer side, instead of fixing different kinds of producer sides.

The PipedRDD stops consuming input in this PR. As for the ShuffedRDD + PipedRDD solely, the fixes in PipedRDD is sufficient. But I noticed the iterator still producing data is also the cause, therefore I made the corresponding changes.

When a task finishes, do we really need to guarantee all the iterators stop producing data? I agree it's better, but I'm afraid it's too much effort to guarantee it. Not only shuffle reader, we also need to fix the sort iterator, aggregate iterator and so on.

I think we can try our best to guarantee that. If it's too much effort, we could stop trying or try different approaches.

"try best" is not a "guarantee".

If we don't need to do this, I suggest we should not do it at all. The new changes in ShuffleBlockFetcherIterator make it harder for people to understand the code(at least to me), and also breaks the semantic of Iterator. And I don't see much benefit of doing it, as the PipedRDD has been fixed. Can we revert the changes in ShuffleBlockFetcherIterator?

Can we revert the changes in ShuffleBlockFetcherIterator?

If you insist, I can revert those changes. But let's wait and see if others have other opinions.
cc @srowen, @HyukjinKwon and @viirya.

also breaks the semantic of Iterator.

Rarely and shouldn't matter that much if the task is already finished.

I agree that this is not the expected behavior of Iterator. If there are elements in Iterator, it should return true when hasNext is called. It sounds more reasonable at the consumer side to stop consuming Iterator.

Sure, how about a follow-up that tries a different approach? the current change isn't harmful per se, and a small improvement.

You're suggesting reading and storing the next element that's available, if not already read, in hasNext? and then next must call hasNext to ensure this is filled if not already, which it does already? yeah that seems reasonable. That pattern is used in other iterators sometimes.

how about a follow-up that tries a different approach

Sure, if we can do it soon. Hopefully we don't leave this partial fix in the code base for years...

Sorry for the late reply. We are on the same page now. And I think @cloud-fan's proposal seems reasonable. I may create a new JIRA and come up with a different fix. However, I am leaving for holidays (lunar new year) soon. Cannot guarantee it will be finished in a couple days, but I will try my best to resolve it before the ending of Feb. Others are welcome to take it over if too much delay.

P.S: I just looked through potential similar issues to PipedRDD. I believe RRunner may have the same issue as it doesn't clean up threads. On the other side, PythonRunner gracefully stops its threads.

@advancedxy thanks for working on it! I'm leaving for lunar new year soon too, end of Feb sounds good.

@advancedxy I am facing the the same issue discussed by @cloud-fan that there are 2 threads consuming the results queue at the same time, and causing spark application to hang. Is the fix for this issue being worked on right now . Is there a JIRA to track the fix?

…s finished ## What changes were proposed in this pull request? Manually release stdin writer and stderr reader thread when task is finished. This commit also marks ShuffleBlockFetchIterator as fully consumed if isZombie is set. ## How was this patch tested? Added new test Closes apache#23638 from advancedxy/SPARK-26713. Authored-by: Xianjin YE <advancedxy@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

cloud-fan · 2019-07-02T07:54:09Z

@advancedxy any updates?

advancedxy · 2019-07-02T15:40:11Z

@cloud-fan sorry, I lost track of this. I will submit a new pr this weekend.

advancedxy · 2019-07-02T15:42:23Z

However, I don't get hanging reports from our internal users after this patch. One of the reasons that I lost track of this issue.

cloud-fan · 2019-07-03T07:05:21Z

I don't get hanging reports from our internal users after this patch

I think it's because the problem has been fixed in PipelinedRDD. According to the previous discussion, we should either revert the changes in ShuffleBlockFetcherIterator, or fix it in the right way for future safety.

dongjoon-hyun · 2019-09-17T15:35:18Z

Hi, All.
Since branch-2.4 is our LTS for 2.x release, can we have this fix in branch-2.4?

cloud-fan · 2019-09-17T15:43:10Z

I'm fine with it, let's also include the followup #25049

advancedxy · 2019-09-17T16:10:14Z

I can submit a pr to branch-2.4.

dongjoon-hyun · 2019-09-17T17:14:04Z

Thank you, @cloud-fan and @advancedxy !
Yes. Please submit a backporting PR, @advancedxy .

…ask is finished ### What changes were proposed in this pull request? Manually release stdin writer and stderr reader thread when task is finished. This is the backport of #23638 including #25049. ### Why are the changes needed? This is a bug fix. PipedRDD's IO threads may hang even the corresponding task is already finished. Without this fix, it would leak resource(memory specially). ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Add new test Closes #25825 from advancedxy/SPARK-26713_for_2.4. Authored-by: Xianjin YE <advancedxy@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…ask is finished ### What changes were proposed in this pull request? Manually release stdin writer and stderr reader thread when task is finished. This is the backport of apache#23638 including apache#25049. ### Why are the changes needed? This is a bug fix. PipedRDD's IO threads may hang even the corresponding task is already finished. Without this fix, it would leak resource(memory specially). ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Add new test Closes apache#25825 from advancedxy/SPARK-26713_for_2.4. Authored-by: Xianjin YE <advancedxy@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

[SPARK-26713][CORE] Manually release stdin writer and stderr reader

1bae19a

thread when task is finished. This commit also marks ShuffleBlockFetchIterator as fully consumed if isZombie is set

srowen reviewed Jan 24, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/rdd/PipedRDD.scala Outdated Show resolved Hide resolved

core/src/main/scala/org/apache/spark/rdd/PipedRDD.scala Outdated Show resolved Hide resolved

Use interrupt instead of stop to stop io threads an inline cleanup logic

ef7643e

advancedxy changed the title ~~[SPARK-26713][CORE] Release pipe IO threads in PipedRDD when task is finished~~ [SPARK-26713][CORE] Interrupt pipe IO threads in PipedRDD when task is finished Jan 24, 2019

srowen reviewed Jan 25, 2019

View reviewed changes

viirya reviewed Jan 25, 2019

View reviewed changes

HyukjinKwon reviewed Jan 25, 2019

View reviewed changes

core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala Outdated Show resolved Hide resolved

advancedxy force-pushed the SPARK-26713 branch from 589640a to f0a66b7 Compare January 25, 2019 15:34

srowen reviewed Jan 25, 2019

View reviewed changes

advancedxy force-pushed the SPARK-26713 branch from f0a66b7 to 15a6051 Compare January 25, 2019 16:17

srowen reviewed Jan 26, 2019

View reviewed changes

Address comments.

be18ba7

advancedxy force-pushed the SPARK-26713 branch from 15a6051 to be18ba7 Compare January 26, 2019 15:54

srowen closed this in 1280bfd Jan 28, 2019

cloud-fan reviewed Jan 28, 2019

View reviewed changes

cloud-fan reviewed Jan 29, 2019

View reviewed changes

cloud-fan mentioned this pull request Jul 4, 2019

[SPARK-26713][CORE][followup] revert the partial fix in ShuffleBlockFetcherIterator #25049

Closed

advancedxy mentioned this pull request Sep 18, 2019

[SPARK-26713][CORE][2.4] Interrupt pipe IO threads in PipedRDD when task is finished #25825

Closed

rshkv mentioned this pull request Mar 12, 2021

[SPARK-26713][CORE][followup] revert the partial fix in ShuffleBlockFetcherIterator palantir/spark#741

Merged

[SPARK-26713][CORE] Interrupt pipe IO threads in PipedRDD when task is finished #23638

[SPARK-26713][CORE] Interrupt pipe IO threads in PipedRDD when task is finished #23638

Conversation

advancedxy commented Jan 24, 2019

What changes were proposed in this pull request?

How was this patch tested?

advancedxy commented Jan 24, 2019

advancedxy commented Jan 24, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 25, 2019

advancedxy commented Jan 25, 2019

srowen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srowen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Jan 27, 2019

SparkQA commented Jan 27, 2019

SparkQA commented Jan 27, 2019

srowen commented Jan 28, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Jul 2, 2019

advancedxy commented Jul 2, 2019

advancedxy commented Jul 2, 2019

cloud-fan commented Jul 3, 2019

dongjoon-hyun commented Sep 17, 2019

cloud-fan commented Sep 17, 2019

advancedxy commented Sep 17, 2019

dongjoon-hyun commented Sep 17, 2019