[SPARK-27568][CORE] Fix readLock leak while calling take()/first() on a cached rdd #24467

Ngone51 · 2019-04-26T07:36:37Z

What changes were proposed in this pull request?

Currently, if we run the code below in Spark:

sc.parallelize(Range(0, 10), 1).cache().take(1)

we'll see the line below in log:

19/04/25 23:48:54 INFO Executor: 1 block locks were not released by TID = 0:
[rdd_0_0]

and, If we set "spark.storage.exceptionOnPinLeak"=true, job will fail.

Normally, we'll always release readLock for the block once we consumed all elements in a CompletionIterator.
However, operation like take()/first() do not need to consume all, which lead to the release behaviour
can't be triggered.

This pr suggests to manually call completion() for the CompletionIterator if the iterator still has next
element after task finished, so that readLock could be released within competion().

How was this patch tested?

Added.

SparkQA · 2019-04-26T07:49:02Z

Test build #104933 has finished for PR 24467 at commit 21ba1dc.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2019-04-27T16:21:57Z

Jenkins, retest this please.

SparkQA · 2019-04-27T16:33:46Z

Test build #104965 has finished for PR 24467 at commit 21ba1dc.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-04-28T15:42:48Z

Is there any better way to do this rather than interrogate the class of the delegate? it's a little hacky

Ngone51 · 2019-04-29T01:41:51Z

@srowen Yeah, agree with that. I'm thinking of it.

jiangxb1987 · 2019-05-06T23:50:22Z

AFAIK this shall not lead to any job failure because the config "spark.storage.exceptionOnPinLeak" is normally turned off. However this is really a issue when people submit jobs from python side, and I submitted #24542 to catch the AssertionError.

To me the fix proposed in this PR is acceptable, but I'm not sure whether we shall still fix this since now it shall not cause critical issues and the fix itself is kind of hacky.

cloud-fan · 2019-05-07T02:28:14Z

This reminds me of the memory leak issue in sort-merge-join.

We use scala Iterator to exchange data between SQL operators/RDDs, and we can only do the cleanup work when the Iterator is consumed up, or in a task completion listener.

But what we really need is a traditional database iterator, with open, next, get and close methods. When a downstream operator finishes its work without consuming up the input data, it can call close to do free the resource.

For this particular case, I think using a task completion listener is good enough?

Ngone51 · 2019-05-07T15:45:18Z

For this particular case, I think using a task completion listener is good enough?

Block level read/write lock mechanism has a basic assumption that all block locks should be released when a task finished. And that's why we check the leaked locks after the task finished. As task completion listener also would be triggered after task finished, so I think using it may not take big difference.

Actually, The process of checking leaked locks(by calling releaseAllLocksForTask(taskId)) is just like a close operation of database iterator, which releases all unlocked blocks for the task. So, I agree with @jiangxb1987 and prefer not to fix it if we could not have a better way against to this hacky way by now.

SparkQA · 2019-06-01T13:28:20Z

Test build #106053 has finished for PR 24467 at commit 21ba1dc.

This patch fails to build.
This patch does not merge cleanly.
This patch adds no public classes.

Ngone51 added 2 commits April 26, 2019 00:05

fix leaked readlock

fd46d7a

add first()

21ba1dc

srowen closed this Jun 1, 2019

Ngone51 deleted the dev-pinleak branch June 1, 2019 14:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-27568][CORE] Fix readLock leak while calling take()/first() on a cached rdd #24467

[SPARK-27568][CORE] Fix readLock leak while calling take()/first() on a cached rdd #24467

Ngone51 commented Apr 26, 2019

SparkQA commented Apr 26, 2019

Ngone51 commented Apr 27, 2019

SparkQA commented Apr 27, 2019

srowen commented Apr 28, 2019

Ngone51 commented Apr 29, 2019

jiangxb1987 commented May 6, 2019

cloud-fan commented May 7, 2019

Ngone51 commented May 7, 2019 •

edited

SparkQA commented Jun 1, 2019

[SPARK-27568][CORE] Fix readLock leak while calling take()/first() on a cached rdd #24467

[SPARK-27568][CORE] Fix readLock leak while calling take()/first() on a cached rdd #24467

Conversation

Ngone51 commented Apr 26, 2019

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Apr 26, 2019

Ngone51 commented Apr 27, 2019

SparkQA commented Apr 27, 2019

srowen commented Apr 28, 2019

Ngone51 commented Apr 29, 2019

jiangxb1987 commented May 6, 2019

cloud-fan commented May 7, 2019

Ngone51 commented May 7, 2019 • edited

SparkQA commented Jun 1, 2019

Ngone51 commented May 7, 2019 •

edited