Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-27568][CORE] Fix readLock leak while calling take()/first() on a cached rdd #24467

Closed
wants to merge 2 commits into from

Conversation

Ngone51
Copy link
Member

@Ngone51 Ngone51 commented Apr 26, 2019

What changes were proposed in this pull request?

Currently, if we run the code below in Spark:

sc.parallelize(Range(0, 10), 1).cache().take(1)

we'll see the line below in log:

19/04/25 23:48:54 INFO Executor: 1 block locks were not released by TID = 0:
[rdd_0_0]

and, If we set "spark.storage.exceptionOnPinLeak"=true, job will fail.

Normally, we'll always release readLock for the block once we consumed all elements in a CompletionIterator.
However, operation like take()/first() do not need to consume all, which lead to the release behaviour
can't be triggered.

This pr suggests to manually call completion() for the CompletionIterator if the iterator still has next
element after task finished, so that readLock could be released within competion().

How was this patch tested?

Added.

@SparkQA
Copy link

SparkQA commented Apr 26, 2019

Test build #104933 has finished for PR 24467 at commit 21ba1dc.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Ngone51
Copy link
Member Author

Ngone51 commented Apr 27, 2019

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Apr 27, 2019

Test build #104965 has finished for PR 24467 at commit 21ba1dc.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Apr 28, 2019

Is there any better way to do this rather than interrogate the class of the delegate? it's a little hacky

@Ngone51
Copy link
Member Author

Ngone51 commented Apr 29, 2019

@srowen Yeah, agree with that. I'm thinking of it.

@jiangxb1987
Copy link
Contributor

AFAIK this shall not lead to any job failure because the config "spark.storage.exceptionOnPinLeak" is normally turned off. However this is really a issue when people submit jobs from python side, and I submitted #24542 to catch the AssertionError.

To me the fix proposed in this PR is acceptable, but I'm not sure whether we shall still fix this since now it shall not cause critical issues and the fix itself is kind of hacky.

@cloud-fan
Copy link
Contributor

This reminds me of the memory leak issue in sort-merge-join.

We use scala Iterator to exchange data between SQL operators/RDDs, and we can only do the cleanup work when the Iterator is consumed up, or in a task completion listener.

But what we really need is a traditional database iterator, with open, next, get and close methods. When a downstream operator finishes its work without consuming up the input data, it can call close to do free the resource.

For this particular case, I think using a task completion listener is good enough?

@Ngone51
Copy link
Member Author

Ngone51 commented May 7, 2019

For this particular case, I think using a task completion listener is good enough?

Block level read/write lock mechanism has a basic assumption that all block locks should be released when a task finished. And that's why we check the leaked locks after the task finished. As task completion listener also would be triggered after task finished, so I think using it may not take big difference.

Actually, The process of checking leaked locks(by calling releaseAllLocksForTask(taskId)) is just like a close operation of database iterator, which releases all unlocked blocks for the task. So, I agree with @jiangxb1987 and prefer not to fix it if we could not have a better way against to this hacky way by now.

@srowen srowen closed this Jun 1, 2019
@SparkQA
Copy link

SparkQA commented Jun 1, 2019

Test build #106053 has finished for PR 24467 at commit 21ba1dc.

  • This patch fails to build.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@Ngone51 Ngone51 deleted the dev-pinleak branch June 1, 2019 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants