Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-30069][CORE][YARN] Clean up non-shuffle disk block manager files following executor exists on YARN #26711

Closed
wants to merge 1 commit into from

Conversation

LantaoJin
Copy link
Contributor

@LantaoJin LantaoJin commented Nov 29, 2019

What changes were proposed in this pull request?

Currently we only clean up the local directories on application removed. However, when executors die and restart repeatedly, many temp files are left untouched in the local directories, which is undesired behavior and could cause disk space used up gradually. Especially, in long running service like Spark thrift-server with dynamic resource allocation disabled, it's very easy causes local disk full.
#21390 fixed the same problem on Standalone mode. On YARN, this issue still exists.

From #21390 (comment), YARN only cleans container local dirs when container (executor) is exited. But these files are not in container local dirs.
Screen Shot 2019-11-29 at 4 52 56 PM

So this patch is very straightforward:
We create these "temp_xxx " files under the container dirs when the executor is running in YARN container.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Add an UT and manually test.

@LantaoJin
Copy link
Contributor Author

@SparkQA
Copy link

SparkQA commented Nov 29, 2019

Test build #114626 has finished for PR 26711 at commit c226dd6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LantaoJin
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Dec 2, 2019

Test build #114690 has finished for PR 26711 at commit c226dd6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor

jiangxb1987 commented Dec 3, 2019

This is actually writing temp shuffle files into YARN container local dirs instead of local dirs? Could there be any performance difference after this PR? (Sorry I'm not familiar with YARN)

@LantaoJin
Copy link
Contributor Author

LantaoJin commented Dec 3, 2019

This is actually writing temp shuffle files into YARN container local dirs instead of local dirs? Could there be any performance difference after this PR? (Sorry I'm not familiar with YARN)

Yes. I think only this part could have performance difference in YARN mode:
https://github.com/apache/spark/pull/26711/files#diff-b25a153e05a044779df448f024c3a3afR125
But this part already had potentially performance problem. See

* is a potentially expensive operation and should only be used for testing.

@LantaoJin
Copy link
Contributor Author

LantaoJin commented Dec 3, 2019

Ah, DiskBlockManager.stop() also needs to traverse localDirs and containerDirs to delete them when Executor.stop() in YARN mode.
https://github.com/apache/spark/pull/26711/files#diff-b25a153e05a044779df448f024c3a3afR248

But I think it's not a problem.

@LantaoJin
Copy link
Contributor Author

gentle ping @cloud-fan @jiangxb1987 @dongjoon-hyun @jerryshao

@LantaoJin
Copy link
Contributor Author

@vanzin @squito

@LantaoJin
Copy link
Contributor Author

@LantaoJin
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Mar 12, 2020

Test build #119692 has finished for PR 26711 at commit c226dd6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LantaoJin
Copy link
Contributor Author

@LantaoJin LantaoJin closed this Mar 14, 2020
@LantaoJin LantaoJin reopened this Mar 14, 2020
@tgravescs
Copy link
Contributor

perhaps you can expand on the description as to the approach you are taking? How do you know the temp_ files aren't being used?

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jun 25, 2020
@github-actions github-actions bot closed this Jun 26, 2020
@tianshuang
Copy link

@LantaoJin Has this problem been solved? We are facing the same problem now

@LantaoJin
Copy link
Contributor Author

@tianshuang This problem should be resolved by this PR. But it was closed by githut bot. So I think it still exists in 3.0.

@tianshuang
Copy link

After thrift-server runs for a long time, you will encounter an error of 'no space left on device'. I hope to reopen this issue to completely solve the problem.

@LantaoJin
Copy link
Contributor Author

After thrift-server runs for a long time, you will encounter an error of 'no space left on device'. I hope to reopen this issue to completely solve the problem.

It cannot reopen now. Seems I have to re-create a new PR.

@LantaoJin
Copy link
Contributor Author

@tianshuang recreated as #29378

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants