-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding known issue for MESOS-1688 #1860
Conversation
Can one of the admins verify this patch? |
Hey Martin, I'm having a bit of trouble seeing how this works around the issue. From what I can tell the issue is that if someone creates Executors that consume all memory, Mesos will refuse to make offers for the tasks. However, this fix just adds 32MB of memory as a requirement for the task... but it seems like if the offer is never made in the first place, this will make no difference. Can you describe a sequence of offers where this change alters the execution? Thanks for looking into this!
|
Hey Patrick, first of all let me emphasize again that this is only a work-around. The I can only argue from an experimental point of view, that I have not BTW, I have also played with changing the executor memory so that there So I'm not sure if this patch should be integrated into the Spark source If I can help in any way, just tell me. Best regards, Am 24.08.2014 19:16, schrieb Patrick Wendell:
|
From my knowledge of Mesos, this seems like a good fix. I think we should do this until MESOS-1688 is fixed. |
Jenkins, test this please |
BTW @MartinWeindel one small request -- can you update the docs/running-on-mesos.md page to explain that each task will consume 32 MB? Otherwise people might set Spark's executor memory to be all of the memory on the Mesos worker, which is going to mean no tasks launched. |
QA tests have started for PR 1860 at commit
|
QA tests have finished for PR 1860 at commit
|
BTW this failure is due to a style check -- you can run sbt scalastyle locally to find all style issues (the Jenkins log also lists the problem). |
@MartinWeindel I think you should check if there's enough memory in the offer first. |
That's true, now that we take 32 MB extra you need to change the logic about how many tasks we can allocate. That will make it trickier. |
Hey @MartinWeindel - I'm curious, which of the following cases are you in: Case 1. You have individual executors that attempt to acquire all the memory on the node. Case 2. You have multiple executors per node, but their total memory adds up the total amount of memory on the node. I could see how this would help with Case 2 because it could prevent a second executor from being launched in a way that acquires all of the host memroy. But I'm still wondering whether it affects Case 1. |
Yes, this becomes tricky. And I don't see a satisfying solution, as I would This patch solves one problem, but will introduce new ones. Because it's I've already created a pull request to get the cause fixed in Mesos: On Mon, Aug 25, 2014 at 7:58 AM, Matei Zaharia notifications@github.com
|
After thinking about this more, it seems that another workaround is to make sure your executors always leave 32 MB free on each node (even if you launch multiple executors, make sure their sizes don't add up to quite the full memory). Would that work? If so, we can just add that to the docs. |
OK, so I have reverted the work-around patch and added a known issue paragraph to the running-on-mesos documentation. |
Cool, thanks, that looks great. |
When using Mesos with the fine-grained mode, a Spark job can run into a dead lock on low allocatable memory on Mesos slaves. As a work-around 32 MB (= Mesos MIN_MEM) are allocated for each task, to ensure Mesos making new offers after task completion. From my perspective, it would be better to fix this problem in Mesos by dropping the constraint on memory for offers, but as temporary solution this patch helps to avoid the dead lock on current Mesos versions. See [[MESOS-1688] No offers if no memory is allocatable](https://issues.apache.org/jira/browse/MESOS-1688) for details for this problem. Author: Martin Weindel <martin.weindel@gmail.com> Closes apache#1860 from MartinWeindel/master and squashes the following commits: 5762030 [Martin Weindel] reverting work-around a6bf837 [Martin Weindel] added known issue for issue MESOS-1688 d9d2ca6 [Martin Weindel] work around for problem with Mesos offering semantic (see [https://issues.apache.org/jira/browse/MESOS-1688])
Just for crossref MESOS-1688 has been committed and will be part of 0.21.0 release cycle. |
Great! I'll create a JIRA to update Spark to it when that comes out. |
When using Mesos with the fine-grained mode, a Spark job can run into a dead lock on low allocatable memory on Mesos slaves. As a work-around 32 MB (= Mesos MIN_MEM) are allocated for each task, to ensure Mesos making new offers after task completion.
From my perspective, it would be better to fix this problem in Mesos by dropping the constraint on memory for offers, but as temporary solution this patch helps to avoid the dead lock on current Mesos versions.
See [MESOS-1688] No offers if no memory is allocatable for details for this problem.