Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid warning message about invalid refuse_seconds value in Mesos >=0.21... #5597

Closed
wants to merge 1 commit into from
Closed

Conversation

MartinWeindel
Copy link
Contributor

Starting with version 0.21.0, Apache Mesos is very noisy if the filter parameter refuse_seconds is set to an invalid value like -1.
I have seen systems with millions of log lines like

W0420 18:00:48.773059 32352 hierarchical_allocator_process.hpp:589] Using the default value of 'refuse_seconds' to create the refused resources filter because the input value is negative

in the Mesos master INFO and WARNING log files.
Therefore the CoarseMesosSchedulerBackend should set the default value for refuse seconds (i.e. 5 seconds) directly.
This is no problem for the fine-grained MesosSchedulerBackend, as it uses the value 1 second for this parameter.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@srowen
Copy link
Member

srowen commented Apr 20, 2015

Sounds reasonable, since the value is reported to be invalid. The intent seemed to be to set this to "unset" or something. 5 seems to do something different as it sets it to a concrete value. Knowing nothing about this, is there maybe a closer equivalent value like 0? or is it really best to set this to a fixed value?

@MartinWeindel
Copy link
Contributor Author

The value 5 seconds is the default value of Mesos, which is used if not
set or an invalid value is given. So at least with current versions of
Mesos nothing changes in the behavior.

The parameter refuse_seconds configures how long Mesos should wait
before it offers resources again after the framework (i.e. here the
Spark scheduler backend) has refused them.
If you set it to 0, this means that Mesos will immediately offer these
resources again with the next allocation (by default after 1 second).
This will cause slightly higher traffic between the scheduler backend
and the Mesos master.
Alternatively, this parameter could be made configurable by Spark, but I
am not sure if it is really worth the effort.
In coarse grained mode, resources are allocated at the start. Are there
any circumstances other than a lost executor, where refused resources
will be used?

Am 20.04.2015 um 19:03 schrieb Sean Owen:

Sounds reasonable, since the value is reported to be invalid. The
intent seemed to be to set this to "unset" or something. 5 seems to do
something different as it sets it to a concrete value. Knowing nothing
about this, is there maybe a closer equivalent value like 0? or is it
really best to set this to a fixed value?


Reply to this email directly or view it on GitHub
#5597 (comment).

@srowen
Copy link
Member

srowen commented Apr 20, 2015

I suspect this is OK, since this was set up eons ago by @mateiz and so the right thing to do has probably changed. Your explanation makes sense. CC @jongyoul just in case.

@ash211
Copy link
Contributor

ash211 commented Apr 21, 2015

Should a bug should be filed with Mesos to log the error once rather than repeatedly and filling up disk?

@srowen
Copy link
Member

srowen commented Apr 21, 2015

@ash211 That would be nice too. It seems like the -1 Spark sends is considered an invalid value, so it sounds like that much should still change.

@tnachen
Copy link
Contributor

tnachen commented Apr 21, 2015

It's logged each time we recover resources in Mesos since that's when we evaluate when the filter should be applied. And yes Mesos needs a positive value for refuse seconds on the filter.

I think refused resources could be used when later there more resources available on from the same slave since someone else's task is gone. Also we're adding capacbility into coarse grained scheduler to launched multiple executors, and in addition to that we're putting in dynamic allocation into coarse grained scheduler too.

So we will be using refuse resources more often, but 5 as the default sounds reasonable to me.

@asfgit asfgit closed this in b063a61 Apr 22, 2015
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
….21...

Starting with version 0.21.0, Apache Mesos is very noisy if the filter parameter refuse_seconds is set to an invalid value like `-1`.
I have seen systems with millions of log lines like
```
W0420 18:00:48.773059 32352 hierarchical_allocator_process.hpp:589] Using the default value of 'refuse_seconds' to create the refused resources filter because the input value is negative
```
in the Mesos master INFO and WARNING log files.
Therefore the CoarseMesosSchedulerBackend should set the default value for refuse seconds (i.e. 5 seconds) directly.
This is no problem for the fine-grained MesosSchedulerBackend, as it uses the value 1 second for this parameter.

Author: mweindel <m.weindel@usu-software.de>

Closes apache#5597 from MartinWeindel/master and squashes the following commits:

2f99ffd [mweindel] Avoid warning message about invalid refuse_seconds value in Mesos >=0.21.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants