mesos: no resources available to schedule container #1183

ahmetb · 2015-08-31T19:59:32Z

Hi, I have a Mesos cluster consisting of 2 agent nodes:

I started 2 containers yesterday like the following:

docker run -d -c 1 -m 100 busybox sleep 999
docker run -d -c 1 -m 100 busybox sleep 999

and I can see they are completed now:

However when I try to schedule another container I'm getting this “no resources available” error:

➜  ~  docker run -d -c 1 -m 100 busybox sleep 999999
FATA[0005] Error response from daemon: no resources available to schedule container

I tried cleaning up some stopped containers but

docker rm -f `docker ps -aq`

is taking forever (been a few minutes, not returned yet, docker rm is getting stuck). I'm setting -c 1 -m 100 it seemed okay and the cluster has plenty of resources offered. Any ideas what's going wrong here?

The text was updated successfully, but these errors were encountered:

tnachen · 2015-09-01T17:50:45Z

Hi there, I think there might be some problems around resource accounting in the swarm mesos scheduler that we need to figure out. Another thing is that we're refactoring the scheduler so we don't use a full offer per task, but try to run multiple tasks on a offer when we can. So short answer is that once we have the refactoring merged hopefully this will be fixed too.

h0tbird · 2015-09-10T09:55:39Z

+1

SorraTheOrc · 2015-09-12T23:05:22Z

This makes Swarm+Mesos completely unusable. What is even worse is that SWARM is sucking up all the resources from other frameworks too. @tnachen can you point me at the refactoring work you mention, this is a blocker for me right now and I'd like to follow along and at least provide some testing, if not patches.

vieux · 2015-09-23T23:28:16Z

@ahmetalpbalkan hey, sorry for the delay, we are aware of the issue. Can you try with #1212 to see if it helps ? thanks.

abronan · 2015-09-25T23:49:06Z

This will not work with the current PR for now. Will update here as soon as the PR is updated with a reasonable solution, it's more tricky than expected because:

One might remove a container not going through swarm or the Mesos cluster driver, so the resource-accounting must listen to the events cluster-wide and deal with adding/removing available resource chunks
The resource accounting and refactoring on the swarm side might alter the existing scheduler for Mesos, this will require some changes here as well
I want to make sure that this will work for additional cluster drivers

TL;DR swarm only
Also the resource accounting happens in Swarm now, but we really should rely on docker/runC to do the resource reservation in an optimistic and atomic way instead of relying on our weird mapping of vcpu -> CpuShares. This works fine for Swarm at a higher level but we should have a more robust resource reservation mechanism at a lower level.

Also it feels wrong to lock a Node (or any) object because two requests are trying to reserve resources at the same time. docker should return an error if one request cannot reserve a given amount of resource, and swarm should handle a retry mechanism at a higher level.

tnachen · 2015-09-26T06:33:26Z

Not sure what you mean by relying on docker/runc for the resource reservation, but I think there is a bit of impedence mismatch here since if you rely only on docker for resource on the other hand your available resources is not given by docker but by Mesos, so you could get out of sync.

And about locking, If you don't lock than effectively you use compareAndInc/Dec for accurate accounting, but since you have multiple resources then I'm not really sure this makes your logic even clearer.

abronan · 2015-09-26T08:09:17Z

Sorry for the confusion, Mesos was a bad example for the whole second part. Because we rely on Mesos for resource offers and to inform us of available resources.. But as the refactoring (in #1212) does not concern only Mesos I got lost in my thoughts...

Second part concerns mostly swarm and hypothetical cluster drivers relying on docker to do the resource accounting. So this was out of topic for this issue.

ghost · 2015-09-28T01:32:07Z

I meet a similar issue in my cluster (Swarm 0.5.0-dev + Mesos 0.25.0); when I run sleep 10000000, docker cli return "no resources available ", but it's running in the slave host & showing in Mesos GUI.

SorraTheOrc · 2015-10-28T07:36:04Z

I confirm that this appears to work. At least on the tests I've done so far I'm not seeing the issue. Thank you @vieux

vieux mentioned this issue Oct 23, 2015

fix issue with timeouts in mesos #1330

Merged

abronan closed this as completed in #1330 Oct 23, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mesos: no resources available to schedule container #1183

mesos: no resources available to schedule container #1183

ahmetb commented Aug 31, 2015

tnachen commented Sep 1, 2015

h0tbird commented Sep 10, 2015

SorraTheOrc commented Sep 12, 2015

vieux commented Sep 23, 2015

abronan commented Sep 25, 2015

tnachen commented Sep 26, 2015

abronan commented Sep 26, 2015

ghost commented Sep 28, 2015

SorraTheOrc commented Oct 28, 2015

mesos: no resources available to schedule container #1183

mesos: no resources available to schedule container #1183

Comments

ahmetb commented Aug 31, 2015

tnachen commented Sep 1, 2015

h0tbird commented Sep 10, 2015

SorraTheOrc commented Sep 12, 2015

vieux commented Sep 23, 2015

abronan commented Sep 25, 2015

tnachen commented Sep 26, 2015

abronan commented Sep 26, 2015

ghost commented Sep 28, 2015

SorraTheOrc commented Oct 28, 2015