Skip to content
This repository has been archived by the owner on Feb 1, 2021. It is now read-only.

mesos: no resources available to schedule container #1183

Closed
ahmetb opened this issue Aug 31, 2015 · 9 comments · Fixed by #1330
Closed

mesos: no resources available to schedule container #1183

ahmetb opened this issue Aug 31, 2015 · 9 comments · Fixed by #1330

Comments

@ahmetb
Copy link
Contributor

ahmetb commented Aug 31, 2015

Hi, I have a Mesos cluster consisting of 2 agent nodes:

image

image

image

I started 2 containers yesterday like the following:

docker run -d -c 1 -m 100 busybox sleep 999
docker run -d -c 1 -m 100 busybox sleep 999

and I can see they are completed now:

image

However when I try to schedule another container I'm getting this “no resources available” error:

➜  ~  docker run -d -c 1 -m 100 busybox sleep 999999
FATA[0005] Error response from daemon: no resources available to schedule container

I tried cleaning up some stopped containers but

docker rm -f `docker ps -aq`

is taking forever (been a few minutes, not returned yet, docker rm is getting stuck). I'm setting -c 1 -m 100 it seemed okay and the cluster has plenty of resources offered. Any ideas what's going wrong here?

@tnachen
Copy link
Contributor

tnachen commented Sep 1, 2015

Hi there, I think there might be some problems around resource accounting in the swarm mesos scheduler that we need to figure out. Another thing is that we're refactoring the scheduler so we don't use a full offer per task, but try to run multiple tasks on a offer when we can. So short answer is that once we have the refactoring merged hopefully this will be fixed too.

@h0tbird
Copy link

h0tbird commented Sep 10, 2015

+1

@SorraTheOrc
Copy link

This makes Swarm+Mesos completely unusable. What is even worse is that SWARM is sucking up all the resources from other frameworks too. @tnachen can you point me at the refactoring work you mention, this is a blocker for me right now and I'd like to follow along and at least provide some testing, if not patches.

@vieux
Copy link
Contributor

vieux commented Sep 23, 2015

@ahmetalpbalkan hey, sorry for the delay, we are aware of the issue. Can you try with #1212 to see if it helps ? thanks.

@abronan
Copy link
Contributor

abronan commented Sep 25, 2015

This will not work with the current PR for now. Will update here as soon as the PR is updated with a reasonable solution, it's more tricky than expected because:

  • One might remove a container not going through swarm or the Mesos cluster driver, so the resource-accounting must listen to the events cluster-wide and deal with adding/removing available resource chunks
  • The resource accounting and refactoring on the swarm side might alter the existing scheduler for Mesos, this will require some changes here as well
  • I want to make sure that this will work for additional cluster drivers

TL;DR swarm only
Also the resource accounting happens in Swarm now, but we really should rely on docker/runC to do the resource reservation in an optimistic and atomic way instead of relying on our weird mapping of vcpu -> CpuShares. This works fine for Swarm at a higher level but we should have a more robust resource reservation mechanism at a lower level.

Also it feels wrong to lock a Node (or any) object because two requests are trying to reserve resources at the same time. docker should return an error if one request cannot reserve a given amount of resource, and swarm should handle a retry mechanism at a higher level.

@tnachen
Copy link
Contributor

tnachen commented Sep 26, 2015

Not sure what you mean by relying on docker/runc for the resource reservation, but I think there is a bit of impedence mismatch here since if you rely only on docker for resource on the other hand your available resources is not given by docker but by Mesos, so you could get out of sync.

And about locking, If you don't lock than effectively you use compareAndInc/Dec for accurate accounting, but since you have multiple resources then I'm not really sure this makes your logic even clearer.

@abronan
Copy link
Contributor

abronan commented Sep 26, 2015

Sorry for the confusion, Mesos was a bad example for the whole second part. Because we rely on Mesos for resource offers and to inform us of available resources.. But as the refactoring (in #1212) does not concern only Mesos I got lost in my thoughts...

Second part concerns mostly swarm and hypothetical cluster drivers relying on docker to do the resource accounting. So this was out of topic for this issue.

@ghost
Copy link

ghost commented Sep 28, 2015

I meet a similar issue in my cluster (Swarm 0.5.0-dev + Mesos 0.25.0); when I run sleep 10000000, docker cli return "no resources available ", but it's running in the slave host & showing in Mesos GUI.

@SorraTheOrc
Copy link

I confirm that this appears to work. At least on the tests I've done so far I'm not seeing the issue. Thank you @vieux

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants