Proposal: Scheduler/Cluster Driver #393

vieux · 2015-02-12T00:35:48Z

Hi all,

This is a very simple version of what a ~~Scheduler~~ Cluster ~~API~~ Driver might look like, it's a PoC and comments are very welcome.

Here are the keys changes in this PR:

Cluster is now an interface.
Depending of the Cluster, it might embed a Scheduler or not. Right now, SwarmCluster does and MesosCluster doesn't.
Added a simple mesos.go file with a lot of TODOs.

Of course I'd love to split this into multiple PRs but here you can see the full picture.

Here is the simple Interface:

type Cluster interface {
         CreateContainer(config *dockerclient.ContainerConfig, name string) (*cluster.Container, error)
         RemoveContainer(container *cluster.Container, force bool) error

         Images()[]*Cluster.Image
         Image(IdOrName) *Cluster.Image
         Containers() []*cluster.Container
         Container(IdOrName string) *cluster.Container

         Info() [][2]string
 }

The Scheduler is nowhere to be found here, because some Cluster might have not and not the other.

Regarding mesos here how I see things:

It will be a "hybrid" framework, we will use mesos for creating and remove container and direct access to the nodes for everything else.

Start with swarm manage --cluster mesos zk://<ip:port1>,<ip:port2>,<ip:port3>/mesos
The MesosCluster will receive Mesos masters from zookeeper in NewEntries, connect to them to get the list of actual docker nodes and fill it's internal cluster.
In CreateContainer and RemoveContainer it'll use the Mesos Master to interact with Mesos.
In Nodes, Containers, Container and Events it will simply use the cluster of docker nodes.

@aluzzardi: WDYT ?
@tnachen & @benh: can you take a look at the comments in mesos.go and tell me if I'm missing something ?
@francisbouvier: thanks for your initial work in #178
@chanwit: we will keep it compiled in for now, but this API shouldn't be an issue with your PR.

Related to #213 and #214

aluzzardi · 2015-02-12T01:06:51Z

Early review comment:

I wouldn't touch the Scheduler but instead make Cluster an interface. SwarmCluster would use the Scheduler whereas the MesosCluster would rely on Mesos.

The reasoning is that, looking at the scheduling interface, it goes far beyond what a scheduler is supposed to do (for instance, Events() or Container()). If we look at the implementation of those functions for instance, they basically rely those calls to the Cluster since that's where it makes more sense to hold the code - otherwise, Scheduler would end up being a mix of Scheduler and Cluster.

Rough idea:

Cluster has an API to interact with the cluster (find a container, get a list of all containers, etc) as we have today, plus the ability to deploy/destroy containers.
Scheduler is nowhere to be seen in the API code etc - every interaction goes through the Cluster somehow.
Swarm ends up being a manager of Clusters - one being the SwarmCluster, the other one the MesosCluster (and later FooCluster).

I believe it resonates better with what those types are doing: MesosScheduler for instance is not a scheduler as it doesn't actually schedule containers, it simply relies calls to Mesos. On the other hand, MesosCluster is what it's supposed to be: a representation of a Mesos cluster where you can schedule containers and look them up.

Thoughts?

vieux · 2015-02-12T01:30:41Z

I agree with the naming (for me. most of the comment is about renaming Scheduler to Cluster, and it makes sense) the API presented in the body of the PR will likely be the same.

Although, I'm in favor of keeping the cluster as it is today; the Cluster we have currently in the code is a DockerCluster, Not a SwarmCluster and both SwarmCluster and MesosCluster will use a DockerCluster (let's name it DockerNodes it's bad but it'll be simpler for the rest of the comment).

So I would agree on something like:

Cluster is an interface.

SwarmCluster is an implementation of Cluster
SwarmCluster has an internal Scheduler (not an interface) and a DockerNodes (not and interface) with the list of actual nodes.
(here, would you move the Scheduler code in cluster/swarm/scheduler? as is it only for the SwarmCluster)

MesosCluster is an implementation of Cluster
MesosCluster has only a DockerNodes, no Scheduler as the Scheduler is mesos itself in this case.

FooCluster is an implementation of Cluster
FooCluster might or might not have a DockerNodes and it's own scheduler.

What do you think of this ?

One question though, where would you see the filters and strategies ? Only in the Scheduler inside SwarmCluster. Because I believe MesosCluster should somehow have access to those.

aluzzardi · 2015-02-12T01:43:46Z

Although, I'm in favor of keeping the cluster as it is today; the Cluster we have currently in the code is a DockerCluster, Not a SwarmCluster and both SwarmCluster and MesosCluster will use a DockerCluster (let's name it DockerNodes it's bad but it'll be simpler for the rest of the comment).

I don't think MesosCluster will use DockerCluster as it doesn't have any useful information for it (I guess that if swarm is set up to only use MesosCluster, there won't be any nodes (as in, docker engines) therefore DockerCluster would be empty.

Since all SwarmCluster will ever do is relay calls to DockerCluster (and the scheduler for things such as DeployContainer), what would be the point of having a separate DockerCluster and SwarmCluster?

(here, would you move the Scheduler code in cluster/swarm/scheduler? as is it only for the SwarmCluster)

I'm not sure, I initially would say I don't think so. The goal of the project remains (by default) to be a simple container scheduler, so it kinda makes sense to keep it as a high level item.

Also, one point I didn't raise earlier: Scheduler eventually might be an interface itself, but in the pure sense of scheduling. One could use the SwarmCluster with a custom "native" Scheduler (for instance, one that does rebalancing within a SwarmCluster in a different way, based on actual resource usage rather than requested, as an example).

One question though, where would you see the filters and strategies ? Only in the Scheduler inside SwarmCluster. Because I believe MesosCluster should somehow have access to those.

Good point. I don't know if filters and strategies will be re-used by the MesosCluster. If we keep scheduler/ as a top level item, we could leave it there. Basically Scheduler is a simple wrapper doing the glue between filters and strategies, so if MesosCluster needs those, it could as well just import the entire scheduler.

chanwit · 2015-02-12T04:54:14Z

@vieux Thanks for the ping. This PR looks great and I'll be trying to catch with these changes!

tnachen · 2015-02-12T08:20:10Z

scheduler/mesos/mesos.go

+}
+
+// Entries are Mesos masters
+func (s *MesosScheduler) NewEntries(entries []*discovery.Entry) {


I'm not sure I understand what this function is for.
The comment above says entries are mesos masters, but in the code it's adding to the cluster which I assume is slave nodes with swarm launched.
For Mesos there isn't any need to look for new nodes, since when new nodes are added to the data center the framework will get offers from it from the resourceOffer callback, and since you launched the task yourself you also know what slave that is, in the framework side just need to listen on the statusUpdate on TASK_RUNNING to get the information.

This will be the generic way to receive Mesos masters.

I do think swarm will need access to the list of nodes, to be able to do the logs, attach, pause, unpause, ps etc.

Unless we have a way to lookup containers / list containers and get the node address they are running in.

In that case, we can build cluster.Containers on the fly when calling Containers() or Container().

When you say receive Mesos masters, are you referring when the master changes?
Swarm definitely needs the list of nodes, and you can query the master for all the slaves. Looks like this is being called back from discovery service. One concern I have is that since this is a seperate discovery process that is proxying information, if some reason the mesos slave is down it will still return information given the node is still alive.
If all the docker commands that it needs to proxy is about a running container, it is much better to get the information from the mesos as that's more definitive.

Indeed, but we still have 2 issues with this:

events: we need direct access to the nodes to get the events

images: we do not want, each time we do a docker images to query mesos for the list of all the nodes, and then to query all the nodes to get their images, it'll take way to long.

vieux · 2015-02-13T02:04:29Z

I updated the PR with @aluzzardi and @tnachen suggestions.

Still lots of blurry things, but let's do baby steps and iterate :)

Now the interface Cluster is an interface and it is the implementation choice to use the Scheduler or not.
- Way less file changed, good news and proof it's probably way clever this way.
- SwarmCluster uses the scheduler to apply filters and strategies on it's docker nodes.
- I'm not sure how it will work for MesosCluster but we want MesosCluster to handle filters and strategy as well.
As of right now, both SwarmCluster and MesosCluster have a list of docker nodes internally, this might change soon as Mesos could manage only offers.
- Maybe we should renameNode and it should become an interface?
- It could be a Docker Node for SwarmCluster and an Offer for MesosCluster

bfirsh · 2015-02-13T17:14:48Z

I wonder if these should be called "drivers" rather than "APIs" to avoid confusion with the Remote API?

vieux · 2015-02-13T18:22:42Z

@bfirsh good idea, I updated the title.

abronan · 2015-02-14T14:25:39Z

Super excited by this PR!

If I understand this well:

Mesos Slaves are running on Nodes next to a docker daemon as well as the Swarm agent for the discovery service.
Swarm Manager will register a custom Mesos framework to communicate with the Mesos-Master and request as well as accept offers.
Swarm Manager will register the framework by also giving a custom Executor binary, whose task will be to communicate with the Mesos-Slave and:
- Run tasks through the docker daemon
- Control the overall resource usage on the Node(s).

A few questions though:

I guess that we assume that the docker daemon as well as the Swarm agent are running on a Node before registering the framework?
Do we allow a secure Swarm setup to communicate with an insecure Mesos setup and conversely?

Also thanks @vieux for taking those suggestions into account! I'm strongly leaning toward @aluzzardi's comments on the use of a Cluster interface. To give a concrete example, I'm designing a Slot scheduler that could be pluggable to swarm and it had no use of most of the functions declared in the Scheduler interface from the initial PR. But now it seems cleaner and removes the blurry boundaries between Cluster and Scheduler. All good!

vieux · 2015-02-19T19:33:29Z

In addition to the Cluster interface, there is now a Node interface to be implemented by each driver.

type Node interface {
    ID() string
    Name() string

    IP() string   //to inject the actual IP of the machine in docker ps (hostname:port or ip:port)
    Addr() string //to know where to connect with the proxy

    Images() []*Image                     //used by the API
    Image(IdOrName string) *Image         //used by the filters
    Containers() []*Container             //used by the filters
    Container(IdOrName string) *Container //used by the filters

    TotalCpus() int64   //used by the strategy
    UsedCpus() int64    //used by the strategy
    TotalMemory() int64 //used by the strategy
    UsedMemory() int64  //used by the strategy

    Labels() map[string]string //used by the filters

    IsHealthy() bool
}

dpetersen · 2015-02-24T19:12:25Z

Last week @alexwelch and I wrote a prototype Kubernetes cluster based on this PR. We got docker run working, including port mapping between containers. A few things we ran into:

Compatible cluster setup was a hassle, due to the daemon on the nodes needing to be accessible by the manager. I understand why that is, but it means an easy prebuilt cluster like Google Container Engine won't work without tweaks.
The docker CLI experience is quirky. docker run results in two containers, the one you wanted and the k8s Pod manager. Running docker rm on one of them would result in both needing to be removed. This is usually hidden because k8s users aren't interacting directly with Docker. I could maybe omit those "management" containers from your interactions(ps, events, etc), but that feels weird.
We ignored Swarm's scheduling and filtering. We started early last week, before your change where the Cluster asks the Scheduler to select a node for new containers. In k8s there is control over where the container goes, but AFAIK you can't say "put the container on this node". You can say "put it on a node with container X" and things like that, but it's not node-centric. This is the part of k8s I'm least familiar with, so there might be a solution to that problem.

By the end of the week, I felt that a k8s driver is technically possible, but wouldn't be compelling to use. Maybe more work (or more k8s experience than I have!) is required, but that Swarm wouldn't take advantage of many features of k8s, and wouldn't behave quite like docker, either. This might change in the future if there is a docker run that operates at an application level (like running Compose's YAML file). Working at a level above individual containers would let you employ more of the features of k8s.

I'm very curious to see a real implementation of the Mesos driver. Maybe that work will give us some ideas for the problems we ran into. Sorry for the huge wall of text!

… a simple scheduler interface. Signed-off-by: Victor Vieux <vieux@docker.com>

Signed-off-by: Victor Vieux <vieux@docker.com>

Usable -> Total & Reserved -> Used Signed-off-by: Victor Vieux <vieux@docker.com>

Signed-off-by: Victor Vieux <vieux@docker.com>

vieux · 2015-02-27T22:30:06Z

Thanks @dpetersen for you valuable feedback, this PR is becoming very big, so we might merge it soon and refine the API in another PR, I'll definitely ping you on it.

aluzzardi · 2015-02-27T23:17:51Z

LGTM

Proposal: Scheduler/Cluster Driver

aluzzardi · 2015-02-27T23:32:00Z

I just noticed that the Node implementation is named swarm.Node and lives in node.go while the Cluster implementation is named swarm.SwarmCluster and lives in swarm.go.

Can you make the two consistent, e.g., rename swarm.SwarmCluster into swarm.Cluster and name the file cluster.go?

vieux · 2015-02-28T00:01:11Z

Sure will do

duncanjw · 2015-03-30T12:00:36Z

@aluzzardi Hi. I pitched up here having followed the PR link from your Swarm post in the Scheduler section https://blog.docker.com/2015/02/scaling-docker-with-swarm/ - if this is now Cluster Driver it is probably worth updating post?

vieux added in progress area/scheduler area/API kind/help wanted labels Feb 12, 2015

vieux added this to the 0.2.0 milestone Feb 12, 2015

vieux self-assigned this Feb 12, 2015

tnachen reviewed Feb 12, 2015
View reviewed changes

vieux changed the title ~~Proposal: Scheduler API~~ Proposal: ~~Scheduler~~ Cluster API Feb 13, 2015

vieux changed the title ~~Proposal: ~~Scheduler~~ Cluster API~~ Proposal: Scheduler/Cluster API Feb 13, 2015

vieux changed the title ~~Proposal: Scheduler/Cluster API~~ Proposal: Scheduler/Cluster Driver Feb 13, 2015

vieux added 7 commits February 27, 2015 14:20

refactor code: move filter/ and strategy/ out of scheduler and create…

eb88068

… a simple scheduler interface. Signed-off-by: Victor Vieux <vieux@docker.com>

initial mesos.go file full fo TODOs

98a21bd

Signed-off-by: Victor Vieux <vieux@docker.com>

clean cut cluster - scheduler

dd537db

Signed-off-by: Victor Vieux <vieux@docker.com>

add SchedulerOptions

126f550

Signed-off-by: Victor Vieux <vieux@docker.com>

cluster API instead of scheduler API

ce98e66

Signed-off-by: Victor Vieux <vieux@docker.com>

move discovery out of the cluster interface

6348fdd

Signed-off-by: Victor Vieux <vieux@docker.com>

remove events from the cluster interface

fa8a066

Signed-off-by: Victor Vieux <vieux@docker.com>

vieux added 5 commits February 27, 2015 14:20

remove Nodes(), add Images() and Images()

47e0312

Signed-off-by: Victor Vieux <vieux@docker.com>

move list of node to swarm only

4bfeb4b

Signed-off-by: Victor Vieux <vieux@docker.com>

add Node interface

d8042f9

Usable -> Total & Reserved -> Used Signed-off-by: Victor Vieux <vieux@docker.com>

removed nodes.go

a8885ab

Signed-off-by: Victor Vieux <vieux@docker.com>

remove mesos.go to move it to it's own PR

8b7afe2

Signed-off-by: Victor Vieux <vieux@docker.com>

aluzzardi added a commit that referenced this pull request Feb 27, 2015

Merge pull request #393 from vieux/mesos_poc

db97473

Proposal: Scheduler/Cluster Driver

aluzzardi merged commit db97473 into docker-archive:master Feb 27, 2015

aluzzardi removed the in progress label Feb 27, 2015

aluzzardi deleted the mesos_poc branch February 27, 2015 23:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Scheduler/Cluster Driver #393

Proposal: Scheduler/Cluster Driver #393

vieux commented Feb 12, 2015

aluzzardi commented Feb 12, 2015

vieux commented Feb 12, 2015

aluzzardi commented Feb 12, 2015

chanwit commented Feb 12, 2015

tnachen Feb 12, 2015

vieux Feb 12, 2015

aluzzardi Feb 12, 2015

tnachen Feb 12, 2015

vieux Feb 13, 2015

vieux commented Feb 13, 2015

bfirsh commented Feb 13, 2015

vieux commented Feb 13, 2015

abronan commented Feb 14, 2015

vieux commented Feb 19, 2015

dpetersen commented Feb 24, 2015

vieux commented Feb 27, 2015

aluzzardi commented Feb 27, 2015

aluzzardi commented Feb 27, 2015

vieux commented Feb 28, 2015

duncanjw commented Mar 30, 2015

Proposal: Scheduler/Cluster Driver #393

Proposal: Scheduler/Cluster Driver #393

Conversation

vieux commented Feb 12, 2015

aluzzardi commented Feb 12, 2015

vieux commented Feb 12, 2015

aluzzardi commented Feb 12, 2015

chanwit commented Feb 12, 2015

tnachen Feb 12, 2015

Choose a reason for hiding this comment

vieux Feb 12, 2015

Choose a reason for hiding this comment

aluzzardi Feb 12, 2015

Choose a reason for hiding this comment

tnachen Feb 12, 2015

Choose a reason for hiding this comment

vieux Feb 13, 2015

Choose a reason for hiding this comment

vieux commented Feb 13, 2015

bfirsh commented Feb 13, 2015

vieux commented Feb 13, 2015

abronan commented Feb 14, 2015

vieux commented Feb 19, 2015

dpetersen commented Feb 24, 2015

vieux commented Feb 27, 2015

aluzzardi commented Feb 27, 2015

aluzzardi commented Feb 27, 2015

vieux commented Feb 28, 2015

duncanjw commented Mar 30, 2015