Skip to content
This repository has been archived by the owner on Feb 1, 2021. It is now read-only.

Proposal: Scheduler/Cluster Driver #393

Merged
merged 12 commits into from
Feb 27, 2015
Merged

Proposal: Scheduler/Cluster Driver #393

merged 12 commits into from
Feb 27, 2015

Conversation

vieux
Copy link
Contributor

@vieux vieux commented Feb 12, 2015

Hi all,

This is a very simple version of what a Scheduler Cluster API Driver might look like, it's a PoC and comments are very welcome.

Here are the keys changes in this PR:

  • Cluster is now an interface.
  • Depending of the Cluster, it might embed a Scheduler or not. Right now, SwarmCluster does and MesosCluster doesn't.
  • Added a simple mesos.go file with a lot of TODOs.

Of course I'd love to split this into multiple PRs but here you can see the full picture.


Here is the simple Interface:

type Cluster interface {
         CreateContainer(config *dockerclient.ContainerConfig, name string) (*cluster.Container, error)
         RemoveContainer(container *cluster.Container, force bool) error

         Images()[]*Cluster.Image
         Image(IdOrName) *Cluster.Image
         Containers() []*cluster.Container
         Container(IdOrName string) *cluster.Container

         Info() [][2]string
 }

The Scheduler is nowhere to be found here, because some Cluster might have not and not the other.


Regarding mesos here how I see things:

It will be a "hybrid" framework, we will use mesos for creating and remove container and direct access to the nodes for everything else.

Swarm + Mesos

Start with swarm manage --cluster mesos zk://<ip:port1>,<ip:port2>,<ip:port3>/mesos
The MesosCluster will receive Mesos masters from zookeeper in NewEntries, connect to them to get the list of actual docker nodes and fill it's internal cluster.
In CreateContainer and RemoveContainer it'll use the Mesos Master to interact with Mesos.
In Nodes, Containers, Container and Events it will simply use the cluster of docker nodes.


@aluzzardi: WDYT ?
@tnachen & @benh: can you take a look at the comments in mesos.go and tell me if I'm missing something ?
@francisbouvier: thanks for your initial work in #178
@chanwit: we will keep it compiled in for now, but this API shouldn't be an issue with your PR.

Related to #213 and #214

@aluzzardi
Copy link
Contributor

Early review comment:

I wouldn't touch the Scheduler but instead make Cluster an interface. SwarmCluster would use the Scheduler whereas the MesosCluster would rely on Mesos.

The reasoning is that, looking at the scheduling interface, it goes far beyond what a scheduler is supposed to do (for instance, Events() or Container()). If we look at the implementation of those functions for instance, they basically rely those calls to the Cluster since that's where it makes more sense to hold the code - otherwise, Scheduler would end up being a mix of Scheduler and Cluster.

Rough idea:

  • Cluster has an API to interact with the cluster (find a container, get a list of all containers, etc) as we have today, plus the ability to deploy/destroy containers.
  • Scheduler is nowhere to be seen in the API code etc - every interaction goes through the Cluster somehow.
  • Swarm ends up being a manager of Clusters - one being the SwarmCluster, the other one the MesosCluster (and later FooCluster).

I believe it resonates better with what those types are doing: MesosScheduler for instance is not a scheduler as it doesn't actually schedule containers, it simply relies calls to Mesos. On the other hand, MesosCluster is what it's supposed to be: a representation of a Mesos cluster where you can schedule containers and look them up.

Thoughts?

@vieux
Copy link
Contributor Author

vieux commented Feb 12, 2015

I agree with the naming (for me. most of the comment is about renaming Scheduler to Cluster, and it makes sense) the API presented in the body of the PR will likely be the same.

Although, I'm in favor of keeping the cluster as it is today; the Cluster we have currently in the code is a DockerCluster, Not a SwarmCluster and both SwarmCluster and MesosCluster will use a DockerCluster (let's name it DockerNodes it's bad but it'll be simpler for the rest of the comment).

So I would agree on something like:

Cluster is an interface.

SwarmCluster is an implementation of Cluster
SwarmCluster has an internal Scheduler (not an interface) and a DockerNodes (not and interface) with the list of actual nodes.
(here, would you move the Scheduler code in cluster/swarm/scheduler? as is it only for the SwarmCluster)

MesosCluster is an implementation of Cluster
MesosCluster has only a DockerNodes, no Scheduler as the Scheduler is mesos itself in this case.

FooCluster is an implementation of Cluster
FooCluster might or might not have a DockerNodes and it's own scheduler.

What do you think of this ?

One question though, where would you see the filters and strategies ? Only in the Scheduler inside SwarmCluster. Because I believe MesosCluster should somehow have access to those.

@aluzzardi
Copy link
Contributor

Although, I'm in favor of keeping the cluster as it is today; the Cluster we have currently in the code is a DockerCluster, Not a SwarmCluster and both SwarmCluster and MesosCluster will use a DockerCluster (let's name it DockerNodes it's bad but it'll be simpler for the rest of the comment).

I don't think MesosCluster will use DockerCluster as it doesn't have any useful information for it (I guess that if swarm is set up to only use MesosCluster, there won't be any nodes (as in, docker engines) therefore DockerCluster would be empty.

Since all SwarmCluster will ever do is relay calls to DockerCluster (and the scheduler for things such as DeployContainer), what would be the point of having a separate DockerCluster and SwarmCluster?

(here, would you move the Scheduler code in cluster/swarm/scheduler? as is it only for the SwarmCluster)

I'm not sure, I initially would say I don't think so. The goal of the project remains (by default) to be a simple container scheduler, so it kinda makes sense to keep it as a high level item.

Also, one point I didn't raise earlier: Scheduler eventually might be an interface itself, but in the pure sense of scheduling. One could use the SwarmCluster with a custom "native" Scheduler (for instance, one that does rebalancing within a SwarmCluster in a different way, based on actual resource usage rather than requested, as an example).

One question though, where would you see the filters and strategies ? Only in the Scheduler inside SwarmCluster. Because I believe MesosCluster should somehow have access to those.

Good point. I don't know if filters and strategies will be re-used by the MesosCluster. If we keep scheduler/ as a top level item, we could leave it there. Basically Scheduler is a simple wrapper doing the glue between filters and strategies, so if MesosCluster needs those, it could as well just import the entire scheduler.

@chanwit
Copy link
Contributor

chanwit commented Feb 12, 2015

@vieux Thanks for the ping. This PR looks great and I'll be trying to catch with these changes!

}

// Entries are Mesos masters
func (s *MesosScheduler) NewEntries(entries []*discovery.Entry) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what this function is for.
The comment above says entries are mesos masters, but in the code it's adding to the cluster which I assume is slave nodes with swarm launched.
For Mesos there isn't any need to look for new nodes, since when new nodes are added to the data center the framework will get offers from it from the resourceOffer callback, and since you launched the task yourself you also know what slave that is, in the framework side just need to listen on the statusUpdate on TASK_RUNNING to get the information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be the generic way to receive Mesos masters.

I do think swarm will need access to the list of nodes, to be able to do the logs, attach, pause, unpause, ps etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we have a way to lookup containers / list containers and get the node address they are running in.

In that case, we can build cluster.Containers on the fly when calling Containers() or Container().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say receive Mesos masters, are you referring when the master changes?
Swarm definitely needs the list of nodes, and you can query the master for all the slaves. Looks like this is being called back from discovery service. One concern I have is that since this is a seperate discovery process that is proxying information, if some reason the mesos slave is down it will still return information given the node is still alive.
If all the docker commands that it needs to proxy is about a running container, it is much better to get the information from the mesos as that's more definitive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, but we still have 2 issues with this:

  • events: we need direct access to the nodes to get the events
  • images: we do not want, each time we do a docker images to query mesos for the list of all the nodes, and then to query all the nodes to get their images, it'll take way to long.

@vieux vieux changed the title Proposal: Scheduler API Proposal: ~~Scheduler~~ Cluster API Feb 13, 2015
@vieux vieux changed the title Proposal: ~~Scheduler~~ Cluster API Proposal: Scheduler/Cluster API Feb 13, 2015
@vieux
Copy link
Contributor Author

vieux commented Feb 13, 2015

I updated the PR with @aluzzardi and @tnachen suggestions.

Still lots of blurry things, but let's do baby steps and iterate :)

  • Now the interface Cluster is an interface and it is the implementation choice to use the Scheduler or not.
    • Way less file changed, good news and proof it's probably way clever this way.
    • SwarmCluster uses the scheduler to apply filters and strategies on it's docker nodes.
    • I'm not sure how it will work for MesosCluster but we want MesosCluster to handle filters and strategy as well.
  • As of right now, both SwarmCluster and MesosCluster have a list of docker nodes internally, this might change soon as Mesos could manage only offers.
    • Maybe we should renameNode and it should become an interface?
    • It could be a Docker Node for SwarmCluster and an Offer for MesosCluster

@bfirsh
Copy link
Contributor

bfirsh commented Feb 13, 2015

I wonder if these should be called "drivers" rather than "APIs" to avoid confusion with the Remote API?

@vieux vieux changed the title Proposal: Scheduler/Cluster API Proposal: Scheduler/Cluster Driver Feb 13, 2015
@vieux
Copy link
Contributor Author

vieux commented Feb 13, 2015

@bfirsh good idea, I updated the title.

@abronan
Copy link
Contributor

abronan commented Feb 14, 2015

Super excited by this PR!

If I understand this well:

  • Mesos Slaves are running on Nodes next to a docker daemon as well as the Swarm agent for the discovery service.
  • Swarm Manager will register a custom Mesos framework to communicate with the Mesos-Master and request as well as accept offers.
  • Swarm Manager will register the framework by also giving a custom Executor binary, whose task will be to communicate with the Mesos-Slave and:
    • Run tasks through the docker daemon
    • Control the overall resource usage on the Node(s).

A few questions though:

  • I guess that we assume that the docker daemon as well as the Swarm agent are running on a Node before registering the framework?
  • Do we allow a secure Swarm setup to communicate with an insecure Mesos setup and conversely?

Also thanks @vieux for taking those suggestions into account! I'm strongly leaning toward @aluzzardi's comments on the use of a Cluster interface. To give a concrete example, I'm designing a Slot scheduler that could be pluggable to swarm and it had no use of most of the functions declared in the Scheduler interface from the initial PR. But now it seems cleaner and removes the blurry boundaries between Cluster and Scheduler. All good!

@vieux
Copy link
Contributor Author

vieux commented Feb 19, 2015

In addition to the Cluster interface, there is now a Node interface to be implemented by each driver.

type Node interface {
    ID() string
    Name() string

    IP() string   //to inject the actual IP of the machine in docker ps (hostname:port or ip:port)
    Addr() string //to know where to connect with the proxy

    Images() []*Image                     //used by the API
    Image(IdOrName string) *Image         //used by the filters
    Containers() []*Container             //used by the filters
    Container(IdOrName string) *Container //used by the filters

    TotalCpus() int64   //used by the strategy
    UsedCpus() int64    //used by the strategy
    TotalMemory() int64 //used by the strategy
    UsedMemory() int64  //used by the strategy

    Labels() map[string]string //used by the filters

    IsHealthy() bool
}

@dpetersen
Copy link

Last week @alexwelch and I wrote a prototype Kubernetes cluster based on this PR. We got docker run working, including port mapping between containers. A few things we ran into:

  • Compatible cluster setup was a hassle, due to the daemon on the nodes needing to be accessible by the manager. I understand why that is, but it means an easy prebuilt cluster like Google Container Engine won't work without tweaks.
  • The docker CLI experience is quirky. docker run results in two containers, the one you wanted and the k8s Pod manager. Running docker rm on one of them would result in both needing to be removed. This is usually hidden because k8s users aren't interacting directly with Docker. I could maybe omit those "management" containers from your interactions(ps, events, etc), but that feels weird.
  • We ignored Swarm's scheduling and filtering. We started early last week, before your change where the Cluster asks the Scheduler to select a node for new containers. In k8s there is control over where the container goes, but AFAIK you can't say "put the container on this node". You can say "put it on a node with container X" and things like that, but it's not node-centric. This is the part of k8s I'm least familiar with, so there might be a solution to that problem.

By the end of the week, I felt that a k8s driver is technically possible, but wouldn't be compelling to use. Maybe more work (or more k8s experience than I have!) is required, but that Swarm wouldn't take advantage of many features of k8s, and wouldn't behave quite like docker, either. This might change in the future if there is a docker run that operates at an application level (like running Compose's YAML file). Working at a level above individual containers would let you employ more of the features of k8s.

I'm very curious to see a real implementation of the Mesos driver. Maybe that work will give us some ideas for the problems we ran into. Sorry for the huge wall of text!

… a simple

scheduler interface.

Signed-off-by: Victor Vieux <vieux@docker.com>
Signed-off-by: Victor Vieux <vieux@docker.com>
Signed-off-by: Victor Vieux <vieux@docker.com>
Signed-off-by: Victor Vieux <vieux@docker.com>
Signed-off-by: Victor Vieux <vieux@docker.com>
Signed-off-by: Victor Vieux <vieux@docker.com>
Signed-off-by: Victor Vieux <vieux@docker.com>
Signed-off-by: Victor Vieux <vieux@docker.com>
Signed-off-by: Victor Vieux <vieux@docker.com>
Usable -> Total & Reserved -> Used

Signed-off-by: Victor Vieux <vieux@docker.com>
Signed-off-by: Victor Vieux <vieux@docker.com>
Signed-off-by: Victor Vieux <vieux@docker.com>
@vieux
Copy link
Contributor Author

vieux commented Feb 27, 2015

Thanks @dpetersen for you valuable feedback, this PR is becoming very big, so we might merge it soon and refine the API in another PR, I'll definitely ping you on it.

@aluzzardi
Copy link
Contributor

LGTM

aluzzardi added a commit that referenced this pull request Feb 27, 2015
Proposal: Scheduler/Cluster Driver
@aluzzardi aluzzardi merged commit db97473 into docker-archive:master Feb 27, 2015
@aluzzardi aluzzardi deleted the mesos_poc branch February 27, 2015 23:17
@aluzzardi
Copy link
Contributor

I just noticed that the Node implementation is named swarm.Node and lives in node.go while the Cluster implementation is named swarm.SwarmCluster and lives in swarm.go.

Can you make the two consistent, e.g., rename swarm.SwarmCluster into swarm.Cluster and name the file cluster.go?

@vieux
Copy link
Contributor Author

vieux commented Feb 28, 2015

Sure will do

@duncanjw
Copy link

@aluzzardi Hi. I pitched up here having followed the PR link from your Swarm post in the Scheduler section https://blog.docker.com/2015/02/scaling-docker-with-swarm/ - if this is now Cluster Driver it is probably worth updating post?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants