[Feature Request] "Simple" Stateful Services #31865

jlhawn · 2017-03-15T20:39:03Z

Docker Services are great for stateless applications (e.g., web servers, workers) but most real application stacks revolve around one or more distributed stateful services (e.g., databases, message queues, etc). Some more modern examples include key-value stores like etcd, Consul, and Zookeeper, document-oriented databases like RethinkDB and MongoDB, and even some newer/experimental SQL databases like CockroachDB. Some of these are a building block for other distributed stateful services - as is the case with Apache Zookeeper and Kafka. These were designed from the beginning as distributed systems which manage replication of data and failover automatically.

This sets them apart from other widely used yet monolithic stateful services such as PostgreSQL and MySQL which typically provide a mechanism for replication of data (usually primary->secondary configurations) but failover still requires manual operator intervention or some other system to be in place which detects failure of the primary and handles promotion of a secondary. So, for the purposes of this feature request, do not consider these types of stateful services. These systems simply do not fit well into the model of Docker Services.

The kinds of stateful services which do fit into this feature request typically have the following properties:

Each peer in the system has a name or address that does not change, even if it restarts. This name or address is advertised to other peers.
Any peer in the system is capable of handling a client's request just as well as any other task in the system or is capable of routing it to the appropriate peer.
Each peer, once assigned a location, stays in that location (physical or virtual machine, i.e. "node" in Docker).
These systems are capable of distributing and replicating their own data across peers (i.e., you don't need any fancy volume plugins; The local volume driver will do) and gracefully handle failover in the case that one or more peers becomes unavailable.
As a consequence of the previous 2 points, no two peer tasks should ever be scheduled to the same Docker node in order to not compromise the availability of that system.
An operator must manually intervene to scale the system up or down (add or remove peers). While in some systems (like etcd) simply adding a new peer automatically replicates data to that peer, other systems also require a configuration change to replicate any data to the new peer. This could be made easier with Docker if Swarmkit provided custom hooks which can be performed when adding or removing a peer task.
When a new peer is added, it is provided with a method for discovering and connecting to all other peers. Once a peer task is running, it is not necessary for this information to be updated when another peer joins as that new peer will be responsible for connecting to the existing peers.

If Docker could provide a set of features which makes the above items possible then these distributed/replicated/auto-failover systems could easily be made into Docker services.

NOTE: I am not requesting a special distributed/replicated Docker volume feature since these systems can distribute/replicate their own data. I explicitly want to use the local storage on a specific node. This feature request is more about having consistent peer discovery and node pinning for these kinds of services.

@docker/core-swarmkit-maintainers

So far, I have been able to kind-of make this work with RethinkDB as an example stateful service but it's far from perfect. To get more control of placement of the peers, I have to run a service in "global" mode to limit to at most 1 task per node with a placement constraint like node.labels.db.replica == true. So in order to scale the service up or down I cannot use docker service scale rethinkdb=N but instead must manage node labels with docker node update .... I actually like this method as it gives a lot of control around placement and I know that the swarm scheduler will never put a database task anywhere else - even if the node it's on becomes unavailable. The real difficulty is in connecting the peers together. My service spec includes an overlay network so each of the peers can communicate over that but it is difficult to know how to connect them. I can't use the service_name.network_name because that resolves to a virtual IP and from within the service usually only resolves to your own task endpoint. It turns out that resolving tasks.service_name.network_name returns a DNS record for each of the peer tasks but the DNS results are eventually consistent and so I cannot rely on this to make sure that the peers are fully connected. This also requires that I build into the image an entrypoint script which does this DNS lookup and prepairs --join arguments before execing into the server process. Another problem that I have is that these peers need to advertise their address to each other and if a task restarts it's given a different name. That's not a big deal with RethinkDB but it may be with other systems.

The text was updated successfully, but these errors were encountered:

dperny · 2017-03-20T20:25:29Z

This is a frequently requested feature and while I'm not sure what the work for it would be like, I think building it would be really worthwhile, TBQHIMHOIRL

duglin · 2017-03-20T20:40:03Z

@jlhawn this sounds similar to what CF does by ordering each instance of a scaled app, and if its is restarted then it gets that same ordinal position again. And I think similar to Kube's StatefulSets. Are these fair comparisons? Just want to make sure I understand the idea behind the feature.

aluzzardi · 2017-03-20T21:10:00Z

/cc @docker/core-swarm-maintainers

jlhawn · 2017-03-20T21:36:09Z

I don't know much of the details around cloud foundry or k8s stateful sets to know whether those are directly comparable. The idea behind the feature is to be able to support stateful distributed systems with the following characteristics/requirements:

The application can handle its own replication of data and failover between peers (so you don't need detachable, distributed volume plugins).
Since the containers would use local volumes, a peer in the stateful service must be permanently assigned to a worker node on which it was scheduled. This also makes the data persistent for the lifetime of that peer (i.e., until it is scaled out) and can provide much better performance which some users demand but cannot get from distributed file systems.
All peers have a static network identity such as a stable hostname and/or static IP (This quality makes it similar to k8s stateful sets).
A peer can reliably and consistently discover who its peers are at start up time.
It is a new peer's responsibility for connecting to those peers which exist when it starts up (by induction this means that all peers will eventually be fully connected).
Scaling up the number of peers can be customized with a hook (.e.g., a command that is exec'd on a new peer).
Scaling down can also be customized with a hook (e.g., a command that is exec'd on an existing peer (not the one being removed) when another permanently leaves).
Scaling up or down is an explicit operator action in which the peer to remove is specified (as a consequence, a node cannot be removed until all stateful services have been scaled down off of that node).

chanwit · 2017-03-22T18:16:59Z

I've hack-ish done this kind of orchestration into SwarmKit back to 1.12 and I would love to see this happening in the future versions of Docker :-) :-)

@jlhawn this sounds similar to what CF does by ordering each instance of a scaled app, and if its is
restarted then it gets that same ordinal position again.

+1 to @duglin comment.
Ordering each stateful task and remembering its position is really important, at least for MySQL + Galera and its variants, MariaDB, Percona, etc.

@jlhawn

A peer can reliably and consistently discover who its peers are at start up time.

For each task, this could be easily done by querying the embedded DNS.

a node cannot be removed until all stateful services have been scaled down off of that node

This is really important point too from my experiment. Not only the peer, also the stateful service should not be allowed to remove if its replica is not 0 / 0.

jlhawn · 2017-03-22T18:56:10Z

A peer can reliably and consistently discover who its peers are at start up time.

For each task, this could be easily done by querying the embedded DNS.

I've got my own hack which does a lookup of tasks.<service_name> to get the address of peers, but the problem with this is that, today, the DNS lookup is not consistent. The records for tasks take a while to propagate to all nodes in the cluster and so I cannot currently rely on this method to discover peers. I understand the reasons for why overlay network DNS is eventually consistent, but for a stateful service it is crucial. I think the easiest work around would be for the swarm manager to push down a list of peer names and network addresses to a worker when it schedules a task for a stateful service. This could then be exposed via a special peers DNS entry for that task.

chanwit · 2017-03-22T20:01:30Z

I think the easiest work around would be for the swarm manager to push down a list of peer names and network addresses to a worker when it schedules a task for a stateful service. This could then be exposed via a special peers DNS entry for that task.

I see. Forgot to think about the gossip protocol that it's eventually consistency. Pushing only network-scope addresses would be also a good way to optimize size of the list.

jlhawn · 2017-03-22T21:26:25Z

Pushing only network-scope addresses would be also a good way to optimize size of the list.

I was thinking that such stateful services could have their own implicit overlay network just for use by the peers in this service. Each peer would have a canonical hostname and static IP on this network.

mcasimir · 2017-04-07T11:17:16Z

@jlhawn +1

Maybe the hook can be externalized to a different service/container or even a plugin instead of beeing a command, that way the service images can stay the same while a controller can intercept the scale up/down commands and interact with services to update their configuration if necessary.

That way ie. maintainers of mongo image will just need to ship a new "controller image" instead of editing hundreds of tags to make it work.

mcasimir · 2017-04-07T11:47:17Z

As for the 'there is no need for fancy volume drivers' i only partially agree. I also think that docker-centric distributed storage is not needed, but I still believe that also the volume binding should be in control of such a feature, as well as the stable networking identity.

Say we want to use EBS volume, in such case we may want a unique volume to be always bound to the same instance of the service: ie. (vol-1234 --> instance 1, vol-2345 --> instance 2), and it needs to stay like that.

That's quite different from the "one instance per node + host volume" since in that case, we don't know in advance volume we need to bind for each new instance, we can't really rely on a driver/hook either, the naming scheme of the volume can be not predictable in advance.

The only solution I can think of is to keep some sort of binding table, and maybe let the driver to request a new volume for new instances.

Ie. Say we have mongo and for each instance we want a new EBS volume to be permanently bound to that.

Each time a new 'peer' is created we need a way to create a volume (let's say that's magically done by the volume driver)

After we have volumes bound we need to store the binding in such a way that the same volume will be always used for the same instance in case the container goes down and up somewhere else.

More questions arise at this point... what should happen in the case of scale down? Should the data be reused on a new scale up? Should the volume be destroyed, kept but not used anymore?

guillemsola · 2017-05-15T08:23:46Z

I really miss this in docker too. Probably the way to go to this would be with small baby steps and see how things work.

For instance, in the Azure ARM JSON templates they are using a loop property when creating many of the same kind of things. Using something like this would allow our custom entry scripts to act depending on the # of instance the container is.

Adding capabilities like naming the container with this loop attribute would be able to refer to these many services by it's container name in an overlay net.

After that probably another entry point for the scripts could be executed once a container is added/removed so that the rest of components act upon it. Maybe separating the configuration from the init part on containers so that a reconfigure flag could be applied...

To share a bit of my pain right now I find myself trying to perform an stress test with a cluster of jmeter servers and a client. At the end I decided to not use swarm at all and create a script that ssh into a variety of machines and spins server containers all the same in the machines. this way it's easy for me to create N machines and the deploy N containers. If I had to do it with swarm or compose I would find myself maintaining a huge compose file or alternatively to creating it programatically from a base template. FYI I took this approach for stateful apps

jlhawn added the kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny label Mar 15, 2017

GordonTheTurtle added the area/swarm label Mar 15, 2017

jlhawn changed the title ~~[Feature Request] "Simple" Stafeful Services~~ [Feature Request] "Simple" Stateful Services Mar 15, 2017

GordonTheTurtle added the area/swarm label Mar 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] "Simple" Stateful Services #31865

[Feature Request] "Simple" Stateful Services #31865

jlhawn commented Mar 15, 2017 •

edited

Loading

dperny commented Mar 20, 2017

duglin commented Mar 20, 2017

aluzzardi commented Mar 20, 2017

jlhawn commented Mar 20, 2017

chanwit commented Mar 22, 2017

jlhawn commented Mar 22, 2017

chanwit commented Mar 22, 2017

jlhawn commented Mar 22, 2017

mcasimir commented Apr 7, 2017

mcasimir commented Apr 7, 2017

guillemsola commented May 15, 2017 •

edited

Loading

[Feature Request] "Simple" Stateful Services #31865

[Feature Request] "Simple" Stateful Services #31865

Comments

jlhawn commented Mar 15, 2017 • edited Loading

dperny commented Mar 20, 2017

duglin commented Mar 20, 2017

aluzzardi commented Mar 20, 2017

jlhawn commented Mar 20, 2017

chanwit commented Mar 22, 2017

jlhawn commented Mar 22, 2017

chanwit commented Mar 22, 2017

jlhawn commented Mar 22, 2017

mcasimir commented Apr 7, 2017

mcasimir commented Apr 7, 2017

guillemsola commented May 15, 2017 • edited Loading

jlhawn commented Mar 15, 2017 •

edited

Loading

guillemsola commented May 15, 2017 •

edited

Loading