-
Notifications
You must be signed in to change notification settings - Fork 18.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] "Simple" Stateful Services #31865
Comments
This is a frequently requested feature and while I'm not sure what the work for it would be like, I think building it would be really worthwhile, TBQHIMHOIRL |
@jlhawn this sounds similar to what CF does by ordering each instance of a scaled app, and if its is restarted then it gets that same ordinal position again. And I think similar to Kube's StatefulSets. Are these fair comparisons? Just want to make sure I understand the idea behind the feature. |
/cc @docker/core-swarm-maintainers |
I don't know much of the details around cloud foundry or k8s stateful sets to know whether those are directly comparable. The idea behind the feature is to be able to support stateful distributed systems with the following characteristics/requirements:
|
I've hack-ish done this kind of orchestration into SwarmKit back to 1.12 and I would love to see this happening in the future versions of Docker :-) :-)
+1 to @duglin comment.
For each task, this could be easily done by querying the embedded DNS.
This is really important point too from my experiment. Not only the peer, also the stateful service should not be allowed to remove if its replica is not |
I've got my own hack which does a lookup of |
I see. Forgot to think about the gossip protocol that it's eventually consistency. Pushing only network-scope addresses would be also a good way to optimize size of the list. |
I was thinking that such stateful services could have their own implicit overlay network just for use by the peers in this service. Each peer would have a canonical hostname and static IP on this network. |
@jlhawn +1 Maybe the hook can be externalized to a different service/container or even a plugin instead of beeing a command, that way the service images can stay the same while a controller can intercept the scale up/down commands and interact with services to update their configuration if necessary. That way ie. maintainers of |
As for the 'there is no need for fancy volume drivers' i only partially agree. I also think that docker-centric distributed storage is not needed, but I still believe that also the volume binding should be in control of such a feature, as well as the stable networking identity. Say we want to use EBS volume, in such case we may want a unique volume to be always bound to the same instance of the service: ie. (vol-1234 --> instance 1, vol-2345 --> instance 2), and it needs to stay like that. That's quite different from the "one instance per node + host volume" since in that case, we don't know in advance volume we need to bind for each new instance, we can't really rely on a driver/hook either, the naming scheme of the volume can be not predictable in advance. The only solution I can think of is to keep some sort of binding table, and maybe let the driver to request a new volume for new instances. Ie. Say we have mongo and for each instance we want a new EBS volume to be permanently bound to that. Each time a new 'peer' is created we need a way to create a volume (let's say that's magically done by the volume driver) After we have volumes bound we need to store the binding in such a way that the same volume will be always used for the same instance in case the container goes down and up somewhere else. More questions arise at this point... what should happen in the case of scale down? Should the data be reused on a new scale up? Should the volume be destroyed, kept but not used anymore? |
I really miss this in docker too. Probably the way to go to this would be with small baby steps and see how things work. For instance, in the Azure ARM JSON templates they are using a loop property when creating many of the same kind of things. Using something like this would allow our custom entry scripts to act depending on the # of instance the container is. Adding capabilities like naming the container with this loop attribute would be able to refer to these many services by it's container name in an overlay net. After that probably another entry point for the scripts could be executed once a container is added/removed so that the rest of components act upon it. Maybe separating the configuration from the init part on containers so that a reconfigure flag could be applied... To share a bit of my pain right now I find myself trying to perform an stress test with a cluster of jmeter servers and a client. At the end I decided to not use swarm at all and create a script that ssh into a variety of machines and spins server containers all the same in the machines. this way it's easy for me to create N machines and the deploy N containers. If I had to do it with swarm or compose I would find myself maintaining a huge compose file or alternatively to creating it programatically from a base template. FYI I took this approach for stateful apps |
Docker Services are great for stateless applications (e.g., web servers, workers) but most real application stacks revolve around one or more distributed stateful services (e.g., databases, message queues, etc). Some more modern examples include key-value stores like etcd, Consul, and Zookeeper, document-oriented databases like RethinkDB and MongoDB, and even some newer/experimental SQL databases like CockroachDB. Some of these are a building block for other distributed stateful services - as is the case with Apache Zookeeper and Kafka. These were designed from the beginning as distributed systems which manage replication of data and failover automatically.
This sets them apart from other widely used yet monolithic stateful services such as PostgreSQL and MySQL which typically provide a mechanism for replication of data (usually primary->secondary configurations) but failover still requires manual operator intervention or some other system to be in place which detects failure of the primary and handles promotion of a secondary. So, for the purposes of this feature request, do not consider these types of stateful services. These systems simply do not fit well into the model of Docker Services.
The kinds of stateful services which do fit into this feature request typically have the following properties:
If Docker could provide a set of features which makes the above items possible then these distributed/replicated/auto-failover systems could easily be made into Docker services.
NOTE: I am not requesting a special distributed/replicated Docker volume feature since these systems can distribute/replicate their own data. I explicitly want to use the local storage on a specific node. This feature request is more about having consistent peer discovery and node pinning for these kinds of services.
@docker/core-swarmkit-maintainers
So far, I have been able to kind-of make this work with RethinkDB as an example stateful service but it's far from perfect. To get more control of placement of the peers, I have to run a service in "global" mode to limit to at most 1 task per node with a placement constraint like
node.labels.db.replica == true
. So in order to scale the service up or down I cannot usedocker service scale rethinkdb=N
but instead must manage node labels withdocker node update ...
. I actually like this method as it gives a lot of control around placement and I know that the swarm scheduler will never put a database task anywhere else - even if the node it's on becomes unavailable. The real difficulty is in connecting the peers together. My service spec includes an overlay network so each of the peers can communicate over that but it is difficult to know how to connect them. I can't use theservice_name.network_name
because that resolves to a virtual IP and from within the service usually only resolves to your own task endpoint. It turns out that resolvingtasks.service_name.network_name
returns a DNS record for each of the peer tasks but the DNS results are eventually consistent and so I cannot rely on this to make sure that the peers are fully connected. This also requires that I build into the image an entrypoint script which does this DNS lookup and prepairs--join
arguments before execing into the server process. Another problem that I have is that these peers need to advertise their address to each other and if a task restarts it's given a different name. That's not a big deal with RethinkDB but it may be with other systems.The text was updated successfully, but these errors were encountered: