-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
If running with replicas > 1, not all routers will pick up new routable services at the same time #152
Comments
Possible solution to this problem: have the router elect a leader to track which router replicas have acknowledged the new routable service. The leader can return the current status of the router cluster in a manner similar to the below: {"versions":
{
"replica1": ["service1"],
"replica2": ["service1", "service2"],
"replica3": ["service1"]
}
} Such a response indicates that 2 replicas still are waiting to pick up the new |
I'm personally not in favor of this, but I'll hold off on why... Entertaining the notion, I think these are some questions that would need addressing:
Admittedly not having answered all of these questions yet, my own personal sense is that's too much added complexity to add in response to the relatively minor concern that a brand new routable service might not be fully viable for 10 seconds. I am, of course, willing to have my mind changed if anyone has thoughts on how some of the above can be addressed with a minimized degree of additional complexity. |
No Go library that I know of that does exactly this. The k8s blog, however, explains the pattern, and it can be done with a sidecar or the k8s client library. No additional etcd needed, just the k8s API.
By having the non-leaders talk to it? Not really sure I understand the question - but the idea is that the non-leaders talk back to the master and the master keeps an append-only log of their status. This is an implementation similar to the k8s event stream, and it's fairly simple.
The very last sentence you wrote will likely be the most straightforward implementation.
This proposal is for building a distributed state management system, so yes, there'll likely be new edge cases. I'm open to other proposals! |
👍 That's very helpful @arschles. Thanks! After reading through it, I have an idea for something that might be a somewhat simplified architecture that would also alleviate some of the onslaught of requests from multiple router replicas to the k8s apiserver... What would you think about the possibility that rather than have the master query followers for state, or rather than have followers report their state to the master, we have the master be the only router replica that speaks to the apiserver? Each time there is a change to the config model it has computed every ten seconds, it can POST that change to the followers, each of which would synchronously apply it? I need to think through that a little bit more, but it would almost keep all replicas perfectly in-sync. Thoughts? Edit: Responses to queries to the master about the state of service x could block until the master gets a 2xx response from every follower for any outstanding config pushes. |
After some offline discussion with @arschles we converged on this as a basic design: https://gist.github.com/arschles/b8ad290f8a872a1d9232 While I'm still averse to the additional complexity, I do see a benefit to a router mesh where all replicas are (almost) immediately consistent when configuration changes. I'll commit to at least experimenting with this for RC1. |
Copying the contents of the aforementioned gist into here: Internal update protocol
External update protocolThe current implementation of the router polls the Kubernetes API for changes in the routable services. Let's say the controller adds a new routable service. After it does so, it queries the leader (see below for how it does this) to ensure that all replicas have received this new routable service. The controller doesn't finish the deploy until the leader says that all replicas have the new routable service. Querying the leaderThe k8s leader election protocol provides for all replicas to know who the leader is. If a request goes to a non-leader, it should just forward it to a leader. This behavior allows the controller (or any other entity) to just query the router service, and wherever the request lands, it will make it to the leader. |
@arschles a thought I've just had... how long do you suppose the leader should wait for each follower to confirm that config has been received and applied? If it's less than 10 seconds, that seems a bit impatient... but if 10 seconds or more, then we're better off just allowing each replica to run its own config loop like they would today. I know what we want is immediate consistency among all router replicas, but there's really no such thing as immediate... what's the acceptable threshold here? |
@krancour 10 seconds seems like a lot, but anyway what I mean is a timeout within which a healthy node will respond, and an unhealthy one won't. We may consider introducing a backoff retry mechanism here, but can discuss further when we get there. Also, what I mean by immediate consistency is that the leader will not continue in the polling loop until all replicas have applied the same change that it just observed. It would be immediate in terms of CAP |
I agree that in the general case, the configuration can be pushed to all replicas and applied much faster than 10 seconds. I just felt like 10 seconds or less feels like an overly aggressive timeout... but maybe not.
Forgive my inquisitiveness on this... would it really solve for the initial complaint then? There's always going to be a window of time in which n > 1 routers might yield one of two different responses to the same request. It seems this window could be made smaller simply by tightening the config loop. Having said that, I think I'm still, overall in favor of this, just perhaps for different reasons. If the config loop were tightened on n replicas, that's a lot of chatter with the apiserver. This scheme we're discussing has only the elected leader talking to the apiserver. (Well... I guess that's not entirely true, because the leader election algorithm involves plenty of chatter with the apiserver.) |
Yes, it would, because although the routers won't all pick up the new configuration at the same time, there would be a time after which we can be sure that all the routers have picked up the new config.
That's ok, because we could, using this protocol, be certain that all of the router replicas have picked up the new configuration. If you prefer, we have a distributed barrier that opens after all replicas have the new configuration. After the barrier opens, the controller (or any other interested party) can be sure that the changes are completely rolled out (for example, the application is completely deployed).
As a related sidenote, there are plans to add a cache to the watch API implementation. I haven't read or heard of any such plans, but I hope that a similar cache exists or is planned for annotations or a superset of that data. Such a cache would alleviate some of the chatter problems, because not all calls wouldn't go to the backing etcd cluster. |
Can you describe the barrier? I believe you refer to the version annotation... but how do "interested parties" know what the version number should be? On a related note, should the leader be the last to record the update in the version annotation? I think that's vital to properly implementing the "barrier?" |
The barrier is the term I gave to what the protocol implements. There is [a standard concurrency construct](https://en.wikipedia.org/wiki/Barrier_(computer_science) called this, and it can be implemented in distributed systems. Specifically, in this case, it's the mechanism that reports that not all routers are done applying the new configuration (the barrier is closed in this case) until they all report that they applied the new config and version number (the barrier is open in this case)
The leader should set the the updated version in the annotation before beginning to push the new config & version number to replicas. If it or any replica (including the leader) fails (for any reason) during the update operation, and then comes back later, it should recognize and record the new version, not the old one. |
More offline discussion with @arschles. I just want to preserve a few bits of what we discusses here for transparency.
There's an obvious problem with this. Ideally controller and router were each intended to be compatible with alternate implementations of one another. If our router implementation implements all these bells and whistles and the controller comes to rely on them, this creates a significant barrier to creating alternative implementations of the router. Tightly coupling the controller to the router seems unwise. At the same time, there would be some great benefits to what @arschles has proposed. One possibility is that the controller could have special case integration with our routers and ignore / not use consistency checks when alternative / unknown implementations of the router are in use. @arschles and I have agreed for the time being that if we were to undertake anything such as that, it would require buy-in from @slack @helgi and others and would almost certainly happen post-stable. I'm removing the rc1 milestone and suggesting that we make this a topic of discussion at the next offsite. |
I followed the "production usage" docs, which actually recommends scaling the router:
Should this be discouraged until this issue is solved? |
@JeanMertz, no. I would not discourage it. If you can live with the possibility that your routers don't all agree on their configuration for a max of 10 seconds every time a change is applied, then there is nothing to worry about here. |
related: #274 should resolve this if we go to a more event-based templating control loop. |
@bacongobbler even with the event-based control loop, the possibility will still remain that for small fractions of time (seconds or less), n routers might have minor disagreements on current state. That said, I don't believe a 100% atomic update to n routers is possible. The best we can ever do here is make that window smaller and smaller and smaller to the point it's inconsequential. I'd argue that 10 seconds or less already qualifies as such. Should we consider closing this perhaps? |
This issue is a continuation of deis/controller#551.
We're not planning to support more than 1 router replica for the beta release, so the controller can simply check the router service until the router picks up the newly routable service.
However, when we begin supporting multiple router replicas, it's no longer sufficient to just check the router service, because not all replicas will pick up the newly routable service at the same time (and, it's possible that some may never pick it up).
The text was updated successfully, but these errors were encountered: