Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If running with replicas > 1, not all routers will pick up new routable services at the same time #152

Closed
arschles opened this issue Mar 23, 2016 · 17 comments

Comments

@arschles
Copy link
Member

This issue is a continuation of deis/controller#551.

We're not planning to support more than 1 router replica for the beta release, so the controller can simply check the router service until the router picks up the newly routable service.

However, when we begin supporting multiple router replicas, it's no longer sufficient to just check the router service, because not all replicas will pick up the newly routable service at the same time (and, it's possible that some may never pick it up).

@arschles arschles added this to the v2.0-rc1 milestone Mar 23, 2016
@arschles
Copy link
Member Author

Possible solution to this problem: have the router elect a leader to track which router replicas have acknowledged the new routable service. The leader can return the current status of the router cluster in a manner similar to the below:

{"versions":
  {
    "replica1": ["service1"],
    "replica2": ["service1", "service2"],
    "replica3": ["service1"]
  }
}

Such a response indicates that 2 replicas still are waiting to pick up the new service2 routable service. The controller (or any other interested party) can use this information to determine when an initial deploy is complete, by ensuring all replicas have the applicable service in the associated list.

@krancour
Copy link
Contributor

I'm personally not in favor of this, but I'll hold off on why...

Entertaining the notion, I think these are some questions that would need addressing:

  1. Is there a reasonable go library for managing leader election? Or maybe there's an established pattern I am not aware of for doing this natively within k8s? I certainly don't want to re-introduce etcd to facilitate this.
  2. How does the leader get information from the non-leaders? I imagine each router exposes information about what services it knows about RESTfully... because, again, I wouldn't want to re-introduce etcd, for instance, so that routers can publish their state to someplace the master can fetch it from.
  3. How does the controller find the leader? Maybe a separate service that has only the leader as an endpoint? (This would require running pods to add and remove labels from themselves. idk yet if that is plausible.) Or maybe the controller doesn't need to find the leader? Perhaps all non-leaders proxy these sort of informational requests to the leader?
  4. New edge cases introduced by this?

Admittedly not having answered all of these questions yet, my own personal sense is that's too much added complexity to add in response to the relatively minor concern that a brand new routable service might not be fully viable for 10 seconds. I am, of course, willing to have my mind changed if anyone has thoughts on how some of the above can be addressed with a minimized degree of additional complexity.

@arschles
Copy link
Member Author

Is there a reasonable go library for managing leader election? Or maybe there's an established pattern I am not aware of for doing this natively within k8s? I certainly don't want to re-introduce etcd to facilitate this.

No Go library that I know of that does exactly this. The k8s blog, however, explains the pattern, and it can be done with a sidecar or the k8s client library. No additional etcd needed, just the k8s API.

How does the leader get information from the non-leaders? I imagine each router exposes information about what services it knows about RESTfully... because, again, I wouldn't want to re-introduce etcd, for instance, so that routers can publish their state to someplace the master can fetch it from.

By having the non-leaders talk to it? Not really sure I understand the question - but the idea is that the non-leaders talk back to the master and the master keeps an append-only log of their status. This is an implementation similar to the k8s event stream, and it's fairly simple.

How does the controller find the leader? Maybe a separate service that has only the leader as an endpoint? (This would require running pods to add and remove labels from themselves. idk yet if that is plausible.) Or maybe the controller doesn't need to find the leader? Perhaps all non-leaders proxy these sort of informational requests to the leader?

The very last sentence you wrote will likely be the most straightforward implementation.

New edge cases introduced by this?

This proposal is for building a distributed state management system, so yes, there'll likely be new edge cases. I'm open to other proposals!

@krancour
Copy link
Contributor

The k8s blog, however, explains the pattern, and it can be done with a sidecar or the k8s client library. No additional etcd needed, just the k8s API.

👍 That's very helpful @arschles. Thanks!

After reading through it, I have an idea for something that might be a somewhat simplified architecture that would also alleviate some of the onslaught of requests from multiple router replicas to the k8s apiserver...

What would you think about the possibility that rather than have the master query followers for state, or rather than have followers report their state to the master, we have the master be the only router replica that speaks to the apiserver? Each time there is a change to the config model it has computed every ten seconds, it can POST that change to the followers, each of which would synchronously apply it?

I need to think through that a little bit more, but it would almost keep all replicas perfectly in-sync. Thoughts?

Edit: Responses to queries to the master about the state of service x could block until the master gets a 2xx response from every follower for any outstanding config pushes.

@krancour
Copy link
Contributor

After some offline discussion with @arschles we converged on this as a basic design:

https://gist.github.com/arschles/b8ad290f8a872a1d9232

While I'm still averse to the additional complexity, I do see a benefit to a router mesh where all replicas are (almost) immediately consistent when configuration changes. I'll commit to at least experimenting with this for RC1.

@krancour krancour self-assigned this Mar 23, 2016
@arschles
Copy link
Member Author

Copying the contents of the aforementioned gist into here:

Internal update protocol

  1. When the leader gets a new configuration, it assigns a monotonically increasing version number to it and stores it in a k8s annotation (similar to how the leader election algorithm uses annotations). The version # will be for router-internal use only.
  2. When the leader gets a new configuration, it increments the version, stores it in the annotation, then sends the version along with the new configuration to all other router replicas. Each replica only responds when it has finished updating both its config and version number. If any replica doesn't respond within a reasonable timeout, the leader terminates it.
  3. When a non-leader comes online with a version #, it compares its local known value with that in the annotation. If they don't match, it requests the most up to date version number and config from the leader
  4. When a non-leader comes online with no version, it requests the most up to date version and config from the leader.
  5. If a non-leader gets one or more configurations pushed to it while it is receiving the version # and config from the leader, it applies the most up to date version number of all configs that it gets

External update protocol

The current implementation of the router polls the Kubernetes API for changes in the routable services. Let's say the controller adds a new routable service. After it does so, it queries the leader (see below for how it does this) to ensure that all replicas have received this new routable service. The controller doesn't finish the deploy until the leader says that all replicas have the new routable service.

Querying the leader

The k8s leader election protocol provides for all replicas to know who the leader is. If a request goes to a non-leader, it should just forward it to a leader. This behavior allows the controller (or any other entity) to just query the router service, and wherever the request lands, it will make it to the leader.

@krancour
Copy link
Contributor

@arschles a thought I've just had... how long do you suppose the leader should wait for each follower to confirm that config has been received and applied? If it's less than 10 seconds, that seems a bit impatient... but if 10 seconds or more, then we're better off just allowing each replica to run its own config loop like they would today. I know what we want is immediate consistency among all router replicas, but there's really no such thing as immediate... what's the acceptable threshold here?

@arschles
Copy link
Member Author

@krancour 10 seconds seems like a lot, but anyway what I mean is a timeout within which a healthy node will respond, and an unhealthy one won't. We may consider introducing a backoff retry mechanism here, but can discuss further when we get there.

Also, what I mean by immediate consistency is that the leader will not continue in the polling loop until all replicas have applied the same change that it just observed. It would be immediate in terms of CAP

@krancour
Copy link
Contributor

10 seconds seems like a lot

I agree that in the general case, the configuration can be pushed to all replicas and applied much faster than 10 seconds. I just felt like 10 seconds or less feels like an overly aggressive timeout... but maybe not.

what I mean by immediate consistency is that the leader will not continue in the polling loop until all replicas have applied the same change that it just observed

Forgive my inquisitiveness on this... would it really solve for the initial complaint then? There's always going to be a window of time in which n > 1 routers might yield one of two different responses to the same request. It seems this window could be made smaller simply by tightening the config loop.

Having said that, I think I'm still, overall in favor of this, just perhaps for different reasons. If the config loop were tightened on n replicas, that's a lot of chatter with the apiserver. This scheme we're discussing has only the elected leader talking to the apiserver. (Well... I guess that's not entirely true, because the leader election algorithm involves plenty of chatter with the apiserver.)

@arschles
Copy link
Member Author

would it really solve for the initial complaint then?

Yes, it would, because although the routers won't all pick up the new configuration at the same time, there would be a time after which we can be sure that all the routers have picked up the new config.

There's always going to be a window of time in which n > 1 routers might yield one of two different responses to the same request

That's ok, because we could, using this protocol, be certain that all of the router replicas have picked up the new configuration. If you prefer, we have a distributed barrier that opens after all replicas have the new configuration. After the barrier opens, the controller (or any other interested party) can be sure that the changes are completely rolled out (for example, the application is completely deployed).

...the leader election algorithm involves plenty of chatter with the apiserver.

As a related sidenote, there are plans to add a cache to the watch API implementation. I haven't read or heard of any such plans, but I hope that a similar cache exists or is planned for annotations or a superset of that data. Such a cache would alleviate some of the chatter problems, because not all calls wouldn't go to the backing etcd cluster.

@krancour
Copy link
Contributor

If you prefer, we have a distributed barrier that opens after all replicas have the new configuration.

Can you describe the barrier? I believe you refer to the version annotation... but how do "interested parties" know what the version number should be?

On a related note, should the leader be the last to record the update in the version annotation? I think that's vital to properly implementing the "barrier?"

@arschles
Copy link
Member Author

Can you describe the barrier? I believe you refer to the version annotation... but how do "interested parties" know what the version number should be?

The barrier is the term I gave to what the protocol implements. There is [a standard concurrency construct](https://en.wikipedia.org/wiki/Barrier_(computer_science) called this, and it can be implemented in distributed systems. Specifically, in this case, it's the mechanism that reports that not all routers are done applying the new configuration (the barrier is closed in this case) until they all report that they applied the new config and version number (the barrier is open in this case)

On a related note, should the leader be the last to record the update in the version annotation?

The leader should set the the updated version in the annotation before beginning to push the new config & version number to replicas. If it or any replica (including the leader) fails (for any reason) during the update operation, and then comes back later, it should recognize and record the new version, not the old one.

@krancour
Copy link
Contributor

More offline discussion with @arschles. I just want to preserve a few bits of what we discusses here for transparency.

  • Clarification of "barrier": The term should not imply that access to the router(s) is blocked while configuration is inconsistent. It's only a "barrier" to access insofar as interested parties can choose to defer their requests until the router mesh reports consistent configuration.
  • The principal focus of ensuring consistency / making consistency queryable has been for controller to know when a newly deployed app has become routable. However, this is far from the only practical use of this. Consider an entirely different scenario... an app operator uses the deis CLI to associate a certificate to one of an application's "custom" domains. In an ideal world, that operation blocks until all routers have received the cert and are listening on 443 for the given domain. So, in the general case, any mechanisms for making consistency occur faster than the eventual consistency we get today is only useful if any hypothetical configuration change can be queried for a determination of whether it is complete. What that looks like:
    • Interested party asks router mesh for current configuration (all non-leaders are proxies for the leader, who will respond to such inquiries). The leader blocks until the mesh's configuration is in a consistent state and then returns the full configuration model (as JSON, presumably). Interested parties (the controller being one such party) can inspect to determine if the change they care about is present.

There's an obvious problem with this. Ideally controller and router were each intended to be compatible with alternate implementations of one another. If our router implementation implements all these bells and whistles and the controller comes to rely on them, this creates a significant barrier to creating alternative implementations of the router.

Tightly coupling the controller to the router seems unwise. At the same time, there would be some great benefits to what @arschles has proposed. One possibility is that the controller could have special case integration with our routers and ignore / not use consistency checks when alternative / unknown implementations of the router are in use. @arschles and I have agreed for the time being that if we were to undertake anything such as that, it would require buy-in from @slack @helgi and others and would almost certainly happen post-stable.

I'm removing the rc1 milestone and suggesting that we make this a topic of discussion at the next offsite.

@JeanMertz
Copy link

I followed the "production usage" docs, which actually recommends scaling the router:

Do you need to scale the router? For greater availability, it's desirable to run more than one instance of the router. How many can only be informed by stress/performance testing the applications in your cluster. To increase the number of router instances from the default of one, increase the number of replicas specified by the deis-router replication controller. Do not specify a number of replicas greater than the number of worker nodes in your Kubernetes cluster.

Should this be discouraged until this issue is solved?

@krancour
Copy link
Contributor

@JeanMertz, no. I would not discourage it. If you can live with the possibility that your routers don't all agree on their configuration for a max of 10 seconds every time a change is applied, then there is nothing to worry about here.

@bacongobbler
Copy link
Member

bacongobbler commented Oct 31, 2016

related: #274 should resolve this if we go to a more event-based templating control loop.

@krancour
Copy link
Contributor

krancour commented Nov 1, 2016

@bacongobbler even with the event-based control loop, the possibility will still remain that for small fractions of time (seconds or less), n routers might have minor disagreements on current state. That said, I don't believe a 100% atomic update to n routers is possible. The best we can ever do here is make that window smaller and smaller and smaller to the point it's inconsequential. I'd argue that 10 seconds or less already qualifies as such. Should we consider closing this perhaps?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants