-
Notifications
You must be signed in to change notification settings - Fork 302
Use gRPC to communicate the engine and agents #1426
Use gRPC to communicate the engine and agents #1426
Conversation
@hectorj2f could you provide more info on use cases of this PR? You are trying to implement new protocol for fleet daemons communication and avoid etcd because it is slow? Is that correct? |
@kayrus Yes, we'll add more information to this PR description such as motivation, use cases, etc...
Generally, this implementation provides two solutions as mentioned above. You can use etcd if that fits better your architecture or requirements. OR you can use the in-memory registry to reduce the dependencies with etcd (but not to avoid using etcd). In that direction, we found out in our infrastructure that a high workload over etcd induces into a poor or wrong behavior of fleet. |
@hectorj2f perhaps it would be worth your time to investigate using fleet with the etcd v3 api, which uses gRPC. |
@mischief yes, that won't be a bad idea although we believe that the use of etcd to provide inter-process communication for the agents, it could end up into potential bottlenecks, as well as it has an impact into the fault tolerance of fleet. Perhaps, @htr could add more light into this to clarify our intention with this PR. On the other hand and related to your suggestion, Do you have benchmarking results to analyze the benefits of etcd v2 VS v3 ? We also use etcd in other projects :). |
@hectorj2f there are only these benchmarks: https://github.com/coreos/etcd/tree/master/Documentation/benchmarks |
This PR addresses issues related with our fleet usage pattern. Usually our fleet cluster isn't composed by many nodes (<50), but it is expected to handle a non trivial (at this time) amount of units (>3000). Currently, fleet's only registry implementation uses etcd to store transient and persistent data. More units means more unitStates being updated per second, which after some time causes some instability (an update can be very quick (<50ms) but also quite slow (~ 1s)). This PR implements a multiplexing layer on top of the former EtcdRegistry. This means:
Another issue that we encountered was the lack of cleanup (or GC) in fleet's etcd directory. This PR doesn't address that issue, but having a single writer makes the development of a solution much easier. This implementation can coexist with the traditional etcd-centric implementation. We have a micro benchmark (note the emphasis on micro benchmark) that we use to compare tweaks, configuration parameters, etc:
The time between the |
@mischief we might, but I still think separating the storage of transient and persistent data is important. |
"github.com/coreos/fleet/registry" | ||
) | ||
|
||
func (e *Engine) rpcLeadership(leaseTTL time.Duration, machID string) lease.Lease { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have you looked at https://github.com/coreos/etcd/blob/master/clientv3/concurrency/election.go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiang90 Yes, this fleet version doesn't use the etcd clientv3 yet. But I always keep one eye on your changes.
@kayrus @tixxdz @jonboulle I will update the PR to the latest version. But, do you want me to include |
@hectorj2f Just create additional PR for godeps. |
@kayrus Alright, thanks 👍 |
hmm, additional commit should be fine..? On Mon, Apr 4, 2016 at 5:02 PM, kayrus notifications@github.com wrote:
|
@jonboulle it will be hard to review whole PR |
Eh, having it in a separate commit seems sufficient to me,but whatever On Mon, Apr 4, 2016 at 5:24 PM kayrus notifications@github.com wrote:
|
@jonboulle If we do so, there would be approx. 230 files changed and github doesn't work well with those kind of PR :/. |
Recently I have been looking into this PR. On the other hand, it was not easy for me to follow each commit in this PR, from older to newer ones. Is there any chance to reorganize the PR, to reduce number of commits? (Though I can already see that it would not be trivial...) And today I rebased this PR on top of master, fixing a minor thing to compile. (see the branch endocode/dongsu/grpc_engine_communication) And I ran functional tests. |
@dongsupark I will upgrade this branch against the latest version and fix the tests, asap. I'd inform you if I find any blocking error. But the purpose is to have all the tests passing or adapt them if need it.
I feel your pain. I can squash our commits but we can do that once tests are happy, and code is LGTMed. |
@hectorj2f I think the point is that having a messy history actually makes the review harder... While github favours reviewing all commits as one big diff, that's not the traditional way to deal with pull requests; and it can be quite daunting for complex changes. Thus it's better to clean up history before submitting upstream, to help the review. Note that "clean up" doesn't typically mean to squash everything into a single commit -- that wouldn't really help. Such complex changes can usually be made as a logical series of individual commits, each representing some meaningful change that can be explained on its own, and doesn't result in anything breaking... |
@antrik Alright!. It makes sense, I'll group the commits and add helpful messages for them. |
Ah, sorry. This was due to my mistake. For some reason I set As for the logical structure of commits, I think it was not that bad in the beginning. The first 10 commits were relatively good to understand, well-structured. It's just that at some point it started to grow like that. :-) |
@dongsupark I rebased against master but I am getting some errors in the tests of Does it ring any bell ? Where could I look at ? |
@dongsupark Should I add the LICENSE to the header of the new files ? or you'd do it once it is merged. |
@hectorj2f That looks like an error from unit tests. I haven't seen such an error. |
Alright @dongsupark ! I'd see what I can do. |
74e147a
to
b39e193
Compare
Great @dongsupark |
This PR aims to provide a new communication mechanism to improve the performance, data transmission and unit state sharing between the fleet engine and agents in a fleet cluster.
Motivation: In our infrastructure, we have experienced some issues with fleet in terms of scalability, performance and fault-tolerance. Therefore we'd like to present our ideas to help improve those areas.
We use gRPC/HTTP2 as framework to expose all the required operations (schedule unit, destroy unit, save state, ...) that will allow to coordinate the engine (the fleet node elected as leader) with the agents. In this implementation, we provide a new registry that stores all the information in-memory. Nevertheless, it's also possible to use the etcd registry.
Generally, this implementation provides two solutions as mentioned above. You can use etcd if that fits better your architecture or requirements. OR you can use the in-memory registry to reduce the dependencies with etcd (but not to avoid using etcd). In that direction, we found out in our infrastructure that a high workload over etcd induces into a poor or wrong behavior of fleet. Besides, we believe that the use of etcd to provide inter-process communication for the agents, it could end up into potential bottlenecks, as well as it has an impact into the fault tolerance of fleet.
Additional information and plots about the motivation of this PR can be found below: #1426 (comment)
This PR has been labeled as WIP, we are still working on improvements, fault tolerance, bug fixing, etc :)
NOTE: If you want to try this PR, you need to rebuild the Go dependencies, and preferably to use Go v1.5.1. This PR was so big that we were forced to exclude our new Go dependencies.