Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for label selection in watt/kubewatch #1292

Closed
esmet opened this issue Mar 6, 2019 · 4 comments
Closed

Support for label selection in watt/kubewatch #1292

esmet opened this issue Mar 6, 2019 · 4 comments
Milestone

Comments

@esmet
Copy link
Contributor

esmet commented Mar 6, 2019

Please describe your use case / problem.

I run multiple deployments of ambassador on a multi-tenant Kubernetes cluster, using ambassador_id to separate them into non-overlapping "environments". There can be hundreds of different environments running at the same time, and each environment can define dozens of Service objects.

In this scenario, Ambassador (kubewatch) uses a fair amount of memory (2-4gb) and takes up substantial CPU to process watcher updates. I've seen Ambassador take 60-70 seconds to process a single 15mb yaml snapshot. Worse, when any single service object changes, every ambassador will perform another that 60-70 second update. Ideally, Ambassador would use memory and CPU proportional to the set of services that are relevant to that particular ambassador_id, and could then scale well even in a massively multi-tenant cluster.

Describe the solution you'd like

I propose adding an environment configuration KUBEWATCH_LABEL_SELECTOR to kubewatch (https://github.com/datawire/teleproxy) which will be a raw label selection string to provide to the List/Watch implementation.

For example, if my architecture guarantees that all service objects in an environment contain a consistent environment label, then I could pass KUBEWATCH_LABEL_SELECTOR="environment=qa123" to limit the number of objects that kubewatch must operate on (ie: only the ones in the qa123 environment). This will limit the amount of memory and CPU required by Ambassador overall.

I have a patch that implements this behavior for kubewatch (https://github.com/datawire/teleproxy)

I chose to open the issue here, at least for starters, since this feels mostly about an Ambassador scalability use case.

Describe alternatives you've considered

I considered investigating the hot path for yaml parsing in diagd to see if we could make it faster. I think this problem is solved best by letting an Ambassador operator tell the system which objects it should look at instead of making the "everything" case faster. Even better, this approach would allow an operator to add new guard rails to prevent user mistakes (eg: have "ambasasdor-staging" only consider services labeled "staging", for even better isolation from "production")

Additional context

I observed a few crash stacks in diagd when it was under performance pressure.

Unfortunately I seem to have misplaced my notes on this, but I remember it was within load_from_filesystem on

110         for filepath, filename in inputs:
111             self.logger.info("reading %s (%s)" % (filename, filepath))
112
113             try:
114                 serialization = open(filepath, "r").read()
115                 self.parse_yaml(serialization, k8s=k8s, filename=filename)
116             except IOError as e:
117                 self.aconf.post_error("could not read YAML from %s: %s" % (filepath, e))

where open() returned None, and the subsequent read() failed.

@esmet
Copy link
Contributor Author

esmet commented Mar 7, 2019

I decided to investigate optimizing the yaml parsing path anyway, and it turns out that we can get a big speedup by using the C loader over the standard python implementation.

The original issue I ran into was that ~20k service objects serialized as a yaml snapshot would take around 70 seconds to parse back into diagd's memory. With the C loader, this time is down to 6.5 seconds.

Combining these two optimizations, my multi-tenant workload now allows for a single ambassador instance to process relevant updates in around 100ms (cutting the 20k services down to around 200-300, and getting a 10x speedup using the C loader). I'm happy with these results and I think each optimization has value own its own. I'll open a separate issue for using the C loader - #1294

@draeron
Copy link

draeron commented Mar 14, 2019

i've been searching the issues and it seems my problem are related. In our case, i would postulate that it's the secrets count which is problematic since all our helm/tiller history is stored in secrets.

#1297

@esmet
Copy link
Contributor Author

esmet commented Mar 17, 2019

Bump: thoughts? This optimization is critical for my use case, and I think others may eventually run into this, too.

@kflynn
Copy link
Member

kflynn commented Mar 21, 2019

@esmet, so sorry for the delay here! I did in fact switch us to the C YAML parser, and I'd be very interested in seeing your patch to Kubewatch. Want to open a PR in the Teleproxy repo?

Also, are you on the Datawire OSS Slack? There's an #ambassador-dev channel there which is a great place for discussions like this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants