Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mgr: TTL cache implementation for mgr module. #41571

Merged
merged 11 commits into from Oct 6, 2021

Conversation

pereman2
Copy link
Contributor

@pereman2 pereman2 commented May 27, 2021

Hey guys, what if we talk about caching today?

For a time there have been some pre-PRs where we've introduced several useful tools to be able to stress test the manager module because, there were some performance problems when requesting for the osdmap with a lot of osds. Here #41260 we used the injector to be able to change the size of the osdmap and also we needed something to call the get_python("osd_map") so we made this cli module to expose the get method(#41150).

We discussed adding cache to different layers as per https://tracker.ceph.com/issues/48388 but, we agreed upon delivering it in the c++ side since it's closer to where we create the PyObjects and also, because it would be shared across all the different python modules. Furthermore, having a shared cache implemented in python would have been more tricky.

About the TTLCache, I've added a header file TTLCache.h in the mgr folder with a simple implementation so we could have a PoC that in fact, this could improve the overall performance.

There are some problems still that need to be addressed. One of them being how to retrieve the memory size used by a PyObject which I marked as todo because it needs some further testing.

benchmarks

Let's talk about benchmarks now. The following chart was run with 10 threads(max) to simulate concurrent requests so there could be some contention in the manager module and with 1000 iterations.

simple

As you can see there is a noticeable improvement where with 3000 osds we see a 56% faster approach with a 10 seconds TTL.(updated as of last benchmarks done with more specific benchmarks, with JSONFormatter too)

Moreover, there are several python modules which modify this osdmap that we are retrieving, that means that we need either something to copy the object or make this immutable. For the copy approach I benchmarked different options: deepcopying(pretty slow as expected), json, shallow copying which can cause problems if we modify a nested object since that wouldn't be copied and at last, pickle which is the clear winner.

We used a 1000 osd osdmap for 1000 iterations.
bar_plot_with_error_bars

Python gc

As you might know, Python uses generational gc and reference counting. When we create a new PyObject we have to own the ref count so for before inserting to the ttlcache and when getting the value from the ttl, we increase the ref count of the obj. If we are only injecting and not using the ttl there is no need to increase it.

I'm not entirely sure if this is the correct way, but this is the one that made sense since every time the object expired it always had a 1 in the ref count.


The injector pr is merged in this PR as I needed it to test things and also, for now the injector's output is the one cached but after we discuss this PR I'll enable it for the get_python. There is still some problems with that that's why it isn't there.

At last, this PR introduces two new options to enable the caching and changing the ttl(which is not added still).

ceph config set mgr mgr_enable_ttl_cache true 

Fixes: https://tracker.ceph.com/issues/52834

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@markhpc
Copy link
Member

markhpc commented May 27, 2021

@perman2 I confess I don't look at the mgr much, but I like the work you are doing here! Keep up the good work!

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

Copy link
Contributor

@sebastian-philipp sebastian-philipp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for my ignorance here:

  1. What is your cache invalidation strategy? Did you made sure that python modules are not getting outdated osdmaps? (I think I simply missed the code line)
  2. Do we need a TTL for things like the osdmap at all? What do you think of indefinitely caching things that are critical for mgr/prometheus?
  3. What is your opinion on https://lists.ceph.io/hyperkitty/list/dev@ceph.io/thread/P3VS3FYBAP6G5YWYMYNGDYIEFHVOWNBM/ ? This is super important, as this PR here has the potential to make the resolution for this sub-interpreter issue more complicated.

Comment on lines 40 to 42
PyObject *obj = f.get();
Py_INCREF(obj);
return obj;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you just fixed a bug here? If yes, which one? you want to modify the commit message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, I'm removing this in the next commit that I'm preparing for the injector. f.get() already returns an object with a refcount of 1 so this Py_INCREF is completly wrong.

@pereman2
Copy link
Contributor Author

@sebastian-philipp

Sorry for my ignorance here:

  1. What is your cache invalidation strategy? Did you made sure that python modules are not getting outdated osdmaps? (I think I simply missed the code line)

10 seconds ttl. If you try to get a expired PyObject it gets erased and returns a nullptr.

I validated it on a superficial level, I printed whenever a value is erased and it seemed ok.

  1. Do we need a TTL for things like the osdmap at all? What do you think of indefinitely caching things that are critical for mgr/prometheus?

Well, osdmap contains information about if an osd is up or down for example, thus a TTL would help keeping it updated.

Furthermore, I thought of storing the ttl in the cache for each value because there might be something that could be kept for a long time but, I haven't checked in depth all the possible outputs of get_python.

  1. What is your opinion on https://lists.ceph.io/hyperkitty/list/dev@ceph.io/thread/P3VS3FYBAP6G5YWYMYNGDYIEFHVOWNBM/ ? This is super important, as this PR here has the potential to make the resolution for this sub-interpreter issue more complicated.

I'm not that experienced with this but, I always thought that the python modules running on different processes made more sense, however, that could mean a lot of work as John said. Probably having a lot of daemons is not fun to maintain and a central process makes sense but I'm not familiar with the performance issues it could introduce.

I just learned about CPython so I don't know much about other options. I've seen that PyPy is faster or something like that but no idea.

As per #14971 (comment), cherrypy and also numpy make use of the global state. I don't like the idea of having to audit every depency because of the global state so I would use an approach which would solve this.

@pereman2 pereman2 added this to In progress in Dashboard via automation Jun 1, 2021
@liewegas
Copy link
Member

liewegas commented Jun 1, 2021

I wonder if we can address the problem without a TTL. In the case of getting an osdmap, there is an associated epoch. What if we just keep a copy of the PyObject for the osdmap for that epoch and discard/invalidate if/when the epoch changes? That way we always get a consistent/correct result.

I'm guessing that the user that modifies the returned structure is the balancer? Maybe we can make the balancer explictily make a copy to modify (and pay the price) instead of paying the price of creating a mutable object for all of the other read-only users too. (This might mean the possibility of hard-to-find bugs if someone does modify the structure though when it shouldn't... it would be nice if there was some way to making things const.)

@liewegas
Copy link
Member

liewegas commented Jun 1, 2021

Is this worth discussing tomorrow during CDM? https://tracker.ceph.com/projects/ceph/wiki/CDM_02-JUN-2021

@pereman2
Copy link
Contributor Author

pereman2 commented Jun 2, 2021

I wonder if we can address the problem without a TTL. In the case of getting an osdmap, there is an associated epoch. What if we just keep a copy of the PyObject for the osdmap for that epoch and discard/invalidate if/when the epoch changes? That way we always get a consistent/correct result.

Makes lots of sense but, what is the rate of change of the osdmap in production? It looks like it might vary significantly between different clusters.

I'm guessing that the user that modifies the returned structure is the balancer? Maybe we can make the balancer explictily make a copy to modify (and pay the price) instead of paying the price of creating a mutable object for all of the other read-only users too. (This might mean the possibility of hard-to-find bugs if someone does modify the structure though when it shouldn't... it would be nice if there was some way to making things const.)

I agree that immutable objects would be a better option. The problem with this is available libraries don't make nested object immutable so those can still be modified. Also we would have to remove every line of code that modifies a structure.

Anyways, we can talk about this in the CDM.

@pereman2
Copy link
Contributor Author

pereman2 commented Jun 7, 2021

@liewegas @sebastian-philipp @tchaikov Following last week's CDM, I've tested storing serialized json and then only having to deserialize it in the python side. We think that there are several pros of this approach:

First of all, it's faster than just copying the PyDict object be it with pickle, deepcopy etc... also it makes more sense this way. Furthermore, we've checked there are different serialization libraries that could be much faster, for example, msgpack which we already have rpms for(also it looks like it is the fastest as per https://medium.com/@shmulikamar/python-serialization-benchmarks-8e5bb700530b).

If we followed this approach, in the longer term we could decouple the mgr from the python modules and use REST+json/msgpack.

At last, I've benchmarked it and we saw some noticeable improvements with larger osdmaps. If we were to use msgpack, which I'm already looking to implement a msgpackformatter, it would be faster overall.

simple

The green line denotes the approach where we store the serialized json and deserialize in the mgr_module.py.

In addition, I haven't fully checked immutable objects but it still looks like it would decrease the performance since we have to traverse the whole object.

Please let me know what you think :P.
cc: @epuertat

@avanthakkar
Copy link
Contributor

jenkins test make check

@avanthakkar
Copy link
Contributor

jenkins test api

Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
@pereman2
Copy link
Contributor Author

Hey @neha-ojha. We were thinking of adding some sort of async cleanup to the cache or something similiar in a follow up PR. How could we expect to approach this? Also, we added an allow list to check which values should be cached as some of them were problematic (like health). Do you think that is good enough or it should be changed?

@neha-ojha
Copy link
Member

Hey @neha-ojha. We were thinking of adding some sort of async cleanup to the cache or something similiar in a follow up PR. How could we expect to approach this? Also, we added an allow list to check which values should be cached as some of them were problematic (like health). Do you think that is good enough or it should be changed?

Let's discuss this at a core standup sometime next week

@pereman2
Copy link
Contributor Author

Let's discuss this at a core standup sometime next week

Okay. I'll join today's meeting.

I ran some more teuthology:
https://pulpito.ceph.com/pdiazbou-2021-09-27_08:45:30-rados:mgr-wip-pereman2-testing-2021-08-20-100-distro-basic-smithi/
1 fail probably unrelated, the rest is ok.

Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
@epuertat epuertat moved this from Reviewer approved to needs-qa in Dashboard Oct 4, 2021
@pereman2
Copy link
Contributor Author

pereman2 commented Oct 4, 2021

@neha-ojha I updated the PR with the configuration files you asked for in the past meeting. Also I ran teuthology with those and got 0 fails:
https://pulpito.ceph.com/pdiazbou-2021-10-04_07:43:14-rados:mgr-wip-pereman2-testing-2021-08-20-100-distro-basic-smithi/
I hope we can move this forward.

@pereman2
Copy link
Contributor Author

pereman2 commented Oct 5, 2021

jenkins test dashboard

@pereman2
Copy link
Contributor Author

pereman2 commented Oct 5, 2021

jenkins test signed

@epuertat epuertat moved this from needs-qa to Ready-to-merge in Dashboard Oct 5, 2021
@pereman2
Copy link
Contributor Author

pereman2 commented Oct 5, 2021

jenkins test dashboard

1 similar comment
@epuertat
Copy link
Member

epuertat commented Oct 5, 2021

jenkins test dashboard

@epuertat epuertat merged commit 26382ca into ceph:feature-48388-cache Oct 6, 2021
Dashboard automation moved this from Ready-to-merge to Done Oct 6, 2021
@epuertat epuertat deleted the ttlcache branch October 6, 2021 19:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Dashboard
  
Done
8 participants