mgr: TTL cache implementation for mgr module. #41571

pereman2 · 2021-05-27T12:07:09Z

Hey guys, what if we talk about caching today?

For a time there have been some pre-PRs where we've introduced several useful tools to be able to stress test the manager module because, there were some performance problems when requesting for the osdmap with a lot of osds. Here #41260 we used the injector to be able to change the size of the osdmap and also we needed something to call the get_python("osd_map") so we made this cli module to expose the get method(#41150).

We discussed adding cache to different layers as per https://tracker.ceph.com/issues/48388 but, we agreed upon delivering it in the c++ side since it's closer to where we create the PyObjects and also, because it would be shared across all the different python modules. Furthermore, having a shared cache implemented in python would have been more tricky.

About the TTLCache, I've added a header file TTLCache.h in the mgr folder with a simple implementation so we could have a PoC that in fact, this could improve the overall performance.

There are some problems still that need to be addressed. One of them being how to retrieve the memory size used by a PyObject which I marked as todo because it needs some further testing.

benchmarks

Let's talk about benchmarks now. The following chart was run with 10 threads(max) to simulate concurrent requests so there could be some contention in the manager module and with 1000 iterations.

As you can see there is a noticeable improvement where with 3000 osds we see a 56% faster approach with a 10 seconds TTL.(updated as of last benchmarks done with more specific benchmarks, with JSONFormatter too)

Moreover, there are several python modules which modify this osdmap that we are retrieving, that means that we need either something to copy the object or make this immutable. For the copy approach I benchmarked different options: deepcopying(pretty slow as expected), json, shallow copying which can cause problems if we modify a nested object since that wouldn't be copied and at last, pickle which is the clear winner.

We used a 1000 osd osdmap for 1000 iterations.

Python gc

As you might know, Python uses generational gc and reference counting. When we create a new PyObject we have to own the ref count so for before inserting to the ttlcache and when getting the value from the ttl, we increase the ref count of the obj. If we are only injecting and not using the ttl there is no need to increase it.

I'm not entirely sure if this is the correct way, but this is the one that made sense since every time the object expired it always had a 1 in the ref count.

The injector pr is merged in this PR as I needed it to test things and also, for now the injector's output is the one cached but after we discuss this PR I'll enable it for the get_python. There is still some problems with that that's why it isn't there.

At last, this PR introduces two new options to enable the caching and changing the ttl(which is not added still).

ceph config set mgr mgr_enable_ttl_cache true

Fixes: https://tracker.ceph.com/issues/52834

Checklist

References tracker ticket
Updates documentation if necessary
Includes tests for new functionality or reproducer for bug

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox

markhpc · 2021-05-27T13:38:57Z

@perman2 I confess I don't look at the mgr much, but I like the work you are doing here! Keep up the good work!

github-actions · 2021-05-27T23:50:06Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

sebastian-philipp

Sorry for my ignorance here:

What is your cache invalidation strategy? Did you made sure that python modules are not getting outdated osdmaps? (I think I simply missed the code line)
Do we need a TTL for things like the osdmap at all? What do you think of indefinitely caching things that are critical for mgr/prometheus?
What is your opinion on https://lists.ceph.io/hyperkitty/list/dev@ceph.io/thread/P3VS3FYBAP6G5YWYMYNGDYIEFHVOWNBM/ ? This is super important, as this PR here has the potential to make the resolution for this sub-interpreter issue more complicated.

sebastian-philipp · 2021-05-31T08:59:02Z

src/mgr/Injector.cc

+  PyObject *obj = f.get();
+  Py_INCREF(obj);
+  return obj;


Did you just fixed a bug here? If yes, which one? you want to modify the commit message?

nope, I'm removing this in the next commit that I'm preparing for the injector. f.get() already returns an object with a refcount of 1 so this Py_INCREF is completly wrong.

pereman2 · 2021-05-31T12:29:47Z

@sebastian-philipp

Sorry for my ignorance here:

What is your cache invalidation strategy? Did you made sure that python modules are not getting outdated osdmaps? (I think I simply missed the code line)

10 seconds ttl. If you try to get a expired PyObject it gets erased and returns a nullptr.

I validated it on a superficial level, I printed whenever a value is erased and it seemed ok.

Do we need a TTL for things like the osdmap at all? What do you think of indefinitely caching things that are critical for mgr/prometheus?

Well, osdmap contains information about if an osd is up or down for example, thus a TTL would help keeping it updated.

Furthermore, I thought of storing the ttl in the cache for each value because there might be something that could be kept for a long time but, I haven't checked in depth all the possible outputs of get_python.

What is your opinion on https://lists.ceph.io/hyperkitty/list/dev@ceph.io/thread/P3VS3FYBAP6G5YWYMYNGDYIEFHVOWNBM/ ? This is super important, as this PR here has the potential to make the resolution for this sub-interpreter issue more complicated.

I'm not that experienced with this but, I always thought that the python modules running on different processes made more sense, however, that could mean a lot of work as John said. Probably having a lot of daemons is not fun to maintain and a central process makes sense but I'm not familiar with the performance issues it could introduce.

I just learned about CPython so I don't know much about other options. I've seen that PyPy is faster or something like that but no idea.

As per #14971 (comment), cherrypy and also numpy make use of the global state. I don't like the idea of having to audit every depency because of the global state so I would use an approach which would solve this.

liewegas · 2021-06-01T16:36:45Z

I wonder if we can address the problem without a TTL. In the case of getting an osdmap, there is an associated epoch. What if we just keep a copy of the PyObject for the osdmap for that epoch and discard/invalidate if/when the epoch changes? That way we always get a consistent/correct result.

I'm guessing that the user that modifies the returned structure is the balancer? Maybe we can make the balancer explictily make a copy to modify (and pay the price) instead of paying the price of creating a mutable object for all of the other read-only users too. (This might mean the possibility of hard-to-find bugs if someone does modify the structure though when it shouldn't... it would be nice if there was some way to making things const.)

liewegas · 2021-06-01T18:34:30Z

Is this worth discussing tomorrow during CDM? https://tracker.ceph.com/projects/ceph/wiki/CDM_02-JUN-2021

pereman2 · 2021-06-02T11:14:11Z

I wonder if we can address the problem without a TTL. In the case of getting an osdmap, there is an associated epoch. What if we just keep a copy of the PyObject for the osdmap for that epoch and discard/invalidate if/when the epoch changes? That way we always get a consistent/correct result.

Makes lots of sense but, what is the rate of change of the osdmap in production? It looks like it might vary significantly between different clusters.

I'm guessing that the user that modifies the returned structure is the balancer? Maybe we can make the balancer explictily make a copy to modify (and pay the price) instead of paying the price of creating a mutable object for all of the other read-only users too. (This might mean the possibility of hard-to-find bugs if someone does modify the structure though when it shouldn't... it would be nice if there was some way to making things const.)

I agree that immutable objects would be a better option. The problem with this is available libraries don't make nested object immutable so those can still be modified. Also we would have to remove every line of code that modifies a structure.

Anyways, we can talk about this in the CDM.

pereman2 · 2021-06-07T11:32:09Z

@liewegas @sebastian-philipp @tchaikov Following last week's CDM, I've tested storing serialized json and then only having to deserialize it in the python side. We think that there are several pros of this approach:

First of all, it's faster than just copying the PyDict object be it with pickle, deepcopy etc... also it makes more sense this way. Furthermore, we've checked there are different serialization libraries that could be much faster, for example, msgpack which we already have rpms for(also it looks like it is the fastest as per https://medium.com/@shmulikamar/python-serialization-benchmarks-8e5bb700530b).

If we followed this approach, in the longer term we could decouple the mgr from the python modules and use REST+json/msgpack.

At last, I've benchmarked it and we saw some noticeable improvements with larger osdmaps. If we were to use msgpack, which I'm already looking to implement a msgpackformatter, it would be faster overall.

The green line denotes the approach where we store the serialized json and deserialize in the mgr_module.py.

In addition, I haven't fully checked immutable objects but it still looks like it would decrease the performance since we have to traverse the whole object.

Please let me know what you think :P.
cc: @epuertat

avanthakkar · 2021-06-07T13:37:08Z

jenkins test make check

avanthakkar · 2021-06-07T13:37:15Z

jenkins test api

Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>

pereman2 · 2021-09-21T14:45:18Z

Hey @neha-ojha. We were thinking of adding some sort of async cleanup to the cache or something similiar in a follow up PR. How could we expect to approach this? Also, we added an allow list to check which values should be cached as some of them were problematic (like health). Do you think that is good enough or it should be changed?

neha-ojha · 2021-09-24T17:36:51Z

Hey @neha-ojha. We were thinking of adding some sort of async cleanup to the cache or something similiar in a follow up PR. How could we expect to approach this? Also, we added an allow list to check which values should be cached as some of them were problematic (like health). Do you think that is good enough or it should be changed?

Let's discuss this at a core standup sometime next week

pereman2 · 2021-09-27T10:26:59Z

Let's discuss this at a core standup sometime next week

Okay. I'll join today's meeting.

I ran some more teuthology:
https://pulpito.ceph.com/pdiazbou-2021-09-27_08:45:30-rados:mgr-wip-pereman2-testing-2021-08-20-100-distro-basic-smithi/
1 fail probably unrelated, the rest is ok.

Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>

pereman2 · 2021-10-04T10:04:50Z

@neha-ojha I updated the PR with the configuration files you asked for in the past meeting. Also I ran teuthology with those and got 0 fails:
https://pulpito.ceph.com/pdiazbou-2021-10-04_07:43:14-rados:mgr-wip-pereman2-testing-2021-08-20-100-distro-basic-smithi/
I hope we can move this forward.

pereman2 · 2021-10-05T08:37:12Z

jenkins test dashboard

pereman2 · 2021-10-05T08:37:59Z

jenkins test signed

pereman2 · 2021-10-05T09:21:49Z

jenkins test dashboard

epuertat · 2021-10-05T13:11:47Z

jenkins test dashboard

pereman2 requested review from epuertat, tchaikov and jdurgin May 27, 2021 12:07

github-actions bot added build/ops common core mgr pybind labels May 27, 2021

pereman2 requested review from a team, Waadkh7, sebastian-philipp, aaSharma14, avanthakkar, nizamial09 and alfonsomthd and removed request for a team May 27, 2021 12:07

markhpc added the performance label May 27, 2021

github-actions bot added the needs-rebase label May 27, 2021

sebastian-philipp reviewed May 31, 2021

View reviewed changes

pereman2 added the dashboard label Jun 1, 2021

pereman2 added this to In progress in Dashboard via automation Jun 1, 2021

pereman2 removed cephfs Ceph File System bluestore mon ceph-volume crimson orchestrator cephadm rook nfs libcephsqlite ceph SQLite3 VFS labels Sep 17, 2021

mgr: cleanup

8df35d9

Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>

pereman2 force-pushed the ttlcache branch from 1ef1561 to 8df35d9 Compare September 17, 2021 15:21

pereman2 requested a review from neha-ojha September 21, 2021 14:45

pereman2 force-pushed the ttlcache branch from 7b23e9a to 63c552d Compare September 29, 2021 09:55

mgr: add teuthology cache enable conf

2a1d2f2

Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>

pereman2 force-pushed the ttlcache branch from 63c552d to 2a1d2f2 Compare September 29, 2021 10:01

epuertat moved this from Reviewer approved to needs-qa in Dashboard Oct 4, 2021

epuertat moved this from needs-qa to Ready-to-merge in Dashboard Oct 5, 2021

epuertat merged commit 26382ca into ceph:feature-48388-cache Oct 6, 2021

Dashboard automation moved this from Ready-to-merge to Done Oct 6, 2021

epuertat deleted the ttlcache branch October 6, 2021 19:25

epuertat added the feature label Feb 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mgr: TTL cache implementation for mgr module. #41571

mgr: TTL cache implementation for mgr module. #41571

pereman2 commented May 27, 2021 •

edited by epuertat

markhpc commented May 27, 2021

github-actions bot commented May 27, 2021

sebastian-philipp left a comment

sebastian-philipp May 31, 2021

pereman2 May 31, 2021

pereman2 commented May 31, 2021

liewegas commented Jun 1, 2021

liewegas commented Jun 1, 2021

pereman2 commented Jun 2, 2021

pereman2 commented Jun 7, 2021

avanthakkar commented Jun 7, 2021

avanthakkar commented Jun 7, 2021

pereman2 commented Sep 21, 2021

neha-ojha commented Sep 24, 2021

pereman2 commented Sep 27, 2021

pereman2 commented Oct 4, 2021

pereman2 commented Oct 5, 2021

pereman2 commented Oct 5, 2021

pereman2 commented Oct 5, 2021

epuertat commented Oct 5, 2021

mgr: TTL cache implementation for mgr module. #41571

mgr: TTL cache implementation for mgr module. #41571

Conversation

pereman2 commented May 27, 2021 • edited by epuertat

benchmarks

Python gc

Checklist

markhpc commented May 27, 2021

github-actions bot commented May 27, 2021

sebastian-philipp left a comment

Choose a reason for hiding this comment

sebastian-philipp May 31, 2021

Choose a reason for hiding this comment

pereman2 May 31, 2021

Choose a reason for hiding this comment

pereman2 commented May 31, 2021

liewegas commented Jun 1, 2021

liewegas commented Jun 1, 2021

pereman2 commented Jun 2, 2021

pereman2 commented Jun 7, 2021

avanthakkar commented Jun 7, 2021

avanthakkar commented Jun 7, 2021

pereman2 commented Sep 21, 2021

neha-ojha commented Sep 24, 2021

pereman2 commented Sep 27, 2021

pereman2 commented Oct 4, 2021

pereman2 commented Oct 5, 2021

pereman2 commented Oct 5, 2021

pereman2 commented Oct 5, 2021

epuertat commented Oct 5, 2021

pereman2 commented May 27, 2021 •

edited by epuertat