New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
crimson: Enable tcmalloc when using seastar #46062
Conversation
Previously this has been caused by libtcmalloc not supporting aligned_alloc: https://bugzilla.redhat.com/show_bug.cgi?id=1569391 Seeing now if I can reproduce this on my centos stream test setup. |
These are the tests that failed make check in jenkins due to attempting to free invalid pointers:
Running ninja check locally with libtcmalloc.so.4.5.3, these all appear to pass:
There were a couple of other failures, but I do not know if they are specifically related to this PR:
|
jenkins test make check |
jenkins retest this please |
|
jenkins retest this please |
1 similar comment
jenkins retest this please |
src/perfglue/CMakeLists.txt
Outdated
@@ -1,4 +1,4 @@ | |||
if(ALLOCATOR STREQUAL "tcmalloc" AND NOT WITH_SEASTAR) | |||
if(ALLOCATOR STREQUAL "tcmalloc") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liu-chunmei does the reason why we disabled heap profiler still hold ?
see 028159e .
5fbcadd
to
9f31ad7
Compare
that's scaring. it deserves a tracker ticket. |
we'd need to address this failure first. |
should be addressed by #46103 |
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to port the Valgrind's suppression rules to ASan:
rzarzynski@teuthology:/a/rzarzynski-2023-03-12_13:57:50-crimson-rados-main-distro-crimson-smithi$ less 7204125/teuthology.log
...
2023-03-12T14:22:11.448 DEBUG:teuthology.orchestra.run.smithi027:> sudo MALLOC_CHECK_=3 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph-osd --no-mon-config --cluster ceph --mkfs --mkkey
-i 0 --monmap /home/ubuntu/cephtest/ceph.monmap
...
023-03-12T14:22:17.159 INFO:teuthology.orchestra.run.smithi027.stderr:=================================================================
2023-03-12T14:22:17.159 INFO:teuthology.orchestra.run.smithi027.stderr:==104216==ERROR: LeakSanitizer: detected memory leaks
2023-03-12T14:22:17.159 INFO:teuthology.orchestra.run.smithi027.stderr:
2023-03-12T14:22:17.159 INFO:teuthology.orchestra.run.smithi027.stderr:Direct leak of 8 byte(s) in 1 object(s) allocated from:
2023-03-12T14:22:17.159 INFO:teuthology.orchestra.run.smithi027.stderr: #0 0x7fa2ec22d307 in operator new(unsigned long) (/lib64/libasan.so.6+0xb6307)
2023-03-12T14:22:17.160 INFO:teuthology.orchestra.run.smithi027.stderr: #1 0x7fa2eb769ddd in InitModule() [clone .part.4] (/lib64/libtcmalloc.so.4+0x2eddd)
2023-03-12T14:22:17.160 INFO:teuthology.orchestra.run.smithi027.stderr:
2023-03-12T14:22:17.160 INFO:teuthology.orchestra.run.smithi027.stderr:SUMMARY: AddressSanitizer: 8 byte(s) leaked in 1 allocation(s).
2023-03-12T14:22:17.226 DEBUG:teuthology.orchestra.run:got remote process result: 1
2023-03-12T14:22:17.227 DEBUG:teuthology.orchestra.run.smithi027:> sudo MALLOC_CHECK_=3 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph-osd --cluster ceph --mkfs --mkkey -i 0 --monmap /home/ubuntu/cephtest/ceph.monmap
...
2023-03-12T14:22:18.136 INFO:teuthology.orchestra.run.smithi027.stderr:INFO 2023-03-12 14:22:18,136 [shard 0] ms - [0x61100002c3c0 client.?(temp_mon_client) - >> mon.? v2:172.21.15.27:3300/0] protocol CONNECTING execute_connecting fault, going to WAIT io_stat(io_state=delay, in_seq=0, out_seq=0, out_pending_msgs_size=1, out_sent_msgs_size=0, need_ack=0, need_keepalive=0, need_keepalive_ack=0) -- Connection refused
2023-03-12T14:22:18.136 INFO:teuthology.orchestra.run.smithi027.stderr:WARN 2023-03-12 14:22:18,136 [shard 0] ms - [0x61100002c3c0 client.?(temp_mon_client) - >> mon.? v2:172.21.15.27:3300/0] waiting 0.2 seconds ...
2023-03-12T14:22:18.336 INFO:teuthology.orchestra.run.smithi027.stderr:INFO 2023-03-12 14:22:18,336 [shard 0] ms - [0x61100002c3c0 client.?(temp_mon_client) - >> mon.? v2:172.21.15.27:3300/0] execute_wait(): going to CONNECTING
2023-03-12T14:22:18.337 INFO:teuthology.orchestra.run.smithi027.stderr:INFO 2023-03-12 14:22:18,337 [shard 0] ms - [0x61100002c3c0 client.?(temp_mon_client) - >> mon.? v2:172.21.15.27:3300/0] protocol CONNECTING execute_connecting fault, going to WAIT io_stat(io_state=delay, in_seq=0, out_seq=0, out_pending_msgs_size=1, out_sent_msgs_size=0, need_ack=0, need_keepalive=0, need_keepalive_ack=0) -- Connection refused
2023-03-12T14:22:18.337 INFO:teuthology.orchestra.run.smithi027.stderr:WARN 2023-03-12 14:22:18,337 [shard 0] ms - [0x61100002c3c0 client.?(temp_mon_client) - >> mon.? v2:172.21.15.27:3300/0] waiting 0.4 seconds ...
...
2023-03-13T02:06:06.399 DEBUG:teuthology.exit:Got signal 15; running 1 handler...
2023-03-13T02:06:06.401 DEBUG:teuthology.task.console_log:Killing console logger for smithi027
2023-03-13T02:06:06.402 DEBUG:teuthology.task.console_log:Killing console logger for smithi139
Otherwise --mkfs
won't create object store data turning entire crimson-rados
runs red: http://pulpito.front.sepia.ceph.com:80/rzarzynski-2023-03-12_13:57:50-crimson-rados-main-distro-crimson-smithi/.
fyi, i recently added a |
To avoid touching |
Perhaps we could unify (thinking about generating |
This PR needs #50598 as a dependency. |
@rzarzynski yeah, let's get #50598 merged. We can combine this if you want or just approve that one first. I'm ok with whatever. |
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
Huzzah! @rzarzynski IS there anything we need for this PR to merge now that #50598 is in? (this PR is blocked on your review) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving as the (not so explicit) dependency got merged.
CC: @Matan-B. |
jenkins test make check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
New unrelated asan failure - https://tracker.ceph.com/issues/61504.
jenkins retest this please |
Hrm, we're seeing address sanitizer errors in make check:
Presumably similar to the false positive from above? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See asan make check issue above.
Since the implicit change to perfglue's CMakeLists.txt result in different allocator selected I pushed #51851 to able to identify which one is being used. |
Similar suppression from
I think we should add Edit: Looks like this is gperf's alloc https://github.com/gperftools/gperftools/blob/master/src/tcmalloc.cc#L1146 |
…n using seastar" This reverts commit 380bc6d. Signed-off-by: Matan Breizman <mbreizma@redhat.com>
classic-osds have always caused significant memory fragmentation when using the libc memory allocator due to the way that Ceph tends to utilize memory. In recent testing, crimson-osd was found to use 25-27GB of RAM with the stock 3GB bluestore cache settings (osd_memory_target is only used when tcmalloc is available). Upon further testing, it was found that the classic OSD is even worse, using between 32-33GB of RAM after a 5 minute 4K sequential write test when using libc malloc. The good news is that it appears that crimson-osd is able to use tcmalloc for alienstore without significant modification. Better still, it drastically reduces memory usage. In the same test that resulted in 25GB RSS memory usage for crimson-osd with libc malloc, a tcmalloc linked version took around 9GB (with an 8GB osd_memory_target). Since we do not yet (afaik) expose classic OSD debugging in crimson it is tough to tell why we are still a little over, but it's clear that for alienstore we are going to need to use tcmalloc as we do in classic. Signed-off-by: Mark Nelson <mnelson@redhat.com>
9f31ad7
to
d884a45
Compare
Rebased and reverted the commit from #51875. (no changes, solely for accurate commit history) |
classic-osds have always caused significant memory fragmentation
when using the libc memory allocator due to the way that Ceph
tends to utilize memory. In recent testing, crimson-osd was found
to use 25-27GB of RAM with the stock 3GB bluestore cache settings
(osd_memory_target is only used when tcmalloc is available). Upon
further testing, it was found that the classic OSD is even worse,
using between 32-33GB of RAM after a 5 minute 4K sequential
write test when using libc malloc.
The good news is that it appears that crimson-osd is able to use
tcmalloc for alienstore without significant modification. Better
still, it drastically reduces memory usage. In the same test that
resulted in 25GB RSS memory usage for crimson-osd with libc malloc,
a tcmalloc linked version took around 9GB (with an 8GB
osd_memory_target). Since we do not yet (afaik) expose classic OSD
debugging in crimson it is tough to tell why we are still a little
over, but it's clear that for alienstore we are going to need to
use tcmalloc as we do in classic.
Signed-off-by: Mark Nelson mnelson@redhat.com
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows