Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extra allocations makes short range queries 5% slower when linked with glibc malloc #10340

Open
mdcallag opened this issue Jul 11, 2022 · 0 comments
Labels
performance Issues related to performance that may or may not be bugs regression

Comments

@mdcallag
Copy link
Contributor

I encountered this by accident when doing benchmarks and using glibc malloc by accident. Normally I use jemalloc, but it wasn't installed on the host on which I compiled db_bench. This is introduced by 8b74cea and the new allocation might be here.

I know we prefer jemalloc vs glibc malloc, but is it possible to reduce the amount of allocations?

Example output from the fwdrangewhilewriting benchmark step shows the impact. QPS drops from 512323 to 485274. The first line is from b82edff and the second is from 8b74cea. These diffs are adjacent in the repo (b82... precedes 8b7...).

ops_sec mb_sec  lsm_sz  blob_sz c_wgb   w_amp   c_mbps  c_wsecs c_csecs b_rgb   b_wgb   usec_op p50     p99     p99.9   p99.99  pmax    uptime  stall%  Nstall  u_cpu   s_cpu   rss     test    date    version job_id  githash
512323  2052.1  18GB    0.0GB,  33.3    14.9    28.8    107     75      0       0       42.9    41.7    76      168     479     22597   1183    0.0     0       21.2    3.1     0.0     fwdrangewhilewriting.t22        2022-07-11T18:36:10     7.3.0           b82edffc7b
485274  1943.7  18GB    0.0GB,  33.2    14.8    28.7    106     74      0       0       45.3    43.7    84      174     489     22534   1183    0.0     0       21.8    2.8     0.0     fwdrangewhilewriting.t22        2022-07-11T18:57:20     7.3.0           8b74cea7fe

From the throughput result and vmstat output (not shared here) I see that 8b74cea uses ~5% more CPU per query. I confirmed this does not reproduce when db_bench is linked with jemalloc.

A reproduction script:

numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/m/rx --wal_dir=/data/m/rx --num=40000000 --key_size=20 --value_size=400 --block_size=8192 --cache_size=193273528320 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=none --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --report_interval_seconds=5 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --num_levels=8 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --seed=1657564500 --report_file=benchmark_fillseq.wal_disabled.v400.log.r.csv 2>&1 

numactl --interleave=all timeout 1800 ./db_bench --benchmarks=seekrandomwhilewriting --use_existing_db=1 --sync=0 --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/m/rx --wal_dir=/data/m/rx --num=40000000 --key_size=20 --value_size=400 --block_size=8192 --cache_size=193273528320 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=none --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=2097152 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --report_interval_seconds=5 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --num_levels=8 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --duration=1200 --threads=22 --merge_operator="put" --seek_nexts=10 --reverse_iterator=false --seed=1657564570 --report_file=benchmark_fwdrangewhilewriting.t22.log.r.csv 2>&1

A flamegraph for b82edff (no regression here)
benchmark_fwdrangewhilewriting t22 log stats perf g 9 Jul11 185402

A flamegraph for 8b74cea that shows the problem -- on the left side of the flamegraph the call stacks for __default_morecore, __libc_free and __libc_malloc are much wider:
benchmark_fwdrangewhilewriting t22 log stats perf g 9 Jul11 191500

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Issues related to performance that may or may not be bugs regression
Projects
None yet
Development

No branches or pull requests

2 participants