Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intra-L0 compaction reduces usage of trivial move during loads in key order #10075

Open
mdcallag opened this issue May 30, 2022 · 5 comments
Open
Labels
performance Issues related to performance that may or may not be bugs

Comments

@mdcallag
Copy link
Contributor

mdcallag commented May 30, 2022

Prior to intra-L0 compaction when db_bench --benchmarks=fillseq is run with leveled compaction there was no compaction, only trivial moves. I can reproduce that using RocksDB versions 4.1 or 5.1.4. But in v6 once there are a few stalls then some intra-L0 compaction occurs and at that point some compaction is used in place of trivial move. This is related to issue 9423 and possibly to issue 10082.

While the analysis above is based on correlation rather than causation. I also ran tests using version 6.28.2 as-is and then hacked where the hack does: disables intra-L0 compaction, disables dynamic resizing of per-level targets. Then I ran fillseq and the as-is binary gets some trivial move and some compaction while the hacked version only has trivial moves.

Other things that I see:

  • write-amp is ~2X larger for the as-is binary because trivial move isn't always used. See the Cumulative compaction line below where it is 1585.01 for as-is versus 878.32 for hacked.
  • compaction wall clock seconds is ~3X larger for the as-is binary. From the Comp(sec) column is is 15517.21 vs 4807.7
  • there are more write stalls for the hacked binary. I am curious if that can be fixed. The cause was always too many L0 files (147247 level0_slowdown, 107452 level0_slowdown_with_compaction) and 2/3 of those occur when L1->L2 is in progress. Given that Ln->Ln+1 should be fast for trivial move I don't understand why this happens.
  • throughput was worse for the hacked binary, but efficiency (see bullet points above) was better. I wonder if the additional stalls from too many L0 files (32.4% vs 22.3%) was the root cause.

A command line is:

./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=16 --max_write_buffer_number=8 --db=/data/m/rx --wal_dir=/data/m/rx --num=4000000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=225485783040 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --report_interval_seconds=5 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --seed=1653095086 --report_file=bm.lc.nt32.cm1.d0/v6.28.2.base/benchmark_fillseq.wal_disabled.v400.log.r.csv

Compaction IO stats from test end for the as-is binary show that some compaction is used:

Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0     19/4   396.26 MB   4.2    490.6     0.0    490.6    1368.9    878.3       0.0   1.6     37.0    103.4  13561.13          12779.04    115741    0.117   2233M      0       0.0       0.0
  L2     58/0    1.20 GB   3.1    216.5   114.0    102.5     214.2    111.7     461.2   1.9    114.4    113.1   1938.53           1928.58      1847    1.050    986M      0       0.0       0.0
  L3     65/0    1.41 GB   0.8      1.0     0.7      0.3       1.0      0.7     837.1   1.5    114.6    114.7      9.36              9.32        14    0.669   4773K      0       0.0       0.0
  L4    131/0    2.34 GB   0.3      0.6     0.6      0.0       0.6      0.6     867.8   1.0    114.2    114.2      5.47              5.43        26    0.211   2785K      0       0.0       0.0
  L5    785/0   12.46 GB   0.3      0.2     0.2      0.0       0.2      0.2     869.8   1.0    112.9    112.9      2.20              2.18        16    0.137   1107K      0       0.0       0.0
  L6   6117/0   99.76 GB   0.6      0.1     0.1      0.0       0.1      0.1     858.0   1.0    114.8    114.8      0.52              0.51         4    0.129    263K      0       0.0       0.0
  L7  61356/0   758.40 GB   0.0      0.0     0.0      0.0       0.0      0.0     758.4   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0       0.0       0.0
 Sum  68531/4   875.96 GB   0.0    709.0   115.6    593.4    1585.0    991.6    4652.3   1.8     46.8    104.6  15517.21          14725.05    117648    0.132   3228M      0       0.0       0.0

Uptime(secs): 12858.4 total, 20.0 interval
Flush(GB): cumulative 878.339, interval 1.203
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 1585.01 GB write, 126.22 MB/s write, 709.03 GB read, 56.46 MB/s read, 15517.2 seconds
Stalls(count): 26163 level0_slowdown, 26055 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 55628 stop for pending_compaction_bytes, 9343 slowdown for pending_compaction_bytes, 0 memtable_compaction,
 0 memtable_slowdown, interval 62 total count

** DB Stats **
Uptime(secs): 12858.4 total, 20.0 interval
Cumulative writes: 0 writes, 3999M keys, 0 commit groups, 0.0 writes per commit group, ingest: 1624.01 GB, 129.33 MB/s
Cumulative WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 GB, 0.00 MB/s
Cumulative stall: 00:47:45.976 H:M:S, 22.3 percent

Compaction IO stats from test end for the hacked binary show that only trivial move is used:

Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0     18/1   155.06 MB   4.2      0.0     0.0      0.0     878.3    878.3       0.0   1.0      0.0     77.3  11640.79           8381.25    104400    0.112       0      0       0.0       0.0
  L2     34/3   292.91 MB   4.2      0.0     0.0      0.0       0.0      0.0     585.5   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0       0.0       0.0
  L3     25/3   215.37 MB   1.0      0.0     0.0      0.0       0.0      0.0     841.3   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0       0.0       0.0
  L4    179/1    1.51 GB   1.0      0.0     0.0      0.0       0.0      0.0     873.1   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0       0.0       0.0
  L5   1426/0   12.00 GB   1.0      0.0     0.0      0.0       0.0      0.0     875.6   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0       0.0       0.0
  L6  11412/0   96.01 GB   1.0      0.0     0.0      0.0       0.0      0.0     864.1   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0       0.0       0.0
  L7  91303/0   768.14 GB   0.0      0.0     0.0      0.0       0.0      0.0     768.1   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0       0.0       0.0
 Sum 104397/8   878.30 GB   0.0      0.0     0.0      0.0     878.3    878.3    4807.7   1.0      0.0     77.3  11640.79           8381.25    104400    0.112       0      0       0.0       0.0

Uptime(secs): 19871.4 total, 20.0 interval
Flush(GB): cumulative 878.322, interval 0.412
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 878.32 GB write, 45.26 MB/s write, 0.00 GB read, 0.00 MB/s read, 11640.8 seconds
Stalls(count): 147247 level0_slowdown, 107452 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 115
memtable_slowdown, interval 205 total count

** DB Stats **
Uptime(secs): 19871.4 total, 20.0 interval
Cumulative writes: 0 writes, 3999M keys, 0 commit groups, 0.0 writes per commit group, ingest: 1623.98 GB, 83.69 MB/s
Cumulative WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 GB, 0.00 MB/s
Cumulative stall: 01:47:12.946 H:M:S, 32.4 percent

Perf results from test end:

# for the as-is binary
fillseq      :       3.215 micros/op 311034 ops/sec;  124.6 MB/s
Microseconds per write:
Count: 4000000000 Average: 3.2151  StdDev: 0.20
Min: 0  Median: 0.5291  Max: 209052317
Percentiles: P50: 0.53 P75: 0.79 P99: 3.55 P99.9: 9.51 P99.99: 2464.22

# for the hacked binary
fillseq      :       4.969 micros/op 201240 ops/sec;   80.6 MB/s
Microseconds per write:
Count: 4000000000 Average: 4.9692  StdDev: 0.17
Min: 0  Median: 0.5314  Max: 369655
Percentiles: P50: 0.53 P75: 0.80 P99: 3.59 P99.9: 8.88 P99.99: 3929.80
@mdcallag mdcallag added the performance Issues related to performance that may or may not be bugs label May 30, 2022
@siying
Copy link
Contributor

siying commented Jun 9, 2022

@ajkr has an idea to group L0 files into logical sorted runs and do stalling control, compaction picking, etc, based on that. But I think it is not trivial to implement. I can't think about an easier solution. Not sure whether Andrew has an idea.

@siying
Copy link
Contributor

siying commented Jun 9, 2022

@ajkr I think about it more. If only doing it for L0 stalling, it might not be that hard. We just group L0 files based on smallest keys, largest keys and seqno, and update the stall triggering condition based on number of groups. L0->L0 compaction might be a little bit harder but still the code can be limited to the compaction component. This approach will create impacts not only for fillseq, but also bulkloading files that for some reasons would be placed in L0. Of course, we don't have bandwidth to do it now, but @ajkr would this approach make sense?

@ajkr
Copy link
Contributor

ajkr commented Jun 21, 2022

Yes it makes sense.

@siying
Copy link
Contributor

siying commented Jun 22, 2022

Btw, I hope with #10161 #10188 #10190 and #10169, Intra-L0 almost never happens when data is loaded in key order.

@mdcallag
Copy link
Contributor Author

@ajkr has an idea to group L0 files into logical sorted runs and do stalling control, compaction picking, etc, based on that. But I think it is not trivial to implement. I can't think about an easier solution. Not sure whether Andrew has an idea.

Great. I think there is opportunity to do interesting work here on the path towards mullet compaction -- tiered for the smaller levels and leveled for the rest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Issues related to performance that may or may not be bugs
Projects
None yet
Development

No branches or pull requests

3 participants