Intra-L0 compaction reduces usage of trivial move during loads in key order #10075

mdcallag · 2022-05-30T22:19:53Z

Prior to intra-L0 compaction when db_bench --benchmarks=fillseq is run with leveled compaction there was no compaction, only trivial moves. I can reproduce that using RocksDB versions 4.1 or 5.1.4. But in v6 once there are a few stalls then some intra-L0 compaction occurs and at that point some compaction is used in place of trivial move. This is related to issue 9423 and possibly to issue 10082.

While the analysis above is based on correlation rather than causation. I also ran tests using version 6.28.2 as-is and then hacked where the hack does: disables intra-L0 compaction, disables dynamic resizing of per-level targets. Then I ran fillseq and the as-is binary gets some trivial move and some compaction while the hacked version only has trivial moves.

Other things that I see:

write-amp is ~2X larger for the as-is binary because trivial move isn't always used. See the Cumulative compaction line below where it is 1585.01 for as-is versus 878.32 for hacked.
compaction wall clock seconds is ~3X larger for the as-is binary. From the Comp(sec) column is is 15517.21 vs 4807.7
there are more write stalls for the hacked binary. I am curious if that can be fixed. The cause was always too many L0 files (147247 level0_slowdown, 107452 level0_slowdown_with_compaction) and 2/3 of those occur when L1->L2 is in progress. Given that Ln->Ln+1 should be fast for trivial move I don't understand why this happens.
throughput was worse for the hacked binary, but efficiency (see bullet points above) was better. I wonder if the additional stalls from too many L0 files (32.4% vs 22.3%) was the root cause.

A command line is:

./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=16 --max_write_buffer_number=8 --db=/data/m/rx --wal_dir=/data/m/rx --num=4000000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=225485783040 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --report_interval_seconds=5 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --seed=1653095086 --report_file=bm.lc.nt32.cm1.d0/v6.28.2.base/benchmark_fillseq.wal_disabled.v400.log.r.csv

Compaction IO stats from test end for the as-is binary show that some compaction is used:

Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0     19/4   396.26 MB   4.2    490.6     0.0    490.6    1368.9    878.3       0.0   1.6     37.0    103.4  13561.13          12779.04    115741    0.117   2233M      0       0.0       0.0
  L2     58/0    1.20 GB   3.1    216.5   114.0    102.5     214.2    111.7     461.2   1.9    114.4    113.1   1938.53           1928.58      1847    1.050    986M      0       0.0       0.0
  L3     65/0    1.41 GB   0.8      1.0     0.7      0.3       1.0      0.7     837.1   1.5    114.6    114.7      9.36              9.32        14    0.669   4773K      0       0.0       0.0
  L4    131/0    2.34 GB   0.3      0.6     0.6      0.0       0.6      0.6     867.8   1.0    114.2    114.2      5.47              5.43        26    0.211   2785K      0       0.0       0.0
  L5    785/0   12.46 GB   0.3      0.2     0.2      0.0       0.2      0.2     869.8   1.0    112.9    112.9      2.20              2.18        16    0.137   1107K      0       0.0       0.0
  L6   6117/0   99.76 GB   0.6      0.1     0.1      0.0       0.1      0.1     858.0   1.0    114.8    114.8      0.52              0.51         4    0.129    263K      0       0.0       0.0
  L7  61356/0   758.40 GB   0.0      0.0     0.0      0.0       0.0      0.0     758.4   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0       0.0       0.0
 Sum  68531/4   875.96 GB   0.0    709.0   115.6    593.4    1585.0    991.6    4652.3   1.8     46.8    104.6  15517.21          14725.05    117648    0.132   3228M      0       0.0       0.0

Uptime(secs): 12858.4 total, 20.0 interval
Flush(GB): cumulative 878.339, interval 1.203
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 1585.01 GB write, 126.22 MB/s write, 709.03 GB read, 56.46 MB/s read, 15517.2 seconds
Stalls(count): 26163 level0_slowdown, 26055 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 55628 stop for pending_compaction_bytes, 9343 slowdown for pending_compaction_bytes, 0 memtable_compaction,
 0 memtable_slowdown, interval 62 total count

** DB Stats **
Uptime(secs): 12858.4 total, 20.0 interval
Cumulative writes: 0 writes, 3999M keys, 0 commit groups, 0.0 writes per commit group, ingest: 1624.01 GB, 129.33 MB/s
Cumulative WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 GB, 0.00 MB/s
Cumulative stall: 00:47:45.976 H:M:S, 22.3 percent

Compaction IO stats from test end for the hacked binary show that only trivial move is used:

Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0     18/1   155.06 MB   4.2      0.0     0.0      0.0     878.3    878.3       0.0   1.0      0.0     77.3  11640.79           8381.25    104400    0.112       0      0       0.0       0.0
  L2     34/3   292.91 MB   4.2      0.0     0.0      0.0       0.0      0.0     585.5   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0       0.0       0.0
  L3     25/3   215.37 MB   1.0      0.0     0.0      0.0       0.0      0.0     841.3   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0       0.0       0.0
  L4    179/1    1.51 GB   1.0      0.0     0.0      0.0       0.0      0.0     873.1   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0       0.0       0.0
  L5   1426/0   12.00 GB   1.0      0.0     0.0      0.0       0.0      0.0     875.6   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0       0.0       0.0
  L6  11412/0   96.01 GB   1.0      0.0     0.0      0.0       0.0      0.0     864.1   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0       0.0       0.0
  L7  91303/0   768.14 GB   0.0      0.0     0.0      0.0       0.0      0.0     768.1   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0       0.0       0.0
 Sum 104397/8   878.30 GB   0.0      0.0     0.0      0.0     878.3    878.3    4807.7   1.0      0.0     77.3  11640.79           8381.25    104400    0.112       0      0       0.0       0.0

Uptime(secs): 19871.4 total, 20.0 interval
Flush(GB): cumulative 878.322, interval 0.412
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 878.32 GB write, 45.26 MB/s write, 0.00 GB read, 0.00 MB/s read, 11640.8 seconds
Stalls(count): 147247 level0_slowdown, 107452 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 115
memtable_slowdown, interval 205 total count

** DB Stats **
Uptime(secs): 19871.4 total, 20.0 interval
Cumulative writes: 0 writes, 3999M keys, 0 commit groups, 0.0 writes per commit group, ingest: 1623.98 GB, 83.69 MB/s
Cumulative WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 GB, 0.00 MB/s
Cumulative stall: 01:47:12.946 H:M:S, 32.4 percent

Perf results from test end:

# for the as-is binary
fillseq      :       3.215 micros/op 311034 ops/sec;  124.6 MB/s
Microseconds per write:
Count: 4000000000 Average: 3.2151  StdDev: 0.20
Min: 0  Median: 0.5291  Max: 209052317
Percentiles: P50: 0.53 P75: 0.79 P99: 3.55 P99.9: 9.51 P99.99: 2464.22

# for the hacked binary
fillseq      :       4.969 micros/op 201240 ops/sec;   80.6 MB/s
Microseconds per write:
Count: 4000000000 Average: 4.9692  StdDev: 0.17
Min: 0  Median: 0.5314  Max: 369655
Percentiles: P50: 0.53 P75: 0.80 P99: 3.59 P99.9: 8.88 P99.99: 3929.80

The text was updated successfully, but these errors were encountered:

siying · 2022-06-09T17:40:07Z

@ajkr has an idea to group L0 files into logical sorted runs and do stalling control, compaction picking, etc, based on that. But I think it is not trivial to implement. I can't think about an easier solution. Not sure whether Andrew has an idea.

siying · 2022-06-09T17:48:46Z

@ajkr I think about it more. If only doing it for L0 stalling, it might not be that hard. We just group L0 files based on smallest keys, largest keys and seqno, and update the stall triggering condition based on number of groups. L0->L0 compaction might be a little bit harder but still the code can be limited to the compaction component. This approach will create impacts not only for fillseq, but also bulkloading files that for some reasons would be placed in L0. Of course, we don't have bandwidth to do it now, but @ajkr would this approach make sense?

ajkr · 2022-06-21T23:13:51Z

Yes it makes sense.

siying · 2022-06-22T00:50:08Z

Btw, I hope with #10161 #10188 #10190 and #10169, Intra-L0 almost never happens when data is loaded in key order.

mdcallag · 2022-06-23T17:50:35Z

@ajkr has an idea to group L0 files into logical sorted runs and do stalling control, compaction picking, etc, based on that. But I think it is not trivial to implement. I can't think about an easier solution. Not sure whether Andrew has an idea.

Great. I think there is opportunity to do interesting work here on the path towards mullet compaction -- tiered for the smaller levels and leveled for the rest.

mdcallag added the performance Issues related to performance that may or may not be bugs label May 30, 2022

mdcallag mentioned this issue May 31, 2022

Write-amp increased for fillseq with universal compaction #10082

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intra-L0 compaction reduces usage of trivial move during loads in key order #10075

Intra-L0 compaction reduces usage of trivial move during loads in key order #10075

mdcallag commented May 30, 2022 •

edited

Loading

siying commented Jun 9, 2022

siying commented Jun 9, 2022

ajkr commented Jun 21, 2022

siying commented Jun 22, 2022

mdcallag commented Jun 23, 2022

Intra-L0 compaction reduces usage of trivial move during loads in key order #10075

Intra-L0 compaction reduces usage of trivial move during loads in key order #10075

Comments

mdcallag commented May 30, 2022 • edited Loading

siying commented Jun 9, 2022

siying commented Jun 9, 2022

ajkr commented Jun 21, 2022

siying commented Jun 22, 2022

mdcallag commented Jun 23, 2022

mdcallag commented May 30, 2022 •

edited

Loading