New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/bluestore: enable SSE-assisted CRC32 calculations in RocksDB #13741

Merged
merged 1 commit into from Mar 7, 2017

Conversation

Projects
None yet
4 participants
@rzarzynski
Contributor

rzarzynski commented Mar 2, 2017

By default RocksDB extensively employs CRC32. It has two paths for the checksum calculation:

  • rocksdb::crc32c::Slow_CRC32,
  • rocksdb::crc32c::Fast_CRC32.

The fast path depends on a run-time discovery of CPU capabilities AND a compile-time define __SSE4_2__. Although my systems really offer SSE4.2 support, the macro was undefined resulting in poor RocksDB performance visible especially during WAL transactions.

The patch (awkwardly) adds the -msse4.2 to CXX_FLAGS for RocksDB. However, I'm still not sure whether the problem is generic and affects other environments. I believe this needs verification.

CC: @liewegas, @markhpc, @ifed01.

@branch-predictor

This comment has been minimized.

Show comment
Hide comment
@branch-predictor

branch-predictor Mar 2, 2017

Member

For sure, a better solution would be to use hardware-assisted CRC32 calculation which is a part of SSE 4.2, Ceph already has code for this (https://github.com/ceph/ceph/blob/master/src/common/crc32c_intel_fast_asm.S). Also, I don't see the point of double checking for SSE 4.2. We should either check it during the build phase or on run-time, having both only leads to confusions.

Member

branch-predictor commented Mar 2, 2017

For sure, a better solution would be to use hardware-assisted CRC32 calculation which is a part of SSE 4.2, Ceph already has code for this (https://github.com/ceph/ceph/blob/master/src/common/crc32c_intel_fast_asm.S). Also, I don't see the point of double checking for SSE 4.2. We should either check it during the build phase or on run-time, having both only leads to confusions.

@rzarzynski

This comment has been minimized.

Show comment
Hide comment
@rzarzynski

rzarzynski Mar 2, 2017

Contributor

Also, I don't see the point of double checking for SSE 4.2. We should either check it during the build phase or on run-time, having both only leads to confusions.

I'm afraid the double check is necessary for all generic solutions:

  • in compile-time someone needs a compiler that is able to understand e.g. _mm_crc32_u64 and emit appropriate machine code,
  • in run-time someone needs a processor that is able to understand the emitted instruction and execute it.
Contributor

rzarzynski commented Mar 2, 2017

Also, I don't see the point of double checking for SSE 4.2. We should either check it during the build phase or on run-time, having both only leads to confusions.

I'm afraid the double check is necessary for all generic solutions:

  • in compile-time someone needs a compiler that is able to understand e.g. _mm_crc32_u64 and emit appropriate machine code,
  • in run-time someone needs a processor that is able to understand the emitted instruction and execute it.
Show outdated Hide outdated src/CMakeLists.txt Outdated
@branch-predictor

This comment has been minimized.

Show comment
Hide comment
@branch-predictor

branch-predictor Mar 2, 2017

Member

@rzarzynski SSE4 is from 2006, and SSE 4.2 support in GCC precedes support for C++11, so if someone is using a compiler so ancient that it doesn't support C++11, they wouldn't be able to build RocksDB anyway - so why bother?

Member

branch-predictor commented Mar 2, 2017

@rzarzynski SSE4 is from 2006, and SSE 4.2 support in GCC precedes support for C++11, so if someone is using a compiler so ancient that it doesn't support C++11, they wouldn't be able to build RocksDB anyway - so why bother?

@markhpc

This comment has been minimized.

Show comment
Hide comment
@markhpc

markhpc Mar 2, 2017

Member

Definitely we need to make sure we are consistently using FAST_CRC32 when possible. I'll have to double check and make sure, but I thought the last time I looked that was the case on our test cluster here.

Member

markhpc commented Mar 2, 2017

Definitely we need to make sure we are consistently using FAST_CRC32 when possible. I'll have to double check and make sure, but I thought the last time I looked that was the case on our test cluster here.

@rzarzynski

This comment has been minimized.

Show comment
Hide comment
@rzarzynski

rzarzynski Mar 2, 2017

Contributor

@markhpc: just dissected ceph-osd from the official Kraken AMD64 package for Xenial:

$ objdump -D -C ./usr/bin/ceph-osd

...

0000000000a39960 <rocksdb::crc32c::IsFastCrc32Supported()>:
  a39960:       31 c0                   xor    %eax,%eax
  a39962:       c3                      retq   
  a39963:       0f 1f 00                nopl   (%rax)
  a39966:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  a3996d:       00 00 00 
Contributor

rzarzynski commented Mar 2, 2017

@markhpc: just dissected ceph-osd from the official Kraken AMD64 package for Xenial:

$ objdump -D -C ./usr/bin/ceph-osd

...

0000000000a39960 <rocksdb::crc32c::IsFastCrc32Supported()>:
  a39960:       31 c0                   xor    %eax,%eax
  a39962:       c3                      retq   
  a39963:       0f 1f 00                nopl   (%rax)
  a39966:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  a3996d:       00 00 00 
@rzarzynski

This comment has been minimized.

Show comment
Hide comment
@rzarzynski

rzarzynski Mar 3, 2017

Contributor

@markhpc: the same with a very different package: 454e15a (wip-rzarzynski-testing branch) packaged for CentOS 7 AMD64 by shaman.ceph.com 10 days ago.

$ objdump -D -C ./usr/bin/ceph-osd

...

0000000000c54360 <rocksdb::crc32c::IsFastCrc32Supported()>:
  c54360:       31 c0                   xor    %eax,%eax
  c54362:       c3                      retq   
  c54363:       0f 1f 00                nopl   (%rax)
  c54366:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  c5436d:       00 00 00 

rocksdb::crc32c::IsFastCrc32Supported() always returns false. This stays in contrast with its counterpart from a local build where -msse4.2 was supplied:

objdump -D -C ./bin/ceph-osd

...

0000000000caf1f0 <rocksdb::crc32c::IsFastCrc32Supported()>:
  caf1f0:       53                      push   %rbx
  caf1f1:       b8 01 00 00 00          mov    $0x1,%eax
  caf1f6:       0f a2                   cpuid  
  caf1f8:       89 c8                   mov    %ecx,%eax
  caf1fa:       c1 e8 14                shr    $0x14,%eax
  caf1fd:       83 e0 01                and    $0x1,%eax
  caf200:       5b                      pop    %rbx
  caf201:       c3                      retq   
  caf202:       0f 1f 40 00             nopl   0x0(%rax)
  caf206:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
Contributor

rzarzynski commented Mar 3, 2017

@markhpc: the same with a very different package: 454e15a (wip-rzarzynski-testing branch) packaged for CentOS 7 AMD64 by shaman.ceph.com 10 days ago.

$ objdump -D -C ./usr/bin/ceph-osd

...

0000000000c54360 <rocksdb::crc32c::IsFastCrc32Supported()>:
  c54360:       31 c0                   xor    %eax,%eax
  c54362:       c3                      retq   
  c54363:       0f 1f 00                nopl   (%rax)
  c54366:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  c5436d:       00 00 00 

rocksdb::crc32c::IsFastCrc32Supported() always returns false. This stays in contrast with its counterpart from a local build where -msse4.2 was supplied:

objdump -D -C ./bin/ceph-osd

...

0000000000caf1f0 <rocksdb::crc32c::IsFastCrc32Supported()>:
  caf1f0:       53                      push   %rbx
  caf1f1:       b8 01 00 00 00          mov    $0x1,%eax
  caf1f6:       0f a2                   cpuid  
  caf1f8:       89 c8                   mov    %ecx,%eax
  caf1fa:       c1 e8 14                shr    $0x14,%eax
  caf1fd:       83 e0 01                and    $0x1,%eax
  caf200:       5b                      pop    %rbx
  caf201:       c3                      retq   
  caf202:       0f 1f 40 00             nopl   0x0(%rax)
  caf206:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
os/bluestore: enable SSE-assisted CRC32 calculations in RocksDB.
By default RocksDB extensively employs CRC32. It has two paths
for the checksum calculation:
 * rocksdb::crc32c::Slow_CRC32,
 * rocksdb::crc32c::Fast_CRC32.

The fast path depends on a run-time discovery of CPU capabilities
AND a compile-time define __SSE4_2__. Although my systems really
offer SSE4.2 support, the macro was undefined resulting in poor
RocksDB performance visible especially during WAL transactions.

The patch (awkwardly) adds the -msse4.2 to CXXFLAGS for RocksDB.

Signed-off-by: Radoslaw Zarzynski <rzarzynski@mirantis.com>
@rzarzynski

This comment has been minimized.

Show comment
Hide comment
@rzarzynski

rzarzynski Mar 5, 2017

Contributor

It looks that the performance degradation caused by rocksdb::crc32c::Slow_CRC32 can be significant in the case of having many WAL operations. Below are results from FIO (nr_files=64, size=256m, bs=4k, numjobs=4). BlueStore has been tuned to use WAL each time (bluestore min alloc size = 65536).

$ cat results/20170305_wip-bs-fastcrc32-in-rocks_REF_wal65536_jobs4.txt 
bluestore: (g=0): rw=randwrite, bs=4096B-4096B,4096B-4096B,4096B-4096B, ioengine=ceph-os, iodepth=128
...
fio-2.18-1-g4ed2
Starting 4 processes

bluestore: (groupid=0, jobs=1): err= 0: pid=49148: Sun Mar  5 21:53:19 2017
  write: IOPS=4884, BW=19.8MiB/s (20.5MB/s)(573MiB/30033msec)
    clat (msec): min=5, max=101, avg=24.04, stdev= 9.77
     lat (msec): min=5, max=101, avg=24.16, stdev= 9.80
    clat percentiles (msec):
     |  1.00th=[   11],  5.00th=[   12], 10.00th=[   14], 20.00th=[   16],
     | 30.00th=[   18], 40.00th=[   20], 50.00th=[   22], 60.00th=[   25],
     | 70.00th=[   29], 80.00th=[   33], 90.00th=[   38], 95.00th=[   43],
     | 99.00th=[   52], 99.50th=[   55], 99.90th=[   69], 99.95th=[   80],
     | 99.99th=[  102]
    lat (msec) : 10=0.97%, 20=40.73%, 50=56.86%, 100=1.43%, 250=0.01%
  cpu          : usr=31.61%, sys=8.10%, ctx=91235, majf=0, minf=11027
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.3%, 16=1.8%, 32=11.9%, >=64=86.0%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=67.5%, 8=10.8%, 16=8.3%, 32=5.5%, 64=2.9%, >=64=5.0%
     issued rwt: total=0,146688,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128
bluestore: (groupid=0, jobs=1): err= 0: pid=49149: Sun Mar  5 21:53:19 2017
  write: IOPS=4927, BW=19.3MiB/s (20.2MB/s)(578MiB/30033msec)
    clat (msec): min=5, max=117, avg=23.87, stdev= 9.71
     lat (msec): min=5, max=117, avg=23.99, stdev= 9.75
    clat percentiles (msec):
     |  1.00th=[   10],  5.00th=[   12], 10.00th=[   13], 20.00th=[   16],
     | 30.00th=[   18], 40.00th=[   20], 50.00th=[   22], 60.00th=[   25],
     | 70.00th=[   29], 80.00th=[   32], 90.00th=[   38], 95.00th=[   42],
     | 99.00th=[   52], 99.50th=[   55], 99.90th=[   67], 99.95th=[   80],
     | 99.99th=[  102]
    lat (msec) : 10=1.10%, 20=41.30%, 50=56.25%, 100=1.33%, 250=0.02%
  cpu          : usr=30.69%, sys=8.22%, ctx=92491, majf=0, minf=12280
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.3%, 16=2.0%, 32=12.0%, >=64=85.6%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=66.9%, 8=10.7%, 16=8.9%, 32=6.2%, 64=2.5%, >=64=4.8%
     issued rwt: total=0,147974,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128
bluestore: (groupid=0, jobs=1): err= 0: pid=49150: Sun Mar  5 21:53:19 2017
  write: IOPS=4899, BW=19.2MiB/s (20.7MB/s)(575MiB/30033msec)
    clat (msec): min=4, max=101, avg=23.98, stdev= 9.70
     lat (msec): min=4, max=101, avg=24.10, stdev= 9.74
    clat percentiles (usec):
     |  1.00th=[10048],  5.00th=[11584], 10.00th=[13120], 20.00th=[15296],
     | 30.00th=[17536], 40.00th=[19584], 50.00th=[21888], 60.00th=[24960],
     | 70.00th=[28288], 80.00th=[31872], 90.00th=[37120], 95.00th=[41728],
     | 99.00th=[51968], 99.50th=[55040], 99.90th=[66048], 99.95th=[79360],
     | 99.99th=[97792]
    lat (msec) : 10=0.94%, 20=40.67%, 50=56.91%, 100=1.48%, 250=0.01%
  cpu          : usr=31.87%, sys=7.82%, ctx=90284, majf=0, minf=12194
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.3%, 16=1.9%, 32=12.1%, >=64=85.7%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=67.6%, 8=10.2%, 16=8.1%, 32=6.5%, 64=2.6%, >=64=5.0%
     issued rwt: total=0,147160,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128
bluestore: (groupid=0, jobs=1): err= 0: pid=49151: Sun Mar  5 21:53:19 2017
  write: IOPS=4922, BW=19.3MiB/s (20.2MB/s)(578MiB/30033msec)
    clat (msec): min=4, max=100, avg=23.91, stdev= 9.86
     lat (msec): min=4, max=100, avg=24.03, stdev= 9.90
    clat percentiles (usec):
     |  1.00th=[ 9920],  5.00th=[11328], 10.00th=[12864], 20.00th=[15168],
     | 30.00th=[17280], 40.00th=[19328], 50.00th=[21888], 60.00th=[24704],
     | 70.00th=[28288], 80.00th=[32128], 90.00th=[37120], 95.00th=[42240],
     | 99.00th=[51968], 99.50th=[55552], 99.90th=[70144], 99.95th=[80384],
     | 99.99th=[98816]
    lat (msec) : 10=1.10%, 20=41.63%, 50=55.81%, 100=1.46%, 250=0.01%
  cpu          : usr=31.63%, sys=7.31%, ctx=92420, majf=0, minf=10628
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.3%, 16=1.7%, 32=11.7%, >=64=86.3%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=66.6%, 8=10.9%, 16=8.9%, 32=6.5%, 64=2.3%, >=64=4.8%
     issued rwt: total=0,147845,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
  WRITE: bw=76.8MiB/s (80.5MB/s), 19.8MiB/s-19.3MiB/s (20.5MB/s-20.2MB/s), io=2303MiB (2415MB), run=30033-30033msec

Total: 19632 = 4884 + 4927 + 4899 + 4922


After the change:

$ cat results/20170305_wip-bs-fastcrc32-in-rocks_wal65536_jobs4.txt 
bluestore: (g=0): rw=randwrite, bs=4096B-4096B,4096B-4096B,4096B-4096B, ioengine=ceph-os, iodepth=128
...
fio-2.18-1-g4ed2
Starting 4 processes

bluestore: (groupid=0, jobs=1): err= 0: pid=12900: Sun Mar  5 21:40:26 2017
  write: IOPS=6600, BW=25.9MiB/s (27.4MB/s)(774MiB/30017msec)
    clat (msec): min=4, max=52, avg=17.63, stdev= 6.00
     lat (msec): min=4, max=53, avg=17.72, stdev= 6.04
    clat percentiles (usec):
     |  1.00th=[ 8768],  5.00th=[ 9664], 10.00th=[10944], 20.00th=[13120],
     | 30.00th=[14272], 40.00th=[15296], 50.00th=[16192], 60.00th=[17280],
     | 70.00th=[19072], 80.00th=[22656], 90.00th=[25984], 95.00th=[28544],
     | 99.00th=[37120], 99.50th=[40192], 99.90th=[46336], 99.95th=[47360],
     | 99.99th=[49408]
    lat (msec) : 10=6.76%, 20=66.58%, 50=26.65%, 100=0.01%
  cpu          : usr=32.05%, sys=5.32%, ctx=97822, majf=0, minf=10201
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.2%, 16=2.0%, 32=15.2%, >=64=82.6%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=65.6%, 8=13.0%, 16=9.0%, 32=4.8%, 64=1.3%, >=64=6.3%
     issued rwt: total=0,198136,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128
bluestore: (groupid=0, jobs=1): err= 0: pid=12901: Sun Mar  5 21:40:26 2017
  write: IOPS=6554, BW=25.7MiB/s (26.9MB/s)(769MiB/30017msec)
    clat (msec): min=4, max=60, avg=17.76, stdev= 6.15
     lat (msec): min=4, max=60, avg=17.85, stdev= 6.18
    clat percentiles (usec):
     |  1.00th=[ 8768],  5.00th=[ 9536], 10.00th=[10816], 20.00th=[12992],
     | 30.00th=[14272], 40.00th=[15296], 50.00th=[16320], 60.00th=[17536],
     | 70.00th=[19584], 80.00th=[23168], 90.00th=[25984], 95.00th=[28800],
     | 99.00th=[37632], 99.50th=[40704], 99.90th=[46336], 99.95th=[48384],
     | 99.99th=[51968]
    lat (msec) : 10=7.03%, 20=64.49%, 50=28.46%, 100=0.02%
  cpu          : usr=32.44%, sys=5.58%, ctx=95608, majf=0, minf=10637
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=1.6%, 32=14.2%, >=64=84.1%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=65.5%, 8=14.3%, 16=8.9%, 32=4.1%, 64=1.2%, >=64=5.9%
     issued rwt: total=0,196740,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128
bluestore: (groupid=0, jobs=1): err= 0: pid=12902: Sun Mar  5 21:40:26 2017
  write: IOPS=6586, BW=25.8MiB/s (26.1MB/s)(772MiB/30017msec)
    clat (msec): min=4, max=62, avg=17.67, stdev= 6.01
     lat (msec): min=4, max=62, avg=17.75, stdev= 6.04
    clat percentiles (usec):
     |  1.00th=[ 8768],  5.00th=[ 9792], 10.00th=[11072], 20.00th=[13120],
     | 30.00th=[14272], 40.00th=[15296], 50.00th=[16192], 60.00th=[17280],
     | 70.00th=[19072], 80.00th=[22656], 90.00th=[25984], 95.00th=[28544],
     | 99.00th=[37120], 99.50th=[40192], 99.90th=[46336], 99.95th=[48384],
     | 99.99th=[51968]
    lat (msec) : 10=5.93%, 20=67.08%, 50=26.97%, 100=0.02%
  cpu          : usr=32.09%, sys=5.50%, ctx=96959, majf=0, minf=10695
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.2%, 16=2.0%, 32=14.9%, >=64=82.9%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=66.7%, 8=13.6%, 16=8.2%, 32=4.3%, 64=1.2%, >=64=6.0%
     issued rwt: total=0,197696,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128
bluestore: (groupid=0, jobs=1): err= 0: pid=12903: Sun Mar  5 21:40:26 2017
  write: IOPS=6586, BW=25.8MiB/s (26.1MB/s)(772MiB/30017msec)
    clat (msec): min=4, max=52, avg=17.67, stdev= 6.01
     lat (msec): min=4, max=53, avg=17.76, stdev= 6.04
    clat percentiles (usec):
     |  1.00th=[ 8768],  5.00th=[ 9664], 10.00th=[10944], 20.00th=[13120],
     | 30.00th=[14272], 40.00th=[15296], 50.00th=[16320], 60.00th=[17280],
     | 70.00th=[19072], 80.00th=[22656], 90.00th=[25984], 95.00th=[28544],
     | 99.00th=[37632], 99.50th=[40192], 99.90th=[46336], 99.95th=[47360],
     | 99.99th=[49920]
    lat (msec) : 10=6.18%, 20=66.73%, 50=27.08%, 100=0.01%
  cpu          : usr=32.20%, sys=5.61%, ctx=96568, majf=0, minf=10676
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.2%, 16=1.9%, 32=14.9%, >=64=83.0%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=67.8%, 8=13.2%, 16=8.1%, 32=3.9%, 64=1.1%, >=64=5.9%
     issued rwt: total=0,197696,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
  WRITE: bw=103MiB/s (108MB/s), 25.7MiB/s-25.9MiB/s (26.9MB/s-27.4MB/s), io=3087MiB (3237MB), run=30017-30017msec

Total: 26326 = 6600 + 6554 + 6586 + 6586


perf report shows that rocksdb::crc32c::Slow_CRC32 was the top bottleneck:

     4.95%  bstore_kv_sync  libfio_ceph_objectstore.so  [.] rocksdb::crc32c::ExtendImpl<&rocksdb::crc32c::Slow_CRC32>

In contrast, rocksdb::crc32c::Fast_CRC32 is placed far away from the top, on 17th position:

     0.90%  bstore_kv_sync  libfio_ceph_objectstore.so  [.] rocksdb::crc32c::ExtendImpl<&rocksdb::crc32c::Fast_CRC32>
Contributor

rzarzynski commented Mar 5, 2017

It looks that the performance degradation caused by rocksdb::crc32c::Slow_CRC32 can be significant in the case of having many WAL operations. Below are results from FIO (nr_files=64, size=256m, bs=4k, numjobs=4). BlueStore has been tuned to use WAL each time (bluestore min alloc size = 65536).

$ cat results/20170305_wip-bs-fastcrc32-in-rocks_REF_wal65536_jobs4.txt 
bluestore: (g=0): rw=randwrite, bs=4096B-4096B,4096B-4096B,4096B-4096B, ioengine=ceph-os, iodepth=128
...
fio-2.18-1-g4ed2
Starting 4 processes

bluestore: (groupid=0, jobs=1): err= 0: pid=49148: Sun Mar  5 21:53:19 2017
  write: IOPS=4884, BW=19.8MiB/s (20.5MB/s)(573MiB/30033msec)
    clat (msec): min=5, max=101, avg=24.04, stdev= 9.77
     lat (msec): min=5, max=101, avg=24.16, stdev= 9.80
    clat percentiles (msec):
     |  1.00th=[   11],  5.00th=[   12], 10.00th=[   14], 20.00th=[   16],
     | 30.00th=[   18], 40.00th=[   20], 50.00th=[   22], 60.00th=[   25],
     | 70.00th=[   29], 80.00th=[   33], 90.00th=[   38], 95.00th=[   43],
     | 99.00th=[   52], 99.50th=[   55], 99.90th=[   69], 99.95th=[   80],
     | 99.99th=[  102]
    lat (msec) : 10=0.97%, 20=40.73%, 50=56.86%, 100=1.43%, 250=0.01%
  cpu          : usr=31.61%, sys=8.10%, ctx=91235, majf=0, minf=11027
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.3%, 16=1.8%, 32=11.9%, >=64=86.0%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=67.5%, 8=10.8%, 16=8.3%, 32=5.5%, 64=2.9%, >=64=5.0%
     issued rwt: total=0,146688,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128
bluestore: (groupid=0, jobs=1): err= 0: pid=49149: Sun Mar  5 21:53:19 2017
  write: IOPS=4927, BW=19.3MiB/s (20.2MB/s)(578MiB/30033msec)
    clat (msec): min=5, max=117, avg=23.87, stdev= 9.71
     lat (msec): min=5, max=117, avg=23.99, stdev= 9.75
    clat percentiles (msec):
     |  1.00th=[   10],  5.00th=[   12], 10.00th=[   13], 20.00th=[   16],
     | 30.00th=[   18], 40.00th=[   20], 50.00th=[   22], 60.00th=[   25],
     | 70.00th=[   29], 80.00th=[   32], 90.00th=[   38], 95.00th=[   42],
     | 99.00th=[   52], 99.50th=[   55], 99.90th=[   67], 99.95th=[   80],
     | 99.99th=[  102]
    lat (msec) : 10=1.10%, 20=41.30%, 50=56.25%, 100=1.33%, 250=0.02%
  cpu          : usr=30.69%, sys=8.22%, ctx=92491, majf=0, minf=12280
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.3%, 16=2.0%, 32=12.0%, >=64=85.6%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=66.9%, 8=10.7%, 16=8.9%, 32=6.2%, 64=2.5%, >=64=4.8%
     issued rwt: total=0,147974,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128
bluestore: (groupid=0, jobs=1): err= 0: pid=49150: Sun Mar  5 21:53:19 2017
  write: IOPS=4899, BW=19.2MiB/s (20.7MB/s)(575MiB/30033msec)
    clat (msec): min=4, max=101, avg=23.98, stdev= 9.70
     lat (msec): min=4, max=101, avg=24.10, stdev= 9.74
    clat percentiles (usec):
     |  1.00th=[10048],  5.00th=[11584], 10.00th=[13120], 20.00th=[15296],
     | 30.00th=[17536], 40.00th=[19584], 50.00th=[21888], 60.00th=[24960],
     | 70.00th=[28288], 80.00th=[31872], 90.00th=[37120], 95.00th=[41728],
     | 99.00th=[51968], 99.50th=[55040], 99.90th=[66048], 99.95th=[79360],
     | 99.99th=[97792]
    lat (msec) : 10=0.94%, 20=40.67%, 50=56.91%, 100=1.48%, 250=0.01%
  cpu          : usr=31.87%, sys=7.82%, ctx=90284, majf=0, minf=12194
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.3%, 16=1.9%, 32=12.1%, >=64=85.7%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=67.6%, 8=10.2%, 16=8.1%, 32=6.5%, 64=2.6%, >=64=5.0%
     issued rwt: total=0,147160,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128
bluestore: (groupid=0, jobs=1): err= 0: pid=49151: Sun Mar  5 21:53:19 2017
  write: IOPS=4922, BW=19.3MiB/s (20.2MB/s)(578MiB/30033msec)
    clat (msec): min=4, max=100, avg=23.91, stdev= 9.86
     lat (msec): min=4, max=100, avg=24.03, stdev= 9.90
    clat percentiles (usec):
     |  1.00th=[ 9920],  5.00th=[11328], 10.00th=[12864], 20.00th=[15168],
     | 30.00th=[17280], 40.00th=[19328], 50.00th=[21888], 60.00th=[24704],
     | 70.00th=[28288], 80.00th=[32128], 90.00th=[37120], 95.00th=[42240],
     | 99.00th=[51968], 99.50th=[55552], 99.90th=[70144], 99.95th=[80384],
     | 99.99th=[98816]
    lat (msec) : 10=1.10%, 20=41.63%, 50=55.81%, 100=1.46%, 250=0.01%
  cpu          : usr=31.63%, sys=7.31%, ctx=92420, majf=0, minf=10628
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.3%, 16=1.7%, 32=11.7%, >=64=86.3%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=66.6%, 8=10.9%, 16=8.9%, 32=6.5%, 64=2.3%, >=64=4.8%
     issued rwt: total=0,147845,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
  WRITE: bw=76.8MiB/s (80.5MB/s), 19.8MiB/s-19.3MiB/s (20.5MB/s-20.2MB/s), io=2303MiB (2415MB), run=30033-30033msec

Total: 19632 = 4884 + 4927 + 4899 + 4922


After the change:

$ cat results/20170305_wip-bs-fastcrc32-in-rocks_wal65536_jobs4.txt 
bluestore: (g=0): rw=randwrite, bs=4096B-4096B,4096B-4096B,4096B-4096B, ioengine=ceph-os, iodepth=128
...
fio-2.18-1-g4ed2
Starting 4 processes

bluestore: (groupid=0, jobs=1): err= 0: pid=12900: Sun Mar  5 21:40:26 2017
  write: IOPS=6600, BW=25.9MiB/s (27.4MB/s)(774MiB/30017msec)
    clat (msec): min=4, max=52, avg=17.63, stdev= 6.00
     lat (msec): min=4, max=53, avg=17.72, stdev= 6.04
    clat percentiles (usec):
     |  1.00th=[ 8768],  5.00th=[ 9664], 10.00th=[10944], 20.00th=[13120],
     | 30.00th=[14272], 40.00th=[15296], 50.00th=[16192], 60.00th=[17280],
     | 70.00th=[19072], 80.00th=[22656], 90.00th=[25984], 95.00th=[28544],
     | 99.00th=[37120], 99.50th=[40192], 99.90th=[46336], 99.95th=[47360],
     | 99.99th=[49408]
    lat (msec) : 10=6.76%, 20=66.58%, 50=26.65%, 100=0.01%
  cpu          : usr=32.05%, sys=5.32%, ctx=97822, majf=0, minf=10201
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.2%, 16=2.0%, 32=15.2%, >=64=82.6%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=65.6%, 8=13.0%, 16=9.0%, 32=4.8%, 64=1.3%, >=64=6.3%
     issued rwt: total=0,198136,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128
bluestore: (groupid=0, jobs=1): err= 0: pid=12901: Sun Mar  5 21:40:26 2017
  write: IOPS=6554, BW=25.7MiB/s (26.9MB/s)(769MiB/30017msec)
    clat (msec): min=4, max=60, avg=17.76, stdev= 6.15
     lat (msec): min=4, max=60, avg=17.85, stdev= 6.18
    clat percentiles (usec):
     |  1.00th=[ 8768],  5.00th=[ 9536], 10.00th=[10816], 20.00th=[12992],
     | 30.00th=[14272], 40.00th=[15296], 50.00th=[16320], 60.00th=[17536],
     | 70.00th=[19584], 80.00th=[23168], 90.00th=[25984], 95.00th=[28800],
     | 99.00th=[37632], 99.50th=[40704], 99.90th=[46336], 99.95th=[48384],
     | 99.99th=[51968]
    lat (msec) : 10=7.03%, 20=64.49%, 50=28.46%, 100=0.02%
  cpu          : usr=32.44%, sys=5.58%, ctx=95608, majf=0, minf=10637
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=1.6%, 32=14.2%, >=64=84.1%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=65.5%, 8=14.3%, 16=8.9%, 32=4.1%, 64=1.2%, >=64=5.9%
     issued rwt: total=0,196740,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128
bluestore: (groupid=0, jobs=1): err= 0: pid=12902: Sun Mar  5 21:40:26 2017
  write: IOPS=6586, BW=25.8MiB/s (26.1MB/s)(772MiB/30017msec)
    clat (msec): min=4, max=62, avg=17.67, stdev= 6.01
     lat (msec): min=4, max=62, avg=17.75, stdev= 6.04
    clat percentiles (usec):
     |  1.00th=[ 8768],  5.00th=[ 9792], 10.00th=[11072], 20.00th=[13120],
     | 30.00th=[14272], 40.00th=[15296], 50.00th=[16192], 60.00th=[17280],
     | 70.00th=[19072], 80.00th=[22656], 90.00th=[25984], 95.00th=[28544],
     | 99.00th=[37120], 99.50th=[40192], 99.90th=[46336], 99.95th=[48384],
     | 99.99th=[51968]
    lat (msec) : 10=5.93%, 20=67.08%, 50=26.97%, 100=0.02%
  cpu          : usr=32.09%, sys=5.50%, ctx=96959, majf=0, minf=10695
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.2%, 16=2.0%, 32=14.9%, >=64=82.9%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=66.7%, 8=13.6%, 16=8.2%, 32=4.3%, 64=1.2%, >=64=6.0%
     issued rwt: total=0,197696,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128
bluestore: (groupid=0, jobs=1): err= 0: pid=12903: Sun Mar  5 21:40:26 2017
  write: IOPS=6586, BW=25.8MiB/s (26.1MB/s)(772MiB/30017msec)
    clat (msec): min=4, max=52, avg=17.67, stdev= 6.01
     lat (msec): min=4, max=53, avg=17.76, stdev= 6.04
    clat percentiles (usec):
     |  1.00th=[ 8768],  5.00th=[ 9664], 10.00th=[10944], 20.00th=[13120],
     | 30.00th=[14272], 40.00th=[15296], 50.00th=[16320], 60.00th=[17280],
     | 70.00th=[19072], 80.00th=[22656], 90.00th=[25984], 95.00th=[28544],
     | 99.00th=[37632], 99.50th=[40192], 99.90th=[46336], 99.95th=[47360],
     | 99.99th=[49920]
    lat (msec) : 10=6.18%, 20=66.73%, 50=27.08%, 100=0.01%
  cpu          : usr=32.20%, sys=5.61%, ctx=96568, majf=0, minf=10676
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.2%, 16=1.9%, 32=14.9%, >=64=83.0%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=67.8%, 8=13.2%, 16=8.1%, 32=3.9%, 64=1.1%, >=64=5.9%
     issued rwt: total=0,197696,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
  WRITE: bw=103MiB/s (108MB/s), 25.7MiB/s-25.9MiB/s (26.9MB/s-27.4MB/s), io=3087MiB (3237MB), run=30017-30017msec

Total: 26326 = 6600 + 6554 + 6586 + 6586


perf report shows that rocksdb::crc32c::Slow_CRC32 was the top bottleneck:

     4.95%  bstore_kv_sync  libfio_ceph_objectstore.so  [.] rocksdb::crc32c::ExtendImpl<&rocksdb::crc32c::Slow_CRC32>

In contrast, rocksdb::crc32c::Fast_CRC32 is placed far away from the top, on 17th position:

     0.90%  bstore_kv_sync  libfio_ceph_objectstore.so  [.] rocksdb::crc32c::ExtendImpl<&rocksdb::crc32c::Fast_CRC32>
@tchaikov

This comment has been minimized.

Show comment
Hide comment
@tchaikov

tchaikov Mar 6, 2017

Contributor

SSE4 is from 2006, and SSE 4.2 support in GCC precedes support for C++11, so if someone is using a compiler so ancient that it doesn't support C++11, they wouldn't be able to build RocksDB anyway - so why bother?

@branch-predictor there is chance that users are building using GCC which is new enough to support GCC, but it does not necessarily understand -msse4.2 if we are (cross) building for ARM chips for example.

Contributor

tchaikov commented Mar 6, 2017

SSE4 is from 2006, and SSE 4.2 support in GCC precedes support for C++11, so if someone is using a compiler so ancient that it doesn't support C++11, they wouldn't be able to build RocksDB anyway - so why bother?

@branch-predictor there is chance that users are building using GCC which is new enough to support GCC, but it does not necessarily understand -msse4.2 if we are (cross) building for ARM chips for example.

@tchaikov

searched nmmintrin.h in the rocksdb tree, the only place that it is using the SSE Intrinsics is crc32.cc, and it is guarding it using the run-time check result. so we should be safe even the targeting machine does not support the SSE4.2 instructions.

@tchaikov tchaikov merged commit 413efbc into ceph:master Mar 7, 2017

3 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodifed Submodules submodules for project are unmodified
Details
default Build finished.
Details
@tchaikov

This comment has been minimized.

Show comment
Hide comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment