New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/bluestore: avoid the VTABLE-related burden in BitMapAllocator's hotspot #14348

Merged
merged 1 commit into from Apr 7, 2017

Conversation

Projects
None yet
2 participants
@rzarzynski
Contributor

rzarzynski commented Apr 5, 2017

perf was reporting many L1 cache misses on handling the dynamic polymorphism in performance critical parts of the BitMapAllocator. Although get_used_blocks, is_exhausted, is_exhausted of BitMapZone were truly inlined, compiler had to verify the appropriate VTABLE as these method were declared as non-final:

       │       /* We're sure the only element type we aggregate is BitMapZone,                                                                       ▒
       │        * so there is no business to go through vptr and thus prohibit                                                                       ▒
       │        * compiler to inline the stuff. Consult BitMapAreaLeaf::init. */                                                                     ▒
       │       while ((child = static_cast<BitMapZone*>(iter.next()))) {                                                                             ▒
  1.70 │       test   %r15,%r15                                                                                                                      ▒
       │       je     250                                                                                                                            ▒
       │     _ZN14BitMapAreaLeaf18child_check_n_lockEP10BitMapZonelb():                                                                              ▒
       │                                                    const bool lock)                                                                         ▒
       │     {                                                                                                                                       ▒
       │       /* The exhausted check can be performed without acquiring the lock. This                                                              ▒
       │        * is because 1) BitMapZone::is_exhausted() actually operates atomically                                                              ▒
       │        * and 2) it's followed by the exclusive, required-aware re-verification. */                                                          ▒
       │       if (child->is_exhausted()) {                                                                                                          ▒
  1.32 │       mov    (%r15),%rdx                                                                                                                    ▒
 19.75 │       mov    0x8(%rdx),%rax                                                                                                                 ▒
 15.09 │       cmp    _DYNAMIC+0x5c0,%rax                                                                                                            ▒
  7.13 │       jne    200                                                                                                                            ▒
       │     _ZN10BitMapZone12is_exhaustedEv():                                                                                                      ▒
       │      * showed that the compiler really needs that hint.                                                                                     ▒
       │      */                                                                                                                                     ▒
       │     inline bool BitMapZone::is_exhausted()                                                                                                  ▒
       │     {                                                                                                                                       ▒
       │       /* BitMapZone::get_used_blocks operates atomically. No need for lock. */                                                              ▒
       │       return get_used_blocks() == size();                                                                                                   ▒
  3.98 │       mov    0x70(%rdx),%rax                                                                                                                ▒
  2.76 │       cmp    _DYNAMIC+0xd98,%rax                                                                                                            ▒
  2.41 │       jne    220                                                                                                                            ▒
       │     _ZNKSt13__atomic_baseIiE4loadESt12memory_order():                                                                                       ▒
       │           {                                                                                                                                 ▒
       │            memory_order __b = __m & __memory_order_mask;                                                                                    ▒
       │             __glibcxx_assert(__b != memory_order_release);                                                                                  ▒
       │             __glibcxx_assert(__b != memory_order_acq_rel);                                                                                  ▒
       │                                                                                                                                             ▒
       │             return __atomic_load_n(&_M_i, __m);                                                                                             ▒
  4.18 │       movslq 0xc(%r15),%rbp                                                                                                                 ▒
       │     _ZN10BitMapZone12is_exhaustedEv():                                                                                                      ▒
  0.96 │ 9a:   mov    0x98(%rdx),%rax                                                                                                                ▒
  2.19 │       cmp    _DYNAMIC+0x1108,%rax                                                                                                           ▒
  2.12 │       jne    210                                                                                                                            ▒
       │     _ZN10BitMapZone16get_total_blocksEv():                                                                                                  ▒
       │     public:                                                                                                                                 ▒
       │       MEMPOOL_CLASS_HELPERS();                                                                                                              ▒
       │       static int64_t count;                                                                                                                 ▒
       │       static int64_t total_blocks;                                                                                                          ▒
       │       static void incr_count() { count++;}                                                                                                  ▒
       │       static int64_t get_total_blocks() {return total_blocks;}                                                                              ▒
  3.98 │       mov    _DYNAMIC+0xee8,%rax                                                                                                            ▒
  0.57 │       mov    (%rax),%rax                                                                                                                    ▒
       │     _ZN10BitMapZone12is_exhaustedEv():                                                                                                      ▒
  1.54 │ b8:   cmp    %rax,%rbp                                                                                                                      ▒
  2.15 │       sete   %al                 

Quick & rough performance tests on ramdisk (BlueStore: bluestore_debug_omit_block_device_write = true, FIO: nr_files=64, size=256k, bs=4k, numjobs=1):

bluestore: (groupid=0, jobs=1): err= 0: pid=5598: Wed Apr  5 17:27:30 2017
  write: IOPS=68.1k, BW=269MiB/s (283MB/s)(5389MiB/20001msec)
    clat (usec): min=263, max=3343, avg=1272.33, stdev=358.11
     lat (usec): min=298, max=3354, avg=1286.15, stdev=357.93
    clat percentiles (usec):
     |  1.00th=[  572],  5.00th=[  724], 10.00th=[  812], 20.00th=[  940],
     | 30.00th=[ 1048], 40.00th=[ 1160], 50.00th=[ 1272], 60.00th=[ 1384],
     | 70.00th=[ 1480], 80.00th=[ 1576], 90.00th=[ 1688], 95.00th=[ 1832],
     | 99.00th=[ 2224], 99.50th=[ 2256], 99.90th=[ 2448], 99.95th=[ 2672],
     | 99.99th=[ 2896]
    lat (usec) : 500=0.40%, 750=5.92%, 1000=19.08%
    lat (msec) : 2=71.23%, 4=3.36%
  cpu          : usr=94.54%, sys=4.64%, ctx=60127, majf=0, minf=63
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.7%, 32=16.0%, >=64=83.3%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.2%, 64=19.3%, >=64=80.4%
     issued rwt: total=0,1379594,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
  WRITE: bw=269MiB/s (283MB/s), 269MiB/s-269MiB/s (283MB/s-283MB/s), io=5389MiB (5651MB), run=20001-20001msec

 Performance counter stats for '/home/radek/fio/fio ceph-bluestore.fio':

      38048.272480      task-clock (msec)         #    1.742 CPUs utilized          
           1429524      context-switches          #    0.038 M/sec                  
             42546      cpu-migrations            #    0.001 M/sec                  
            118080      page-faults               #    0.003 M/sec                  
      126162730773      cycles                    #    3.316 GHz                      (31.09%)
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
      128781237616      instructions              #    1.02  insns per cycle          (38.79%)
       27970164751      branches                  #  735.123 M/sec                    (38.88%)
         224572213      branch-misses             #    0.80% of all branches          (38.87%)
       42264951814      L1-dcache-loads           # 1110.824 M/sec                    (29.00%)
        3472723280      L1-dcache-load-misses     #    8.22% of all L1-dcache hits    (18.58%)
        1148130497      LLC-loads                 #   30.176 M/sec                    (15.80%)
          10094608      LLC-load-misses           #    1.76% of all LL-cache hits     (23.20%)
   <not supported>      L1-icache-loads          
        2361962854      L1-icache-load-misses     #   62.078 M/sec                    (30.66%)
       40944363221      dTLB-loads                # 1076.116 M/sec                    (27.16%)
          22155473      dTLB-load-misses          #    0.05% of all dTLB cache hits   (20.96%)
         413217331      iTLB-loads                #   10.860 M/sec                    (15.49%)
          16616049      iTLB-load-misses          #    4.02% of all iTLB cache hits   (23.19%)
   <not supported>      L1-dcache-prefetches     
   <not supported>      L1-dcache-prefetch-misses

      21.843812565 seconds time elapsed
  • after the changes:
bluestore: (groupid=0, jobs=1): err= 0: pid=5676: Wed Apr  5 17:28:04 2017
  write: IOPS=71.9k, BW=281MiB/s (294MB/s)(5613MiB/20001msec)
    clat (usec): min=380, max=2691, avg=1313.16, stdev=284.59
     lat (usec): min=403, max=2709, avg=1326.41, stdev=283.69
    clat percentiles (usec):
     |  1.00th=[  676],  5.00th=[  812], 10.00th=[  908], 20.00th=[ 1048],
     | 30.00th=[ 1144], 40.00th=[ 1256], 50.00th=[ 1352], 60.00th=[ 1432],
     | 70.00th=[ 1512], 80.00th=[ 1592], 90.00th=[ 1672], 95.00th=[ 1720],
     | 99.00th=[ 1800], 99.50th=[ 1832], 99.90th=[ 1912], 99.95th=[ 2008],
     | 99.99th=[ 2224]
    lat (usec) : 500=0.02%, 750=2.63%, 1000=13.64%
    lat (msec) : 2=83.66%, 4=0.05%
  cpu          : usr=93.96%, sys=5.20%, ctx=62400, majf=0, minf=63
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=8.6%, >=64=91.4%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=0.1%, 8=0.1%, 16=0.3%, 32=1.1%, 64=52.9%, >=64=45.7%
     issued rwt: total=0,1436938,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
  WRITE: bw=281MiB/s (294MB/s), 281MiB/s-281MiB/s (294MB/s-294MB/s), io=5613MiB (5886MB), run=20001-20001msec

 Performance counter stats for '/home/radek/fio/fio ceph-bluestore.fio':

      38714.212532      task-clock (msec)         #    1.718 CPUs utilized          
           1359565      context-switches          #    0.035 M/sec                  
             59492      cpu-migrations            #    0.002 M/sec                  
            114900      page-faults               #    0.003 M/sec                  
      127981053074      cycles                    #    3.306 GHz                      (30.71%)
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
      139000759426      instructions              #    1.09  insns per cycle          (38.39%)
       30298479354      branches                  #  782.619 M/sec                    (38.40%)
         233080704      branch-misses             #    0.77% of all branches          (38.60%)
       44212271444      L1-dcache-loads           # 1142.017 M/sec                    (31.61%)
        3527524000      L1-dcache-load-misses     #    7.98% of all L1-dcache hits    (16.89%)
        1175247546      LLC-loads                 #   30.357 M/sec                    (15.86%)
          11784219      LLC-load-misses           #    2.01% of all LL-cache hits     (23.11%)
   <not supported>      L1-icache-loads          
        2293363339      L1-icache-load-misses     #   59.238 M/sec                    (30.59%)
       43967663994      dTLB-loads                # 1135.698 M/sec                    (28.23%)
          25554268      dTLB-load-misses          #    0.06% of all dTLB cache hits   (19.62%)
         378933156      iTLB-loads                #    9.788 M/sec                    (15.87%)
          16835187      iTLB-load-misses          #    4.44% of all iTLB cache hits   (23.07%)
   <not supported>      L1-dcache-prefetches     
   <not supported>      L1-dcache-prefetch-misses

      22.535340023 seconds time elapsed
@@ -989,7 +983,7 @@ inline bool BitMapAreaLeaf::child_check_n_lock(BitMapZone* const child,
/* The exhausted check can be performed without acquiring the lock. This
* is because 1) BitMapZone::is_exhausted() actually operates atomically
* and 2) it's followed by the exclusive, required-aware re-verification. */
if (child->is_exhausted()) {
if (child->BitMapZone::is_exhausted()) {

This comment has been minimized.

@rzarzynski

rzarzynski Apr 5, 2017

Contributor

As the method is declared now as final in BitAllocator.h, compiler already gets a chance to skip the vtable check. Enforcing this behaviour via ->BitMapZone:: is just for those ones that don't apply this optimization (if any).

@rzarzynski

rzarzynski Apr 5, 2017

Contributor

As the method is declared now as final in BitAllocator.h, compiler already gets a chance to skip the vtable check. Enforcing this behaviour via ->BitMapZone:: is just for those ones that don't apply this optimization (if any).

@liewegas liewegas self-assigned this Apr 5, 2017

os/bluestore: don't use virtual for accessing BitMapZone::is_exhausted.
Signed-off-by: Radoslaw Zarzynski <rzarzynski@mirantis.com>

@liewegas liewegas added the needs-qa label Apr 6, 2017

@liewegas

This comment has been minimized.

Show comment
Hide comment
@liewegas

liewegas Apr 6, 2017

Member

retest this please

Member

liewegas commented Apr 6, 2017

retest this please

@liewegas liewegas merged commit c43faf1 into ceph:master Apr 7, 2017

3 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodifed Submodules submodules for project are unmodified
Details
default Build finished.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment