broker/content-cache: throughput drop after conversion to zhashx #3591

chu11 · 2021-04-08T15:20:32Z

While continuing to work on #3583 I got some strange performance results. After a lot of trial and error, it appears that PR #3589 commit a608e51 introduced something that at times (but not all the time) slows down throughput.

Here's an example, I ran the following 10 times.

src/cmd/flux start ./throughput_test_simulated.sh 4096 10

The script simple runs a number of jobs (4096) a number of times (10) and greps for just the final throughput result.

#!/bin/sh

count=$1
n=$2
echo "Running throughput.py ${count} jobs per iteration, ${n} iterations"
for i in `seq 1 ${n}`
do
    ./throughput.py -n ${count} | grep throughput
done

Here's the results of the run:

Running throughput.py 4096 jobs per iteration, 10 iterations
throughput:     95.5 job/s (script:  94.8 job/s)
throughput:     87.7 job/s (script:  87.0 job/s)
throughput:     83.1 job/s (script:  82.5 job/s)
throughput:     75.9 job/s (script:  75.4 job/s)
throughput:     73.7 job/s (script:  73.2 job/s)
throughput:     73.0 job/s (script:  72.5 job/s)
throughput:     70.1 job/s (script:  69.6 job/s)
throughput:     61.9 job/s (script:  61.5 job/s)
throughput:     34.9 job/s (script:  34.7 job/s)
throughput:     12.2 job/s (script:  12.2 job/s)
Running throughput.py 4096 jobs per iteration, 10 iterations
throughput:     99.4 job/s (script:  98.6 job/s)
throughput:     90.1 job/s (script:  89.4 job/s)
throughput:     83.1 job/s (script:  82.5 job/s)
throughput:     67.8 job/s (script:  67.2 job/s)
throughput:     66.3 job/s (script:  65.9 job/s)
throughput:     64.2 job/s (script:  63.8 job/s)
throughput:     62.6 job/s (script:  62.2 job/s)
throughput:     56.5 job/s (script:  56.1 job/s)
throughput:     44.8 job/s (script:  44.6 job/s)
throughput:     31.5 job/s (script:  31.4 job/s)
Running throughput.py 4096 jobs per iteration, 10 iterations
throughput:     96.3 job/s (script:  95.6 job/s)
throughput:     91.2 job/s (script:  90.5 job/s)
throughput:     81.7 job/s (script:  81.1 job/s)
throughput:     79.5 job/s (script:  78.9 job/s)
throughput:     60.8 job/s (script:  60.4 job/s)
throughput:     35.6 job/s (script:  35.2 job/s)
throughput:     33.5 job/s (script:  33.3 job/s)
throughput:     34.3 job/s (script:  34.1 job/s)
throughput:     31.7 job/s (script:  31.6 job/s)
throughput:     31.4 job/s (script:  31.1 job/s)
Running throughput.py 4096 jobs per iteration, 10 iterations
throughput:     98.3 job/s (script:  97.5 job/s)
throughput:     91.9 job/s (script:  91.1 job/s)
throughput:     83.4 job/s (script:  82.8 job/s)
throughput:     76.2 job/s (script:  75.7 job/s)
throughput:     67.4 job/s (script:  66.8 job/s)
throughput:     66.5 job/s (script:  66.1 job/s)
throughput:     62.4 job/s (script:  62.1 job/s)
throughput:     56.0 job/s (script:  55.8 job/s)
throughput:     49.0 job/s (script:  48.8 job/s)
throughput:     44.6 job/s (script:  44.4 job/s)
Running throughput.py 4096 jobs per iteration, 10 iterations
throughput:     99.4 job/s (script:  98.6 job/s)
throughput:     92.1 job/s (script:  91.4 job/s)
throughput:     84.6 job/s (script:  84.0 job/s)
throughput:     74.0 job/s (script:  73.5 job/s)
throughput:     64.1 job/s (script:  63.5 job/s)
throughput:     52.3 job/s (script:  52.0 job/s)
throughput:     49.5 job/s (script:  49.3 job/s)
throughput:     45.6 job/s (script:  45.4 job/s)
throughput:     46.9 job/s (script:  46.7 job/s)
throughput:     46.1 job/s (script:  45.9 job/s)
Running throughput.py 4096 jobs per iteration, 10 iterations
throughput:     98.5 job/s (script:  97.7 job/s)
throughput:     95.2 job/s (script:  94.4 job/s)
throughput:     93.6 job/s (script:  92.8 job/s)
throughput:     88.5 job/s (script:  87.7 job/s)
throughput:     87.0 job/s (script:  86.3 job/s)
throughput:     83.5 job/s (script:  82.9 job/s)
throughput:     77.5 job/s (script:  76.9 job/s)
throughput:     73.3 job/s (script:  72.6 job/s)
throughput:     64.6 job/s (script:  64.2 job/s)
throughput:     65.5 job/s (script:  65.1 job/s)
Running throughput.py 4096 jobs per iteration, 10 iterations
throughput:     96.3 job/s (script:  95.6 job/s)
throughput:     92.2 job/s (script:  91.4 job/s)
throughput:     83.0 job/s (script:  82.4 job/s)
throughput:     75.8 job/s (script:  75.3 job/s)
throughput:     66.6 job/s (script:  66.2 job/s)
throughput:     65.8 job/s (script:  65.5 job/s)
throughput:     60.0 job/s (script:  59.5 job/s)
throughput:     49.8 job/s (script:  49.5 job/s)
throughput:     44.9 job/s (script:  44.4 job/s)
throughput:     44.5 job/s (script:  44.1 job/s)
Running throughput.py 4096 jobs per iteration, 10 iterations
throughput:     98.0 job/s (script:  97.2 job/s)
throughput:     91.5 job/s (script:  90.7 job/s)
throughput:     83.6 job/s (script:  82.9 job/s)
throughput:     80.7 job/s (script:  80.1 job/s)
throughput:     78.4 job/s (script:  77.9 job/s)
throughput:     63.3 job/s (script:  62.8 job/s)
throughput:     60.4 job/s (script:  60.1 job/s)
throughput:     47.1 job/s (script:  46.9 job/s)
throughput:     42.8 job/s (script:  42.6 job/s)
throughput:     30.1 job/s (script:  30.0 job/s)
Running throughput.py 4096 jobs per iteration, 10 iterations
throughput:     97.6 job/s (script:  96.8 job/s)
throughput:     89.1 job/s (script:  88.4 job/s)
throughput:     84.0 job/s (script:  83.4 job/s)
throughput:     77.1 job/s (script:  76.6 job/s)
throughput:     77.1 job/s (script:  76.5 job/s)
throughput:     68.3 job/s (script:  67.8 job/s)
throughput:     55.6 job/s (script:  55.4 job/s)
throughput:     49.7 job/s (script:  49.4 job/s)
throughput:     47.9 job/s (script:  47.6 job/s)
throughput:     28.5 job/s (script:  28.2 job/s)
Running throughput.py 4096 jobs per iteration, 10 iterations
throughput:     97.4 job/s (script:  96.6 job/s)
throughput:     91.2 job/s (script:  90.5 job/s)
throughput:     83.4 job/s (script:  82.8 job/s)
throughput:     78.1 job/s (script:  77.6 job/s)
throughput:     69.1 job/s (script:  68.7 job/s)
throughput:     55.9 job/s (script:  55.6 job/s)
throughput:     44.5 job/s (script:  44.3 job/s)
throughput:     32.1 job/s (script:  32.0 job/s)
throughput:     32.4 job/s (script:  32.2 job/s)
throughput:     29.7 job/s (script:  29.6 job/s)

As you can see, the final throughput varies from as low as ~12/s to as high as ~65/s. I have gotten "good" runs as well, but didn't seem to see one amongst the 10 above.

When running against the commit 3d068c7 (just a few commits earlier before the converted use of zhashx), everything ran as expected (post PR #3585) with the final throughput always > 80/s.

Running throughput.py 4096 jobs per iteration, 10 iterations
throughput:     98.5 job/s (script:  97.6 job/s)
throughput:     94.0 job/s (script:  93.3 job/s)
throughput:     92.5 job/s (script:  91.7 job/s)
throughput:     90.3 job/s (script:  89.6 job/s)
throughput:     89.7 job/s (script:  89.0 job/s)
throughput:     89.2 job/s (script:  88.4 job/s)
throughput:     89.5 job/s (script:  88.8 job/s)
throughput:     88.3 job/s (script:  87.6 job/s)
throughput:     87.1 job/s (script:  86.4 job/s)
throughput:     84.9 job/s (script:  84.2 job/s)

The text was updated successfully, but these errors were encountered:

grondo · 2021-04-08T15:29:00Z

What are the details of the test environment?

I also wonder if it is reproducible with #3590 applied, as I didn't see anything like this in my testing. (However, given that the issue went away when you reverted #3589, you have a strong case something is going on there)

To get to the bottom of this you may have to devise a way to capture perf tracing on the 1st and 10th runs and look at the difference in the case where there is a performance hit.

I'll see if this is reproducible on my system.

chu11 · 2021-04-08T15:31:20Z

I ran on catalyst, so not the most perfect environment, but over many runs a pattern was emerging.

Was going to try and profile later on today basically using the same techniques i used in #3583.

grondo · 2021-04-08T15:35:44Z

Yeah, the pattern is concerning. Are you running on a login node or allocating a node for testing?

If you have a node to yourself for testing, it is also handy to use perf record -F 99 -a --call-graph dwarf -- sleep 5 and profile the entire system for 5s (or use whatever time you'd like), though synchronizing that with the first and 10th run in your automated test may be tricky.

chu11 · 2021-04-08T17:55:29Z

I've just been doing it on the login node, accepting the occasional performance blip. But when I see the throughput drop into the teens, I knew something was wrong.

chu11 · 2021-04-09T03:04:44Z

I think there is probably something racy, perhaps with an initialization or something like that. I do not get the horrible throughput drop everytime I run. And when I run with perf record, I get the throughput drop less frequently. It took me awhile to finally get a "good" perf record for the poor throughput case.

Anyways, the perf diff output.

     0.05%    +63.96%  libczmq.so.3.0.0                               [.] zhashx_next                                                       
    44.24%    -27.74%  [kernel.kallsyms]                              [k] 0xffffffff8a0030a0                                                
     7.93%     -6.13%  libjansson.so.4.10.0                           [.] 0x0000000000002470                                                
     0.04%     +5.40%  lt-flux-broker                                 [.] sync_cb                                                           
     3.85%     -3.00%  libc-2.17.so                                   [.] _int_malloc                                                       
     3.52%     -2.73%  libzmq.so.5.0.1                                [.] 0x000000000000ebc0                                                
     2.92%     -2.30%  libc-2.17.so                                   [.] __libc_calloc                                                     
     2.82%     -2.14%  libc-2.17.so                                   [.] _int_free                                                         
     0.01%     +1.55%  lt-flux-broker                                 [.] cache_flush.part.10                                               
     1.72%     -1.48%  libpython3.6m.so.1.0                           [.] 0x000000000008e696                                                
     1.96%     -1.39%  libc-2.17.so                                   [.] malloc                                                            
     0.92%     -0.87%  libczmq.so.3.0.0                               [.] zlist_first                                                       
     0.91%     -0.80%  libpython3.6m.so.1.0                           [.] _PyEval_EvalFrameDefault                                          
     1.05%     -0.79%  libc-2.17.so                                   [.] malloc_consolidate                                                
     0.97%     -0.77%  libc-2.17.so                                   [.] __strcmp_sse42                                                    
     0.95%     -0.69%  libsqlite3.so.0.8.6                            [.] 0x000000000000f390                                                
     0.77%     -0.55%  libc-2.17.so                                   [.] free                                                              
     0.67%     -0.52%  libc-2.17.so                                   [.] __memcpy_ssse3_back

Looking into the profiling that was recorded:

-   70.59%    64.01%  flux-broker-0  libczmq.so.3.0.0  [.] zhashx_next                                                                     ▒
   - 38.63% _start                                                                                                                         ▒
        __libc_start_main                                                                                                                  ▒
        main                                                                                                                               ▒
        flux_reactor_run                                                                                                                   ▒
        ev_run                                                                                                                             ▒
        ev_run                                                                                                                             ▒
      - ev_invoke_pending                                                                                                                  ▒
         - 29.95% sync_cb                                                                                                                  ▒
              cache_purge (inlined)                                                                                                        ▒
              zhashx_next                                                                                                                  ▒
         - 8.68% cache_store_continuation                                                                                                  ▒
              cache_flush (inlined)                                                                                                        ▒
              zhashx_next                                                                                                                  ▒
   - 25.33% main                                                                                                                           ▒
        flux_reactor_run                                                                                                                   ▒
        ev_run                                                                                                                             ▒
        ev_run                                                                                                                             ▒
      - ev_invoke_pending                                                                                                                  ▒
         - 18.39% sync_cb                                                                                                                  ▒
              cache_purge (inlined)                                                                                                        ▒
              zhashx_next                                                                                                                  ▒
         - 6.93% cache_store_continuation                                                                                                  ▒
              cache_flush (inlined)                                                                                                        ▒
              zhashx_next

Not clear what's going on. Perhaps cache entries are not getting cleared? So things are sticking around longer, thus zhashx_next() spins alot? Not sure, TBD.

chu11 · 2021-04-09T05:51:59Z

Adding some content cache stats to my runs. Short story, nothing looks extra ordinarily different.

slow throughput case

Running throughput.py 4096 jobs per iteration, 13 iterations
throughput:     96.3 job/s (script:  95.5 job/s)
1: 43 seconds
content.acct-entries: 76916
content.acct-size: 22263126
content.acct-dirty: 0
content.acct-valid: 76916
throughput:     91.6 job/s (script:  90.9 job/s)
2: 88 seconds
content.acct-entries: 142634
content.acct-size: 28003396
content.acct-dirty: 0
content.acct-valid: 142634
throughput:     84.8 job/s (script:  84.2 job/s)
3: 137 seconds
content.acct-entries: 209748
content.acct-size: 35132035
content.acct-dirty: 0
content.acct-valid: 209748
throughput:     80.0 job/s (script:  79.4 job/s)
4: 189 seconds
content.acct-entries: 276390
content.acct-size: 42380386
content.acct-dirty: 0
content.acct-valid: 276390
throughput:     73.2 job/s (script:  72.7 job/s)
5: 246 seconds
content.acct-entries: 340087
content.acct-size: 44087035
content.acct-dirty: 0
content.acct-valid: 340087
throughput:     51.6 job/s (script:  51.3 job/s)
6: 326 seconds
content.acct-entries: 406778
content.acct-size: 53705771
content.acct-dirty: 0
content.acct-valid: 406778
throughput:     52.8 job/s (script:  52.6 job/s)
7: 404 seconds
content.acct-entries: 472411
content.acct-size: 57817831
content.acct-dirty: 0
content.acct-valid: 472411
throughput:     48.6 job/s (script:  48.4 job/s)
8: 489 seconds
content.acct-entries: 539500
content.acct-size: 65475653
content.acct-dirty: 0
content.acct-valid: 539500
throughput:     47.0 job/s (script:  46.6 job/s)
9: 577 seconds
content.acct-entries: 606803
content.acct-size: 72283635
content.acct-dirty: 0
content.acct-valid: 606803
throughput:     30.9 job/s (script:  30.8 job/s)
10: 711 seconds
content.acct-entries: 673599
content.acct-size: 77814574
content.acct-dirty: 0
content.acct-valid: 673599
throughput:     29.0 job/s (script:  28.9 job/s)
11: 853 seconds
content.acct-entries: 740254
content.acct-size: 81832853
content.acct-dirty: 0
content.acct-valid: 740254
throughput:     9.7 job/s (script:   9.7 job/s)
12: 1277 seconds
content.acct-entries: 803699
content.acct-size: 84770975
content.acct-dirty: 0
content.acct-valid: 803699
throughput:     10.7 job/s (script:  10.7 job/s)
13: 1662 seconds
content.acct-entries: 869795
content.acct-size: 93254650
content.acct-dirty: 0
content.acct-valid: 869044
Total of 1663 seconds for everything

normal throughput case (before the zhashx implementation change)

Running throughput.py 4096 jobs per iteration, 13 iterations
throughput:     98.0 job/s (script:  97.2 job/s)
1: 43 seconds
content.acct-entries: 75886
content.acct-size: 20468931
content.acct-dirty: 0
content.acct-valid: 75886
throughput:     94.2 job/s (script:  93.4 job/s)
2: 87 seconds
content.acct-entries: 142606
content.acct-size: 26942802
content.acct-dirty: 0
content.acct-valid: 142606
throughput:     93.1 job/s (script:  92.3 job/s)
3: 131 seconds
content.acct-entries: 209888
content.acct-size: 34076293
content.acct-dirty: 0
content.acct-valid: 209888
throughput:     88.9 job/s (script:  88.2 job/s)
4: 178 seconds
content.acct-entries: 274463
content.acct-size: 40210837
content.acct-dirty: 0
content.acct-valid: 274463
throughput:     91.6 job/s (script:  90.8 job/s)
5: 224 seconds
content.acct-entries: 342337
content.acct-size: 48314265
content.acct-dirty: 0
content.acct-valid: 342337
throughput:     91.8 job/s (script:  91.0 job/s)
6: 269 seconds
content.acct-entries: 408866
content.acct-size: 55396556
content.acct-dirty: 0
content.acct-valid: 408866
throughput:     89.7 job/s (script:  89.0 job/s)
7: 315 seconds
content.acct-entries: 475137
content.acct-size: 62692985
content.acct-dirty: 0
content.acct-valid: 475137
throughput:     86.1 job/s (script:  85.4 job/s)
8: 363 seconds
content.acct-entries: 541897
content.acct-size: 71250854
content.acct-dirty: 0
content.acct-valid: 541897
throughput:     87.1 job/s (script:  86.4 job/s)
9: 411 seconds
content.acct-entries: 607689
content.acct-size: 74891434
content.acct-dirty: 0
content.acct-valid: 607689
throughput:     84.9 job/s (script:  84.2 job/s)
10: 460 seconds
content.acct-entries: 674136
content.acct-size: 82251888
content.acct-dirty: 0
content.acct-valid: 674136
throughput:     84.1 job/s (script:  83.4 job/s)
11: 509 seconds
content.acct-entries: 741575
content.acct-size: 88844375
content.acct-dirty: 0
content.acct-valid: 741575
throughput:     81.9 job/s (script:  81.3 job/s)
12: 560 seconds
content.acct-entries: 808196
content.acct-size: 96199030
content.acct-dirty: 0
content.acct-valid: 808196
13: 613 seconds
content.acct-entries: 876382
content.acct-size: 103160695
content.acct-dirty: 0
content.acct-valid: 876382
Total of 613 seconds for everything

chu11 · 2021-04-09T06:25:33Z

potential theory: at the top of the zhashx implementation

    The hash table always has a size that is prime and roughly doubles its
    size when 75% full. In case of hash collisions items are chained in a
    linked list. The hash table size is increased slightly (up to 5 times
    before roughly doubling the size) when an overly long chain (between 1
    and 63 items depending on table size) is detected.

Most "traditional" hash tables I know of only increase in size when they are X% full. Skimming the code, zhash appears to do this.

zhashx() appears to also increase in size when there are many hash collisions on a single hash bucket.

Could performance randomness be related to this? The zhashx is simply way bigger than if we used zhash depending on hashing luck?

Its hard to test b/c we can't get internals from the zhashx data structure.

Just a guess at the moment.

garlick · 2021-04-09T14:16:51Z

zhashx() appears to also increase in size when there are many hash collisions on a single hash bucket.

We should have ideal case for the hash function since sha1 keys should be very evenly distributed.

Since the resize occurs inside zhashx_insert() I would sort of expect that function to be popping up as a hot spot if the resize were a problem. It doesn't ever shrink so it seems like it should stablize (though the specific test might be quitting before that happens).

From your previous perf trace, I thought maybe it would be worth seeing if we could employ an LRU instead of the scanning done by the heartbeat-driven cache purge. That repeated scanning (even though it is limited with some heuristics) seems like a probable source of overhead. I have a PR started for this though I have to pause it for today while I travel.

chu11 · 2021-04-10T14:59:30Z

Eek! Just to make sure about the hash sizes, I modified czmq with a printf in the right locations and ld-preloaded it.

using zhashx

Running throughput.py 4096 jobs per iteration, 10 iterations
new limit: 131023
new limit: 262111
new limit: 524243
new limit: 1048517
new limit: 2097091
new limit: 4194247
new limit: 8388547
new limit: 16777141
throughput:     95.9 job/s (script:  95.1 job/s)
new limit: 33554341
throughput:     88.6 job/s (script:  87.9 job/s)
new limit: 67108763
new limit: 134217593
throughput:     79.1 job/s (script:  78.4 job/s)
throughput:     70.0 job/s (script:  69.5 job/s)
throughput:     68.6 job/s (script:  68.2 job/s)
throughput:     61.5 job/s (script:  61.1 job/s)
new limit: 268435331
new limit: 536870839
throughput:     53.2 job/s (script:  52.9 job/s)
throughput:     33.7 job/s (script:  33.6 job/s)
throughput:     31.3 job/s (script:  31.0 job/s)
throughput:     26.3 job/s (script:  26.1 job/s)

using zhash

Running throughput.py 4096 jobs per iteration, 10 iterations
new limit: 131023
new limit zhash: 130560
throughput:     96.6 job/s (script:  95.7 job/s)
new limit zhash: 261120
throughput:     96.2 job/s (script:  95.3 job/s)
new limit zhash: 522240
throughput:     92.1 job/s (script:  91.3 job/s)
throughput:     90.4 job/s (script:  89.6 job/s)
throughput:     89.6 job/s (script:  88.8 job/s)
new limit zhash: 1044480
throughput:     89.7 job/s (script:  88.8 job/s)
throughput:     88.5 job/s (script:  87.7 job/s)
throughput:     88.5 job/s (script:  87.6 job/s)
throughput:     87.5 job/s (script:  86.7 job/s)
throughput:     85.2 job/s (script:  84.4 job/s)

even though number of entries is always < 1M, we have a zhashx that's > 500M buckets at the end. The zhash case is at ~1M buckets.

Interestingly enough, even before the first 4096 jobs are done, the zhashx size is > 16M, but in the zhash case its still only 130K.

Why this is happening is still a mystery to me or if its just dumb luck. zhash and zhashx appear to use the same default hashing algorithm. Dunno if the hashing algorithm is just a poor one or what.

Edit: my czmq hack

diff --git a/src/zhash.c b/src/zhash.c
index 625ae20..86752b8 100644
--- a/src/zhash.c
+++ b/src/zhash.c
@@ -167,6 +167,9 @@ zhash_insert (zhash_t *self, const char *key, void *value)
         if (!new_items)
             return -1;
 
+        if (new_limit > 100000) {
+          fprintf (stderr, "new limit zhash: %ju\n", (uintmax_t)new_limit);
+        }
         //  Move all items to the new hash table, rehashing to
         //  take into account new hash table limit
         uint index;
diff --git a/src/zhashx.c b/src/zhashx.c
index 74485a4..37a9ddf 100644
--- a/src/zhashx.c
+++ b/src/zhashx.c
@@ -215,6 +215,10 @@ s_zhashx_rehash (zhashx_t *self, uint new_prime_index)
     if (!new_items)
         return -1;
 
+    if (new_limit > 100000) {
+      fprintf (stderr, "new limit: %ju\n", (uintmax_t)new_limit);
+    }
+
     //  Move all items to the new hash table, rehashing to
     //  take into account new hash table limit
     size_t index;

garlick · 2021-04-10T15:24:51Z

Awesome job zeroing in on that!

I guess my next question is: is the growth caused by

the chain limit, indicating a problem with the hash algorithm
the hash size, indicating a problem with cache management, perhaps introduced in my PR

Maybe it would be worth adding a printf at the two call sites of s_zhashx_rehash() to see which it is?

Edit: curious: how did you isolate the content-cache hash from the other places where zhashx is being used in broker modules?

chu11 · 2021-04-10T15:40:22Z

I'm zero-ing in, but I actually think this is zhashx bug with how they implement chain limits. I think the chain limit is always staying at 1 instead of increasing in size. So it only takes an occasional hash collision to increase the size of the hash. With the thousands of hashes going on, it just takes some bad luck once in awhile. When I manually increase the default chain limit from 1 to 5, the problem goes away.

Edit: curious: how did you isolate the content-cache hash from the other places where zhashx is being used in broker modules?

Good point, i guess i didn't. I just assumed if the zhashx was getting large, it was in the content-cache :P

I think your LRU idea will solve this issue regardless of sticking with zhash or zhashx, although sticking with zhashx will increase the heap size until the bug is fixed in czmq.

Edit: oops, didn't answer your other questions

Maybe it would be worth adding a printf at the two call sites of s_zhashx_rehash() to see which it is?

Yup, already did that :-) The rehash was being called from the "chain limit" path and not the "load" one.

garlick · 2021-04-10T16:00:06Z

It does seem like they only increase the chain limit on the "load factor" path not the "chain limit" one. Hmm.

chu11 · 2021-04-10T16:07:59Z

yup, this seems to fix things as well

@@ -330,10 +336,13 @@ s_item_lookup (zhashx_t *self, const void *key)
     if (len > self->chain_limit) {
         //  Create new hash table
         uint new_prime_index = self->prime_index + GROWTH_FACTOR;
+        if (primes [new_prime_index] > 100000)
+          fprintf (stderr, "resize based on chain, limit = %u\n", self->chain_limit);
         if (s_zhashx_rehash (self, new_prime_index))
             return NULL;
         limit = primes [self->prime_index];
         self->cached_index = self->hasher (key) % limit;
+        self->chain_limit += CHAIN_GROWS;
     }
     return item;
 }

I'll open up an issue in czmq to discuss possible solutions. I think the above or simply making the default chain limit > 1 would be sufficient. Or maybe there's some other way too.

garlick · 2021-04-10T16:09:02Z

Excellent!!! Seems like this may help our memory footprint in other areas as well!

grondo · 2021-04-10T16:59:59Z

Wow, amazing debugging Al!

…

On Sat, Apr 10, 2021, 9:09 AM Jim Garlick ***@***.***> wrote: Excellent!!! Seems like this may help our memory footprint in other areas as well! — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3591 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFVEUUL63VLDGO3VVTTV3TTIBZ2VANCNFSM42TDRRQQ> .

chu11 · 2021-04-10T18:20:58Z

Posted an issue zeromq/czmq#2173 and PR zeromq/czmq#2174

But of course this fix isn't in any czmq that's in the wild or on our systems, so we would have to think about how to approach this issue in the content-cache. Revert back to zhash? Fix czmq with a local build?

Using the example above, ~500M vs ~1M hash buckets on the heap. The zhashx item_t struct appears to be 40 bytes if I'm counting correctly. Assuming 2^N byte aligned lets say that struct rounds up to 64 bytes, so that 64 * 500M ~ 30G on the heap vs ~64 megs.

Edit:

Perhaps worth mentioning. I had this run a bit ago. I think I hit enough unlucky collisions that basically things stopped working.

Running throughput.py 4096 jobs per iteration, 13 iterations                                                                                
throughput:     101.7 job/s (script: 100.8 job/s)                                                                                           
1: 41 seconds                                                                                                                               
throughput:     90.4 job/s (script:  89.7 job/s)                                                                                            
2: 87 seconds                                                                                                                               
throughput:     75.9 job/s (script:  75.4 job/s)                                                                                            
3: 141 seconds                                                                                                                              
throughput:     71.5 job/s (script:  71.0 job/s)                                                                                            
4: 199 seconds                                                                                                                              
throughput:     50.6 job/s (script:  50.4 job/s)                                                                                            
5: 280 seconds                                                                                                                              
throughput:     34.8 job/s (script:  34.7 job/s)                                                                                            
6: 399 seconds                                                                                                                              
throughput:     31.2 job/s (script:  31.2 job/s)                                                                                            
7: 530 seconds                                                                                                                              
throughput:     19.1 job/s (script:  19.1 job/s)                                                                                            
8: 745 seconds                                                                                                                              
throughput:     12.3 job/s (script:  12.1 job/s)                                                                                            
9: 1083 seconds                                                                                                                             
throughput:     11.1 job/s (script:  11.0 job/s)                                                                                            
10: 1455 seconds                                                                                                                            
^C[ perf record: Woken up 41517 times to write data ]                                                                                       
Traceback (most recent call last):                                                                                                          
  File "./throughput.py", line 123, in <module>                                                                                             
    bulk = BulkRun(flux.Flux(), args.njobs, jobspec).run(args)                                                                              
  File "./throughput.py", line 106, in run                                                                                                  
    self.handle.reactor_run()                                                                                                               
  File "/tmp/achu/flux-core/src/bindings/python/flux/core/handle.py", line 301, in reactor_run                                              
    raise KeyboardInterrupt                                                                                                                 
KeyboardInterrupt                                                                                                                           
11: 3029 seconds

the 11th iteration (of 13) ran for 26 minutes before I ctrl+C'ed it. So there is a "worst case w/ bad luck" kinda case in this issue.

chu11 · 2021-04-10T21:34:10Z

This lead me to an obvious question. Is this zhashx issue hitting us elsewhere in flux-core? We use it a lot. The KVS worries me b/c we use it in the KVS cache and we do tons of inserts/lookups there.

Other places don't worry me quite as much but I don't know the code everywhere that well. Or perhaps it adds up? Having a zhashx with 1 million buckets instead of 100K buckets isn't a big deal in 1 module, but it is in like 20 modules.

Going back to the zhash implementation in the content-cache, this is about the avergage throughput I get on my test after 40 iterations.

Running throughput.py 4096 jobs per iteration, 40 iterations
throughput:     100.2 job/s (script:  99.3 job/s)
throughput:     95.8 job/s (script:  94.9 job/s)
throughput:     94.1 job/s (script:  93.3 job/s)
throughput:     91.2 job/s (script:  90.4 job/s)
throughput:     88.7 job/s (script:  87.9 job/s)
throughput:     87.9 job/s (script:  87.2 job/s)
throughput:     89.0 job/s (script:  88.3 job/s)
throughput:     89.4 job/s (script:  88.6 job/s)
throughput:     86.4 job/s (script:  85.6 job/s)
throughput:     84.6 job/s (script:  83.9 job/s)
throughput:     81.6 job/s (script:  81.0 job/s)
throughput:     78.7 job/s (script:  77.9 job/s)
throughput:     79.8 job/s (script:  79.2 job/s)
throughput:     80.3 job/s (script:  79.7 job/s)
throughput:     79.0 job/s (script:  78.2 job/s)
throughput:     77.6 job/s (script:  77.0 job/s)
throughput:     79.1 job/s (script:  78.4 job/s)
throughput:     74.6 job/s (script:  74.1 job/s)
throughput:     78.0 job/s (script:  77.4 job/s)
throughput:     77.7 job/s (script:  77.1 job/s)
throughput:     80.7 job/s (script:  80.0 job/s)
throughput:     81.6 job/s (script:  80.9 job/s)
throughput:     82.9 job/s (script:  82.1 job/s)
throughput:     80.6 job/s (script:  79.6 job/s)
throughput:     81.5 job/s (script:  80.8 job/s)
throughput:     80.5 job/s (script:  79.8 job/s)
throughput:     79.5 job/s (script:  78.8 job/s)
throughput:     78.4 job/s (script:  77.8 job/s)
throughput:     73.0 job/s (script:  72.4 job/s)
throughput:     75.7 job/s (script:  75.1 job/s)
throughput:     79.2 job/s (script:  78.6 job/s)
throughput:     78.8 job/s (script:  78.1 job/s)
throughput:     80.7 job/s (script:  80.0 job/s)
throughput:     81.7 job/s (script:  80.9 job/s)
throughput:     76.9 job/s (script:  76.2 job/s)
throughput:     80.3 job/s (script:  79.7 job/s)
throughput:     79.3 job/s (script:  78.6 job/s)
throughput:     80.7 job/s (script:  80.0 job/s)
throughput:     74.1 job/s (script:  73.5 job/s)
throughput:     78.9 job/s (script:  78.2 job/s)

LD preloading my fixed up zhashx.

Running throughput.py 4096 jobs per iteration, 40 iterations
throughput:     99.3 job/s (script:  98.5 job/s)
throughput:     93.7 job/s (script:  92.8 job/s)
throughput:     93.2 job/s (script:  92.4 job/s)
throughput:     93.7 job/s (script:  92.9 job/s)
throughput:     92.5 job/s (script:  91.7 job/s)
throughput:     91.7 job/s (script:  90.8 job/s)
throughput:     89.0 job/s (script:  88.2 job/s)
throughput:     88.9 job/s (script:  88.1 job/s)
throughput:     86.9 job/s (script:  86.1 job/s)
throughput:     83.6 job/s (script:  82.9 job/s)
throughput:     86.6 job/s (script:  85.8 job/s)
throughput:     84.7 job/s (script:  84.0 job/s)
throughput:     82.8 job/s (script:  82.1 job/s)
throughput:     82.2 job/s (script:  81.5 job/s)
throughput:     83.1 job/s (script:  82.4 job/s)
throughput:     81.2 job/s (script:  80.5 job/s)
throughput:     78.5 job/s (script:  77.8 job/s)
throughput:     79.7 job/s (script:  79.0 job/s)
throughput:     81.1 job/s (script:  80.4 job/s)
throughput:     80.1 job/s (script:  79.4 job/s)
throughput:     81.6 job/s (script:  80.8 job/s)
throughput:     79.2 job/s (script:  78.6 job/s)
throughput:     78.9 job/s (script:  78.2 job/s)
throughput:     80.1 job/s (script:  79.4 job/s)
throughput:     81.4 job/s (script:  80.7 job/s)
throughput:     81.1 job/s (script:  80.4 job/s)
throughput:     76.8 job/s (script:  76.1 job/s)
throughput:     78.1 job/s (script:  77.4 job/s)
throughput:     78.0 job/s (script:  77.4 job/s)
throughput:     77.7 job/s (script:  76.8 job/s)
throughput:     78.8 job/s (script:  78.1 job/s)
throughput:     79.6 job/s (script:  79.0 job/s)
throughput:     79.8 job/s (script:  79.1 job/s)
throughput:     74.3 job/s (script:  73.7 job/s)
throughput:     77.1 job/s (script:  76.4 job/s)
throughput:     79.7 job/s (script:  79.0 job/s)
throughput:     77.9 job/s (script:  77.2 job/s)
throughput:     76.9 job/s (script:  76.1 job/s)
throughput:     81.2 job/s (script:  80.4 job/s)
throughput:     75.9 job/s (script:  75.2 job/s)

granted its only 1 run and on a node I don't exclusively own, but it does seem the fixes help performance outside of the content-cache a smidge? Either way, it's not huge, suggesting issue doesn't seem to affect other parts of flux-core too much.

garlick · 2021-04-10T23:14:01Z

Thoughts on pulling in just zhashx_t to say src/common/libzhash? I find it a bit scary to pull in all of czmq when we don't use much of it, but zhashx doesn't change much upstream (Hah, just noticed you were the last person to change it in January, but before that, nothing since 2018). Not sure what other czmq internals it requires, but maybe we can lop off a few things we don't really need, like zhashx_load() et al and the selftest, to avoid pulling on the string too much?

This bug feels serious to me. Not sure if we can wait for it to filter down to the distros.

Edit: maybe we could grab just zlistx_t and zhashx_t and call it libczmqcontainers? It might be nice to have control over things like their handling of malloc failures going forward.

chu11 · 2021-04-10T23:35:17Z

Pulling in zhashx for a libzhashx doesn't seem like a terrible idea, except czmq is distributed under MPL2. Reading https://www.mozilla.org/en-US/MPL/2.0/FAQ/ and https://www.mozilla.org/en-US/MPL/2.0/combining-mpl-and-gpl/ hurt my head :-)

If we copy & paste in the files, I think we're good? But we probably have to rename functions to ensure no symbol conflicts? So we have to go down the modified file path?

Edit: ok, if my reading is correct, I think if we copy in files and only edit in those files, we're fine. We just can't copy in any LGPL code into the files. Is that your guy's reading?

garlick · 2021-04-10T23:50:04Z

Edit: ok, if my reading is correct, I think if we copy in files and only edit in those files, we're fine.

Yes, that's how I read it - the file needs to remain MPL2, but it can be included (and the source can be modified) in an (L)GPL project.

garlick · 2021-04-10T23:52:27Z

On renaming - I don't think we'll have symbol conflicts since symbols in a statically linked internal library would take precedence over ones in an external dynamically linked library. We might run into trouble with include files though - or would we if we don't change any function prototypes?

Edit: I may have forgotten yet again how that stuff works so please don't take this comment too seriously.

chu11 · 2021-04-11T02:54:35Z

On renaming - I don't think we'll have symbol conflicts since symbols in a statically linked internal library would take precedence over ones in an external dynamically linked library. We might run into trouble with include files though - or would we if we don't change any function prototypes?

I dunno, I feel like we're just askin for trouble :-) e.g. portability on systems we're not so familiar with.

My preference would be to slightly adjust the function names. Perhaps just prefix all with an f? So like fzhashx_create()?

chu11 · 2021-04-11T04:57:19Z

Started my fhashx lib just to give it a shot.

/* This is a copy of the czmq zhashx library.
 *
 * Due to czmq issue #2173
 * (https://github.com/zeromq/czmq/issues/2173) performance in
 * flux-core was extremely poor at times and flux-core could not
 * wait for the OS distros to pick up the bug fix.
 *
 * This library is a verbatim copy with the following changes:
 *
 * - rename all "zhashx" to "fzhashx" to avoid symbol collisions.
 * - replace all calls to zmalloc() to calloc()
 * - add fzhashx_t typedef
 * - add #define for freen()
 * - remove zhashx_test() and all associated testing code
 * - remove functions using zframe_t (zhashx_unpack(), zhashx_pack(),
 *   fzhashx_unpack_own(), fzhashx_pack_own())
 * - remove all declarations of CZMQ_EXPORT
 * - adjust headers
 */

More changes than I would have liked, but not soooo bad.

https://github.com/chu11/flux-core/tree/issue3591_internal_zhashx

grondo · 2021-04-11T14:44:05Z

You might be able to do symbol renaming transparently in the copied header with some clever preprocessor macros, instead of making changes by hand.

…

On Sat, Apr 10, 2021, 7:54 PM Al Chu ***@***.***> wrote: On renaming - I don't think we'll have symbol conflicts since symbols in a statically linked internal library would take precedence over ones in an external dynamically linked library. We might run into trouble with include files though - or would we if we don't change any function prototypes? I dunno, I feel like we're just askin for trouble :-) My preference would be to slightly adjust the function names. Perhaps just prefix all with an f? So like fzhashx_create()? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3591 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFVEUTNSYMTB4W2KMO3HVDTIEFPNANCNFSM42TDRRQQ> .

chu11 · 2021-04-11T18:40:49Z

You might be able to do symbol renaming transparently in the copied header with some clever preprocessor macros, instead of making changes by hand.

I had considered this, but thought it bug-prone. Like the next time we hit an issue in zhashx_insert() or something, I'll end up looking in the wrong place for a bug?

But ... perhaps I was being overly conservative and it's probably fine. Let me give it a shot, see it turns out ok.

chu11 · 2021-04-12T06:58:54Z

So I tried to go with the technique that my copied in library names everything fzhashx, but use a header like:

#ifndef _ZHASHX_INT_H
#define _ZHASHX_INT_H
#define zhashx_t fzhashx_t
#define zhashx_new fzhashx_new
...
#endif

in files to do the mapping and limit all of the code changes. And when czmq is updated, we can just remove this header without causing tons of code changes. Seems to work. A little painfully manual to update all the files that use zhashx, but I suppose not as painful as search and replacing zhashx with fzhashx.

chu11 · 2021-04-12T07:00:25Z

Oh crud, I just realized something. Do we need to export fzhashx? flux-sched uses zhashx.

garlick · 2021-04-12T13:17:42Z

Good point - I opened flux-framework/flux-sched#821.

I think probably it would be more appropriate for them to also pull in zhashx_t with the fix than for flux-core to export it. Or maybe std::hash or boost::hash would work over there given C++? (I have no idea).

Problem: To avoid bugs in most OS distro's versions of czmq, we want internal flux callers to use the internal "fzhashx" library instead of the czmq "zhashx" library whenever a "zhashx" is used. Solution: Include the file "zhashx_int.h" whereever zhashx data structures are used. Update Makefile.am files to add libczmqcontainers.la when appropriate. Fixes flux-framework#3591

Problem: To avoid bugs in most OS distro's versions of czmq, we want internal flux callers to use the internal "fzhashx" library instead of the czmq "zhashx" library whenever a "zhashx" is used. Solution: Include the file "czmq_containers.h" whereever zhashx data structures are used. Update Makefile.am files to add libczmqcontainers.la when appropriate. Fixes flux-framework#3591

chu11 mentioned this issue Apr 10, 2021

zhashx: hash limit increasing much faster than zhash zeromq/czmq#2173

Closed

chu11 self-assigned this Apr 10, 2021

garlick mentioned this issue Apr 11, 2021

use LRU to purge content-cache #3593

Merged

garlick mentioned this issue Apr 12, 2021

sched might need to work around zhashx resize bug flux-framework/flux-sched#821

Closed

chu11 mentioned this issue Apr 15, 2021

libczmqcontainers: introduce new internal library of czmq zhashx library #3596

Merged

mergify bot closed this as completed in 6771b05 Apr 17, 2021

chu11 mentioned this issue May 19, 2021

kvs: remove client disconnect bottleneck #3663

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

broker/content-cache: throughput drop after conversion to zhashx #3591

broker/content-cache: throughput drop after conversion to zhashx #3591

chu11 commented Apr 8, 2021

grondo commented Apr 8, 2021

chu11 commented Apr 8, 2021

grondo commented Apr 8, 2021

chu11 commented Apr 8, 2021

chu11 commented Apr 9, 2021 •

edited

Loading

chu11 commented Apr 9, 2021

chu11 commented Apr 9, 2021 •

edited

Loading

garlick commented Apr 9, 2021

chu11 commented Apr 10, 2021 •

edited

Loading

garlick commented Apr 10, 2021 •

edited

Loading

chu11 commented Apr 10, 2021 •

edited

Loading

garlick commented Apr 10, 2021

chu11 commented Apr 10, 2021

garlick commented Apr 10, 2021

grondo commented Apr 10, 2021 via email

chu11 commented Apr 10, 2021 •

edited

Loading

chu11 commented Apr 10, 2021

garlick commented Apr 10, 2021 •

edited

Loading

chu11 commented Apr 10, 2021 •

edited

Loading

garlick commented Apr 10, 2021

garlick commented Apr 10, 2021 •

edited

Loading

chu11 commented Apr 11, 2021 •

edited

Loading

chu11 commented Apr 11, 2021

grondo commented Apr 11, 2021 via email

chu11 commented Apr 11, 2021

chu11 commented Apr 12, 2021 •

edited

Loading

chu11 commented Apr 12, 2021

garlick commented Apr 12, 2021

broker/content-cache: throughput drop after conversion to zhashx #3591

broker/content-cache: throughput drop after conversion to zhashx #3591

Comments

chu11 commented Apr 8, 2021

grondo commented Apr 8, 2021

chu11 commented Apr 8, 2021

grondo commented Apr 8, 2021

chu11 commented Apr 8, 2021

chu11 commented Apr 9, 2021 • edited Loading

chu11 commented Apr 9, 2021

chu11 commented Apr 9, 2021 • edited Loading

garlick commented Apr 9, 2021

chu11 commented Apr 10, 2021 • edited Loading

garlick commented Apr 10, 2021 • edited Loading

chu11 commented Apr 10, 2021 • edited Loading

garlick commented Apr 10, 2021

chu11 commented Apr 10, 2021

garlick commented Apr 10, 2021

grondo commented Apr 10, 2021 via email

chu11 commented Apr 10, 2021 • edited Loading

chu11 commented Apr 10, 2021

garlick commented Apr 10, 2021 • edited Loading

chu11 commented Apr 10, 2021 • edited Loading

garlick commented Apr 10, 2021

garlick commented Apr 10, 2021 • edited Loading

chu11 commented Apr 11, 2021 • edited Loading

chu11 commented Apr 11, 2021

grondo commented Apr 11, 2021 via email

chu11 commented Apr 11, 2021

chu11 commented Apr 12, 2021 • edited Loading

chu11 commented Apr 12, 2021

garlick commented Apr 12, 2021

chu11 commented Apr 9, 2021 •

edited

Loading

chu11 commented Apr 9, 2021 •

edited

Loading

chu11 commented Apr 10, 2021 •

edited

Loading

garlick commented Apr 10, 2021 •

edited

Loading

chu11 commented Apr 10, 2021 •

edited

Loading

chu11 commented Apr 10, 2021 •

edited

Loading

garlick commented Apr 10, 2021 •

edited

Loading

chu11 commented Apr 10, 2021 •

edited

Loading

garlick commented Apr 10, 2021 •

edited

Loading

chu11 commented Apr 11, 2021 •

edited

Loading

chu11 commented Apr 12, 2021 •

edited

Loading