-
Notifications
You must be signed in to change notification settings - Fork 36.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use real number of cores for default -par, ignore virtual cores #6361
Conversation
To determine the default for `-par`, the number of script verification threads, use [boost::thread::physical_concurrency()](http://www.boost.org/doc/libs/1_58_0/doc/html/thread/thread_management.html#thread.thread_management.thread.physical_concurrency) which counts only physical cores, not virtual cores. Virtual cores are roughly a set of cached registers to avoid context switches while threading, they cannot actually perform work, so spawning a verification thread for them could even reduce efficiency and will put undue load on the system. Should fix issue bitcoin#6358, as well as some other reported system overload issues, especially on Intel processors. The function was only introduced in boost 1.56, so provide a utility function `GetNumCores` to fall back for older Boost versions.
We'll also have to bump the boost version in depends for this to work (currently 1.55), but I remember that was already the plan for 0.12. |
@laanwj yes, we need to go ahead with the boost bump. I'm getting ready to leave for a few days for a conference, but I'll make the boost/qt bumps first priority when I get back. Ideally those should be done very early in the cycle so we have some testing. Taking that one step further, I'll check the other deps while I'm at it and see if they need bumps too. |
utACK |
ACK |
4716267 Use real number of cores for default -par, ignore virtual cores (Wladimir J. van der Laan)
Posthumous ACK. |
To determine the default for `-par`, the number of script verification threads, use [boost::thread::physical_concurrency()](http://www.boost.org/doc/libs/1_58_0/doc/html/thread/thread_management.html#thread.thread_management.thread.physical_concurrency) which counts only physical cores, not virtual cores. Virtual cores are roughly a set of cached registers to avoid context switches while threading, they cannot actually perform work, so spawning a verification thread for them could even reduce efficiency and will put undue load on the system. Should fix issue bitcoin#6358, as well as some other reported system overload issues, especially on Intel processors. The function was only introduced in boost 1.56, so provide a utility function `GetNumCores` to fall back for older Boost versions. Github-Pull: bitcoin#6361 Rebased-From: 4716267
Use real number of cores, ignore virtual cores Cherry-picked from the following upstream PRs: - bitcoin/bitcoin#6361 - bitcoin/bitcoin#6370 Part of #2074.
…, to determine available cores 937bf43 Use std::thread::hardware_concurrency, instead of Boost, to determine available cores (fanquake) Pull request description: Following discussion on IRC about replacing Boost usage for detecting available system cores, I've opened this to collect some benchmarks + further discussion. The current method for detecting available cores was introduced in #6361. Recap of the IRC chat: ``` 21:14:08 fanquake: Since we seem to be giving Boost removal a good shot for 0.15, does anyone have suggestions for replacing GetNumCores? 21:14:26 fanquake: There is std::thread::hardware_concurrency(), but that seems to count virtual cores, which I don't think we want. 21:14:51 BlueMatt: fanquake: I doubt we'll do boost removal for 0.15 21:14:58 BlueMatt: shit like BOOST_FOREACH, sure 21:15:07 BlueMatt: but all of boost? doubtful, there are still things we need 21:16:36 fanquake: Yea sorry, not the whole lot, but we can remove a decent chunk. Just looking into what else needs to be done to replace some of the less involved Boost usage. 21:16:43 BlueMatt: fair 21:17:14 wumpus: yes, it makes sense to plan ahead a bit, without immediately doing it 21:18:12 wumpus: right, don't count virtual cores, that used to be the case but it makes no sense for our usage 21:19:15 wumpus: it'd create a swarm of threads overwhelming any machine with hyperthreading (+accompanying thread stack overhead), for script validation, and there was no gain at all for that 21:20:03 sipa: BlueMatt: don't worry, there is no hurry 21:59:10 morcos: wumpus: i don't think that is correct 21:59:24 morcos: suppose you have 4 cores (8 virtual cores) 21:59:24 wumpus: fanquake: indeed seems that std has no equivalent to physical_concurrency, on any standard. That's annoying as it is non-trivial to implement 21:59:35 morcos: i think running par=8 (if it let you) would be notably faster 21:59:59 morcos: jeremyrubin and i discussed this at length a while back... i think i commented about it on irc at the time 22:00:21 wumpus: morcos: I think the conclusion at the time was that it made no difference, but sure would make sense to benchmark 22:00:39 morcos: perhaps historical testing on the virtual vs actual cores was polluted by concurrency issues that have now improved 22:00:47 wumpus: I think there are not more ALUs, so there is not really a point in having more threads 22:01:40 wumpus: hyperthreads are basically just a stored register state right? 22:02:23 sipa: wumpus: yes but it helps the scheduler 22:02:27 wumpus: in which case the only speedup using "number of cores" threads would give you is, possibly, excluding other software from running on the cores on the same time 22:02:37 morcos: well this is where i get out of my depth 22:02:50 sipa: if one of the threads is waiting on a read from ram, the other can use the arithmetic unit for example 22:02:54 morcos: wumpus: i'm pretty sure though that the speed up is considerably more than what you might expect from that 22:02:59 wumpus: sipa: ok, I back down, I didn't want to argue this at all 22:03:35 morcos: the reason i haven't tested it myself, is the machine i usually use has 16 cores... so not easy due to remaining concurrency issues to get much more speedup 22:03:36 wumpus: I'm fine with restoring it to number of virtual threads if that's faster 22:03:54 morcos: we should have somene with 4 cores (and  actually test it though, i agree 22:03:58 sipa: i would expect (but we should benchmark...) that if 8 scriot validation threads instead of 4 on a quadcore hyperthreading is not faster, it's due to lock contention 22:04:20 morcos: sipa: yeah thats my point, i think lock contention isn't that bad with 8 now 22:04:22 wumpus: on 64-bit systems the additional thread overhead wouldn't be important at least 22:04:23 gmaxwell: I previously benchmarked, a long time ago, it was faster. 22:04:33 gmaxwell: (to use the HT core count) 22:04:44 wumpus: why was this changed at all then? 22:04:47 wumpus: I'm confused 22:05:04 sipa: good question! 22:05:06 gmaxwell: I had no idea we changed it. 22:05:25 wumpus: sigh  22:05:54 gmaxwell: What PR changed it? 22:06:51 gmaxwell: In any case, on 32-bit it's probably a good tradeoff... the extra ram overhead is worth avoiding. 22:07:22 wumpus: #6361 22:07:28 gmaxwell: PR 6461 btw. 22:07:37 gmaxwell: er lol at least you got it right. 22:07:45 wumpus: the complaint was that systems became unsuably slow when using that many thread 22:07:51 wumpus: so at least I got one thing right, woohoo 22:07:55 sipa: seems i even acked it! 22:07:57 BlueMatt: wumpus: there are more alus 22:08:38 BlueMatt: but we need to improve lock contention first 22:08:40 morcos: anywya, i think in the past the lock contention made 8 threads regardless of cores a bit dicey.. now that is much better (although more still to be done) 22:09:01 BlueMatt: or we can just merge #10192, thats fee 22:09:04 gribble: #10192 | Cache full script execution results in addition to signatures by TheBlueMatt · Pull Request #10192 · bitcoin/bitcoin · GitHub 22:09:11 BlueMatt: s/fee/free/ 22:09:21 morcos: no, we do not need to improve lock contention first. but we should probably do that before we increase the max beyond 16 22:09:26 BlueMatt: then we can toss concurrency issues out the window and get more speedup anyway 22:09:35 gmaxwell: wumpus: yea, well in QT I thought we also diminished the count by 1 or something? but yes, if the motivation was to reduce how heavily the machine was used, thats fair. 22:09:56 sipa: the benefit of using HT cores is certainly not a factor 2 22:09:58 wumpus: gmaxwell: for the default I think this makes a lot of sense, yes 22:10:10 gmaxwell: morcos: right now on my 24/28 physical core hosts going beyond 16 still reduces performance. 22:10:11 wumpus: gmaxwell: do we also restrict the maximum par using this? that'd make less sense 22:10:51 wumpus: if someone *wants* to use the virtual cores they should be able to by setting -par= 22:10:51 sipa: *flies to US* 22:10:52 BlueMatt: sipa: sure, but the shared cache helps us get more out of it than some others, as morcos points out 22:11:30 BlueMatt: (because it means our thread contention issues are less) 22:12:05 morcos: gmaxwell: yeah i've been bogged down in fee estimation as well (and the rest of life) for a while now.. otherwise i would have put more effort into jeremy's checkqueue 22:12:36 BlueMatt: morcos: heh, well now you can do other stuff while the rest of us get bogged down in understanding fee estimation enough to review it  22:12:37 wumpus: [to answer my own question: no, the limit for par is MAX_SCRIPTCHECK_THREADS, or 16] 22:12:54 morcos: but to me optimizing for more than 16 cores is pretty valuable as miners could use beefy machines and be less concerned by block validation time 22:14:38 BlueMatt: morcos: i think you may be surprised by the number of mining pools that are on VPSes that do not have 16 cores  22:15:34 gmaxwell: I assume right now most of the time block validation is bogged in the parts that are not as concurrent. simple because caching makes the concurrent parts so fast. (and soon to hopefully increase with bluematt's patch) 22:17:55 gmaxwell: improving sha2 speed, or transaction malloc overhead are probably bigger wins now for connection at the tip than parallelism beyond 16 (though I'd like that too). 22:18:21 BlueMatt: sha2 speed is big 22:18:27 morcos: yeah lots of things to do actually... 22:18:57 gmaxwell: BlueMatt: might be a tiny bit less big if we didn't hash the block header 8 times for every block.  22:21:27 BlueMatt: ehh, probably, but I'm less rushed there 22:21:43 BlueMatt: my new cache thing is about to add a bunch of hashing 22:21:50 BlueMatt: 1 sha round per tx 22:22:25 BlueMatt: and sigcache is obviously a ton ``` Tree-SHA512: a594430e2a77d8cc741ea8c664a2867b1e1693e5050a4bbc8511e8d66a2bffe241a9965f6dff1e7fbb99f21dd1fdeb95b826365da8bd8f9fab2d0ffd80d5059c
3f3edde [Bench] Use PIVX address in Base58Decode test (random-zebra) 5a1be90 [Travis] Disable benchmark framework for trusty test (random-zebra) 1bd89ac Initialize recently introduced non-static class member lastCycles to zero in constructor (random-zebra) ec60671 Require a steady clock for bench with at least micro precision (random-zebra) 84069ce bench: prefer a steady clock if the resolution is no worse (random-zebra) 38367b1 bench: switch to std::chrono for time measurements (random-zebra) a24633a Remove countMaskInv caching in bench framework (random-zebra) 9e9bc22 Restore default format state of cout after printing with std::fixed/setprecision (random-zebra) 3dd559d Avoid static analyzer warnings regarding uninitialized arguments (random-zebra) e85f224 Replace boost::function with std::function (C++11) (random-zebra) 98c0857 Prevent warning: variable 'x' is uninitialized (random-zebra) 7f0d4b3 FastRandom benchmark (random-zebra) d9fa0c6 Add prevector destructor benchmark (random-zebra) e1527ba Assert that what might look like a possible division by zero is actually unreachable (random-zebra) e94cf15 bench: Fix initialization order in registration (random-zebra) 151c25f Basic CCheckQueue Benchmarks (random-zebra) 51aedbc Use std:thread:hardware_concurrency, instead of Boost, to determine available cores (random-zebra) d447613 Use real number of cores for default -par, ignore virtual cores (random-zebra) 9162a56 [Refactoring] Removed using namespace <xxx> from bench/ sources (random-zebra) 5c07f67 bench: Add support for measuring CPU cycles (random-zebra) 41ce1ed bench: Fix subtle counting issue when rescaling iteration count (random-zebra) 68ea794 Avoid integer division in the benchmark inner-most loop. (random-zebra) 3fa4f27 bench: Added base58 encoding/decoding benchmarks (random-zebra) 4442118 bench: Add crypto hash benchmarks (random-zebra) a5179b6 [Trivial] ensure minimal header conventions (random-zebra) 8607d6b Support very-fast-running benchmarks (random-zebra) 4aebb60 Simple benchmarking framework (random-zebra) Pull request description: Introduces the benchmarking framework, loosely based on google's micro-benchmarking library (https://github.com/google/benchmark), ported from Bitcoin, up to 0.16. The benchmark framework is hard-coded to run each benchmark for one wall-clock second, and then spits out .csv-format timing information to stdout. Backported PR: - bitcoin#6733 - bitcoin#6770 - bitcoin#6892 - bitcoin#8039 - bitcoin#8107 - bitcoin#8115 - bitcoin#9200 - bitcoin#9202 - bitcoin#9281 - bitcoin#6361 - bitcoin#10271 - bitcoin#9498 - bitcoin#9712 - bitcoin#9547 - bitcoin#9505 (benchmark only. Rest was in #1557) - bitcoin#9792 (benchmark only. Rest was in #643) - bitcoin#10272 - bitcoin#10395 (base58 only) - bitcoin#10963 - bitcoin#11303 (first commit) - bitcoin#11562 - bitcoin#11646 - bitcoin#11654 Current output of `src/bench/bench_pivx`: ``` #Benchmark,count,min(ns),max(ns),average(ns),min_cycles,max_cycles,average_cycles Base58CheckEncode,131072,7697,8065,7785,20015,20971,20242 Base58Decode,294912,3305,3537,3454,8595,9198,8981 Base58Encode,180224,5498,6020,5767,14297,15652,14994 CCheckQueueSpeed,320,3159960,3535173,3352787,8216030,9191602,8717388 CCheckQueueSpeedPrevectorJob,96,9184484,11410840,10823070,23880046,29668680,28140445 FastRandom_1bit,320,3143690,4838162,3199156,8173726,12579373,8317941 FastRandom_32bit,60,17097612,17923669,17367440,44454504,46602306,45156079 PrevectorClear,3072,334741,366618,346731,870340,953224,901516 PrevectorDestructor,2816,344233,368912,357281,895022,959187,928948 RIPEMD160,288,3404503,3693917,3577774,8851850,9604334,9302363 SHA1,384,2718128,2891558,2802513,7067238,7518184,7286652 SHA256,176,6133760,6580005,6239866,15948035,17108376,16223916 SHA512,240,4251468,4358706,4313463,11054006,11332826,11215186 Sleep100ms,10,100221470,100302411,100239073,260580075,260790726,260625870 ``` NOTE: Not all the tests have been pulled yet (as we might not have the code being tested, or it would require rewrites to work with our different code base), but the framework is updated to December 2017. ACKs for top commit: Fuzzbawls: ACK 3f3edde Tree-SHA512: c283311a9accf6d2feeb93b185afa08589ebef3f18b6e86980dbc3647b9845f75ac9ecce2f1b08738d25ceac36596a2c89d41e4dbf3b463502aa695611aa1f8e
…f Boost, to determine available cores 937bf43 Use std::thread::hardware_concurrency, instead of Boost, to determine available cores (fanquake) Pull request description: Following discussion on IRC about replacing Boost usage for detecting available system cores, I've opened this to collect some benchmarks + further discussion. The current method for detecting available cores was introduced in bitcoin#6361. Recap of the IRC chat: ``` 21:14:08 fanquake: Since we seem to be giving Boost removal a good shot for 0.15, does anyone have suggestions for replacing GetNumCores? 21:14:26 fanquake: There is std::thread::hardware_concurrency(), but that seems to count virtual cores, which I don't think we want. 21:14:51 BlueMatt: fanquake: I doubt we'll do boost removal for 0.15 21:14:58 BlueMatt: shit like BOOST_FOREACH, sure 21:15:07 BlueMatt: but all of boost? doubtful, there are still things we need 21:16:36 fanquake: Yea sorry, not the whole lot, but we can remove a decent chunk. Just looking into what else needs to be done to replace some of the less involved Boost usage. 21:16:43 BlueMatt: fair 21:17:14 wumpus: yes, it makes sense to plan ahead a bit, without immediately doing it 21:18:12 wumpus: right, don't count virtual cores, that used to be the case but it makes no sense for our usage 21:19:15 wumpus: it'd create a swarm of threads overwhelming any machine with hyperthreading (+accompanying thread stack overhead), for script validation, and there was no gain at all for that 21:20:03 sipa: BlueMatt: don't worry, there is no hurry 21:59:10 morcos: wumpus: i don't think that is correct 21:59:24 morcos: suppose you have 4 cores (8 virtual cores) 21:59:24 wumpus: fanquake: indeed seems that std has no equivalent to physical_concurrency, on any standard. That's annoying as it is non-trivial to implement 21:59:35 morcos: i think running par=8 (if it let you) would be notably faster 21:59:59 morcos: jeremyrubin and i discussed this at length a while back... i think i commented about it on irc at the time 22:00:21 wumpus: morcos: I think the conclusion at the time was that it made no difference, but sure would make sense to benchmark 22:00:39 morcos: perhaps historical testing on the virtual vs actual cores was polluted by concurrency issues that have now improved 22:00:47 wumpus: I think there are not more ALUs, so there is not really a point in having more threads 22:01:40 wumpus: hyperthreads are basically just a stored register state right? 22:02:23 sipa: wumpus: yes but it helps the scheduler 22:02:27 wumpus: in which case the only speedup using "number of cores" threads would give you is, possibly, excluding other software from running on the cores on the same time 22:02:37 morcos: well this is where i get out of my depth 22:02:50 sipa: if one of the threads is waiting on a read from ram, the other can use the arithmetic unit for example 22:02:54 morcos: wumpus: i'm pretty sure though that the speed up is considerably more than what you might expect from that 22:02:59 wumpus: sipa: ok, I back down, I didn't want to argue this at all 22:03:35 morcos: the reason i haven't tested it myself, is the machine i usually use has 16 cores... so not easy due to remaining concurrency issues to get much more speedup 22:03:36 wumpus: I'm fine with restoring it to number of virtual threads if that's faster 22:03:54 morcos: we should have somene with 4 cores (and  actually test it though, i agree 22:03:58 sipa: i would expect (but we should benchmark...) that if 8 scriot validation threads instead of 4 on a quadcore hyperthreading is not faster, it's due to lock contention 22:04:20 morcos: sipa: yeah thats my point, i think lock contention isn't that bad with 8 now 22:04:22 wumpus: on 64-bit systems the additional thread overhead wouldn't be important at least 22:04:23 gmaxwell: I previously benchmarked, a long time ago, it was faster. 22:04:33 gmaxwell: (to use the HT core count) 22:04:44 wumpus: why was this changed at all then? 22:04:47 wumpus: I'm confused 22:05:04 sipa: good question! 22:05:06 gmaxwell: I had no idea we changed it. 22:05:25 wumpus: sigh  22:05:54 gmaxwell: What PR changed it? 22:06:51 gmaxwell: In any case, on 32-bit it's probably a good tradeoff... the extra ram overhead is worth avoiding. 22:07:22 wumpus: bitcoin#6361 22:07:28 gmaxwell: PR 6461 btw. 22:07:37 gmaxwell: er lol at least you got it right. 22:07:45 wumpus: the complaint was that systems became unsuably slow when using that many thread 22:07:51 wumpus: so at least I got one thing right, woohoo 22:07:55 sipa: seems i even acked it! 22:07:57 BlueMatt: wumpus: there are more alus 22:08:38 BlueMatt: but we need to improve lock contention first 22:08:40 morcos: anywya, i think in the past the lock contention made 8 threads regardless of cores a bit dicey.. now that is much better (although more still to be done) 22:09:01 BlueMatt: or we can just merge bitcoin#10192, thats fee 22:09:04 gribble: bitcoin#10192 | Cache full script execution results in addition to signatures by TheBlueMatt · Pull Request bitcoin#10192 · bitcoin/bitcoin · GitHub 22:09:11 BlueMatt: s/fee/free/ 22:09:21 morcos: no, we do not need to improve lock contention first. but we should probably do that before we increase the max beyond 16 22:09:26 BlueMatt: then we can toss concurrency issues out the window and get more speedup anyway 22:09:35 gmaxwell: wumpus: yea, well in QT I thought we also diminished the count by 1 or something? but yes, if the motivation was to reduce how heavily the machine was used, thats fair. 22:09:56 sipa: the benefit of using HT cores is certainly not a factor 2 22:09:58 wumpus: gmaxwell: for the default I think this makes a lot of sense, yes 22:10:10 gmaxwell: morcos: right now on my 24/28 physical core hosts going beyond 16 still reduces performance. 22:10:11 wumpus: gmaxwell: do we also restrict the maximum par using this? that'd make less sense 22:10:51 wumpus: if someone *wants* to use the virtual cores they should be able to by setting -par= 22:10:51 sipa: *flies to US* 22:10:52 BlueMatt: sipa: sure, but the shared cache helps us get more out of it than some others, as morcos points out 22:11:30 BlueMatt: (because it means our thread contention issues are less) 22:12:05 morcos: gmaxwell: yeah i've been bogged down in fee estimation as well (and the rest of life) for a while now.. otherwise i would have put more effort into jeremy's checkqueue 22:12:36 BlueMatt: morcos: heh, well now you can do other stuff while the rest of us get bogged down in understanding fee estimation enough to review it  22:12:37 wumpus: [to answer my own question: no, the limit for par is MAX_SCRIPTCHECK_THREADS, or 16] 22:12:54 morcos: but to me optimizing for more than 16 cores is pretty valuable as miners could use beefy machines and be less concerned by block validation time 22:14:38 BlueMatt: morcos: i think you may be surprised by the number of mining pools that are on VPSes that do not have 16 cores  22:15:34 gmaxwell: I assume right now most of the time block validation is bogged in the parts that are not as concurrent. simple because caching makes the concurrent parts so fast. (and soon to hopefully increase with bluematt's patch) 22:17:55 gmaxwell: improving sha2 speed, or transaction malloc overhead are probably bigger wins now for connection at the tip than parallelism beyond 16 (though I'd like that too). 22:18:21 BlueMatt: sha2 speed is big 22:18:27 morcos: yeah lots of things to do actually... 22:18:57 gmaxwell: BlueMatt: might be a tiny bit less big if we didn't hash the block header 8 times for every block.  22:21:27 BlueMatt: ehh, probably, but I'm less rushed there 22:21:43 BlueMatt: my new cache thing is about to add a bunch of hashing 22:21:50 BlueMatt: 1 sha round per tx 22:22:25 BlueMatt: and sigcache is obviously a ton ``` Tree-SHA512: a594430e2a77d8cc741ea8c664a2867b1e1693e5050a4bbc8511e8d66a2bffe241a9965f6dff1e7fbb99f21dd1fdeb95b826365da8bd8f9fab2d0ffd80d5059c
…f Boost, to determine available cores 937bf43 Use std::thread::hardware_concurrency, instead of Boost, to determine available cores (fanquake) Pull request description: Following discussion on IRC about replacing Boost usage for detecting available system cores, I've opened this to collect some benchmarks + further discussion. The current method for detecting available cores was introduced in bitcoin#6361. Recap of the IRC chat: ``` 21:14:08 fanquake: Since we seem to be giving Boost removal a good shot for 0.15, does anyone have suggestions for replacing GetNumCores? 21:14:26 fanquake: There is std::thread::hardware_concurrency(), but that seems to count virtual cores, which I don't think we want. 21:14:51 BlueMatt: fanquake: I doubt we'll do boost removal for 0.15 21:14:58 BlueMatt: shit like BOOST_FOREACH, sure 21:15:07 BlueMatt: but all of boost? doubtful, there are still things we need 21:16:36 fanquake: Yea sorry, not the whole lot, but we can remove a decent chunk. Just looking into what else needs to be done to replace some of the less involved Boost usage. 21:16:43 BlueMatt: fair 21:17:14 wumpus: yes, it makes sense to plan ahead a bit, without immediately doing it 21:18:12 wumpus: right, don't count virtual cores, that used to be the case but it makes no sense for our usage 21:19:15 wumpus: it'd create a swarm of threads overwhelming any machine with hyperthreading (+accompanying thread stack overhead), for script validation, and there was no gain at all for that 21:20:03 sipa: BlueMatt: don't worry, there is no hurry 21:59:10 morcos: wumpus: i don't think that is correct 21:59:24 morcos: suppose you have 4 cores (8 virtual cores) 21:59:24 wumpus: fanquake: indeed seems that std has no equivalent to physical_concurrency, on any standard. That's annoying as it is non-trivial to implement 21:59:35 morcos: i think running par=8 (if it let you) would be notably faster 21:59:59 morcos: jeremyrubin and i discussed this at length a while back... i think i commented about it on irc at the time 22:00:21 wumpus: morcos: I think the conclusion at the time was that it made no difference, but sure would make sense to benchmark 22:00:39 morcos: perhaps historical testing on the virtual vs actual cores was polluted by concurrency issues that have now improved 22:00:47 wumpus: I think there are not more ALUs, so there is not really a point in having more threads 22:01:40 wumpus: hyperthreads are basically just a stored register state right? 22:02:23 sipa: wumpus: yes but it helps the scheduler 22:02:27 wumpus: in which case the only speedup using "number of cores" threads would give you is, possibly, excluding other software from running on the cores on the same time 22:02:37 morcos: well this is where i get out of my depth 22:02:50 sipa: if one of the threads is waiting on a read from ram, the other can use the arithmetic unit for example 22:02:54 morcos: wumpus: i'm pretty sure though that the speed up is considerably more than what you might expect from that 22:02:59 wumpus: sipa: ok, I back down, I didn't want to argue this at all 22:03:35 morcos: the reason i haven't tested it myself, is the machine i usually use has 16 cores... so not easy due to remaining concurrency issues to get much more speedup 22:03:36 wumpus: I'm fine with restoring it to number of virtual threads if that's faster 22:03:54 morcos: we should have somene with 4 cores (and  actually test it though, i agree 22:03:58 sipa: i would expect (but we should benchmark...) that if 8 scriot validation threads instead of 4 on a quadcore hyperthreading is not faster, it's due to lock contention 22:04:20 morcos: sipa: yeah thats my point, i think lock contention isn't that bad with 8 now 22:04:22 wumpus: on 64-bit systems the additional thread overhead wouldn't be important at least 22:04:23 gmaxwell: I previously benchmarked, a long time ago, it was faster. 22:04:33 gmaxwell: (to use the HT core count) 22:04:44 wumpus: why was this changed at all then? 22:04:47 wumpus: I'm confused 22:05:04 sipa: good question! 22:05:06 gmaxwell: I had no idea we changed it. 22:05:25 wumpus: sigh  22:05:54 gmaxwell: What PR changed it? 22:06:51 gmaxwell: In any case, on 32-bit it's probably a good tradeoff... the extra ram overhead is worth avoiding. 22:07:22 wumpus: bitcoin#6361 22:07:28 gmaxwell: PR 6461 btw. 22:07:37 gmaxwell: er lol at least you got it right. 22:07:45 wumpus: the complaint was that systems became unsuably slow when using that many thread 22:07:51 wumpus: so at least I got one thing right, woohoo 22:07:55 sipa: seems i even acked it! 22:07:57 BlueMatt: wumpus: there are more alus 22:08:38 BlueMatt: but we need to improve lock contention first 22:08:40 morcos: anywya, i think in the past the lock contention made 8 threads regardless of cores a bit dicey.. now that is much better (although more still to be done) 22:09:01 BlueMatt: or we can just merge bitcoin#10192, thats fee 22:09:04 gribble: bitcoin#10192 | Cache full script execution results in addition to signatures by TheBlueMatt · Pull Request bitcoin#10192 · bitcoin/bitcoin · GitHub 22:09:11 BlueMatt: s/fee/free/ 22:09:21 morcos: no, we do not need to improve lock contention first. but we should probably do that before we increase the max beyond 16 22:09:26 BlueMatt: then we can toss concurrency issues out the window and get more speedup anyway 22:09:35 gmaxwell: wumpus: yea, well in QT I thought we also diminished the count by 1 or something? but yes, if the motivation was to reduce how heavily the machine was used, thats fair. 22:09:56 sipa: the benefit of using HT cores is certainly not a factor 2 22:09:58 wumpus: gmaxwell: for the default I think this makes a lot of sense, yes 22:10:10 gmaxwell: morcos: right now on my 24/28 physical core hosts going beyond 16 still reduces performance. 22:10:11 wumpus: gmaxwell: do we also restrict the maximum par using this? that'd make less sense 22:10:51 wumpus: if someone *wants* to use the virtual cores they should be able to by setting -par= 22:10:51 sipa: *flies to US* 22:10:52 BlueMatt: sipa: sure, but the shared cache helps us get more out of it than some others, as morcos points out 22:11:30 BlueMatt: (because it means our thread contention issues are less) 22:12:05 morcos: gmaxwell: yeah i've been bogged down in fee estimation as well (and the rest of life) for a while now.. otherwise i would have put more effort into jeremy's checkqueue 22:12:36 BlueMatt: morcos: heh, well now you can do other stuff while the rest of us get bogged down in understanding fee estimation enough to review it  22:12:37 wumpus: [to answer my own question: no, the limit for par is MAX_SCRIPTCHECK_THREADS, or 16] 22:12:54 morcos: but to me optimizing for more than 16 cores is pretty valuable as miners could use beefy machines and be less concerned by block validation time 22:14:38 BlueMatt: morcos: i think you may be surprised by the number of mining pools that are on VPSes that do not have 16 cores  22:15:34 gmaxwell: I assume right now most of the time block validation is bogged in the parts that are not as concurrent. simple because caching makes the concurrent parts so fast. (and soon to hopefully increase with bluematt's patch) 22:17:55 gmaxwell: improving sha2 speed, or transaction malloc overhead are probably bigger wins now for connection at the tip than parallelism beyond 16 (though I'd like that too). 22:18:21 BlueMatt: sha2 speed is big 22:18:27 morcos: yeah lots of things to do actually... 22:18:57 gmaxwell: BlueMatt: might be a tiny bit less big if we didn't hash the block header 8 times for every block.  22:21:27 BlueMatt: ehh, probably, but I'm less rushed there 22:21:43 BlueMatt: my new cache thing is about to add a bunch of hashing 22:21:50 BlueMatt: 1 sha round per tx 22:22:25 BlueMatt: and sigcache is obviously a ton ``` Tree-SHA512: a594430e2a77d8cc741ea8c664a2867b1e1693e5050a4bbc8511e8d66a2bffe241a9965f6dff1e7fbb99f21dd1fdeb95b826365da8bd8f9fab2d0ffd80d5059c
…f Boost, to determine available cores 937bf43 Use std::thread::hardware_concurrency, instead of Boost, to determine available cores (fanquake) Pull request description: Following discussion on IRC about replacing Boost usage for detecting available system cores, I've opened this to collect some benchmarks + further discussion. The current method for detecting available cores was introduced in bitcoin#6361. Recap of the IRC chat: ``` 21:14:08 fanquake: Since we seem to be giving Boost removal a good shot for 0.15, does anyone have suggestions for replacing GetNumCores? 21:14:26 fanquake: There is std::thread::hardware_concurrency(), but that seems to count virtual cores, which I don't think we want. 21:14:51 BlueMatt: fanquake: I doubt we'll do boost removal for 0.15 21:14:58 BlueMatt: shit like BOOST_FOREACH, sure 21:15:07 BlueMatt: but all of boost? doubtful, there are still things we need 21:16:36 fanquake: Yea sorry, not the whole lot, but we can remove a decent chunk. Just looking into what else needs to be done to replace some of the less involved Boost usage. 21:16:43 BlueMatt: fair 21:17:14 wumpus: yes, it makes sense to plan ahead a bit, without immediately doing it 21:18:12 wumpus: right, don't count virtual cores, that used to be the case but it makes no sense for our usage 21:19:15 wumpus: it'd create a swarm of threads overwhelming any machine with hyperthreading (+accompanying thread stack overhead), for script validation, and there was no gain at all for that 21:20:03 sipa: BlueMatt: don't worry, there is no hurry 21:59:10 morcos: wumpus: i don't think that is correct 21:59:24 morcos: suppose you have 4 cores (8 virtual cores) 21:59:24 wumpus: fanquake: indeed seems that std has no equivalent to physical_concurrency, on any standard. That's annoying as it is non-trivial to implement 21:59:35 morcos: i think running par=8 (if it let you) would be notably faster 21:59:59 morcos: jeremyrubin and i discussed this at length a while back... i think i commented about it on irc at the time 22:00:21 wumpus: morcos: I think the conclusion at the time was that it made no difference, but sure would make sense to benchmark 22:00:39 morcos: perhaps historical testing on the virtual vs actual cores was polluted by concurrency issues that have now improved 22:00:47 wumpus: I think there are not more ALUs, so there is not really a point in having more threads 22:01:40 wumpus: hyperthreads are basically just a stored register state right? 22:02:23 sipa: wumpus: yes but it helps the scheduler 22:02:27 wumpus: in which case the only speedup using "number of cores" threads would give you is, possibly, excluding other software from running on the cores on the same time 22:02:37 morcos: well this is where i get out of my depth 22:02:50 sipa: if one of the threads is waiting on a read from ram, the other can use the arithmetic unit for example 22:02:54 morcos: wumpus: i'm pretty sure though that the speed up is considerably more than what you might expect from that 22:02:59 wumpus: sipa: ok, I back down, I didn't want to argue this at all 22:03:35 morcos: the reason i haven't tested it myself, is the machine i usually use has 16 cores... so not easy due to remaining concurrency issues to get much more speedup 22:03:36 wumpus: I'm fine with restoring it to number of virtual threads if that's faster 22:03:54 morcos: we should have somene with 4 cores (and  actually test it though, i agree 22:03:58 sipa: i would expect (but we should benchmark...) that if 8 scriot validation threads instead of 4 on a quadcore hyperthreading is not faster, it's due to lock contention 22:04:20 morcos: sipa: yeah thats my point, i think lock contention isn't that bad with 8 now 22:04:22 wumpus: on 64-bit systems the additional thread overhead wouldn't be important at least 22:04:23 gmaxwell: I previously benchmarked, a long time ago, it was faster. 22:04:33 gmaxwell: (to use the HT core count) 22:04:44 wumpus: why was this changed at all then? 22:04:47 wumpus: I'm confused 22:05:04 sipa: good question! 22:05:06 gmaxwell: I had no idea we changed it. 22:05:25 wumpus: sigh  22:05:54 gmaxwell: What PR changed it? 22:06:51 gmaxwell: In any case, on 32-bit it's probably a good tradeoff... the extra ram overhead is worth avoiding. 22:07:22 wumpus: bitcoin#6361 22:07:28 gmaxwell: PR 6461 btw. 22:07:37 gmaxwell: er lol at least you got it right. 22:07:45 wumpus: the complaint was that systems became unsuably slow when using that many thread 22:07:51 wumpus: so at least I got one thing right, woohoo 22:07:55 sipa: seems i even acked it! 22:07:57 BlueMatt: wumpus: there are more alus 22:08:38 BlueMatt: but we need to improve lock contention first 22:08:40 morcos: anywya, i think in the past the lock contention made 8 threads regardless of cores a bit dicey.. now that is much better (although more still to be done) 22:09:01 BlueMatt: or we can just merge bitcoin#10192, thats fee 22:09:04 gribble: bitcoin#10192 | Cache full script execution results in addition to signatures by TheBlueMatt · Pull Request bitcoin#10192 · bitcoin/bitcoin · GitHub 22:09:11 BlueMatt: s/fee/free/ 22:09:21 morcos: no, we do not need to improve lock contention first. but we should probably do that before we increase the max beyond 16 22:09:26 BlueMatt: then we can toss concurrency issues out the window and get more speedup anyway 22:09:35 gmaxwell: wumpus: yea, well in QT I thought we also diminished the count by 1 or something? but yes, if the motivation was to reduce how heavily the machine was used, thats fair. 22:09:56 sipa: the benefit of using HT cores is certainly not a factor 2 22:09:58 wumpus: gmaxwell: for the default I think this makes a lot of sense, yes 22:10:10 gmaxwell: morcos: right now on my 24/28 physical core hosts going beyond 16 still reduces performance. 22:10:11 wumpus: gmaxwell: do we also restrict the maximum par using this? that'd make less sense 22:10:51 wumpus: if someone *wants* to use the virtual cores they should be able to by setting -par= 22:10:51 sipa: *flies to US* 22:10:52 BlueMatt: sipa: sure, but the shared cache helps us get more out of it than some others, as morcos points out 22:11:30 BlueMatt: (because it means our thread contention issues are less) 22:12:05 morcos: gmaxwell: yeah i've been bogged down in fee estimation as well (and the rest of life) for a while now.. otherwise i would have put more effort into jeremy's checkqueue 22:12:36 BlueMatt: morcos: heh, well now you can do other stuff while the rest of us get bogged down in understanding fee estimation enough to review it  22:12:37 wumpus: [to answer my own question: no, the limit for par is MAX_SCRIPTCHECK_THREADS, or 16] 22:12:54 morcos: but to me optimizing for more than 16 cores is pretty valuable as miners could use beefy machines and be less concerned by block validation time 22:14:38 BlueMatt: morcos: i think you may be surprised by the number of mining pools that are on VPSes that do not have 16 cores  22:15:34 gmaxwell: I assume right now most of the time block validation is bogged in the parts that are not as concurrent. simple because caching makes the concurrent parts so fast. (and soon to hopefully increase with bluematt's patch) 22:17:55 gmaxwell: improving sha2 speed, or transaction malloc overhead are probably bigger wins now for connection at the tip than parallelism beyond 16 (though I'd like that too). 22:18:21 BlueMatt: sha2 speed is big 22:18:27 morcos: yeah lots of things to do actually... 22:18:57 gmaxwell: BlueMatt: might be a tiny bit less big if we didn't hash the block header 8 times for every block.  22:21:27 BlueMatt: ehh, probably, but I'm less rushed there 22:21:43 BlueMatt: my new cache thing is about to add a bunch of hashing 22:21:50 BlueMatt: 1 sha round per tx 22:22:25 BlueMatt: and sigcache is obviously a ton ``` Tree-SHA512: a594430e2a77d8cc741ea8c664a2867b1e1693e5050a4bbc8511e8d66a2bffe241a9965f6dff1e7fbb99f21dd1fdeb95b826365da8bd8f9fab2d0ffd80d5059c
…f Boost, to determine available cores 937bf43 Use std::thread::hardware_concurrency, instead of Boost, to determine available cores (fanquake) Pull request description: Following discussion on IRC about replacing Boost usage for detecting available system cores, I've opened this to collect some benchmarks + further discussion. The current method for detecting available cores was introduced in bitcoin#6361. Recap of the IRC chat: ``` 21:14:08 fanquake: Since we seem to be giving Boost removal a good shot for 0.15, does anyone have suggestions for replacing GetNumCores? 21:14:26 fanquake: There is std::thread::hardware_concurrency(), but that seems to count virtual cores, which I don't think we want. 21:14:51 BlueMatt: fanquake: I doubt we'll do boost removal for 0.15 21:14:58 BlueMatt: shit like BOOST_FOREACH, sure 21:15:07 BlueMatt: but all of boost? doubtful, there are still things we need 21:16:36 fanquake: Yea sorry, not the whole lot, but we can remove a decent chunk. Just looking into what else needs to be done to replace some of the less involved Boost usage. 21:16:43 BlueMatt: fair 21:17:14 wumpus: yes, it makes sense to plan ahead a bit, without immediately doing it 21:18:12 wumpus: right, don't count virtual cores, that used to be the case but it makes no sense for our usage 21:19:15 wumpus: it'd create a swarm of threads overwhelming any machine with hyperthreading (+accompanying thread stack overhead), for script validation, and there was no gain at all for that 21:20:03 sipa: BlueMatt: don't worry, there is no hurry 21:59:10 morcos: wumpus: i don't think that is correct 21:59:24 morcos: suppose you have 4 cores (8 virtual cores) 21:59:24 wumpus: fanquake: indeed seems that std has no equivalent to physical_concurrency, on any standard. That's annoying as it is non-trivial to implement 21:59:35 morcos: i think running par=8 (if it let you) would be notably faster 21:59:59 morcos: jeremyrubin and i discussed this at length a while back... i think i commented about it on irc at the time 22:00:21 wumpus: morcos: I think the conclusion at the time was that it made no difference, but sure would make sense to benchmark 22:00:39 morcos: perhaps historical testing on the virtual vs actual cores was polluted by concurrency issues that have now improved 22:00:47 wumpus: I think there are not more ALUs, so there is not really a point in having more threads 22:01:40 wumpus: hyperthreads are basically just a stored register state right? 22:02:23 sipa: wumpus: yes but it helps the scheduler 22:02:27 wumpus: in which case the only speedup using "number of cores" threads would give you is, possibly, excluding other software from running on the cores on the same time 22:02:37 morcos: well this is where i get out of my depth 22:02:50 sipa: if one of the threads is waiting on a read from ram, the other can use the arithmetic unit for example 22:02:54 morcos: wumpus: i'm pretty sure though that the speed up is considerably more than what you might expect from that 22:02:59 wumpus: sipa: ok, I back down, I didn't want to argue this at all 22:03:35 morcos: the reason i haven't tested it myself, is the machine i usually use has 16 cores... so not easy due to remaining concurrency issues to get much more speedup 22:03:36 wumpus: I'm fine with restoring it to number of virtual threads if that's faster 22:03:54 morcos: we should have somene with 4 cores (and  actually test it though, i agree 22:03:58 sipa: i would expect (but we should benchmark...) that if 8 scriot validation threads instead of 4 on a quadcore hyperthreading is not faster, it's due to lock contention 22:04:20 morcos: sipa: yeah thats my point, i think lock contention isn't that bad with 8 now 22:04:22 wumpus: on 64-bit systems the additional thread overhead wouldn't be important at least 22:04:23 gmaxwell: I previously benchmarked, a long time ago, it was faster. 22:04:33 gmaxwell: (to use the HT core count) 22:04:44 wumpus: why was this changed at all then? 22:04:47 wumpus: I'm confused 22:05:04 sipa: good question! 22:05:06 gmaxwell: I had no idea we changed it. 22:05:25 wumpus: sigh  22:05:54 gmaxwell: What PR changed it? 22:06:51 gmaxwell: In any case, on 32-bit it's probably a good tradeoff... the extra ram overhead is worth avoiding. 22:07:22 wumpus: bitcoin#6361 22:07:28 gmaxwell: PR 6461 btw. 22:07:37 gmaxwell: er lol at least you got it right. 22:07:45 wumpus: the complaint was that systems became unsuably slow when using that many thread 22:07:51 wumpus: so at least I got one thing right, woohoo 22:07:55 sipa: seems i even acked it! 22:07:57 BlueMatt: wumpus: there are more alus 22:08:38 BlueMatt: but we need to improve lock contention first 22:08:40 morcos: anywya, i think in the past the lock contention made 8 threads regardless of cores a bit dicey.. now that is much better (although more still to be done) 22:09:01 BlueMatt: or we can just merge bitcoin#10192, thats fee 22:09:04 gribble: bitcoin#10192 | Cache full script execution results in addition to signatures by TheBlueMatt · Pull Request bitcoin#10192 · bitcoin/bitcoin · GitHub 22:09:11 BlueMatt: s/fee/free/ 22:09:21 morcos: no, we do not need to improve lock contention first. but we should probably do that before we increase the max beyond 16 22:09:26 BlueMatt: then we can toss concurrency issues out the window and get more speedup anyway 22:09:35 gmaxwell: wumpus: yea, well in QT I thought we also diminished the count by 1 or something? but yes, if the motivation was to reduce how heavily the machine was used, thats fair. 22:09:56 sipa: the benefit of using HT cores is certainly not a factor 2 22:09:58 wumpus: gmaxwell: for the default I think this makes a lot of sense, yes 22:10:10 gmaxwell: morcos: right now on my 24/28 physical core hosts going beyond 16 still reduces performance. 22:10:11 wumpus: gmaxwell: do we also restrict the maximum par using this? that'd make less sense 22:10:51 wumpus: if someone *wants* to use the virtual cores they should be able to by setting -par= 22:10:51 sipa: *flies to US* 22:10:52 BlueMatt: sipa: sure, but the shared cache helps us get more out of it than some others, as morcos points out 22:11:30 BlueMatt: (because it means our thread contention issues are less) 22:12:05 morcos: gmaxwell: yeah i've been bogged down in fee estimation as well (and the rest of life) for a while now.. otherwise i would have put more effort into jeremy's checkqueue 22:12:36 BlueMatt: morcos: heh, well now you can do other stuff while the rest of us get bogged down in understanding fee estimation enough to review it  22:12:37 wumpus: [to answer my own question: no, the limit for par is MAX_SCRIPTCHECK_THREADS, or 16] 22:12:54 morcos: but to me optimizing for more than 16 cores is pretty valuable as miners could use beefy machines and be less concerned by block validation time 22:14:38 BlueMatt: morcos: i think you may be surprised by the number of mining pools that are on VPSes that do not have 16 cores  22:15:34 gmaxwell: I assume right now most of the time block validation is bogged in the parts that are not as concurrent. simple because caching makes the concurrent parts so fast. (and soon to hopefully increase with bluematt's patch) 22:17:55 gmaxwell: improving sha2 speed, or transaction malloc overhead are probably bigger wins now for connection at the tip than parallelism beyond 16 (though I'd like that too). 22:18:21 BlueMatt: sha2 speed is big 22:18:27 morcos: yeah lots of things to do actually... 22:18:57 gmaxwell: BlueMatt: might be a tiny bit less big if we didn't hash the block header 8 times for every block.  22:21:27 BlueMatt: ehh, probably, but I'm less rushed there 22:21:43 BlueMatt: my new cache thing is about to add a bunch of hashing 22:21:50 BlueMatt: 1 sha round per tx 22:22:25 BlueMatt: and sigcache is obviously a ton ``` Tree-SHA512: a594430e2a77d8cc741ea8c664a2867b1e1693e5050a4bbc8511e8d66a2bffe241a9965f6dff1e7fbb99f21dd1fdeb95b826365da8bd8f9fab2d0ffd80d5059c
…f Boost, to determine available cores 937bf43 Use std::thread::hardware_concurrency, instead of Boost, to determine available cores (fanquake) Pull request description: Following discussion on IRC about replacing Boost usage for detecting available system cores, I've opened this to collect some benchmarks + further discussion. The current method for detecting available cores was introduced in bitcoin#6361. Recap of the IRC chat: ``` 21:14:08 fanquake: Since we seem to be giving Boost removal a good shot for 0.15, does anyone have suggestions for replacing GetNumCores? 21:14:26 fanquake: There is std::thread::hardware_concurrency(), but that seems to count virtual cores, which I don't think we want. 21:14:51 BlueMatt: fanquake: I doubt we'll do boost removal for 0.15 21:14:58 BlueMatt: shit like BOOST_FOREACH, sure 21:15:07 BlueMatt: but all of boost? doubtful, there are still things we need 21:16:36 fanquake: Yea sorry, not the whole lot, but we can remove a decent chunk. Just looking into what else needs to be done to replace some of the less involved Boost usage. 21:16:43 BlueMatt: fair 21:17:14 wumpus: yes, it makes sense to plan ahead a bit, without immediately doing it 21:18:12 wumpus: right, don't count virtual cores, that used to be the case but it makes no sense for our usage 21:19:15 wumpus: it'd create a swarm of threads overwhelming any machine with hyperthreading (+accompanying thread stack overhead), for script validation, and there was no gain at all for that 21:20:03 sipa: BlueMatt: don't worry, there is no hurry 21:59:10 morcos: wumpus: i don't think that is correct 21:59:24 morcos: suppose you have 4 cores (8 virtual cores) 21:59:24 wumpus: fanquake: indeed seems that std has no equivalent to physical_concurrency, on any standard. That's annoying as it is non-trivial to implement 21:59:35 morcos: i think running par=8 (if it let you) would be notably faster 21:59:59 morcos: jeremyrubin and i discussed this at length a while back... i think i commented about it on irc at the time 22:00:21 wumpus: morcos: I think the conclusion at the time was that it made no difference, but sure would make sense to benchmark 22:00:39 morcos: perhaps historical testing on the virtual vs actual cores was polluted by concurrency issues that have now improved 22:00:47 wumpus: I think there are not more ALUs, so there is not really a point in having more threads 22:01:40 wumpus: hyperthreads are basically just a stored register state right? 22:02:23 sipa: wumpus: yes but it helps the scheduler 22:02:27 wumpus: in which case the only speedup using "number of cores" threads would give you is, possibly, excluding other software from running on the cores on the same time 22:02:37 morcos: well this is where i get out of my depth 22:02:50 sipa: if one of the threads is waiting on a read from ram, the other can use the arithmetic unit for example 22:02:54 morcos: wumpus: i'm pretty sure though that the speed up is considerably more than what you might expect from that 22:02:59 wumpus: sipa: ok, I back down, I didn't want to argue this at all 22:03:35 morcos: the reason i haven't tested it myself, is the machine i usually use has 16 cores... so not easy due to remaining concurrency issues to get much more speedup 22:03:36 wumpus: I'm fine with restoring it to number of virtual threads if that's faster 22:03:54 morcos: we should have somene with 4 cores (and  actually test it though, i agree 22:03:58 sipa: i would expect (but we should benchmark...) that if 8 scriot validation threads instead of 4 on a quadcore hyperthreading is not faster, it's due to lock contention 22:04:20 morcos: sipa: yeah thats my point, i think lock contention isn't that bad with 8 now 22:04:22 wumpus: on 64-bit systems the additional thread overhead wouldn't be important at least 22:04:23 gmaxwell: I previously benchmarked, a long time ago, it was faster. 22:04:33 gmaxwell: (to use the HT core count) 22:04:44 wumpus: why was this changed at all then? 22:04:47 wumpus: I'm confused 22:05:04 sipa: good question! 22:05:06 gmaxwell: I had no idea we changed it. 22:05:25 wumpus: sigh  22:05:54 gmaxwell: What PR changed it? 22:06:51 gmaxwell: In any case, on 32-bit it's probably a good tradeoff... the extra ram overhead is worth avoiding. 22:07:22 wumpus: bitcoin#6361 22:07:28 gmaxwell: PR 6461 btw. 22:07:37 gmaxwell: er lol at least you got it right. 22:07:45 wumpus: the complaint was that systems became unsuably slow when using that many thread 22:07:51 wumpus: so at least I got one thing right, woohoo 22:07:55 sipa: seems i even acked it! 22:07:57 BlueMatt: wumpus: there are more alus 22:08:38 BlueMatt: but we need to improve lock contention first 22:08:40 morcos: anywya, i think in the past the lock contention made 8 threads regardless of cores a bit dicey.. now that is much better (although more still to be done) 22:09:01 BlueMatt: or we can just merge bitcoin#10192, thats fee 22:09:04 gribble: bitcoin#10192 | Cache full script execution results in addition to signatures by TheBlueMatt · Pull Request bitcoin#10192 · bitcoin/bitcoin · GitHub 22:09:11 BlueMatt: s/fee/free/ 22:09:21 morcos: no, we do not need to improve lock contention first. but we should probably do that before we increase the max beyond 16 22:09:26 BlueMatt: then we can toss concurrency issues out the window and get more speedup anyway 22:09:35 gmaxwell: wumpus: yea, well in QT I thought we also diminished the count by 1 or something? but yes, if the motivation was to reduce how heavily the machine was used, thats fair. 22:09:56 sipa: the benefit of using HT cores is certainly not a factor 2 22:09:58 wumpus: gmaxwell: for the default I think this makes a lot of sense, yes 22:10:10 gmaxwell: morcos: right now on my 24/28 physical core hosts going beyond 16 still reduces performance. 22:10:11 wumpus: gmaxwell: do we also restrict the maximum par using this? that'd make less sense 22:10:51 wumpus: if someone *wants* to use the virtual cores they should be able to by setting -par= 22:10:51 sipa: *flies to US* 22:10:52 BlueMatt: sipa: sure, but the shared cache helps us get more out of it than some others, as morcos points out 22:11:30 BlueMatt: (because it means our thread contention issues are less) 22:12:05 morcos: gmaxwell: yeah i've been bogged down in fee estimation as well (and the rest of life) for a while now.. otherwise i would have put more effort into jeremy's checkqueue 22:12:36 BlueMatt: morcos: heh, well now you can do other stuff while the rest of us get bogged down in understanding fee estimation enough to review it  22:12:37 wumpus: [to answer my own question: no, the limit for par is MAX_SCRIPTCHECK_THREADS, or 16] 22:12:54 morcos: but to me optimizing for more than 16 cores is pretty valuable as miners could use beefy machines and be less concerned by block validation time 22:14:38 BlueMatt: morcos: i think you may be surprised by the number of mining pools that are on VPSes that do not have 16 cores  22:15:34 gmaxwell: I assume right now most of the time block validation is bogged in the parts that are not as concurrent. simple because caching makes the concurrent parts so fast. (and soon to hopefully increase with bluematt's patch) 22:17:55 gmaxwell: improving sha2 speed, or transaction malloc overhead are probably bigger wins now for connection at the tip than parallelism beyond 16 (though I'd like that too). 22:18:21 BlueMatt: sha2 speed is big 22:18:27 morcos: yeah lots of things to do actually... 22:18:57 gmaxwell: BlueMatt: might be a tiny bit less big if we didn't hash the block header 8 times for every block.  22:21:27 BlueMatt: ehh, probably, but I'm less rushed there 22:21:43 BlueMatt: my new cache thing is about to add a bunch of hashing 22:21:50 BlueMatt: 1 sha round per tx 22:22:25 BlueMatt: and sigcache is obviously a ton ``` Tree-SHA512: a594430e2a77d8cc741ea8c664a2867b1e1693e5050a4bbc8511e8d66a2bffe241a9965f6dff1e7fbb99f21dd1fdeb95b826365da8bd8f9fab2d0ffd80d5059c
To determine the default for
-par
, the number of script verification threads, use boost::thread::physical_concurrency() which counts only physical cores, not virtual cores.Virtual cores are roughly a set of cached registers to avoid context switches while threading, they cannot actually perform work, so spawning a verification thread for them could even reduce efficiency and will put undue load on the system.
Should fix issue #6358, as well as some other reported system overload issues, especially on Intel processors.
The function was only introduced in boost 1.56, so provide a utility function
GetNumCores
to fall back for older Boost versions.