Use real number of cores for default -par, ignore virtual cores #6361

Merged
merged 1 commit into from Jul 2, 2015

Conversation

Projects
None yet
5 participants
@laanwj
Member

laanwj commented Jul 1, 2015

To determine the default for -par, the number of script verification threads, use boost::thread::physical_concurrency() which counts only physical cores, not virtual cores.

Virtual cores are roughly a set of cached registers to avoid context switches while threading, they cannot actually perform work, so spawning a verification thread for them could even reduce efficiency and will put undue load on the system.

Should fix issue #6358, as well as some other reported system overload issues, especially on Intel processors.

The function was only introduced in boost 1.56, so provide a utility function GetNumCores to fall back for older Boost versions.

Use real number of cores for default -par, ignore virtual cores
To determine the default for `-par`, the number of script verification
threads, use [boost::thread::physical_concurrency()](http://www.boost.org/doc/libs/1_58_0/doc/html/thread/thread_management.html#thread.thread_management.thread.physical_concurrency)
which counts only physical cores, not virtual cores.

Virtual cores are roughly a set of cached registers to avoid context
switches while threading, they cannot actually perform work, so spawning
a verification thread for them could even reduce efficiency and will put
undue load on the system.

Should fix issue #6358, as well as some other reported system overload
issues, especially on Intel processors.

The function was only introduced in boost 1.56, so provide a utility
function `GetNumCores` to fall back for older Boost versions.

@laanwj laanwj added the Bug label Jul 1, 2015

@laanwj

This comment has been minimized.

Show comment
Hide comment
@laanwj

laanwj Jul 1, 2015

Member

We'll also have to bump the boost version in depends for this to work (currently 1.55), but I remember that was already the plan for 0.12.

Member

laanwj commented Jul 1, 2015

We'll also have to bump the boost version in depends for this to work (currently 1.55), but I remember that was already the plan for 0.12.

@theuni

This comment has been minimized.

Show comment
Hide comment
@theuni

theuni Jul 2, 2015

Member

@laanwj yes, we need to go ahead with the boost bump. I'm getting ready to leave for a few days for a conference, but I'll make the boost/qt bumps first priority when I get back. Ideally those should be done very early in the cycle so we have some testing.

Taking that one step further, I'll check the other deps while I'm at it and see if they need bumps too.

Member

theuni commented Jul 2, 2015

@laanwj yes, we need to go ahead with the boost bump. I'm getting ready to leave for a few days for a conference, but I'll make the boost/qt bumps first priority when I get back. Ideally those should be done very early in the cycle so we have some testing.

Taking that one step further, I'll check the other deps while I'm at it and see if they need bumps too.

@luke-jr

This comment has been minimized.

Show comment
Hide comment
@luke-jr

luke-jr Jul 2, 2015

Member

utACK

Member

luke-jr commented Jul 2, 2015

utACK

@paveljanik

This comment has been minimized.

Show comment
Hide comment
@paveljanik

paveljanik Jul 2, 2015

Contributor

ACK

Contributor

paveljanik commented Jul 2, 2015

ACK

@laanwj laanwj merged commit 4716267 into bitcoin:master Jul 2, 2015

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

laanwj added a commit that referenced this pull request Jul 2, 2015

Merge pull request #6361
4716267 Use real number of cores for default -par, ignore virtual cores (Wladimir J. van der Laan)
@sipa

This comment has been minimized.

Show comment
Hide comment
@sipa

sipa Jul 9, 2015

Member

Posthumous ACK.

Member

sipa commented Jul 9, 2015

Posthumous ACK.

luke-jr added a commit to luke-jr/bitcoin that referenced this pull request Jan 10, 2016

Use real number of cores for default -par, ignore virtual cores
To determine the default for `-par`, the number of script verification
threads, use [boost::thread::physical_concurrency()](http://www.boost.org/doc/libs/1_58_0/doc/html/thread/thread_management.html#thread.thread_management.thread.physical_concurrency)
which counts only physical cores, not virtual cores.

Virtual cores are roughly a set of cached registers to avoid context
switches while threading, they cannot actually perform work, so spawning
a verification thread for them could even reduce efficiency and will put
undue load on the system.

Should fix issue #6358, as well as some other reported system overload
issues, especially on Intel processors.

The function was only introduced in boost 1.56, so provide a utility
function `GetNumCores` to fall back for older Boost versions.

Github-Pull: #6361
Rebased-From: 4716267

@str4d str4d referenced this pull request in zcash/zcash Feb 14, 2017

Merged

Bitcoin 0.12 misc PRs 1 #2099

@str4d str4d referenced this pull request in zcash/zcash Mar 29, 2017

Merged

Use real number of cores, ignore virtual cores #2227

@dagurval dagurval referenced this pull request in bitcoinxt/bitcoinxt Apr 23, 2017

Merged

Util ports #197

zkbot added a commit to zcash/zcash that referenced this pull request Jun 16, 2017

Auto merge of #2227 - str4d:2074-ignore-virtual-cores, r=str4d
Use real number of cores, ignore virtual cores

Cherry-picked from the following upstream PRs:

- bitcoin/bitcoin#6361
- bitcoin/bitcoin#6370

Part of #2074.

laanwj added a commit that referenced this pull request Mar 6, 2018

Merge #10271: Use std::thread::hardware_concurrency, instead of Boost…
…, to determine available cores


937bf43 Use std::thread::hardware_concurrency, instead of Boost, to determine available cores (fanquake)

Pull request description:

  Following discussion on IRC about replacing Boost usage for detecting available system cores, I've opened this to collect some benchmarks + further discussion.

  The current method for detecting available cores was introduced in #6361.

  Recap of the IRC chat:
  ```
  21:14:08 fanquake: Since we seem to be giving Boost removal a good shot for 0.15, does anyone have suggestions for replacing GetNumCores?
  21:14:26 fanquake: There is std::thread::hardware_concurrency(), but that seems to count virtual cores, which I don't think we want.
  21:14:51 BlueMatt: fanquake: I doubt we'll do boost removal for 0.15
  21:14:58 BlueMatt: shit like BOOST_FOREACH, sure
  21:15:07 BlueMatt: but all of boost? doubtful, there are still things we need
  21:16:36 fanquake: Yea sorry, not the whole lot, but we can remove a decent chunk. Just looking into what else needs to be done to replace some of the less involved Boost usage.
  21:16:43 BlueMatt: fair
  21:17:14 wumpus: yes, it makes sense to plan ahead a bit, without immediately doing it
  21:18:12 wumpus: right, don't count virtual cores, that used to be the case but it makes no sense for our usage
  21:19:15 wumpus: it'd create a swarm of threads overwhelming any machine with hyperthreading (+accompanying thread stack overhead), for script validation, and there was no gain at all for that
  21:20:03 sipa: BlueMatt: don't worry, there is no hurry
  21:59:10 morcos: wumpus: i don't think that is correct
  21:59:24 morcos: suppose you have 4 cores (8 virtual cores)
  21:59:24 wumpus: fanquake: indeed seems that std has no equivalent to physical_concurrency, on any standard. That's annoying as it is non-trivial to implement
  21:59:35 morcos: i think running par=8 (if it let you) would be notably faster
  21:59:59 morcos: jeremyrubin and i discussed this at length a while back... i think i commented about it on irc at the time
  22:00:21 wumpus: morcos: I think the conclusion at the time was that it made no difference, but sure would make sense to benchmark
  22:00:39 morcos: perhaps historical testing on the virtual vs actual cores was polluted by concurrency issues that have now improved
  22:00:47 wumpus: I think there are not more ALUs, so there is not really a point in having more threads
  22:01:40 wumpus: hyperthreads are basically just a stored register state right?
  22:02:23 sipa: wumpus: yes but it helps the scheduler
  22:02:27 wumpus: in which case the only speedup using "number of cores" threads would give you is, possibly, excluding other software from running on the cores on the same time
  22:02:37 morcos: well this is where i get out of my depth
  22:02:50 sipa: if one of the threads is waiting on a read from ram, the other can use the arithmetic unit for example
  22:02:54 morcos: wumpus: i'm pretty sure though that the speed up is considerably more than what you might expect from that
  22:02:59 wumpus: sipa: ok, I back down, I didn't want to argue this at all
  22:03:35 morcos: the reason i haven't tested it myself, is the machine i usually use has 16 cores... so not easy due to remaining concurrency issues to get much more speedup
  22:03:36 wumpus: I'm fine with restoring it to number of virtual threads if that's faster
  22:03:54 morcos: we should have somene with 4 cores (and  actually test it though, i agree
  22:03:58 sipa: i would expect (but we should benchmark...) that if 8 scriot validation threads instead of 4 on a quadcore hyperthreading is not faster, it's due to lock contention
  22:04:20 morcos: sipa: yeah thats my point, i think lock contention isn't that bad with 8 now
  22:04:22 wumpus: on 64-bit systems the additional thread overhead wouldn't be important at least
  22:04:23 gmaxwell: I previously benchmarked, a long time ago, it was faster.
  22:04:33 gmaxwell: (to use the HT core count)
  22:04:44 wumpus: why was this changed at all then?
  22:04:47 wumpus: I'm confused
  22:05:04 sipa: good question!
  22:05:06 gmaxwell: I had no idea we changed it.
  22:05:25 wumpus: sigh 
  22:05:54 gmaxwell: What PR changed it?
  22:06:51 gmaxwell: In any case, on 32-bit it's probably a good tradeoff... the extra ram overhead is worth avoiding.
  22:07:22 wumpus: #6361
  22:07:28 gmaxwell: PR 6461 btw.
  22:07:37 gmaxwell: er lol at least you got it right.
  22:07:45 wumpus: the complaint was that systems became unsuably slow when using that many thread
  22:07:51 wumpus: so at least I got one thing right, woohoo
  22:07:55 sipa: seems i even acked it!
  22:07:57 BlueMatt: wumpus: there are more alus
  22:08:38 BlueMatt: but we need to improve lock contention first
  22:08:40 morcos: anywya, i think in the past the lock contention made 8 threads regardless of cores a bit dicey.. now that is much better (although more still to be done)
  22:09:01 BlueMatt: or we can just merge #10192, thats fee
  22:09:04 gribble: #10192 | Cache full script execution results in addition to signatures by TheBlueMatt · Pull Request #10192 · bitcoin/bitcoin · GitHub
  22:09:11 BlueMatt: s/fee/free/
  22:09:21 morcos: no, we do not need to improve lock contention first.   but we should probably do that before we increase the max beyond 16
  22:09:26 BlueMatt: then we can toss concurrency issues out the window and get more speedup anyway
  22:09:35 gmaxwell: wumpus: yea, well in QT I thought we also diminished the count by 1 or something?  but yes, if the motivation was to reduce how heavily the machine was used, thats fair.
  22:09:56 sipa: the benefit of using HT cores is certainly not a factor 2
  22:09:58 wumpus: gmaxwell: for the default I think this makes a lot of sense, yes
  22:10:10 gmaxwell: morcos: right now on my 24/28 physical core hosts going beyond 16 still reduces performance.
  22:10:11 wumpus: gmaxwell: do we also restrict the maximum par using this? that'd make less sense
  22:10:51 wumpus: if someone *wants* to use the virtual cores they should be able to by setting -par=
  22:10:51 sipa: *flies to US*
  22:10:52 BlueMatt: sipa: sure, but the shared cache helps us get more out of it than some others, as morcos points out
  22:11:30 BlueMatt: (because it means our thread contention issues are less)
  22:12:05 morcos: gmaxwell: yeah i've been bogged down in fee estimation as well (and the rest of life) for a while now.. otherwise i would have put more effort into jeremy's checkqueue
  22:12:36 BlueMatt: morcos: heh, well now you can do other stuff while the rest of us get bogged down in understanding fee estimation enough to review it 
  22:12:37 wumpus: [to answer my own question: no, the limit for par is MAX_SCRIPTCHECK_THREADS, or 16]
  22:12:54 morcos: but to me optimizing for more than 16 cores is pretty valuable as miners could use beefy machines and be less concerned by block validation time
  22:14:38 BlueMatt: morcos: i think you may be surprised by the number of mining pools that are on VPSes that do not have 16 cores 
  22:15:34 gmaxwell: I assume right now most of the time block validation is bogged in the parts that are not as concurrent. simple because caching makes the concurrent parts so fast. (and soon to hopefully increase with bluematt's patch)
  22:17:55 gmaxwell: improving sha2 speed, or transaction malloc overhead are probably bigger wins now for connection at the tip than parallelism beyond 16 (though I'd like that too).
  22:18:21 BlueMatt: sha2 speed is big
  22:18:27 morcos: yeah lots of things to do actually...
  22:18:57 gmaxwell: BlueMatt: might be a tiny bit less big if we didn't hash the block header 8 times for every block. 
  22:21:27 BlueMatt: ehh, probably, but I'm less rushed there
  22:21:43 BlueMatt: my new cache thing is about to add a bunch of hashing
  22:21:50 BlueMatt: 1 sha round per tx
  22:22:25 BlueMatt: and sigcache is obviously a ton
  ```

Tree-SHA512: a594430e2a77d8cc741ea8c664a2867b1e1693e5050a4bbc8511e8d66a2bffe241a9965f6dff1e7fbb99f21dd1fdeb95b826365da8bd8f9fab2d0ffd80d5059c

@sickpig sickpig referenced this pull request in BitcoinUnlimited/BitcoinUnlimited Jun 19, 2018

Open

[PORT] Default to virtual cores and set appropriate thread scheduler #1151

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment