-
Notifications
You must be signed in to change notification settings - Fork 36.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache full script execution results in addition to signatures #10192
Cache full script execution results in addition to signatures #10192
Conversation
void InitScriptExecutionCache() { | ||
// nMaxCacheSize is unsigned. If -maxsigcachesize is set to zero, | ||
// setup_bytes creates the minimum possible cache (2 elements). | ||
size_t nMaxCacheSize = std::min(std::max((int64_t)0, GetArg("-maxsigcachesize", DEFAULT_MAX_SIG_CACHE_SIZE) / 2), MAX_MAX_SIG_CACHE_SIZE) * ((size_t) 1 << 20); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the division should be outside of GetArg. Otherwise, if you specify -maxsigcachesize=32, you end up with a total of 64MiB worth of caches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good, because the division is outside of the GetArg. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems I am blind.
7a0fb5a
to
7164f97
Compare
Fixed test_bitcoin segfaulting as it didnt init the script cache as it does the sigcache. |
7164f97
to
b917cfd
Compare
Concept ACK. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept Ack!
src/validation.cpp
Outdated
@@ -12,6 +12,7 @@ | |||
#include "consensus/consensus.h" | |||
#include "consensus/merkle.h" | |||
#include "consensus/validation.h" | |||
#include "cuckoocache.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe better to add a wrapper class around cuckoocache.h in a separate file so that you don't depend on CuckooCache internals and can replace it with something more efficient more easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is using the public interface of CuckooCache::cache really using its "internals"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess not.
src/validation.cpp
Outdated
@@ -1356,7 +1382,32 @@ bool CheckTxInputs(const CTransaction& tx, CValidationState& state, const CCoins | |||
} | |||
}// namespace Consensus | |||
|
|||
bool CheckInputs(const CTransaction& tx, CValidationState &state, const CCoinsViewCache &inputs, bool fScriptChecks, unsigned int flags, bool cacheStore, PrecomputedTransactionData& txdata, std::vector<CScriptCheck> *pvChecks) | |||
namespace { | |||
class ShaHashSplitter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#9480 exposes this class from the sigcache, we should probably just use that rather than adding this code again in a another location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rebased on that, removed this class.
src/validation.cpp
Outdated
// correct (ie that the transaction hash which is in tx's prevouts | ||
// properly commits to the scriptPubKey in the inputs view of that | ||
// transaction). | ||
static uint256 nonce(GetRandHash()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static in the middle of a function is messy -- add a wrapper class around cuckoocache.h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd much much rather have a static in the function than add yet another class to compile. We need to have fewer two-line wrapper classes, not more...our memory usage is already insane.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Memory usage during compile time? Is it really that bad for an added class?
Can you at least move the static decl to the top of the function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Moved the static to the top of the function with the static CuckooCache as well.
src/validation.cpp
Outdated
static uint256 nonce(GetRandHash()); | ||
uint256 hashCacheEntry; | ||
CSHA256().Write(nonce.begin(), 32).Write(tx.GetWitnessHash().begin(), 32).Write((unsigned char*)&flags, sizeof(flags)).Finalize(hashCacheEntry.begin()); | ||
AssertLockHeld(cs_main); //TODO: Remove this requirement by making CuckooCache not require external locks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes; adding a wrapper class would do this for you. Can be a read lock.
Also if you're using this class single threaded only ever, cuckoocache could be extended to offer a version without atomics...
src/validation.cpp
Outdated
// transaction). | ||
static uint256 nonce(GetRandHash()); | ||
uint256 hashCacheEntry; | ||
CSHA256().Write(nonce.begin(), 32).Write(tx.GetWitnessHash().begin(), 32).Write((unsigned char*)&flags, sizeof(flags)).Finalize(hashCacheEntry.begin()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI You can cut the hashing overhead in half by either:
- making the nonce 64 bytes & static caching then copyinh the midstate.
- using only 19 bytes of nonce.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using 19 bytes of nonce, thats way more than enough.
src/validation.cpp
Outdated
// correct (ie that the transaction hash which is in tx's prevouts | ||
// properly commits to the scriptPubKey in the inputs view of that | ||
// transaction). | ||
static uint256 nonce(GetRandHash()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Struggling to understand how a unique nonce per cache entry vs per cache works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's static so it is per cache.
concept ACK |
b917cfd
to
9eeac54
Compare
Addressed Jeremy's comments aside from the request for a wrapper class, I think we need fewer dummy classes, not more :/. Also rebased on #9480. |
e2c9146
to
8ef8ce5
Compare
src/validation.cpp
Outdated
static uint256 nonce(GetRandHash()); | ||
uint256 hashCacheEntry; | ||
// We only use the first 19 bytes of nonce to avoid a second SHA | ||
// round - giving us 19 + 32 + 4 = 55 bytes (+ 8 + 1 = 64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Static assert on the size of the flags so the nonce gets reduced if the flags are made 64-bits in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
b2b5eba
to
bb136ee
Compare
So if I understand correctly, this makes CCoinsViewMempool, and therefore the mempool itself, consensus critical, which I would prefer to avoid. Would we still get a performance benefit if we were to include the coins being spent in the hash for the index lookup? Edited: as per offline discussion, perhaps just move CCoinsViewMemPool into validation.cpp to make it clear it's part of consensus, and sanity check the results from the mempool. |
7f9ba1c
to
1f75777
Compare
@sdaftuar I'm not convinced its a massive concern, but I went ahead and added a wrapper which checks each scriptPubKey returned by the CCoinsViewCache is the one committed to by the input's prevout hash, which I believe removes that dependancy entirely. |
src/script/sigcache.cpp
Outdated
@@ -74,7 +74,7 @@ void InitSignatureCache() | |||
{ | |||
// nMaxCacheSize is unsigned. If -maxsigcachesize is set to zero, | |||
// setup_bytes creates the minimum possible cache (2 elements). | |||
size_t nMaxCacheSize = std::min(std::max((int64_t)0, GetArg("-maxsigcachesize", DEFAULT_MAX_SIG_CACHE_SIZE)), MAX_MAX_SIG_CACHE_SIZE) * ((size_t) 1 << 20); | |||
size_t nMaxCacheSize = std::min(std::max((int64_t)0, GetArg("-maxsigcachesize", DEFAULT_MAX_SIG_CACHE_SIZE) / 2), MAX_MAX_SIG_CACHE_SIZE) * ((size_t) 1 << 20); | |||
size_t nElems = signatureCache.setup_bytes(nMaxCacheSize); | |||
LogPrintf("Using %zu MiB out of %zu requested for signature cache, able to store %zu elements\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to update the LogPrintf here (and in the new Init) to say something about how the space is divided among multiple caches now, so users aren't confused why they aren't getting the full allocation here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, done, also updated the doc for -maxsigcachesize.
src/validation.cpp
Outdated
@@ -12,6 +12,7 @@ | |||
#include "consensus/consensus.h" | |||
#include "consensus/merkle.h" | |||
#include "consensus/validation.h" | |||
#include "cuckoocache.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess not.
Concept ACK |
src/validation.cpp
Outdated
@@ -1669,6 +1748,41 @@ class WarningBitsConditionChecker : public AbstractThresholdConditionChecker | |||
// Protected by cs_main | |||
static ThresholdConditionCache warningcache[VERSIONBITS_NUM_BITS]; | |||
|
|||
static unsigned int GetBlockScriptFlags(const CBlockIndex* pindex, const CChainParams& chainparams) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function can take Consensus::Params directly instead of CChainParams
Can we pass ThresholdConditionCache& cache
to this and move it to src/versionbits ?
I know not all flags are activated with bip9, but it seems the best place to me.
Doing so would prevent you from calling IsWitnessEnabled() and AssertLockHeld(cs_main), but do you need to?
You can do the latter from the caller like we do with VersionBitsState
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd much prefer to encapsulate the version bits stuff in GetBlockScriptFlags, and keep the diff on the smaller side. You're welcome to move things around afterwards, validation.cpp need to continue its trek towards better internal interfaces.
I changed it to take Consensus::Params.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving it to src/versionbits.cpp would still encapsulate version bits stuff in GetBlockScriptFlags, you would just not be using the versionbitscache and cs_main globals from validation.cpp.
You will also avoid using the unnecessary wrapper IsWitnessEnabled()
(specially unnecessary in this function since you are already using VersionBitsState()
directly for csv anyway).
It is "free" to do it now vs moving it in the future (if that ever gets the review). It may not be important to your approach to libconsensus, but it is to mine. I've been trying to do this for a long time, @CodeShark tried something similar as preparation to bip9 too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think it makes any sense to move this entire function into src/versionbits.cpp? This function has stuff for BIP 65, 66 and 16, none of which are versionbits-activated? Moving all soft forks into a separate thing is definitely a separate issue from this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noted that at the beginning "I know not all flags are activated with bip9", but I still think it's the best place for this new function (even if some lines of the function don't rely on versionbits, they also don't rely on anything that versionbits didn't already rely on).Doing a separate module only for this single function seems stupid to me.
Anyway, as disappointing as your rejection to move the new function for free may be to me, let's move on.
Can you at least avoid the use of globals you don't need to use, ie versionbitscache and cs_main (both used directly and also indirectly through IsWitnessEnabled())?
That way, if we ever move this function out of validation (like all previous attempts at creating it did) it can be a clean moveonly commit (whether it goes to versionbits.o or somewhere else) instead of needing to change the function signature.
src/validation.cpp
Outdated
@@ -451,6 +451,8 @@ int64_t GetTransactionSigOpCost(const CTransaction& tx, const CCoinsViewCache& i | |||
return nSigOps; | |||
} | |||
|
|||
// Returns the script flags which should be checked for a given block | |||
static unsigned int GetBlockScriptFlags(const CBlockIndex* pindex, const CChainParams& chainparams); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why a separated declaration if it's going to be static?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a forward declaration. As to why the definition isn't just there, I'm not sure, maybe GetBlockScriptFlags also has some dependency that would need a forward decl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason its down further is that I didnt want to move it up with the mempool stuff in validation.cpp, but its called from ATMP, so needed the forward declaration. Hopefully this stuff gets ironed out as we move towards better internal interfaces (see #10279).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are creating the function as new, it won't be any more disruptive to avoid the forward declaration. Anyway, my preference would be to move it to versionbits.o which is also ignored as "something that can be ironed out later", so whatever.
src/validation.cpp
Outdated
@@ -922,10 +922,15 @@ bool AcceptToMemoryPoolWorker(CTxMemPool& pool, CValidationState& state, const C | |||
// There is a similar check in CreateNewBlock() to prevent creating | |||
// invalid blocks (using TestBlockValidity), however allowing such | |||
// transactions into the mempool can be exploited as a DoS attack. | |||
if (!CheckInputs(tx, state, view, true, GetBlockScriptFlags(chainActive.Tip(), Params()), true, true, txdata)) | |||
unsigned int currentBlockScriptVerifyFlags = GetBlockScriptFlags(chainActive.Tip(), Params()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
history bike-sheeding currentBlockScriptVerifyFlags should be part of the previous commit, not this one (and for my taste this commit could be an independent PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its needed in this PR because its much more likely to trigger after the changes in this PR, though still unsupported. I'm happy to make it a PR on its own if you think its really worth it, but it seems to me this PR is still of very-reviewable-size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You misunderstood. Creating a currentBlockScriptVerifyFlags local variable just makes the next line simpler, which is fine, but could have been done in "Cache full script execution results in addition to signatures" directly which already touches those lines instead of "Do not print soft-fork-script warning with -promiscuousmempool", which would be clearer without that extra bikeshedding +2-1.
Anyway, as said, history bikeshedding, not a big deal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
src/validation.cpp
Outdated
@@ -923,7 +955,7 @@ bool AcceptToMemoryPoolWorker(CTxMemPool& pool, CValidationState& state, const C | |||
// invalid blocks (using TestBlockValidity), however allowing such | |||
// transactions into the mempool can be exploited as a DoS attack. | |||
unsigned int currentBlockScriptVerifyFlags = GetBlockScriptFlags(chainActive.Tip(), Params()); | |||
if (!CheckInputs(tx, state, view, true, currentBlockScriptVerifyFlags, true, true, txdata)) | |||
if (!CheckInputsFromMempoolAndCache(tx, state, view, pool, currentBlockScriptVerifyFlags, true, txdata)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps everything that is below this case could be moved to CheckInputsFromMempoolAndCache too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think CheckInputsFromMempoolAndCache should make assumptions about having called CheckInputs already prior to this call? Maybe I'm confused as to what you're suggesting here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean moving the following inside the function too:
{
// If we're using promiscuousmempoolflags, we may hit this normally
// Check if current block has some flags that scriptVerifyFlags
// does not before printing an ominous warning
if (!(~scriptVerifyFlags & currentBlockScriptVerifyFlags))
return error("%s: BUG! PLEASE REPORT THIS! ConnectInputs failed against MANDATORY but not STANDARD flags %s, %s",
__func__, hash.ToString(), FormatStateMessage(state));
}
Then here you would just call:
if (!CheckInputsFromMempoolAndCache(tx, state, view, pool, scriptVerifyFlags, currentBlockScriptVerifyFlags, true, txdata)) {
return false;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps even put the 2 calls to checkinputs in it, from 061d06f#diff-24efdb00bfbe56b140fb006b562cc70bR889
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that logging implicitly makes an assumption about the fact that you've already called CheckInputs prior to the CheckInputsFromMempoolAndCache call, which I'd prefer not to do. CheckInputsFromMempoolAndCache only does some additional mempool-shouldnt-pollute-cache checks, and then calls CheckInputs.
f8ceb94
to
29561bb
Compare
@morcos I added an additional commit to only pass the scriptPubKey and nValue from the prevout into CScriptCheck, so hopefully any such future changes would be super clear to reviewers as consensus bugs. Sadly I dont really want to just include the height in the hash, as there are many heights, but I think this is a sufficient change. |
316d328
to
e3f9c05
Compare
Replaced test with one by @sdaftuar which is much better. Should be good to go now. |
ACK e3f9c05 |
// CheckInputs should succeed iff test_flags doesn't intersect with | ||
// failing_flags | ||
bool expected_return_value = !(test_flags & failing_flags); | ||
if (expected_return_value && upgraded_nop) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic may become unnecessary with #10699.
…atures e3f9c05 Add CheckInputs() unit tests (Suhas Daftuar) a3543af Better document CheckInputs parameter meanings (Matt Corallo) 309ee1a Update -maxsigcachesize doc clarify init logprints for it (Matt Corallo) b014668 Add CheckInputs wrapper CCoinsViewMemPool -> non-consensus-critical (Matt Corallo) eada04e Do not print soft-fork-script warning with -promiscuousmempool (Matt Corallo) b5fea8d Cache full script execution results in addition to signatures (Matt Corallo) 6d22b2b Pull script verify flags calculation out of ConnectBlock (Matt Corallo) Tree-SHA512: 0c6c3c79c64fcb21e17ab60290c5c96d4fac11624c49f841a4201eec21cb480314c52a07d1e3abd4f9c764785cc57bfd178511f495aa0469addb204e96214fe4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reACK w/ new tests
posthumous utACK Thanks for doing this. |
probably wrong place to have this conversation, but for the sake of continuity...
Yeah I've (asynchronously) been thinking about this one a little bit in spare cycles. There are a couple tricks on could do to easily increase frequency, e.g. add a insert_n_times to the cuckoocache where you can specify that it should be put onto 1-8 of it's hash locations. This makes the hash "stickier" in memory compared to other entries and is pretty easy to compute. Generally, this also provides a really interesting mechanism for further tuning the hit rate based on other priors. E.g., if a txn comes in and we estimate that it has a 0.5 chance of entering a block, we could set it to have, say, half of it's hash values inserted onto. If a txn comes in and we estimate that it has 0.25 we could fill a quarter of them. This can also be simulated by inserting the hash N times with different salts or something without modifying the cuckoocache internally. |
5618b7d Do not shadow upper local variable `state`. (Pavel Janík) Pull request description: Tests added in #10192 emit few shadowing warnings: ``` test/txvalidationcache_tests.cpp:268:26: warning: declaration shadows a local variable [-Wshadow] test/txvalidationcache_tests.cpp:296:26: warning: declaration shadows a local variable [-Wshadow] test/txvalidationcache_tests.cpp:357:26: warning: declaration shadows a local variable [-Wshadow] ``` Remove shadowing declarations and reuse the upper local declaration as in other already present test cases. Tree-SHA512: 1e3c52cf963f8f33e729900c8ecdcd5cc6fe28caa441ba53c4636df9cc3d1a351ca231966d36384589f1340ae8ddd447424c2ee3e8527d334d0412f0d1a10c8f
…, to determine available cores 937bf43 Use std::thread::hardware_concurrency, instead of Boost, to determine available cores (fanquake) Pull request description: Following discussion on IRC about replacing Boost usage for detecting available system cores, I've opened this to collect some benchmarks + further discussion. The current method for detecting available cores was introduced in #6361. Recap of the IRC chat: ``` 21:14:08 fanquake: Since we seem to be giving Boost removal a good shot for 0.15, does anyone have suggestions for replacing GetNumCores? 21:14:26 fanquake: There is std::thread::hardware_concurrency(), but that seems to count virtual cores, which I don't think we want. 21:14:51 BlueMatt: fanquake: I doubt we'll do boost removal for 0.15 21:14:58 BlueMatt: shit like BOOST_FOREACH, sure 21:15:07 BlueMatt: but all of boost? doubtful, there are still things we need 21:16:36 fanquake: Yea sorry, not the whole lot, but we can remove a decent chunk. Just looking into what else needs to be done to replace some of the less involved Boost usage. 21:16:43 BlueMatt: fair 21:17:14 wumpus: yes, it makes sense to plan ahead a bit, without immediately doing it 21:18:12 wumpus: right, don't count virtual cores, that used to be the case but it makes no sense for our usage 21:19:15 wumpus: it'd create a swarm of threads overwhelming any machine with hyperthreading (+accompanying thread stack overhead), for script validation, and there was no gain at all for that 21:20:03 sipa: BlueMatt: don't worry, there is no hurry 21:59:10 morcos: wumpus: i don't think that is correct 21:59:24 morcos: suppose you have 4 cores (8 virtual cores) 21:59:24 wumpus: fanquake: indeed seems that std has no equivalent to physical_concurrency, on any standard. That's annoying as it is non-trivial to implement 21:59:35 morcos: i think running par=8 (if it let you) would be notably faster 21:59:59 morcos: jeremyrubin and i discussed this at length a while back... i think i commented about it on irc at the time 22:00:21 wumpus: morcos: I think the conclusion at the time was that it made no difference, but sure would make sense to benchmark 22:00:39 morcos: perhaps historical testing on the virtual vs actual cores was polluted by concurrency issues that have now improved 22:00:47 wumpus: I think there are not more ALUs, so there is not really a point in having more threads 22:01:40 wumpus: hyperthreads are basically just a stored register state right? 22:02:23 sipa: wumpus: yes but it helps the scheduler 22:02:27 wumpus: in which case the only speedup using "number of cores" threads would give you is, possibly, excluding other software from running on the cores on the same time 22:02:37 morcos: well this is where i get out of my depth 22:02:50 sipa: if one of the threads is waiting on a read from ram, the other can use the arithmetic unit for example 22:02:54 morcos: wumpus: i'm pretty sure though that the speed up is considerably more than what you might expect from that 22:02:59 wumpus: sipa: ok, I back down, I didn't want to argue this at all 22:03:35 morcos: the reason i haven't tested it myself, is the machine i usually use has 16 cores... so not easy due to remaining concurrency issues to get much more speedup 22:03:36 wumpus: I'm fine with restoring it to number of virtual threads if that's faster 22:03:54 morcos: we should have somene with 4 cores (and  actually test it though, i agree 22:03:58 sipa: i would expect (but we should benchmark...) that if 8 scriot validation threads instead of 4 on a quadcore hyperthreading is not faster, it's due to lock contention 22:04:20 morcos: sipa: yeah thats my point, i think lock contention isn't that bad with 8 now 22:04:22 wumpus: on 64-bit systems the additional thread overhead wouldn't be important at least 22:04:23 gmaxwell: I previously benchmarked, a long time ago, it was faster. 22:04:33 gmaxwell: (to use the HT core count) 22:04:44 wumpus: why was this changed at all then? 22:04:47 wumpus: I'm confused 22:05:04 sipa: good question! 22:05:06 gmaxwell: I had no idea we changed it. 22:05:25 wumpus: sigh  22:05:54 gmaxwell: What PR changed it? 22:06:51 gmaxwell: In any case, on 32-bit it's probably a good tradeoff... the extra ram overhead is worth avoiding. 22:07:22 wumpus: #6361 22:07:28 gmaxwell: PR 6461 btw. 22:07:37 gmaxwell: er lol at least you got it right. 22:07:45 wumpus: the complaint was that systems became unsuably slow when using that many thread 22:07:51 wumpus: so at least I got one thing right, woohoo 22:07:55 sipa: seems i even acked it! 22:07:57 BlueMatt: wumpus: there are more alus 22:08:38 BlueMatt: but we need to improve lock contention first 22:08:40 morcos: anywya, i think in the past the lock contention made 8 threads regardless of cores a bit dicey.. now that is much better (although more still to be done) 22:09:01 BlueMatt: or we can just merge #10192, thats fee 22:09:04 gribble: #10192 | Cache full script execution results in addition to signatures by TheBlueMatt · Pull Request #10192 · bitcoin/bitcoin · GitHub 22:09:11 BlueMatt: s/fee/free/ 22:09:21 morcos: no, we do not need to improve lock contention first. but we should probably do that before we increase the max beyond 16 22:09:26 BlueMatt: then we can toss concurrency issues out the window and get more speedup anyway 22:09:35 gmaxwell: wumpus: yea, well in QT I thought we also diminished the count by 1 or something? but yes, if the motivation was to reduce how heavily the machine was used, thats fair. 22:09:56 sipa: the benefit of using HT cores is certainly not a factor 2 22:09:58 wumpus: gmaxwell: for the default I think this makes a lot of sense, yes 22:10:10 gmaxwell: morcos: right now on my 24/28 physical core hosts going beyond 16 still reduces performance. 22:10:11 wumpus: gmaxwell: do we also restrict the maximum par using this? that'd make less sense 22:10:51 wumpus: if someone *wants* to use the virtual cores they should be able to by setting -par= 22:10:51 sipa: *flies to US* 22:10:52 BlueMatt: sipa: sure, but the shared cache helps us get more out of it than some others, as morcos points out 22:11:30 BlueMatt: (because it means our thread contention issues are less) 22:12:05 morcos: gmaxwell: yeah i've been bogged down in fee estimation as well (and the rest of life) for a while now.. otherwise i would have put more effort into jeremy's checkqueue 22:12:36 BlueMatt: morcos: heh, well now you can do other stuff while the rest of us get bogged down in understanding fee estimation enough to review it  22:12:37 wumpus: [to answer my own question: no, the limit for par is MAX_SCRIPTCHECK_THREADS, or 16] 22:12:54 morcos: but to me optimizing for more than 16 cores is pretty valuable as miners could use beefy machines and be less concerned by block validation time 22:14:38 BlueMatt: morcos: i think you may be surprised by the number of mining pools that are on VPSes that do not have 16 cores  22:15:34 gmaxwell: I assume right now most of the time block validation is bogged in the parts that are not as concurrent. simple because caching makes the concurrent parts so fast. (and soon to hopefully increase with bluematt's patch) 22:17:55 gmaxwell: improving sha2 speed, or transaction malloc overhead are probably bigger wins now for connection at the tip than parallelism beyond 16 (though I'd like that too). 22:18:21 BlueMatt: sha2 speed is big 22:18:27 morcos: yeah lots of things to do actually... 22:18:57 gmaxwell: BlueMatt: might be a tiny bit less big if we didn't hash the block header 8 times for every block.  22:21:27 BlueMatt: ehh, probably, but I'm less rushed there 22:21:43 BlueMatt: my new cache thing is about to add a bunch of hashing 22:21:50 BlueMatt: 1 sha round per tx 22:22:25 BlueMatt: and sigcache is obviously a ton ``` Tree-SHA512: a594430e2a77d8cc741ea8c664a2867b1e1693e5050a4bbc8511e8d66a2bffe241a9965f6dff1e7fbb99f21dd1fdeb95b826365da8bd8f9fab2d0ffd80d5059c
…to signatures e3f9c05 Add CheckInputs() unit tests (Suhas Daftuar) a3543af Better document CheckInputs parameter meanings (Matt Corallo) 309ee1a Update -maxsigcachesize doc clarify init logprints for it (Matt Corallo) b014668 Add CheckInputs wrapper CCoinsViewMemPool -> non-consensus-critical (Matt Corallo) eada04e Do not print soft-fork-script warning with -promiscuousmempool (Matt Corallo) b5fea8d Cache full script execution results in addition to signatures (Matt Corallo) 6d22b2b Pull script verify flags calculation out of ConnectBlock (Matt Corallo) Tree-SHA512: 0c6c3c79c64fcb21e17ab60290c5c96d4fac11624c49f841a4201eec21cb480314c52a07d1e3abd4f9c764785cc57bfd178511f495aa0469addb204e96214fe4
…to signatures e3f9c05 Add CheckInputs() unit tests (Suhas Daftuar) a3543af Better document CheckInputs parameter meanings (Matt Corallo) 309ee1a Update -maxsigcachesize doc clarify init logprints for it (Matt Corallo) b014668 Add CheckInputs wrapper CCoinsViewMemPool -> non-consensus-critical (Matt Corallo) eada04e Do not print soft-fork-script warning with -promiscuousmempool (Matt Corallo) b5fea8d Cache full script execution results in addition to signatures (Matt Corallo) 6d22b2b Pull script verify flags calculation out of ConnectBlock (Matt Corallo) Tree-SHA512: 0c6c3c79c64fcb21e17ab60290c5c96d4fac11624c49f841a4201eec21cb480314c52a07d1e3abd4f9c764785cc57bfd178511f495aa0469addb204e96214fe4
…f Boost, to determine available cores 937bf43 Use std::thread::hardware_concurrency, instead of Boost, to determine available cores (fanquake) Pull request description: Following discussion on IRC about replacing Boost usage for detecting available system cores, I've opened this to collect some benchmarks + further discussion. The current method for detecting available cores was introduced in dashpay#6361. Recap of the IRC chat: ``` 21:14:08 fanquake: Since we seem to be giving Boost removal a good shot for 0.15, does anyone have suggestions for replacing GetNumCores? 21:14:26 fanquake: There is std::thread::hardware_concurrency(), but that seems to count virtual cores, which I don't think we want. 21:14:51 BlueMatt: fanquake: I doubt we'll do boost removal for 0.15 21:14:58 BlueMatt: shit like BOOST_FOREACH, sure 21:15:07 BlueMatt: but all of boost? doubtful, there are still things we need 21:16:36 fanquake: Yea sorry, not the whole lot, but we can remove a decent chunk. Just looking into what else needs to be done to replace some of the less involved Boost usage. 21:16:43 BlueMatt: fair 21:17:14 wumpus: yes, it makes sense to plan ahead a bit, without immediately doing it 21:18:12 wumpus: right, don't count virtual cores, that used to be the case but it makes no sense for our usage 21:19:15 wumpus: it'd create a swarm of threads overwhelming any machine with hyperthreading (+accompanying thread stack overhead), for script validation, and there was no gain at all for that 21:20:03 sipa: BlueMatt: don't worry, there is no hurry 21:59:10 morcos: wumpus: i don't think that is correct 21:59:24 morcos: suppose you have 4 cores (8 virtual cores) 21:59:24 wumpus: fanquake: indeed seems that std has no equivalent to physical_concurrency, on any standard. That's annoying as it is non-trivial to implement 21:59:35 morcos: i think running par=8 (if it let you) would be notably faster 21:59:59 morcos: jeremyrubin and i discussed this at length a while back... i think i commented about it on irc at the time 22:00:21 wumpus: morcos: I think the conclusion at the time was that it made no difference, but sure would make sense to benchmark 22:00:39 morcos: perhaps historical testing on the virtual vs actual cores was polluted by concurrency issues that have now improved 22:00:47 wumpus: I think there are not more ALUs, so there is not really a point in having more threads 22:01:40 wumpus: hyperthreads are basically just a stored register state right? 22:02:23 sipa: wumpus: yes but it helps the scheduler 22:02:27 wumpus: in which case the only speedup using "number of cores" threads would give you is, possibly, excluding other software from running on the cores on the same time 22:02:37 morcos: well this is where i get out of my depth 22:02:50 sipa: if one of the threads is waiting on a read from ram, the other can use the arithmetic unit for example 22:02:54 morcos: wumpus: i'm pretty sure though that the speed up is considerably more than what you might expect from that 22:02:59 wumpus: sipa: ok, I back down, I didn't want to argue this at all 22:03:35 morcos: the reason i haven't tested it myself, is the machine i usually use has 16 cores... so not easy due to remaining concurrency issues to get much more speedup 22:03:36 wumpus: I'm fine with restoring it to number of virtual threads if that's faster 22:03:54 morcos: we should have somene with 4 cores (and  actually test it though, i agree 22:03:58 sipa: i would expect (but we should benchmark...) that if 8 scriot validation threads instead of 4 on a quadcore hyperthreading is not faster, it's due to lock contention 22:04:20 morcos: sipa: yeah thats my point, i think lock contention isn't that bad with 8 now 22:04:22 wumpus: on 64-bit systems the additional thread overhead wouldn't be important at least 22:04:23 gmaxwell: I previously benchmarked, a long time ago, it was faster. 22:04:33 gmaxwell: (to use the HT core count) 22:04:44 wumpus: why was this changed at all then? 22:04:47 wumpus: I'm confused 22:05:04 sipa: good question! 22:05:06 gmaxwell: I had no idea we changed it. 22:05:25 wumpus: sigh  22:05:54 gmaxwell: What PR changed it? 22:06:51 gmaxwell: In any case, on 32-bit it's probably a good tradeoff... the extra ram overhead is worth avoiding. 22:07:22 wumpus: bitcoin#6361 22:07:28 gmaxwell: PR 6461 btw. 22:07:37 gmaxwell: er lol at least you got it right. 22:07:45 wumpus: the complaint was that systems became unsuably slow when using that many thread 22:07:51 wumpus: so at least I got one thing right, woohoo 22:07:55 sipa: seems i even acked it! 22:07:57 BlueMatt: wumpus: there are more alus 22:08:38 BlueMatt: but we need to improve lock contention first 22:08:40 morcos: anywya, i think in the past the lock contention made 8 threads regardless of cores a bit dicey.. now that is much better (although more still to be done) 22:09:01 BlueMatt: or we can just merge bitcoin#10192, thats fee 22:09:04 gribble: bitcoin#10192 | Cache full script execution results in addition to signatures by TheBlueMatt · Pull Request bitcoin#10192 · bitcoin/bitcoin · GitHub 22:09:11 BlueMatt: s/fee/free/ 22:09:21 morcos: no, we do not need to improve lock contention first. but we should probably do that before we increase the max beyond 16 22:09:26 BlueMatt: then we can toss concurrency issues out the window and get more speedup anyway 22:09:35 gmaxwell: wumpus: yea, well in QT I thought we also diminished the count by 1 or something? but yes, if the motivation was to reduce how heavily the machine was used, thats fair. 22:09:56 sipa: the benefit of using HT cores is certainly not a factor 2 22:09:58 wumpus: gmaxwell: for the default I think this makes a lot of sense, yes 22:10:10 gmaxwell: morcos: right now on my 24/28 physical core hosts going beyond 16 still reduces performance. 22:10:11 wumpus: gmaxwell: do we also restrict the maximum par using this? that'd make less sense 22:10:51 wumpus: if someone *wants* to use the virtual cores they should be able to by setting -par= 22:10:51 sipa: *flies to US* 22:10:52 BlueMatt: sipa: sure, but the shared cache helps us get more out of it than some others, as morcos points out 22:11:30 BlueMatt: (because it means our thread contention issues are less) 22:12:05 morcos: gmaxwell: yeah i've been bogged down in fee estimation as well (and the rest of life) for a while now.. otherwise i would have put more effort into jeremy's checkqueue 22:12:36 BlueMatt: morcos: heh, well now you can do other stuff while the rest of us get bogged down in understanding fee estimation enough to review it  22:12:37 wumpus: [to answer my own question: no, the limit for par is MAX_SCRIPTCHECK_THREADS, or 16] 22:12:54 morcos: but to me optimizing for more than 16 cores is pretty valuable as miners could use beefy machines and be less concerned by block validation time 22:14:38 BlueMatt: morcos: i think you may be surprised by the number of mining pools that are on VPSes that do not have 16 cores  22:15:34 gmaxwell: I assume right now most of the time block validation is bogged in the parts that are not as concurrent. simple because caching makes the concurrent parts so fast. (and soon to hopefully increase with bluematt's patch) 22:17:55 gmaxwell: improving sha2 speed, or transaction malloc overhead are probably bigger wins now for connection at the tip than parallelism beyond 16 (though I'd like that too). 22:18:21 BlueMatt: sha2 speed is big 22:18:27 morcos: yeah lots of things to do actually... 22:18:57 gmaxwell: BlueMatt: might be a tiny bit less big if we didn't hash the block header 8 times for every block.  22:21:27 BlueMatt: ehh, probably, but I'm less rushed there 22:21:43 BlueMatt: my new cache thing is about to add a bunch of hashing 22:21:50 BlueMatt: 1 sha round per tx 22:22:25 BlueMatt: and sigcache is obviously a ton ``` Tree-SHA512: a594430e2a77d8cc741ea8c664a2867b1e1693e5050a4bbc8511e8d66a2bffe241a9965f6dff1e7fbb99f21dd1fdeb95b826365da8bd8f9fab2d0ffd80d5059c
…f Boost, to determine available cores 937bf43 Use std::thread::hardware_concurrency, instead of Boost, to determine available cores (fanquake) Pull request description: Following discussion on IRC about replacing Boost usage for detecting available system cores, I've opened this to collect some benchmarks + further discussion. The current method for detecting available cores was introduced in dashpay#6361. Recap of the IRC chat: ``` 21:14:08 fanquake: Since we seem to be giving Boost removal a good shot for 0.15, does anyone have suggestions for replacing GetNumCores? 21:14:26 fanquake: There is std::thread::hardware_concurrency(), but that seems to count virtual cores, which I don't think we want. 21:14:51 BlueMatt: fanquake: I doubt we'll do boost removal for 0.15 21:14:58 BlueMatt: shit like BOOST_FOREACH, sure 21:15:07 BlueMatt: but all of boost? doubtful, there are still things we need 21:16:36 fanquake: Yea sorry, not the whole lot, but we can remove a decent chunk. Just looking into what else needs to be done to replace some of the less involved Boost usage. 21:16:43 BlueMatt: fair 21:17:14 wumpus: yes, it makes sense to plan ahead a bit, without immediately doing it 21:18:12 wumpus: right, don't count virtual cores, that used to be the case but it makes no sense for our usage 21:19:15 wumpus: it'd create a swarm of threads overwhelming any machine with hyperthreading (+accompanying thread stack overhead), for script validation, and there was no gain at all for that 21:20:03 sipa: BlueMatt: don't worry, there is no hurry 21:59:10 morcos: wumpus: i don't think that is correct 21:59:24 morcos: suppose you have 4 cores (8 virtual cores) 21:59:24 wumpus: fanquake: indeed seems that std has no equivalent to physical_concurrency, on any standard. That's annoying as it is non-trivial to implement 21:59:35 morcos: i think running par=8 (if it let you) would be notably faster 21:59:59 morcos: jeremyrubin and i discussed this at length a while back... i think i commented about it on irc at the time 22:00:21 wumpus: morcos: I think the conclusion at the time was that it made no difference, but sure would make sense to benchmark 22:00:39 morcos: perhaps historical testing on the virtual vs actual cores was polluted by concurrency issues that have now improved 22:00:47 wumpus: I think there are not more ALUs, so there is not really a point in having more threads 22:01:40 wumpus: hyperthreads are basically just a stored register state right? 22:02:23 sipa: wumpus: yes but it helps the scheduler 22:02:27 wumpus: in which case the only speedup using "number of cores" threads would give you is, possibly, excluding other software from running on the cores on the same time 22:02:37 morcos: well this is where i get out of my depth 22:02:50 sipa: if one of the threads is waiting on a read from ram, the other can use the arithmetic unit for example 22:02:54 morcos: wumpus: i'm pretty sure though that the speed up is considerably more than what you might expect from that 22:02:59 wumpus: sipa: ok, I back down, I didn't want to argue this at all 22:03:35 morcos: the reason i haven't tested it myself, is the machine i usually use has 16 cores... so not easy due to remaining concurrency issues to get much more speedup 22:03:36 wumpus: I'm fine with restoring it to number of virtual threads if that's faster 22:03:54 morcos: we should have somene with 4 cores (and  actually test it though, i agree 22:03:58 sipa: i would expect (but we should benchmark...) that if 8 scriot validation threads instead of 4 on a quadcore hyperthreading is not faster, it's due to lock contention 22:04:20 morcos: sipa: yeah thats my point, i think lock contention isn't that bad with 8 now 22:04:22 wumpus: on 64-bit systems the additional thread overhead wouldn't be important at least 22:04:23 gmaxwell: I previously benchmarked, a long time ago, it was faster. 22:04:33 gmaxwell: (to use the HT core count) 22:04:44 wumpus: why was this changed at all then? 22:04:47 wumpus: I'm confused 22:05:04 sipa: good question! 22:05:06 gmaxwell: I had no idea we changed it. 22:05:25 wumpus: sigh  22:05:54 gmaxwell: What PR changed it? 22:06:51 gmaxwell: In any case, on 32-bit it's probably a good tradeoff... the extra ram overhead is worth avoiding. 22:07:22 wumpus: bitcoin#6361 22:07:28 gmaxwell: PR 6461 btw. 22:07:37 gmaxwell: er lol at least you got it right. 22:07:45 wumpus: the complaint was that systems became unsuably slow when using that many thread 22:07:51 wumpus: so at least I got one thing right, woohoo 22:07:55 sipa: seems i even acked it! 22:07:57 BlueMatt: wumpus: there are more alus 22:08:38 BlueMatt: but we need to improve lock contention first 22:08:40 morcos: anywya, i think in the past the lock contention made 8 threads regardless of cores a bit dicey.. now that is much better (although more still to be done) 22:09:01 BlueMatt: or we can just merge bitcoin#10192, thats fee 22:09:04 gribble: bitcoin#10192 | Cache full script execution results in addition to signatures by TheBlueMatt · Pull Request bitcoin#10192 · bitcoin/bitcoin · GitHub 22:09:11 BlueMatt: s/fee/free/ 22:09:21 morcos: no, we do not need to improve lock contention first. but we should probably do that before we increase the max beyond 16 22:09:26 BlueMatt: then we can toss concurrency issues out the window and get more speedup anyway 22:09:35 gmaxwell: wumpus: yea, well in QT I thought we also diminished the count by 1 or something? but yes, if the motivation was to reduce how heavily the machine was used, thats fair. 22:09:56 sipa: the benefit of using HT cores is certainly not a factor 2 22:09:58 wumpus: gmaxwell: for the default I think this makes a lot of sense, yes 22:10:10 gmaxwell: morcos: right now on my 24/28 physical core hosts going beyond 16 still reduces performance. 22:10:11 wumpus: gmaxwell: do we also restrict the maximum par using this? that'd make less sense 22:10:51 wumpus: if someone *wants* to use the virtual cores they should be able to by setting -par= 22:10:51 sipa: *flies to US* 22:10:52 BlueMatt: sipa: sure, but the shared cache helps us get more out of it than some others, as morcos points out 22:11:30 BlueMatt: (because it means our thread contention issues are less) 22:12:05 morcos: gmaxwell: yeah i've been bogged down in fee estimation as well (and the rest of life) for a while now.. otherwise i would have put more effort into jeremy's checkqueue 22:12:36 BlueMatt: morcos: heh, well now you can do other stuff while the rest of us get bogged down in understanding fee estimation enough to review it  22:12:37 wumpus: [to answer my own question: no, the limit for par is MAX_SCRIPTCHECK_THREADS, or 16] 22:12:54 morcos: but to me optimizing for more than 16 cores is pretty valuable as miners could use beefy machines and be less concerned by block validation time 22:14:38 BlueMatt: morcos: i think you may be surprised by the number of mining pools that are on VPSes that do not have 16 cores  22:15:34 gmaxwell: I assume right now most of the time block validation is bogged in the parts that are not as concurrent. simple because caching makes the concurrent parts so fast. (and soon to hopefully increase with bluematt's patch) 22:17:55 gmaxwell: improving sha2 speed, or transaction malloc overhead are probably bigger wins now for connection at the tip than parallelism beyond 16 (though I'd like that too). 22:18:21 BlueMatt: sha2 speed is big 22:18:27 morcos: yeah lots of things to do actually... 22:18:57 gmaxwell: BlueMatt: might be a tiny bit less big if we didn't hash the block header 8 times for every block.  22:21:27 BlueMatt: ehh, probably, but I'm less rushed there 22:21:43 BlueMatt: my new cache thing is about to add a bunch of hashing 22:21:50 BlueMatt: 1 sha round per tx 22:22:25 BlueMatt: and sigcache is obviously a ton ``` Tree-SHA512: a594430e2a77d8cc741ea8c664a2867b1e1693e5050a4bbc8511e8d66a2bffe241a9965f6dff1e7fbb99f21dd1fdeb95b826365da8bd8f9fab2d0ffd80d5059c
…f Boost, to determine available cores 937bf43 Use std::thread::hardware_concurrency, instead of Boost, to determine available cores (fanquake) Pull request description: Following discussion on IRC about replacing Boost usage for detecting available system cores, I've opened this to collect some benchmarks + further discussion. The current method for detecting available cores was introduced in dashpay#6361. Recap of the IRC chat: ``` 21:14:08 fanquake: Since we seem to be giving Boost removal a good shot for 0.15, does anyone have suggestions for replacing GetNumCores? 21:14:26 fanquake: There is std::thread::hardware_concurrency(), but that seems to count virtual cores, which I don't think we want. 21:14:51 BlueMatt: fanquake: I doubt we'll do boost removal for 0.15 21:14:58 BlueMatt: shit like BOOST_FOREACH, sure 21:15:07 BlueMatt: but all of boost? doubtful, there are still things we need 21:16:36 fanquake: Yea sorry, not the whole lot, but we can remove a decent chunk. Just looking into what else needs to be done to replace some of the less involved Boost usage. 21:16:43 BlueMatt: fair 21:17:14 wumpus: yes, it makes sense to plan ahead a bit, without immediately doing it 21:18:12 wumpus: right, don't count virtual cores, that used to be the case but it makes no sense for our usage 21:19:15 wumpus: it'd create a swarm of threads overwhelming any machine with hyperthreading (+accompanying thread stack overhead), for script validation, and there was no gain at all for that 21:20:03 sipa: BlueMatt: don't worry, there is no hurry 21:59:10 morcos: wumpus: i don't think that is correct 21:59:24 morcos: suppose you have 4 cores (8 virtual cores) 21:59:24 wumpus: fanquake: indeed seems that std has no equivalent to physical_concurrency, on any standard. That's annoying as it is non-trivial to implement 21:59:35 morcos: i think running par=8 (if it let you) would be notably faster 21:59:59 morcos: jeremyrubin and i discussed this at length a while back... i think i commented about it on irc at the time 22:00:21 wumpus: morcos: I think the conclusion at the time was that it made no difference, but sure would make sense to benchmark 22:00:39 morcos: perhaps historical testing on the virtual vs actual cores was polluted by concurrency issues that have now improved 22:00:47 wumpus: I think there are not more ALUs, so there is not really a point in having more threads 22:01:40 wumpus: hyperthreads are basically just a stored register state right? 22:02:23 sipa: wumpus: yes but it helps the scheduler 22:02:27 wumpus: in which case the only speedup using "number of cores" threads would give you is, possibly, excluding other software from running on the cores on the same time 22:02:37 morcos: well this is where i get out of my depth 22:02:50 sipa: if one of the threads is waiting on a read from ram, the other can use the arithmetic unit for example 22:02:54 morcos: wumpus: i'm pretty sure though that the speed up is considerably more than what you might expect from that 22:02:59 wumpus: sipa: ok, I back down, I didn't want to argue this at all 22:03:35 morcos: the reason i haven't tested it myself, is the machine i usually use has 16 cores... so not easy due to remaining concurrency issues to get much more speedup 22:03:36 wumpus: I'm fine with restoring it to number of virtual threads if that's faster 22:03:54 morcos: we should have somene with 4 cores (and  actually test it though, i agree 22:03:58 sipa: i would expect (but we should benchmark...) that if 8 scriot validation threads instead of 4 on a quadcore hyperthreading is not faster, it's due to lock contention 22:04:20 morcos: sipa: yeah thats my point, i think lock contention isn't that bad with 8 now 22:04:22 wumpus: on 64-bit systems the additional thread overhead wouldn't be important at least 22:04:23 gmaxwell: I previously benchmarked, a long time ago, it was faster. 22:04:33 gmaxwell: (to use the HT core count) 22:04:44 wumpus: why was this changed at all then? 22:04:47 wumpus: I'm confused 22:05:04 sipa: good question! 22:05:06 gmaxwell: I had no idea we changed it. 22:05:25 wumpus: sigh  22:05:54 gmaxwell: What PR changed it? 22:06:51 gmaxwell: In any case, on 32-bit it's probably a good tradeoff... the extra ram overhead is worth avoiding. 22:07:22 wumpus: bitcoin#6361 22:07:28 gmaxwell: PR 6461 btw. 22:07:37 gmaxwell: er lol at least you got it right. 22:07:45 wumpus: the complaint was that systems became unsuably slow when using that many thread 22:07:51 wumpus: so at least I got one thing right, woohoo 22:07:55 sipa: seems i even acked it! 22:07:57 BlueMatt: wumpus: there are more alus 22:08:38 BlueMatt: but we need to improve lock contention first 22:08:40 morcos: anywya, i think in the past the lock contention made 8 threads regardless of cores a bit dicey.. now that is much better (although more still to be done) 22:09:01 BlueMatt: or we can just merge bitcoin#10192, thats fee 22:09:04 gribble: bitcoin#10192 | Cache full script execution results in addition to signatures by TheBlueMatt · Pull Request bitcoin#10192 · bitcoin/bitcoin · GitHub 22:09:11 BlueMatt: s/fee/free/ 22:09:21 morcos: no, we do not need to improve lock contention first. but we should probably do that before we increase the max beyond 16 22:09:26 BlueMatt: then we can toss concurrency issues out the window and get more speedup anyway 22:09:35 gmaxwell: wumpus: yea, well in QT I thought we also diminished the count by 1 or something? but yes, if the motivation was to reduce how heavily the machine was used, thats fair. 22:09:56 sipa: the benefit of using HT cores is certainly not a factor 2 22:09:58 wumpus: gmaxwell: for the default I think this makes a lot of sense, yes 22:10:10 gmaxwell: morcos: right now on my 24/28 physical core hosts going beyond 16 still reduces performance. 22:10:11 wumpus: gmaxwell: do we also restrict the maximum par using this? that'd make less sense 22:10:51 wumpus: if someone *wants* to use the virtual cores they should be able to by setting -par= 22:10:51 sipa: *flies to US* 22:10:52 BlueMatt: sipa: sure, but the shared cache helps us get more out of it than some others, as morcos points out 22:11:30 BlueMatt: (because it means our thread contention issues are less) 22:12:05 morcos: gmaxwell: yeah i've been bogged down in fee estimation as well (and the rest of life) for a while now.. otherwise i would have put more effort into jeremy's checkqueue 22:12:36 BlueMatt: morcos: heh, well now you can do other stuff while the rest of us get bogged down in understanding fee estimation enough to review it  22:12:37 wumpus: [to answer my own question: no, the limit for par is MAX_SCRIPTCHECK_THREADS, or 16] 22:12:54 morcos: but to me optimizing for more than 16 cores is pretty valuable as miners could use beefy machines and be less concerned by block validation time 22:14:38 BlueMatt: morcos: i think you may be surprised by the number of mining pools that are on VPSes that do not have 16 cores  22:15:34 gmaxwell: I assume right now most of the time block validation is bogged in the parts that are not as concurrent. simple because caching makes the concurrent parts so fast. (and soon to hopefully increase with bluematt's patch) 22:17:55 gmaxwell: improving sha2 speed, or transaction malloc overhead are probably bigger wins now for connection at the tip than parallelism beyond 16 (though I'd like that too). 22:18:21 BlueMatt: sha2 speed is big 22:18:27 morcos: yeah lots of things to do actually... 22:18:57 gmaxwell: BlueMatt: might be a tiny bit less big if we didn't hash the block header 8 times for every block.  22:21:27 BlueMatt: ehh, probably, but I'm less rushed there 22:21:43 BlueMatt: my new cache thing is about to add a bunch of hashing 22:21:50 BlueMatt: 1 sha round per tx 22:22:25 BlueMatt: and sigcache is obviously a ton ``` Tree-SHA512: a594430e2a77d8cc741ea8c664a2867b1e1693e5050a4bbc8511e8d66a2bffe241a9965f6dff1e7fbb99f21dd1fdeb95b826365da8bd8f9fab2d0ffd80d5059c
…f Boost, to determine available cores 937bf43 Use std::thread::hardware_concurrency, instead of Boost, to determine available cores (fanquake) Pull request description: Following discussion on IRC about replacing Boost usage for detecting available system cores, I've opened this to collect some benchmarks + further discussion. The current method for detecting available cores was introduced in dashpay#6361. Recap of the IRC chat: ``` 21:14:08 fanquake: Since we seem to be giving Boost removal a good shot for 0.15, does anyone have suggestions for replacing GetNumCores? 21:14:26 fanquake: There is std::thread::hardware_concurrency(), but that seems to count virtual cores, which I don't think we want. 21:14:51 BlueMatt: fanquake: I doubt we'll do boost removal for 0.15 21:14:58 BlueMatt: shit like BOOST_FOREACH, sure 21:15:07 BlueMatt: but all of boost? doubtful, there are still things we need 21:16:36 fanquake: Yea sorry, not the whole lot, but we can remove a decent chunk. Just looking into what else needs to be done to replace some of the less involved Boost usage. 21:16:43 BlueMatt: fair 21:17:14 wumpus: yes, it makes sense to plan ahead a bit, without immediately doing it 21:18:12 wumpus: right, don't count virtual cores, that used to be the case but it makes no sense for our usage 21:19:15 wumpus: it'd create a swarm of threads overwhelming any machine with hyperthreading (+accompanying thread stack overhead), for script validation, and there was no gain at all for that 21:20:03 sipa: BlueMatt: don't worry, there is no hurry 21:59:10 morcos: wumpus: i don't think that is correct 21:59:24 morcos: suppose you have 4 cores (8 virtual cores) 21:59:24 wumpus: fanquake: indeed seems that std has no equivalent to physical_concurrency, on any standard. That's annoying as it is non-trivial to implement 21:59:35 morcos: i think running par=8 (if it let you) would be notably faster 21:59:59 morcos: jeremyrubin and i discussed this at length a while back... i think i commented about it on irc at the time 22:00:21 wumpus: morcos: I think the conclusion at the time was that it made no difference, but sure would make sense to benchmark 22:00:39 morcos: perhaps historical testing on the virtual vs actual cores was polluted by concurrency issues that have now improved 22:00:47 wumpus: I think there are not more ALUs, so there is not really a point in having more threads 22:01:40 wumpus: hyperthreads are basically just a stored register state right? 22:02:23 sipa: wumpus: yes but it helps the scheduler 22:02:27 wumpus: in which case the only speedup using "number of cores" threads would give you is, possibly, excluding other software from running on the cores on the same time 22:02:37 morcos: well this is where i get out of my depth 22:02:50 sipa: if one of the threads is waiting on a read from ram, the other can use the arithmetic unit for example 22:02:54 morcos: wumpus: i'm pretty sure though that the speed up is considerably more than what you might expect from that 22:02:59 wumpus: sipa: ok, I back down, I didn't want to argue this at all 22:03:35 morcos: the reason i haven't tested it myself, is the machine i usually use has 16 cores... so not easy due to remaining concurrency issues to get much more speedup 22:03:36 wumpus: I'm fine with restoring it to number of virtual threads if that's faster 22:03:54 morcos: we should have somene with 4 cores (and  actually test it though, i agree 22:03:58 sipa: i would expect (but we should benchmark...) that if 8 scriot validation threads instead of 4 on a quadcore hyperthreading is not faster, it's due to lock contention 22:04:20 morcos: sipa: yeah thats my point, i think lock contention isn't that bad with 8 now 22:04:22 wumpus: on 64-bit systems the additional thread overhead wouldn't be important at least 22:04:23 gmaxwell: I previously benchmarked, a long time ago, it was faster. 22:04:33 gmaxwell: (to use the HT core count) 22:04:44 wumpus: why was this changed at all then? 22:04:47 wumpus: I'm confused 22:05:04 sipa: good question! 22:05:06 gmaxwell: I had no idea we changed it. 22:05:25 wumpus: sigh  22:05:54 gmaxwell: What PR changed it? 22:06:51 gmaxwell: In any case, on 32-bit it's probably a good tradeoff... the extra ram overhead is worth avoiding. 22:07:22 wumpus: bitcoin#6361 22:07:28 gmaxwell: PR 6461 btw. 22:07:37 gmaxwell: er lol at least you got it right. 22:07:45 wumpus: the complaint was that systems became unsuably slow when using that many thread 22:07:51 wumpus: so at least I got one thing right, woohoo 22:07:55 sipa: seems i even acked it! 22:07:57 BlueMatt: wumpus: there are more alus 22:08:38 BlueMatt: but we need to improve lock contention first 22:08:40 morcos: anywya, i think in the past the lock contention made 8 threads regardless of cores a bit dicey.. now that is much better (although more still to be done) 22:09:01 BlueMatt: or we can just merge bitcoin#10192, thats fee 22:09:04 gribble: bitcoin#10192 | Cache full script execution results in addition to signatures by TheBlueMatt · Pull Request bitcoin#10192 · bitcoin/bitcoin · GitHub 22:09:11 BlueMatt: s/fee/free/ 22:09:21 morcos: no, we do not need to improve lock contention first. but we should probably do that before we increase the max beyond 16 22:09:26 BlueMatt: then we can toss concurrency issues out the window and get more speedup anyway 22:09:35 gmaxwell: wumpus: yea, well in QT I thought we also diminished the count by 1 or something? but yes, if the motivation was to reduce how heavily the machine was used, thats fair. 22:09:56 sipa: the benefit of using HT cores is certainly not a factor 2 22:09:58 wumpus: gmaxwell: for the default I think this makes a lot of sense, yes 22:10:10 gmaxwell: morcos: right now on my 24/28 physical core hosts going beyond 16 still reduces performance. 22:10:11 wumpus: gmaxwell: do we also restrict the maximum par using this? that'd make less sense 22:10:51 wumpus: if someone *wants* to use the virtual cores they should be able to by setting -par= 22:10:51 sipa: *flies to US* 22:10:52 BlueMatt: sipa: sure, but the shared cache helps us get more out of it than some others, as morcos points out 22:11:30 BlueMatt: (because it means our thread contention issues are less) 22:12:05 morcos: gmaxwell: yeah i've been bogged down in fee estimation as well (and the rest of life) for a while now.. otherwise i would have put more effort into jeremy's checkqueue 22:12:36 BlueMatt: morcos: heh, well now you can do other stuff while the rest of us get bogged down in understanding fee estimation enough to review it  22:12:37 wumpus: [to answer my own question: no, the limit for par is MAX_SCRIPTCHECK_THREADS, or 16] 22:12:54 morcos: but to me optimizing for more than 16 cores is pretty valuable as miners could use beefy machines and be less concerned by block validation time 22:14:38 BlueMatt: morcos: i think you may be surprised by the number of mining pools that are on VPSes that do not have 16 cores  22:15:34 gmaxwell: I assume right now most of the time block validation is bogged in the parts that are not as concurrent. simple because caching makes the concurrent parts so fast. (and soon to hopefully increase with bluematt's patch) 22:17:55 gmaxwell: improving sha2 speed, or transaction malloc overhead are probably bigger wins now for connection at the tip than parallelism beyond 16 (though I'd like that too). 22:18:21 BlueMatt: sha2 speed is big 22:18:27 morcos: yeah lots of things to do actually... 22:18:57 gmaxwell: BlueMatt: might be a tiny bit less big if we didn't hash the block header 8 times for every block.  22:21:27 BlueM
This adds a new CuckooCache in validation, caching whether all of a
transaction's scripts were valid with a given set of script flags.
Unlike previous attempts at caching an entire transaction's
validity, which have nearly universally introduced consensus
failures, this only caches the validity of a transaction's
scriptSigs. As these are pure functions of the transaction and
data it commits to, this should be much safer.
This is somewhat duplicative with the sigcache, as entries in the
new cache will also have several entries in the sigcache. However,
the sigcache is kept both as ATMP relies on it and because it
prevents malleability-based DoS attacks on the new higher-level
cache. Instead, the -sigcachesize option is re-used - cutting the
sigcache size in half and using the newly freed memory for the
script execution cache.
Transactions which match the script execution cache never even have
entries in the script check thread's workqueue created.
Note that the cache is indexed only on the script execution flags
and the transaction's witness hash. While this is sufficient to
make the CScriptCheck() calls pure functions, this introduces
dependancies on the mempool calculating things such as the
PrecomputedTransactionData object, filling the CCoinsViewCache, etc
in the exact same way as ConnectBlock. I belive this is a reasonable
assumption, but should be noted carefully.
In a rather naive benchmark (reindex-chainstate up to block 284k
with cuckoocache always returning true for contains(),
-assumevalid=0 and a very large dbcache), this connected blocks
~1.7x faster.