Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call ProcessNewBlock() asynchronously #16175

Closed

Conversation

@TheBlueMatt
Copy link
Contributor

TheBlueMatt commented Jun 9, 2019

This is essentially a rebase of #12934, though its actually greenfield to reduce diff a bit. Based on #16174 as without it the parallelism would be almost entirely useless. Note that this is draft as I haven't (yet) bothered doing the backgrounding of the block validation, its currently just a return value change and the net_processing changes to handle it.

@DrahtBot

This comment has been minimized.

Copy link
Contributor

DrahtBot commented Jun 9, 2019

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #16202 (Refactor network message deserialization by jonasschnelli)
  • #16194 (refactor: share blockmetadata with BlockManager by jamesob)
  • #15921 (Tidy up ValidationState interface by jnewbery)
  • #15606 ([experimental] UTXO snapshots by jamesob)
  • #15545 ([doc] explain why CheckBlock() is called before AcceptBlock by Sjors)
  • #15505 ([p2p] Request NOTFOUND transactions immediately from other outbound peers, when possible by sdaftuar)
  • #15206 (Immediately disconnect on invalid net message checksum by jonasschnelli)
  • #14046 (net: Refactor message parsing (CNetMessage), adds flexibility by jonasschnelli)
  • #14032 (Add p2p layer encryption with ECDH/ChaCha20Poly1305 by jonasschnelli)
  • #9849 (Qt: Network Watch tool by luke-jr)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@TheBlueMatt TheBlueMatt force-pushed the TheBlueMatt:2019-06-add-second-thread branch 2 times, most recently from 3e8ad42 to 5779633 Jun 9, 2019
@TheBlueMatt

This comment has been minimized.

Copy link
Contributor Author

TheBlueMatt commented Jun 9, 2019

Oops, it turns out its actually not possible to do #16174 first separately (and without it, this PR is entirely useless). By itself, #16174 ends up hitting the deadly cross-validationinterface-queue deadlock - it takes cs_peerstate first, then (eventually) calls PNB, which may call LimitValidationInterfaceQueue(). Meanwhile, in the validation queue, we may end up waiting on cs_peerstate to update some peer's state. Of course, once we actually move block validation into its own thread, this is no longer (materially) a concern - we trivially guarantee there are no locks held going into LimitValidationInterfaceQueue().

@TheBlueMatt TheBlueMatt force-pushed the TheBlueMatt:2019-06-add-second-thread branch 4 times, most recently from a36f301 to 4eb8ebf Jun 9, 2019
@TheBlueMatt TheBlueMatt marked this pull request as ready for review Jun 9, 2019
@TheBlueMatt TheBlueMatt force-pushed the TheBlueMatt:2019-06-add-second-thread branch from 4eb8ebf to d31e3d0 Jun 9, 2019
@TheBlueMatt

This comment has been minimized.

Copy link
Contributor Author

TheBlueMatt commented Jun 9, 2019

Note that there is still a regression causing some functional tests to time out as we may end up waiting 100ms to discover that a block has been processed and we can process the next message from that peer (this shouldn't be an issue in the case of downloading from multiple peers since we'll always wake on a new tip block, and seems to not be an issue for one peer, even though its possible it would be due to a wake race). Not 100% sure what the best solution to that is.

@jamesob

This comment has been minimized.

Copy link
Member

jamesob commented Jun 11, 2019

Oops, it turns out its actually not possible to do #16174 first separately [...] Meanwhile, in the validation queue, we may end up waiting on cs_peerstate to update some peer's state.

For other reviewers: the particular validationinterface callback at fault here looks to be PeerLogicValidation::BlockChecked.

Big Concept ACK on the lock-splitting at the very least, though I do wish it could be done separately somehow. I'm a Concept-Concept ACK on this (in concept), but want to make sure I fully understand the implications.

TheBlueMatt added 10 commits Jun 8, 2019
CNodeState was added for validation-state-tracking, and thus,
logically, was protected by cs_main. However, as it has grown to
include non-validation state (taking state from CNode), and as
we've reduced cs_main usage for other unrelated things, CNodeState
is left with lots of cs_main locking in net_processing.

In order to ease transition to something new, this adds only a
dummy CPeerState which is held as a reference for the duration of
message processing.

Note that moving things is somewhat tricky pre validation-thread as
a consistent lockorder must be kept - we can't take a lock on the
new cs_peerstate in anything that's called directly from
validation.
This prepares for making ProcessNewBlock processing actually happen
in a separate thread from the caller, even though all callsites
currently block on the return value immediately.

Note that this makes some of the unit tests less restrictive.
9fdf05d resolved some lock
inversion warnings in denialofservice_tests, but left in a number
of cs_main locks that are unnecessary (introducing lock inversion
warnings in future changes).
Essentially, our goal is to not process anything for the given peer
until the block finishes processing (emulating the previous behavior)
without actually blocking the ProcessMessages loops. Obviously, in
most cases, we'll just go on to the next peer and immediately hit a
cs_main lock, blocking us anyway, but this we can slowly improve
that state over time by moving things from CNodeState to CPeerState.
Spawn a background thread at startup which validates each block as
it comes in from ProcessNewBlock, taking advantage of the new
std::future return value to keep tests simple (and the new
net_processing handling of such values async already).

This makes introducing subtle validationinterface deadlocks much
harder as any locks held going into ProcessNewBlock do not interact
with (in the form of lockorder restrictions) locks taken in
validationinterface callbacks.

Note that after this commit, feature_block and feature_assumevalid
tests time out due to increased latency between block processing
when those blocks do not represent a new best block. This will be
resolved in the next commit.
This resolves the performance regression introduced in the previous
commit by always waking the message processing thread after each
block future resolves.

Sadly, this is somewhat awkward - all other validationinterface
callbacks represent an actual change to the global validation state,
whereas this callback indicates only that a call which one
validation "client" made has completed. After going back and forth
for some time I didn't see a materially better way to resolve this
issue, and luckily its a rather simple change, but its far from
ideal. Note that because we absolutely do not want to ever block on
a ProcessNewBlock-returned-future, the callback approach is
critical.
This removes the cs_main lock which is taken on every
ProcessMessages call.
@TheBlueMatt TheBlueMatt force-pushed the TheBlueMatt:2019-06-add-second-thread branch from 9727e5e to 2d1e8f6 Jun 12, 2019
*
* @param[in] pblock The block we want to process.
* @param[in] fForceProcessing Process this block even if unrequested; used for non-network block sources and whitelisted peers.
* @param[out] fNewBlock A boolean which is set to indicate if the block was first received via this call
* @return True if state.IsValid()
* @return A future which complets with a boolean which is set to indicate if the block was first received via this call

This comment has been minimized.

Copy link
@practicalswift

practicalswift Jun 13, 2019

Member

Adding an "e" completes the spelling here :-)

Copy link
Contributor

ryanofsky left a comment

Started review (will update list below with progress).

  • 9bc8ca6 Remove unnecessary cs_mains in denialofservice_tests (1/10)
  • 79e6e44 Make ProcessNewBlock return a future instead of an immediate bool (2/10)
  • e898229 Add a new peer state tracking class to reduce cs_main contention. (3/10)
  • 8ddbb12 Move net_processing's ProcessNewBlock calls to resolve async. (4/10)
  • 64e74ce Run ProcessNewBlock in a background thread (XXX: p2p_sendheaders broken) (5/10)
  • ba810b9 Add a callback to indicate a block has been processed (6/10)
  • 3499d98 Move BlockChecked to a background thread (7/10)
  • 445bb84 Move nDoS counters to CPeerState (and, thus, out of cs_main) (8/10)
  • b6a0552 Move rejects into cs_peerstate. (9/10)
  • 2d1e8f6 Actually parallelize validation somewhat (10/10)
@@ -388,7 +399,20 @@ struct CNodeState {
// Keeps track of the time (in microseconds) when transactions were requested last time
limitedmap<uint256, int64_t> g_already_asked_for GUARDED_BY(cs_main)(MAX_INV_SZ);

/** Note that this must be locked BEFORE cs_main! */
CCriticalSection cs_peerstate;

This comment has been minimized.

Copy link
@ryanofsky

ryanofsky Jun 13, 2019

Contributor

In commit "Add a new peer state tracking class to reduce cs_main contention." (e898229)

Do you want this to be a recursive mutex for some reason? If so, it'd helpful to say in an comment why recursive locking is helpful here. Otherwise I'd suggest changing this line to Mutex g_peerstate_mutex (also not using the old "critical section" terminology).

@@ -756,12 +780,22 @@ void PeerLogicValidation::InitializeNode(CNode *pnode) {
LOCK(cs_main);
mapNodeState.emplace_hint(mapNodeState.end(), std::piecewise_construct, std::forward_as_tuple(nodeid), std::forward_as_tuple(addr, std::move(addrName), pnode->fInbound, pnode->m_manual_connection));
}
{
LOCK(cs_peerstate);
mapPeerState.emplace_hint(mapPeerState.end(), nodeid, CPeerState{});

This comment has been minimized.

Copy link
@ryanofsky

ryanofsky Jun 13, 2019

Contributor

In commit "Add a new peer state tracking class to reduce cs_main contention." (e898229)

Not necessarily suggesting you should change this here, but is the only reason for adding a new mapPeerState map to make the diff smaller? If entries are always added to both mapNodeState and mapPeerState maps and removed from the maps at the same times with both locks held, it would seem more ideal to have just single map where each entry contains separate node state and peer state structs (to support separate lock annotations).

* contention. Thus, new state tracking should go here, and we should eventually
* move most (non-validation-specific) state here.
*/
struct CPeerState {

This comment has been minimized.

Copy link
@ryanofsky

ryanofsky Jun 13, 2019

Contributor

In commit "Add a new peer state tracking class to reduce cs_main contention." (e898229)

Probably better to follow current coding convention and call it PeerState instead of CPeerState

/** Map maintaining per-node state. */
static std::map<NodeId, CPeerState> mapPeerState GUARDED_BY(cs_peerstate);

static CPeerState *PeerState(NodeId pnode) EXCLUSIVE_LOCKS_REQUIRED(cs_peerstate) LOCKS_EXCLUDED(cs_main) {

This comment has been minimized.

Copy link
@ryanofsky

ryanofsky Jun 13, 2019

Contributor

In commit "Add a new peer state tracking class to reduce cs_main contention." (e898229)

Would suggest calling this LookupPeerState, to be consistent with LookupBlockIndex, and because otherwise calls to this function look like they are constructing objects, not looking up existing objects

/** Map maintaining per-node state. */
static std::map<NodeId, CPeerState> mapPeerState GUARDED_BY(cs_peerstate);

This comment has been minimized.

Copy link
@ryanofsky

ryanofsky Jun 13, 2019

Contributor

In commit "Add a new peer state tracking class to reduce cs_main contention." (e898229)

g_peer_states instead of mapPeerState would be more in line with current coding convention

}
}

result_promise.set_value(fNewBlock);

This comment has been minimized.

Copy link
@ryanofsky

ryanofsky Jun 13, 2019

Contributor

In commit "Make ProcessNewBlock return a future instead of an immediate bool" (79e6e44)

This line is added here and them moved below NotifyHeaderTip in a later commit 64e74ce. It would be preferable to just add in the right place below now so it doesn't need to be moved later.

return result;
}

std::future<bool> ProcessNewBlock(const CChainParams& chainparams, const std::shared_ptr<const CBlock> pblock, bool fForceProcessing)

This comment has been minimized.

Copy link
@ryanofsky

ryanofsky Jun 13, 2019

Contributor

In commit "Run ProcessNewBlock in a background thread (XXX: p2p_sendheaders broken)" (64e74ce)

Documentation for this function in validation.h becomes out of date with this commit and should be updated. It still says "This only returns after the best known valid block is made active."

bool ret = CheckBlock(*pblock, state, chainparams.GetConsensus());
if (ret) {
// Store to disk
ret = ::ChainstateActive().AcceptBlock(pblock, state, chainparams, &pindex, fForceProcessing, nullptr, &fNewBlock);

This comment has been minimized.

Copy link
@ryanofsky

ryanofsky Jun 13, 2019

Contributor

In commit "Run ProcessNewBlock in a background thread (XXX: p2p_sendheaders broken)" (64e74ce)

Throughout this function shouldn't ::ChainstateActive() be replaced by *this?

if (!new_block && accepted) {
LOCK(cs_main);
const CBlockIndex* pindex = LookupBlockIndex(blockptr->GetHash());
if (!new_block && pindex && pindex->IsValid()) {

This comment has been minimized.

Copy link
@ryanofsky

ryanofsky Jun 13, 2019

Contributor

In commit "Make ProcessNewBlock return a future instead of an immediate bool" (79e6e44)

This seems fine, but just to make sure I understand correctly, there is a change in behavior here? Previously accepted would only be true if CheckBlock and AcceptBlock and ActivateBestChain all succeeded but this can be true even they fail?

@@ -175,7 +175,7 @@ BOOST_FIXTURE_TEST_CASE(blockfilter_index_initial_sync, TestChain100Setup)
uint256 chainA_last_header = last_header;
for (size_t i = 0; i < 2; i++) {
const auto& block = chainA[i];
BOOST_REQUIRE(ProcessNewBlock(Params(), block, true, nullptr));

This comment has been minimized.

Copy link
@ryanofsky

ryanofsky Jun 13, 2019

Contributor

In commit "Make ProcessNewBlock return a future instead of an immediate bool" (79e6e44)

In the commit message about these tests you wrote "Note that this makes some of the unit tests less restrictive." But I don't think there's a good justification for doing this. If you just declared struct ProcessNewBlockResult { bool error = false; bool already_have = false; }; and returned std::future<ProcessNewBlockResult>, the tests could behave the same and the code would be more readable and less fragile, without the need for people to guess from context whether the bool indicates success or newness.

This comment has been minimized.

Copy link
@ajtowns

ajtowns Jun 18, 2019

Contributor

Think I agree with this suggestion; dropping the fNewBlock result at the same time as turning the return into a future seems unnecessarily complicated.

@jamesob

This comment has been minimized.

Copy link
Member

jamesob commented Jun 13, 2019

When IBDing from a single peer, this branch appears to be no slower (or negligibly slower) than master. I'm going to modify the benchmark framework to test against multiple peers, which is where I'm assuming we'd expect to see some speedup.

ibd local range 500000 505000

@ryanofsky

This comment has been minimized.

Copy link
Contributor

ryanofsky commented Jun 14, 2019

Just want to note that all my comments above are just suggested cleanups. Feel free to ignore them if they don't make sense or aren't worth effort to implement. The only thing I'd really like to see are more comments about locking. When a lock is held in a small scope for specific purpose, I don't think there's a need to have a comment, but when a lock is held over wide scope for no obvious reason it's good to have a comment that suggests why it needs to be acquired there and when it is safe to release.

@TheBlueMatt

This comment has been minimized.

Copy link
Contributor Author

TheBlueMatt commented Jun 18, 2019

Working on rewriting this to make it simpler but it looks like it may end up growing so dunno about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.