Peer connections hang temporarily or fail permanently #1435

oxarbitrage · 2020-12-02T13:43:34Z

Analysis

Peer connection failures can result in a slow sync, temporary sync hangs, or permanent sync hangs.

These failures can happen after network interruptions, or due to normal connection churn. They might also happen because Zebra's protocol state machine gets in an invalid or unrecoverable state.

Next Steps

Here are some things we could try:

Run Zebra with a single local peer, and make it panic as soon as that peer disappears. It's ok for the peer to go from ready to unready and back, and maybe disconnect, but it should never disappear.

Version

zebrad 3.0.0-alpha.0

Current main branch
Edit: as of 2020-12-02

Platform
Linux oxarbitrage 4.15.0-20-generic #21-Ubuntu SMP Tue Apr 24 06:16:15 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Description

During sync on the mainnet the sync component stopped responding. The program didn't crashed or hanged totally as the inbound requests where still being responded however no more block downloads were made. I have a log of this at https://gist.github.com/oxarbitrage/2f067fed9588c3d942e499d1252fc777 where the last msg from the sync component is at https://gist.github.com/oxarbitrage/2f067fed9588c3d942e499d1252fc777#file-gistfile1-txt-L2843 and no further msg after 30 minutes.

I tried stooping the program and start again. The sync resumed.

--
When trying to sync the zcash blockchain using zebra i will expect to download all the blocks from start to end without intervention. Instead, i had to stop(ctrl-c) the program and restart to keep going.

The text was updated successfully, but these errors were encountered:

teor2345 · 2020-12-02T21:24:00Z

Your logs say that you're using commit bfbc737.

This looks like issue #1409, which is fixed by PR #1425 in a later commit b449fe9.

Please pull the latest main branch, recompile, and relaunch Zebra. If the issue happens again, please re-open this ticket.

oxarbitrage · 2020-12-02T21:31:35Z

Yea, the PR was not merged when i started that test earlier today. Thanks, i will reopen if i see it again.

oxarbitrage · 2020-12-02T23:23:18Z

I am seeing the same issue with an updated branch that includes #1425

No response from the sync component after 30 minutes.

A restart fixes it.

Click triangle to reveal logs

Dec 02 19:46:32.007  INFO {zebrad="7d1dc174"}:peer{addr=39.97.178.10:8233}:msg_as_req{msg=getheaders}:inbound:state: zebra_state::service: responding to peer GetBlocks or GetHeaders final_height=Height(14491) response_len=158 chain_tip_height=Height(419181) stop_height=None intersection_height=Some(Height(14333))
Dec 02 19:46:35.684  INFO {zebrad="7d1dc174"}:sync: zebrad::components::sync: waiting for pending blocks tips.len=1 in_flight=2372 lookahead_limit=2000
Dec 02 19:46:35.685  INFO {zebrad="7d1dc174"}:sync: zebrad::components::sync: extending tips tips.len=1 in_flight=2000 lookahead_limit=2000
Dec 02 19:46:36.668  INFO {zebrad="7d1dc174"}:sync:extend_tips:checkpoint: zebra_consensus::checkpoint: verified checkpoint range block_count=400 current_range=(Excluded(Height(419181)), Included(Height(419581)))
Dec 02 19:46:45.185  INFO {zebrad="7d1dc174"}:sync: zebrad::components::sync: waiting for pending blocks tips.len=1 in_flight=2372 lookahead_limit=2000
Dec 02 19:46:45.185  INFO {zebrad="7d1dc174"}:sync: zebrad::components::sync: extending tips tips.len=1 in_flight=2000 lookahead_limit=2000
Dec 02 19:46:49.016  INFO {zebrad="7d1dc174"}:sync: zebrad::components::sync: waiting for pending blocks tips.len=1 in_flight=2372 lookahead_limit=2000
Dec 02 19:46:49.017  INFO {zebrad="7d1dc174"}:sync: zebrad::components::sync: extending tips tips.len=1 in_flight=2000 lookahead_limit=2000
Dec 02 19:46:52.900  INFO {zebrad="7d1dc174"}:sync: zebrad::components::sync: waiting for pending blocks tips.len=1 in_flight=2374 lookahead_limit=2000
Dec 02 19:46:52.901  INFO {zebrad="7d1dc174"}:sync: zebrad::components::sync: extending tips tips.len=1 in_flight=2000 lookahead_limit=2000
Dec 02 19:46:57.378  INFO {zebrad="7d1dc174"}:sync: zebrad::components::sync: waiting for pending blocks tips.len=1 in_flight=2489 lookahead_limit=2000
Dec 02 19:47:52.769  INFO {zebrad="7d1dc174"}:peer{addr=39.97.178.10:8233}:msg_as_req{msg=inv}:inbound:download_and_verify{hash=000000000036349c3579797cff490c80a6d337c83ab20839b7b1c831caf5cedf}: zebra_state::service: pruned utxo requests old_len=5 new_len=0 prune_count=5 tip=Some((Height(419581), block::Hash("000000000003de545eedfd47b6ae10327fa3b317a6d1eb66a29c3bd6029a243d")))
Dec 02 19:51:31.471  INFO {zebrad="7d1dc174"}:peer{addr=47.254.69.198:8233}:msg_as_req{msg=getheaders}:inbound: zebra_state::service: pruned utxo requests old_len=104 new_len=103 prune_count=1 tip=Some((Height(419581), block::Hash("000000000003de545eedfd47b6ae10327fa3b317a6d1eb66a29c3bd6029a243d")))
Dec 02 19:51:31.521  INFO {zebrad="7d1dc174"}:peer{addr=47.254.69.198:8233}:msg_as_req{msg=getheaders}:inbound:state: zebra_state::service: responding to peer GetBlocks or GetHeaders final_height=Height(14496) response_len=158 chain_tip_height=Height(419581) stop_height=None intersection_height=Some(Height(14338))
Dec 02 19:56:13.322  INFO {zebrad="7d1dc174"}:peer{addr=59.110.243.62:8233}:msg_as_req{msg=inv}:inbound:download_and_verify{hash=00000000010eaac9ff05c2db4c17a64ea5ec024efb300f517c6fec60236817e8}: zebra_state::service: pruned utxo requests old_len=203 new_len=0 prune_count=203 tip=Some((Height(419581), block::Hash("000000000003de545eedfd47b6ae10327fa3b317a6d1eb66a29c3bd6029a243d")))
Dec 02 20:01:13.848  INFO {zebrad="7d1dc174"}:peer{addr=47.254.69.198:8233}:msg_as_req{msg=getheaders}:inbound: zebra_state::service: pruned utxo requests old_len=96 new_len=91 prune_count=5 tip=Some((Height(419581), block::Hash("000000000003de545eedfd47b6ae10327fa3b317a6d1eb66a29c3bd6029a243d")))
Dec 02 20:01:13.867  INFO {zebrad="7d1dc174"}:peer{addr=47.254.69.198:8233}:msg_as_req{msg=getheaders}:inbound:state: zebra_state::service: responding to peer GetBlocks or GetHeaders final_height=Height(14507) response_len=158 chain_tip_height=Height(419581) stop_height=None intersection_height=Some(Height(14349))
Dec 02 20:04:41.853  INFO {zebrad="7d1dc174"}:peer{addr=120.79.50.67:8233}:msg_as_req{msg=inv}:inbound:download_and_verify{hash=000000000010e5d01e28c86a99c18906fd6cf6c975e565908fc471c5fbe42e83}: zebra_state::service: pruned utxo requests old_len=109 new_len=0 prune_count=109 tip=Some((Height(419581), block::Hash("000000000003de545eedfd47b6ae10327fa3b317a6d1eb66a29c3bd6029a243d")))
Dec 02 20:04:57.586  INFO {zebrad="7d1dc174"}:peer{addr=66.70.180.144:8233}:msg_as_req{msg=getheaders}:inbound:state: zebra_state::service: responding to peer GetBlocks or GetHeaders final_height=Height(14510) response_len=158 chain_tip_height=Height(419581) stop_height=None intersection_height=Some(Height(14352))
Dec 02 20:08:51.840  INFO {zebrad="7d1dc174"}:peer{addr=139.99.123.157:8233}:msg_as_req{msg=getheaders}:inbound: zebra_state::service: pruned utxo requests old_len=275 new_len=9 prune_count=266 tip=Some((Height(419581), block::Hash("000000000003de545eedfd47b6ae10327fa3b317a6d1eb66a29c3bd6029a243d")))
Dec 02 20:08:51.856  INFO {zebrad="7d1dc174"}:peer{addr=139.99.123.157:8233}:msg_as_req{msg=getheaders}:inbound:state: zebra_state::service: responding to peer GetBlocks or GetHeaders final_height=Height(14513) response_len=158 chain_tip_height=Height(419581) stop_height=None intersection_height=Some(Height(14355))
Dec 02 20:12:51.891  INFO {zebrad="7d1dc174"}:peer{addr=120.79.50.67:8233}:msg_as_req{msg=inv}:inbound:download_and_verify{hash=000000000060fd9d327ae2680d946af863c191a96b1b8981574a6310f1830e20}: zebra_state::service: pruned utxo requests old_len=21 new_len=0 prune_count=21 tip=Some((Height(419581), block::Hash("000000000003de545eedfd47b6ae10327fa3b317a6d1eb66a29c3bd6029a243d")))
Dec 02 20:14:14.204  INFO {zebrad="7d1dc174"}:peer{addr=39.99.44.175:8233}:msg_as_req{msg=getheaders}:inbound:state: zebra_state::service: responding to peer GetBlocks or GetHeaders final_height=Height(14516) response_len=158 chain_tip_height=Height(419581) stop_height=None intersection_height=Some(Height(14358))
Dec 02 20:14:18.081  INFO {zebrad="7d1dc174"}:peer{addr=176.74.154.45:8233}:msg_as_req{msg=getheaders}:inbound:state: zebra_state::service: responding to peer GetBlocks or GetHeaders final_height=Height(14516) response_len=158 chain_tip_height=Height(419581) stop_height=None intersection_height=Some(Height(14358))
Dec 02 20:16:27.824  INFO {zebrad="7d1dc174"}:peer{addr=47.254.69.198:8233}:msg_as_req{msg=inv}:inbound:download_and_verify{hash=000000000152e1077b370d87f446e04c10c7c586c0645266b0ba8c3c94d24451}: zebra_state::service: pruned utxo requests old_len=74 new_len=2 prune_count=72 tip=Some((Height(419581), block::Hash("000000000003de545eedfd47b6ae10327fa3b317a6d1eb66a29c3bd6029a243d")))
Dec 02 20:18:41.734  INFO {zebrad="7d1dc174"}:peer{addr=39.97.178.10:8233}:msg_as_req{msg=inv}:inbound:download_and_verify{hash=0000000000b8bf272dcc55cfc71b478c59116ccdf713a73dab85882c975fe3da}: zebra_state::service: pruned utxo requests old_len=2 new_len=0 prune_count=2 tip=Some((Height(419581), block::Hash("000000000003de545eedfd47b6ae10327fa3b317a6d1eb66a29c3bd6029a243d")))

teor2345 · 2020-12-03T03:22:35Z

@oxarbitrage if you have a 10s of gigabytes of spare disk space, you can help us diagnose the issue using a trace log:

would it be possible to run a sync with filter = 'trace' and save a complete execution trace? It will likely be quite large, but we'd have the ability to trace exactly what data is causing the timeout.
#1406 (comment)

If you don't have lots of disk space, using filter = 'debug' might help us diagnose the issue - the logs would be a lot smaller than trace.

oxarbitrage · 2020-12-03T21:30:53Z

I was able to find the problem in debug mode so i have some logs https://gist.github.com/oxarbitrage/d2369fd4e5abb24e683e5a6301fcc6ff#file-sync_debug_logs_mainnet-txt

I didn't researched much yet but one possibility is i am losing connection locally(my adsl change IP every 12 hours or so) and by the amount of marking candidate as failed candidate.addr= msgs in that log all candidates are blacklisted when the connection returns back.

I will confirm this, can post bigger logs if needed.

oxarbitrage · 2020-12-03T21:51:29Z

Yea, it seems that is what is happening. I resumed and blocks are downloaded fine, then i closed my connection in purpose, while zebra is still running. Waited like 30 seconds and connected back, zebra was never able to download more blocks long after my connection was active again.

I have some logs for this too:

https://gist.github.com/oxarbitrage/916ed4d49db39116917ad6e938555e8e#file-sync_disconnect_mainnet-txt

oxarbitrage · 2020-12-03T22:00:28Z

Is the cut is short enough(tested a cut of 3-5 seconds) and there is no time to mark all the peers as failing then the sync will continue

oxarbitrage · 2020-12-03T22:59:10Z

The cut haves to be long(of at least 1 minute here) to actually be totally unrecoverable in the same session.

In a 1>minute cut sometimes i get this warnings long after the connection is restored:

Dec 03 19:49:46.245  WARN crawl_and_dial: zebra_network::peer_set::set: network request with no peer connections. Hint: check your network connection

They at least indicate something is going on with the network with default info level. In some other runs i am not seeing this warns, i suppose this can depend on the amount of time the connection was off.

hdevalence · 2020-12-03T23:15:33Z

Hmm, I don't think that the failure of an existing connection causes a peer address to be put in the "failed candidates" list, that's a list of peers where the connection attempt failed. (It's not a permanent rejection, it's just the lowest-preference for reconnection attempts). The peer set will attempt to reconnect to peers it disconnected from, by waiting until they could no longer be connected (ie. waiting beyond a timeout interval), then treating them as high-priority candidates for new connections.

The peer set is supposed to send a demand signal when it does not have enough peers, causing attempts to connect to new peers. It's unclear why that mechanism isn't working.

teor2345 · 2020-12-04T00:42:54Z

I'm also seeing this issue, it looks like my machine or router stopped allowing connections for a few hours this morning. When it came back up, Zebra reconnected to 1 peer, but has failed to reconnect to any other peers.

Edit: this is on testnet, so I should be seeing 5-20 active peers

teor2345 · 2020-12-04T03:03:23Z

I've just seen a Testnet Zebra rapidly lose all its peers, and never reconnect:

Click triangle to reveal logs

Dec 04 12:52:33.915  INFO {zebrad="af01a78e"}:sync:extend_tips: zebra_network::peer_set::set: network request with no ready peers: finding more peers, waiting for 15 peers to answer requests
Dec 04 12:52:49.767  INFO {zebrad="af01a78e"}:sync: zebrad::components::sync: extending tips tips.len=1 in_flight=1606 lookahead_limit=2000
Dec 04 12:53:06.779  INFO {zebrad="af01a78e"}:sync: zebrad::components::sync: extending tips tips.len=1 in_flight=1977 lookahead_limit=2000
Dec 04 12:53:32.834  INFO {zebrad="af01a78e"}:sync:extend_tips: zebra_state::service: pruned utxo requests old_len=1365 new_len=40 prune_count=1325 tip=Some((Height(282491), block::Hash("0013da35d7196632f70cfa9fcc6e22906db20d21430ec08cfd2ba231bfd594b4")))
Dec 04 12:53:34.275  INFO {zebrad="af01a78e"}:sync:extend_tips: zebra_network::peer_set::set: network request with no ready peers: finding more peers, waiting for 15 peers to answer requests
Dec 04 12:54:17.672  INFO {zebrad="af01a78e"}:sync:extend_tips: zebra_state::service: pruned utxo requests old_len=729 new_len=45 prune_count=684 tip=Some((Height(282491), block::Hash("0013da35d7196632f70cfa9fcc6e22906db20d21430ec08cfd2ba231bfd594b4")))
Dec 04 12:54:35.569  INFO {zebrad="af01a78e"}:sync:extend_tips: zebra_network::peer_set::set: network request with no ready peers: finding more peers, waiting for 5 peers to answer requests
Dec 04 12:55:19.382  INFO {zebrad="af01a78e"}:peer{addr=[2600:3c00::f03c:92ff:fe48:bd98]:18233}:msg_as_req{msg=getheaders}:inbound: zebra_state::service: pruned utxo requests old_len=423 new_len=49 prune_count=374 tip=Some((Height(282491), block::Hash("0013da35d7196632f70cfa9fcc6e22906db20d21430ec08cfd2ba231bfd594b4")))
Dec 04 12:55:19.384  INFO {zebrad="af01a78e"}:peer{addr=[2600:3c00::f03c:92ff:fe48:bd98]:18233}:msg_as_req{msg=getheaders}:inbound:state: zebra_state::service: responding to peer GetBlocks or GetHeaders final_height=Height(141750) response_len=158 chain_tip_height=Height(282491) stop_height=None intersection_height=Some(Height(141592))
Dec 04 12:55:40.353  INFO {zebrad="af01a78e"}:sync:extend_tips: zebra_network::peer_set::set: network request with no ready peers: finding more peers, waiting for 1 peers to answer requests
Dec 04 12:55:50.508  INFO {zebrad="af01a78e"}:sync:extend_tips: zebra_state::service: pruned utxo requests old_len=4801 new_len=62 prune_count=4739 tip=Some((Height(282491), block::Hash("0013da35d7196632f70cfa9fcc6e22906db20d21430ec08cfd2ba231bfd594b4")))
Dec 04 12:56:45.185  WARN {zebrad="af01a78e"}:sync:extend_tips: zebra_network::peer_set::set: network request with no peer connections. Hint: check your network connection
Dec 04 12:57:47.290  WARN {zebrad="af01a78e"}:sync:extend_tips: zebra_network::peer_set::set: network request with no peer connections. Hint: check your network connection
Dec 04 12:59:02.290  WARN {zebrad="af01a78e"}:sync:extend_tips: zebra_network::peer_set::set: network request with no peer connections. Hint: check your network connection
Dec 04 13:00:17.291  WARN {zebrad="af01a78e"}:sync:extend_tips: zebra_network::peer_set::set: network request with no peer connections. Hint: check your network connection
Dec 04 13:01:32.291  WARN {zebrad="af01a78e"}:sync:extend_tips: zebra_network::peer_set::set: network request with no peer connections. Hint: check your network connection
Dec 04 13:02:47.291  WARN crawl_and_dial: zebra_network::peer_set::set: network request with no peer connections. Hint: check your network connection

teor2345 · 2020-12-04T05:21:08Z

I've taken the triage off this bug, because we already triaged some of its duplicates into the first alpha

oxarbitrage · 2020-12-06T21:51:17Z

Between each warn here is what i can get in a trace level:

Click triangle to reveal logs

...
Dec 06 18:42:32.966 TRACE {zebrad="5b687f87"}:sync:extend_tips:download_and_verify{hash=0000000000176a3344d93598c018fddef80f2edd59b194a7c0a0b7a59aa52c4a}: zebra_network::peer_set::set: ready_peers=0 unready_peers=0
Dec 06 18:42:32.966  WARN {zebrad="5b687f87"}:sync:extend_tips:download_and_verify{hash=0000000000176a3344d93598c018fddef80f2edd59b194a7c0a0b7a59aa52c4a}: zebra_network::peer_set::set: network request with no peer connections. Hint: check your network connection
Dec 06 18:42:32.967 TRACE {zebrad="5b687f87"}:sync:extend_tips:download_and_verify{hash=0000000000176a3344d93598c018fddef80f2edd59b194a7c0a0b7a59aa52c4a}: zebra_network::peer_set::set: preselected service was not ready, reselecting
Dec 06 18:42:32.967 TRACE {zebrad="5b687f87"}:sync:extend_tips:download_and_verify{hash=0000000000176a3344d93598c018fddef80f2edd59b194a7c0a0b7a59aa52c4a}: zebra_network::peer_set::set: no ready services, sending demand signal
Dec 06 18:42:32.967 TRACE {zebrad="5b687f87"}:sync:extend_tips:download_and_verify{hash=0000000000176a3344d93598c018fddef80f2edd59b194a7c0a0b7a59aa52c4a}: tower::buffer::worker: service.ready=false delay
Dec 06 18:42:52.968 DEBUG {zebrad="5b687f87"}:sync:extend_tips:download_and_verify{hash=0000000000176a3344d93598c018fddef80f2edd59b194a7c0a0b7a59aa52c4a}: zebra_network::policies: retrying req=BlocksByHash({block::Hash("0000000000176a3344d93598c018fddef80f2edd59b194a7c0a0b7a59aa52c4a")}) e=Elapsed(()) remaining_tries=1
Dec 06 18:42:52.968 DEBUG {zebrad="5b687f87"}:sync:extend_tips:download_and_verify{hash=0000000000176a3344d93598c018fddef80f2edd59b194a7c0a0b7a59aa52c4a}: zebra_network::policies: retrying req=BlocksByHash({block::Hash("0000000000176a3344d93598c018fddef80f2edd59b194a7c0a0b7a59aa52c4a")}) e=Elapsed(()) remaining_tries=2
Dec 06 18:43:47.966 TRACE {zebrad="5b687f87"}:task{kind=task spawn.location=/home/oxarbitrage/.cargo/git/checkouts/tower-b098c32cf5a1bcca/d4d1c67/tower/src/buffer/service.rs:67:9}: tower::buffer::worker: worker polling for next message
Dec 06 18:43:47.966 TRACE {zebrad="5b687f87"}:task{kind=task spawn.location=/home/oxarbitrage/.cargo/git/checkouts/tower-b098c32cf5a1bcca/d4d1c67/tower/src/buffer/service.rs:67:9}: tower::buffer::worker: dropping cancelled buffered request
Dec 06 18:43:47.966 TRACE {zebrad="5b687f87"}:task{kind=task spawn.location=/home/oxarbitrage/.cargo/git/checkouts/tower-b098c32cf5a1bcca/d4d1c67/tower/src/buffer/service.rs:67:9}: tower::buffer::worker: dropping cancelled request
Dec 06 18:43:47.966 TRACE {zebrad="5b687f87"}:task{kind=task spawn.location=/home/oxarbitrage/zebra/zebra_sync/zebra/zebra-network/src/peer_set/initialize.rs:162:23}:crawl_and_dial: tower::buffer::service: sending request to buffer worker
Dec 06 18:43:47.966 TRACE {zebrad="5b687f87"}:task{kind=task spawn.location=/home/oxarbitrage/.cargo/git/checkouts/tower-b098c32cf5a1bcca/d4d1c67/tower/src/buffer/service.rs:67:9}: tower::buffer::worker: worker polling for next message
Dec 06 18:43:47.966 TRACE {zebrad="5b687f87"}:task{kind=task spawn.location=/home/oxarbitrage/.cargo/git/checkouts/tower-b098c32cf5a1bcca/d4d1c67/tower/src/buffer/service.rs:67:9}: tower::buffer::worker: processing new request
Dec 06 18:43:47.966 TRACE {zebrad="5b687f87"}:task{kind=task spawn.location=/home/oxarbitrage/zebra/zebra_sync/zebra/zebra-network/src/peer_set/initialize.rs:162:23}:crawl_and_dial: tower::buffer::worker: resumed=false worker received request; waiting for service readiness
Dec 06 18:43:47.966 TRACE {zebrad="5b687f87"}:sync:extend_tips:download_and_verify{hash=000000000100c07024462b6221c73daca37afc18047cc0a5fcf2e4520573c821}: tower::hedge::rotating_histogram: Time since last rotation is 189.693851337s.  clearing!
Dec 06 18:43:47.967 TRACE {zebrad="5b687f87"}:task{kind=task spawn.location=/home/oxarbitrage/zebra/zebra_sync/zebra/zebra-network/src/peer_set/initialize.rs:162:23}:crawl_and_dial: zebra_network::peer_set::set: ready_peers=0 unready_peers=0
Dec 06 18:43:47.967 TRACE {zebrad="5b687f87"}:sync:extend_tips:download_and_verify{hash=000000000100c07024462b6221c73daca37afc18047cc0a5fcf2e4520573c821}: tower::buffer::service: sending request to buffer worker
Dec 06 18:43:47.967  WARN {zebrad="5b687f87"}:task{kind=task spawn.location=/home/oxarbitrage/zebra/zebra_sync/zebra/zebra-network/src/peer_set/initialize.rs:162:23}:crawl_and_dial: zebra_network::peer_set::set: network request with no peer connections. Hint: check your network connection
Dec 06 18:43:47.967 DEBUG {zebrad="5b687f87"}:sync:extend_tips:download_and_verify{hash=000000000100c07024462b6221c73daca37afc18047cc0a5fcf2e4520573c821}: zebrad::components::sync::downloads: requested block
Dec 06 18:43:47.967 TRACE {zebrad="5b687f87"}:task{kind=task spawn.location=/home/oxarbitrage/zebra/zebra_sync/zebra/zebra-network/src/peer_set/initialize.rs:162:23}:crawl_and_dial: zebra_network::peer_set::set: preselected service was not ready, reselecting
Dec 06 18:43:47.967 TRACE {zebrad="5b687f87"}:task{kind=task spawn.location=/home/oxarbitrage/zebra/zebra_sync/zebra/zebra-network/src/peer_set/initialize.rs:162:23}:crawl_and_dial: zebra_network::peer_set::set: no ready services, sending demand signal
Dec 06 18:43:47.967 TRACE {zebrad="5b687f87"}:task{kind=task spawn.location=/home/oxarbitrage/zebra/zebra_sync/zebra/zebra-network/src/peer_set/initialize.rs:162:23}:crawl_and_dial: tower::buffer::worker: service.ready=false delay
Dec 06 18:43:47.967 DEBUG {zebrad="5b687f87"}:sync:extend_tips:download_and_verify{hash=000000000254e1d8cf90681e84f877d56fb4f7d4965b3eb09c175ca56820b635}: zebrad::components::sync::downloads: waiting to request block
Dec 06 18:44:07.968 DEBUG {zebrad="5b687f87"}:sync:extend_tips:download_and_verify{hash=000000000100c07024462b6221c73daca37afc18047cc0a5fcf2e4520573c821}: zebra_network::policies: retrying req=BlocksByHash({block::Hash("000000000100c07024462b6221c73daca37afc18047cc0a5fcf2e4520573c821")}) e=Elapsed(()) remaining_tries=2
Dec 06 18:45:03.017 TRACE {zebrad="5b687f87"}:task{kind=task spawn.location=/home/oxarbitrage/.cargo/git/checkouts/tower-b098c32cf5a1bcca/d4d1c67/tower/src/buffer/service.rs:67:9}: tower::buffer::worker: worker polling for next message
Dec 06 18:45:03.017 TRACE {zebrad="5b687f87"}:task{kind=task spawn.location=/home/oxarbitrage/.cargo/git/checkouts/tower-b098c32cf5a1bcca/d4d1c67/tower/src/buffer/service.rs:67:9}: tower::buffer::worker: resuming buffered request
Dec 06 18:45:03.017 TRACE {zebrad="5b687f87"}:task{kind=task spawn.location=/home/oxarbitrage/zebra/zebra_sync/zebra/zebra-network/src/peer_set/initialize.rs:162:23}:crawl_and_dial: tower::buffer::worker: resumed=true worker received request; waiting for service readiness
Dec 06 18:45:03.017 TRACE {zebrad="5b687f87"}:task{kind=task spawn.location=/home/oxarbitrage/zebra/zebra_sync/zebra/zebra-network/src/peer_set/initialize.rs:162:23}:crawl_and_dial: zebra_network::peer_set::set: ready_peers=0 unready_peers=0
Dec 06 18:45:03.017  WARN {zebrad="5b687f87"}:task{kind=task spawn.location=/home/oxarbitrage/zebra/zebra_sync/zebra/zebra-network/src/peer_set/initialize.rs:162:23}:crawl_and_dial: zebra_network::peer_set::set: network request with no peer connections. Hint: check your network connection
Dec 06 18:45:03.017 TRACE {zebrad="5b687f87"}:task{kind=task spawn.location=/home/oxarbitrage/zebra/zebra_sync/zebra/zebra-network/src/peer_set/initialize.rs:162:23}:crawl_and_dial: zebra_network::peer_set::set: preselected service was not ready, reselecting
Dec 06 18:45:03.017 TRACE {zebrad="5b687f87"}:task{kind=task spawn.location=/home/oxarbitrage/zebra/zebra_sync/zebra/zebra-network/src/peer_set/initialize.rs:162:23}:crawl_and_dial: zebra_network::peer_set::set: no ready services, sending demand signal
Dec 06 18:45:03.017 TRACE {zebrad="5b687f87"}:task{kind=task spawn.location=/home/oxarbitrage/zebra/zebra_sync/zebra/zebra-network/src/peer_set/initialize.rs:162:23}:crawl_and_dial: tower::buffer::worker: service.ready=false delay
...

At around every 1 minute(connected or not connected) we get this one: https://github.com/ZcashFoundation/zebra/blob/main/zebra-network/src/peer_set/set.rs#L452 so we have a working loop but it seems that the demand signal at the following line is not working.

teor2345 · 2020-12-14T01:54:51Z

We just merged #1468, which should provide better diagnostics for this issue, and panics if our assumptions about the peer state machine don't hold.

teor2345 · 2021-01-05T03:16:34Z

We think #1531 will fix most of these issues - once it's merged we'll close this ticket.

Any new hangs should have separate tickets - ideally based on the part of Zebra that's hanging.

teor2345 · 2021-01-06T01:13:07Z

Testing shows that #1531 doesn't fix this issue

teor2345 · 2021-01-16T00:13:07Z

This issue persists after #1586, but it seems less frequent, and has a lower impact (until there are no peers at all, which still results in permanent failure.)

teor2345 · 2021-01-25T03:02:44Z

When zebrad is stuck in a "network request with no peer connections" loop, starting a peer that connects to zebrad's listener port gets the syncer running again.

So there appear to be two issues here:

the syncer hangs when there are no peers, even though it has a timeout on peer requests - zebrad syncer hangs when there are no peers, despite network timeouts #1633
the peer set is dropping peers, hanging them, or otherwise leaving them in a bad state
- for example, it uses 100% CPU, a sign of a busy-waiting loop - Extreme CPU usage when sync has hung #1634

teor2345 · 2021-01-29T02:39:45Z

@mpguerra #1620 only fixes one cause of this issue, there could me more instances of #1593, and other causes of peer hangs, like #1633.

oxarbitrage added C-bug Category: This is a bug S-needs-triage Status: A bug report needs triage labels Dec 2, 2020

teor2345 closed this as completed Dec 2, 2020

oxarbitrage reopened this Dec 2, 2020

teor2345 changed the title ~~sync component dead during sync~~ Peer connections fail permanently after network interruptions Dec 4, 2020

teor2345 added this to the First Alpha Release milestone Dec 4, 2020

teor2345 mentioned this issue Dec 4, 2020

UTXO hang on first finalized testnet block #1406

Closed

teor2345 added A-rust Area: Updates to Rust code and removed S-needs-triage Status: A bug report needs triage labels Dec 4, 2020

This was referenced Dec 7, 2020

Add peer set tracing and unreachable panics #1468

Merged

Unable to restart sync after sleep #1467

Closed

mpguerra modified the milestones: First Alpha Release, v1.0.0-alpha.1 Dec 9, 2020

teor2345 mentioned this issue Dec 14, 2020

Panic: called fail_with on already-failed connection state #1510

Closed

teor2345 changed the title ~~Peer connections fail permanently after network interruptions~~ Peer connections are slow or fail permanently Dec 14, 2020

teor2345 mentioned this issue Dec 14, 2020

Very slow sync #1516

Closed

This was referenced Dec 16, 2020

Stop panicking when ClientRequests return an error #1531

Merged

Improve the random port function docs #1536

Merged

teor2345 mentioned this issue Jan 6, 2021

Sync service hangs, but inbound service continues to answer requests #1559

Closed

3 tasks

teor2345 modified the milestones: v1.0.0-alpha.1, 2021 Sprint 1 Jan 11, 2021

This was referenced Jan 13, 2021

Add sync and inbound timeouts #1586

Merged

Make sure Zebra uses poll_ready and Buffer reservations correctly #1593

Closed

mpguerra assigned teor2345 Jan 14, 2021

teor2345 mentioned this issue Jan 22, 2021

Fix poll_ready usage for the Inbound service #1620

Merged

2 tasks

This was referenced Jan 25, 2021

zebrad syncer hangs when there are no peers, despite network timeouts #1633

Closed

Extreme CPU usage when sync has hung #1634

Closed

Limit inbound download and verify queue #1622

Merged

mpguerra linked a pull request Jan 25, 2021 that will close this issue

Fix poll_ready usage for the Inbound service #1620

Merged

2 tasks

mpguerra modified the milestones: 2021 Sprint 1, 2021 Sprint 4, 2021 Sprint 3 Jan 25, 2021

teor2345 removed a link to a pull request Jan 29, 2021

Fix poll_ready usage for the Inbound service #1620

Merged

2 tasks

This was referenced Feb 3, 2021

Stop ignoring semaphore errors in tower-batch #1669

Closed

Use the correct bound for the semaphore in tower-batch #1670

Closed

Fix candidate set address state handling #1709

Merged

Log when the syncer awaits peer readiness #1714

Merged

teor2345 added I-hang A Zebra component stops responding to requests P-High labels Feb 16, 2021

teor2345 closed this as completed in #1709 Feb 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Peer connections hang temporarily or fail permanently #1435

Peer connections hang temporarily or fail permanently #1435

oxarbitrage commented Dec 2, 2020 •

edited by teor2345

Loading

teor2345 commented Dec 2, 2020

oxarbitrage commented Dec 2, 2020

oxarbitrage commented Dec 2, 2020 •

edited

Loading

teor2345 commented Dec 3, 2020 •

edited

Loading

oxarbitrage commented Dec 3, 2020

oxarbitrage commented Dec 3, 2020 •

edited

Loading

oxarbitrage commented Dec 3, 2020

oxarbitrage commented Dec 3, 2020

hdevalence commented Dec 3, 2020

teor2345 commented Dec 4, 2020 •

edited

Loading

teor2345 commented Dec 4, 2020 •

edited

Loading

teor2345 commented Dec 4, 2020

oxarbitrage commented Dec 6, 2020 •

edited by teor2345

Loading

teor2345 commented Dec 14, 2020

teor2345 commented Jan 5, 2021

teor2345 commented Jan 6, 2021

teor2345 commented Jan 16, 2021 •

edited

Loading

teor2345 commented Jan 25, 2021 •

edited

Loading

teor2345 commented Jan 29, 2021

Peer connections hang temporarily or fail permanently #1435

Peer connections hang temporarily or fail permanently #1435

Comments

oxarbitrage commented Dec 2, 2020 • edited by teor2345 Loading

teor2345 commented Dec 2, 2020

oxarbitrage commented Dec 2, 2020

oxarbitrage commented Dec 2, 2020 • edited Loading

teor2345 commented Dec 3, 2020 • edited Loading

oxarbitrage commented Dec 3, 2020

oxarbitrage commented Dec 3, 2020 • edited Loading

oxarbitrage commented Dec 3, 2020

oxarbitrage commented Dec 3, 2020

hdevalence commented Dec 3, 2020

teor2345 commented Dec 4, 2020 • edited Loading

teor2345 commented Dec 4, 2020 • edited Loading

teor2345 commented Dec 4, 2020

oxarbitrage commented Dec 6, 2020 • edited by teor2345 Loading

teor2345 commented Dec 14, 2020

teor2345 commented Jan 5, 2021

teor2345 commented Jan 6, 2021

teor2345 commented Jan 16, 2021 • edited Loading

teor2345 commented Jan 25, 2021 • edited Loading

teor2345 commented Jan 29, 2021

oxarbitrage commented Dec 2, 2020 •

edited by teor2345

Loading

oxarbitrage commented Dec 2, 2020 •

edited

Loading

teor2345 commented Dec 3, 2020 •

edited

Loading

oxarbitrage commented Dec 3, 2020 •

edited

Loading

teor2345 commented Dec 4, 2020 •

edited

Loading

teor2345 commented Dec 4, 2020 •

edited

Loading

oxarbitrage commented Dec 6, 2020 •

edited by teor2345

Loading

teor2345 commented Jan 16, 2021 •

edited

Loading

teor2345 commented Jan 25, 2021 •

edited

Loading