Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catch more sync errors #1545

Merged
merged 10 commits into from Feb 12, 2020
Merged

Catch more sync errors #1545

merged 10 commits into from Feb 12, 2020

Conversation

@carver
Copy link
Contributor

carver commented Feb 12, 2020

What was wrong?

A variety of errors pop up sporadically. Occasionally, they event stop sync dead.

There is also an issue that causes the BeamSyncer to hang on shutdown. This was preventing pivot from working correctly, so is a blocker for #908

How was it fixed?

Mostly, just catch exceptions that were bubbling up too far.

To-Do

  • Clean up commit history

Cute Animal Picture

put a cute animal picture link inside the parentheses

carver added 5 commits Feb 11, 2020
Make sure that tasks that hang on the Beam Imports are cancelled, when
the beam syncer shuts down. This is important for a clean shutdown, and
also for a clean beam pivot.
await self._handle_commands(msg_stream, stop)
except asyncio.TimeoutError as exc:
self.logger.warning(
"Timed out waiting for command from %s, Stop: %r, exiting...",

This comment has been minimized.

Copy link
@carver

carver Feb 12, 2020

Author Contributor

I don't understand the internals of connection/transport/etc well enough here to confidently squash this. So for now I'm logging a warning. Anyone have a better idea what should be happening? Maybe @pipermerriam ?

@@ -272,7 +273,7 @@ def write(self, data: bytes) -> None:
self,
self.read_state.name,
)
raise Exception(f"Corrupted transport: {self} - state={self.read_state.name}")

This comment has been minimized.

Copy link
@carver

carver Feb 12, 2020

Author Contributor

I didn't want to blindly catch an Exception.

@@ -734,15 +734,15 @@ def __init__(

async def _provide_missing_account_tries(self) -> None:
async for event in self.wait_iter(self._event_bus.stream(CollectMissingAccount)):
self.run_task(self._serve_account(event))
self.run_task(self.wait(self._serve_account(event)))

This comment has been minimized.

Copy link
@carver

carver Feb 12, 2020

Author Contributor

This was causing the BeamSyncer to hang on shutdown, which made it impossible to pivot.

This comment has been minimized.

Copy link
@cburgdorf

cburgdorf Feb 12, 2020

Contributor

I'm just noting that the slightly better fix here is probably to migrate it to the new async-service which doesn't need the wait constructs anymore. But that shouldn't hold you back fixing it this way for now.

This comment has been minimized.

Copy link
@pipermerriam

pipermerriam Feb 12, 2020

Member

I'll handle this separately.

vm.logger.info("Got OSError while Beam Sync previewed execution of %s", header)
except BrokenPipeError:
vm.logger.info("Got BrokenPipeError while Beam Sync previewed execution of %s", header)
vm.logger.debug("BrokenPipeError trace for %s", header, exc_info=True)

This comment has been minimized.

Copy link
@carver

carver Feb 12, 2020

Author Contributor

This seems to be the simplest way to handle when beam importing is cancelled externally. Someday, we can either handle it more explicitly/proactively, or we can just reduce it to debug or debug2 log, if we decide this catch-all approach isn't problematic.

@@ -139,10 +143,28 @@ def _update_queen(self, peer: ETHPeer) -> ETHPeer:
elif peer == self._queen_peer:
# nothing to do, peer is already the queen
return None
elif _peer_sort_key(peer) < _peer_sort_key(self._queen_peer):

This comment has been minimized.

Copy link
@carver

carver Feb 12, 2020

Author Contributor

This change isn't nearly as big as it looks. Basically, we just have to catch it when a peer disappears at the moment we ask for it's quality. We want to do different things depending on which peer went away. So we have to make these lookups ahead of the if block.

@carver carver requested a review from cburgdorf Feb 12, 2020
for line in peer.get_extra_stats():
self.logger.debug(" %s", line)
except (UnknownAPI, PeerConnectionLost) as exc:
self.logger.debug(" Failure during stats lookup: %r", exc)

This comment has been minimized.

Copy link
@cburgdorf

cburgdorf Feb 12, 2020

Contributor

Is there a reason for the whitespace in the beginning of this logging message? It's been present in the previous log so maybe it was just carried along.

This comment has been minimized.

Copy link
@carver

carver Feb 12, 2020

Author Contributor

Yeah, it's an indentation to indicate that it's a sub-detail of the line that came before it. The line before it identifies which peer we are logging about.

For example:

   DEBUG  2020-02-11 15:51:17,318           ETHPeerPool  == Peer details == 
   DEBUG  2020-02-11 15:51:17,319           ETHPeerPool  ETHPeer <Session <Node(0x945b15@1.1.1.1)> 8bb73250-fecb-41c9-9e06-16b12f79cdcd>: uptime=58s, received_msgs=17
   DEBUG  2020-02-11 15:51:17,319           ETHPeerPool  client_version_string='Geth/v1.9.3-stable-cfbb969d/linux-amd64/go1.11.5'
   DEBUG  2020-02-11 15:51:17,319           ETHPeerPool      trinity.protocol.eth.commands.BlockBodies: None
   DEBUG  2020-02-11 15:51:17,319           ETHPeerPool      trinity.protocol.eth.commands.BlockHeaders: msgs=1  items=2  rtt=19.01/0.16/0.00  ips=0.62755  timeouts=0  quality=5
   DEBUG  2020-02-11 15:51:17,319           ETHPeerPool      trinity.protocol.eth.commands.NodeData: None
   DEBUG  2020-02-11 15:51:17,319           ETHPeerPool      trinity.protocol.eth.commands.Receipts: None
   DEBUG  2020-02-11 15:51:17,319           ETHPeerPool  ETHPeer <Session <Node(0xe61905@2.2.2.2)> 81a2d9fa-3dce-4562-80b7-82f5f9a1c0a4>: uptime=58s, received_msgs=6
   DEBUG  2020-02-11 15:51:17,319           ETHPeerPool  client_version_string='Geth/v1.9.2-stable-e76047e9/linux-amd64/go1.11.5'
   DEBUG  2020-02-11 15:51:17,319           ETHPeerPool      Failure during stats lookup: <exception details here>

This comment has been minimized.

Copy link
@cburgdorf

cburgdorf Feb 12, 2020

Contributor

Ah, of course! Nevermind :)

Copy link
Contributor

cburgdorf left a comment

LGTM. Left some comments in line but nothing that should hold this back.

@@ -734,15 +734,15 @@ def __init__(

async def _provide_missing_account_tries(self) -> None:
async for event in self.wait_iter(self._event_bus.stream(CollectMissingAccount)):
self.run_task(self._serve_account(event))
self.run_task(self.wait(self._serve_account(event)))

This comment has been minimized.

Copy link
@cburgdorf

cburgdorf Feb 12, 2020

Contributor

I'm just noting that the slightly better fix here is probably to migrate it to the new async-service which doesn't need the wait constructs anymore. But that shouldn't hold you back fixing it this way for now.

new_peer_quality = _peer_sort_key(peer)
except (UnknownAPI, PeerConnectionLost) as exc:
self.logger.debug("Ignoring %s, because we can't get speed stats: %r", peer, exc)
return None

This comment has been minimized.

Copy link
@cburgdorf

cburgdorf Feb 12, 2020

Contributor

I know this has always returned None so no need to change it now but let me just impersonate @pipermerriam for a second to point out that it would be more aligned with the rest of our code base if this would raise exceptions (e.g. QueenUnchanged, NoPreviousQueen) instead of returning None here.

This comment has been minimized.

Copy link
@carver

carver Feb 12, 2020

Author Contributor

It didn't feel right at first glance, so I poked around and decided it wasn't really important to return anything ever. So I just removed all returns :)

@@ -33,6 +33,7 @@
)
from p2p.cancellable import CancellableMixin
from p2p.exceptions import (
TransportOverlap,

This comment has been minimized.

Copy link
@pipermerriam

pipermerriam Feb 12, 2020

Member

I think that CorruptTransport would be more appropriate. I think the definition of what happens here is that we lose our place in the stream due to a read operation being interrupted at some point. At this stage I don't think there is a way to recover meaning that we should close the transport and terminate the peer connection when we detect this occuring.

@@ -734,15 +734,15 @@ def __init__(

async def _provide_missing_account_tries(self) -> None:
async for event in self.wait_iter(self._event_bus.stream(CollectMissingAccount)):
self.run_task(self._serve_account(event))
self.run_task(self.wait(self._serve_account(event)))

This comment has been minimized.

Copy link
@pipermerriam

pipermerriam Feb 12, 2020

Member

I'll handle this separately.

carver added 5 commits Feb 12, 2020
It is a code smell to sometimes return an object and other times return
None. Turns out the returned value wasn't really that helpful or
critical, so the cleanest resolution was to remove it altogether.
@carver carver force-pushed the carver:catch-more-sync-errors branch from 27c921d to 5be4576 Feb 12, 2020
@carver carver merged commit 9638379 into ethereum:master Feb 12, 2020
44 checks passed
44 checks passed
ci/circleci: docker-trinity-beacon-image-build-test Your tests passed on CircleCI!
Details
ci/circleci: docker-trinity-image-build-test Your tests passed on CircleCI!
Details
ci/circleci: py36-docs Your tests passed on CircleCI!
Details
ci/circleci: py36-eth1-components Your tests passed on CircleCI!
Details
ci/circleci: py36-eth1-core Your tests passed on CircleCI!
Details
ci/circleci: py36-eth1-monitor-trio Your tests passed on CircleCI!
Details
ci/circleci: py36-eth2-core Your tests passed on CircleCI!
Details
ci/circleci: py36-eth2-fixtures Your tests passed on CircleCI!
Details
ci/circleci: py36-eth2-integration Your tests passed on CircleCI!
Details
ci/circleci: py36-eth2-utils Your tests passed on CircleCI!
Details
ci/circleci: py36-integration Your tests passed on CircleCI!
Details
ci/circleci: py36-lightchain_integration Your tests passed on CircleCI!
Details
ci/circleci: py36-lint Your tests passed on CircleCI!
Details
ci/circleci: py36-lint-eth2 Your tests passed on CircleCI!
Details
ci/circleci: py36-long_run_integration Your tests passed on CircleCI!
Details
ci/circleci: py36-p2p Your tests passed on CircleCI!
Details
ci/circleci: py36-p2p-trio Your tests passed on CircleCI!
Details
ci/circleci: py36-rpc-blockchain Your tests passed on CircleCI!
Details
ci/circleci: py36-rpc-state-byzantium Your tests passed on CircleCI!
Details
ci/circleci: py36-rpc-state-constantinople Your tests passed on CircleCI!
Details
ci/circleci: py36-rpc-state-frontier Your tests passed on CircleCI!
Details
ci/circleci: py36-rpc-state-homestead Your tests passed on CircleCI!
Details
ci/circleci: py36-rpc-state-petersburg Your tests passed on CircleCI!
Details
ci/circleci: py36-rpc-state-spurious_dragon Your tests passed on CircleCI!
Details
ci/circleci: py36-rpc-state-tangerine_whistle Your tests passed on CircleCI!
Details
ci/circleci: py36-wheel-cli Your tests passed on CircleCI!
Details
ci/circleci: py37-eth1-components Your tests passed on CircleCI!
Details
ci/circleci: py37-eth1-core Your tests passed on CircleCI!
Details
ci/circleci: py37-eth1-monitor-trio Your tests passed on CircleCI!
Details
ci/circleci: py37-eth2-components Your tests passed on CircleCI!
Details
ci/circleci: py37-eth2-core Your tests passed on CircleCI!
Details
ci/circleci: py37-eth2-fixtures Your tests passed on CircleCI!
Details
ci/circleci: py37-eth2-integration Your tests passed on CircleCI!
Details
ci/circleci: py37-eth2-utils Your tests passed on CircleCI!
Details
ci/circleci: py37-libp2p Your tests passed on CircleCI!
Details
ci/circleci: py37-lint Your tests passed on CircleCI!
Details
ci/circleci: py37-lint-eth2 Your tests passed on CircleCI!
Details
ci/circleci: py37-p2p Your tests passed on CircleCI!
Details
ci/circleci: py37-p2p-trio Your tests passed on CircleCI!
Details
ci/circleci: py37-rpc-state-quadratic Your tests passed on CircleCI!
Details
ci/circleci: py37-rpc-state-sstore Your tests passed on CircleCI!
Details
ci/circleci: py37-rpc-state-zero_knowledge Your tests passed on CircleCI!
Details
ci/circleci: py37-wheel-cli Your tests passed on CircleCI!
Details
continuous-documentation/read-the-docs Read the Docs build succeeded!
Details
@carver carver deleted the carver:catch-more-sync-errors branch Feb 12, 2020
@carver carver mentioned this pull request Feb 13, 2020
3 of 3 tasks complete
carver added a commit to carver/trinity that referenced this pull request Feb 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.