Persist Peer Information between runs #270

lithp · 2019-02-12T02:38:12Z

A cleaned up version of #261

What was wrong?

Trinity wastes time trying to connect to nodes which it already knows won't work (they're on the wrong network or using the wrong genesis).

How was it fixed?

Trinity now keeps a sqlite database with information on peers. It records failures and doesn't attempt to reconnect to useless peers until a timeout has passed.

This PR doesn't do things like remember good nodes or exponential backoff, but it hopefully lays a foundation which will make those changes easy in the future.

Todo:

aiosqlite? -- decide not to do this because the library seems rather immature and I don't expect sqlite will block the thread for very long, even though it's ugly that it blocks the thread
When logging use localtime? That or explicitly mention that the logs use UTC
Add an entry to release notes

Cute Animal Picture

cburgdorf · 2019-02-12T22:47:25Z

p2p/peer_pool.py

@@ -79,11 +83,16 @@ def __init__(self,
                 privkey: datatypes.PrivateKey,
                 context: BasePeerContext,
                 max_peers: int = DEFAULT_MAX_PEERS,
+                 peer_info: BasePeerInfoPersistance = None,


Wouldn't it make sense to put NoopPeerInfoPersistance() as a default instead of None? And then removing line 92-93.

Yeah, good point. My instinct is to avoid passing objects as parameter defaults but NoopPeerInfoPersistance has no state so it's safe here.

How about just not having a default at this level and taking care of it at the level above this where it is instantiated.

How about just not having a default at this level and taking care of it at the level above this where it is instantiated.

My idea was that BasePeerPool would always have a peer_info without needing to put null checks everywhere. This also makes the interface a little simpler, if the caller doesn't care about persistence then it can implicitly not-care, vs explicitly passing in a Noop object.

p2p/persistance.py

cburgdorf · 2019-02-12T22:55:04Z

p2p/persistance.py

+        return True
+
+
+class ClosedException(Exception):


I think this should derive from BaseP2PError and probably even move into p2p/exceptions.py

Huh, I'm willing to do it but why do you think that error is more appropriate? This exception doesn't have anything to do with peers, it signals an invalid state which should never be reached. I thought about making it an assert False but think this stacktrace is a little easier to read.

If seeing this is truly something a user should never encounter then downgrading to Exception is acceptable. If it's helpful to have a custom exception but we don't want to expose it to users, then maybe add that explanation to the docstring. If a user might encounter this in some situation (which it sounds like they shouldn't) then it should be in p2p.exceptions and inherit from the base exception.

Do as you see fit 😸

ralexstokes · 2019-02-12T22:56:19Z

looks nice :) this is minor but shall i suggest s/persistance/persistence/?

cburgdorf · 2019-02-12T23:05:03Z

p2p/persistance.py

+T = TypeVar('T', bound=Callable[..., Any])
+
+
+def must_be_open(func: T) -> T:


This looks like we are throwing away valuable type information here ( T has a bound of Callable[..., Any]).

Did you check if mypy still thinks can_connect_to will return bool after this decorator is applied on? You can test it if you put in reveal_type(can_connect_to) and run mypy.

https://mypy.readthedocs.io/en/latest/generics.html#declaring-decorators

According to the docs preserving function signatures is done as follows:

FuncType = Callable[..., Any] F = TypeVar('F', bound=FuncType) # A decorator that preserves the signature. def my_decorator(func: F) -> F: def wrapper(*args, **kwds): ... return cast(F, wrapper)

And the current code looks roughly equivalent to this.

Did you check if mypy still thinks can_connect_to will return bool after this decorator is applied on?

I didn't but this decorator is called on more methods than can_connect_to, and I wanted it to work no matter what the signature of the wrapped method is. You're right that it's not very pretty though 😢 I found a ticket which tracks mypy support for this: python/mypy#3157

cburgdorf · 2019-02-12T23:07:42Z

p2p/persistance.py

+
+    def _fetch_node(self, remote: Node) -> sqlite3.Row:
+        enode = remote.uri()
+        cursor = self.db.execute('SELECT * from bad_nodes WHERE enode = ?', (enode,))


I haven't worked with SQL for ages and never did so in Python. I think @pipermerriam already commented on SQL injections in the previous PR so I assume it is double checked that this is the right way to do it and we are on the safe side?

Yes, using the ? in the query string and placing the values as positional arguments does keep us safe from sql injection.

It makes me sad that this is the correct way but it's indeed the correct way. named params are nicer but this sqlite3 library doesn't support them.

Actually, this is undocumented but after looking at the sqlite3 library, params are passed directly to sqlite, which means named parameters are possible.

That means the code could look like this:

self.db.execute( ''' INSERT INTO bad_nodes (enode, until, reason, error_count) VALUES (:enode, :until, :reason, :error_count) ''', { 'enode': enode, 'until': time_to_str(until), 'reason': reason, 'error_count': error_count, } )

But that might be a level of boilerplate which the queries aren't yet big enough to justify

p2p/persistance.py

cburgdorf · 2019-02-12T23:12:10Z

p2p/persistance.py

+
+class MemoryPeerInfoPersistance(SQLPeerInfoPersistance):
+    def __init__(self, logger: ExtendedDebugLogger) -> None:
+        super().__init__(Path(":memory:"), logger)


Sweet ❤️

Yeah, in memory SQLite is pretty sexy. I recall loving this for testing Django apps.

cburgdorf · 2019-02-12T23:19:17Z

p2p/persistance.py

+
+
+class BasePeerInfoPersistance(abc.ABC):
+    def __init__(self, logger: ExtendedDebugLogger) -> None:


rather than monkeypatching current_time in the tests I personally would prefer the __init__ to take a parameter with time_provider: Callable[[], datetime.datetime] = datetime.datetime.utcnow. So with a sensible default value yet over-writable for testing without reaching for monkeypatch.

I'm minorly 👎 on this. Time is one of the few things that I'm ok with mocking as I haven't come across a better mechanism. I would advocate removing the current_time function in favor of just using datetime.datetime.utcnow() and then in our tests, using the monkeypatch fixture to swap that function out for one that returns the desired value.

Fine either way 👍

My idea on this is that there are two big categories of references, there's the "immutable" world of static references (immutable globals such as + or, in this file, sqlite3), and also the mutable world of objects which programs typically concern themselves with.

Dependency injection works well for the latter, there are legitimately different implementations of BasePeerInfoPersistance floating around and almost all classes won't care exactly which one they use, so by taking responsibility for constructing BasePeerInfoPersistance away from those classes we can make the code more flexible and much easier to test.

I think DI makes less sense for the former. Imports are fixed at the beginning of runtime, there aren't going to be multiple implementations of datetime.datetime.utcnow() floating around so you lose no flexibility by explicitly referencing it. And if you squint, python is already doing DI! This module is asking for an implementation of datetime.datetime and python is providing one. test_persistence isn't monkeypatching so much as changing which implementation of utcnow() this module receives.

test_persistence isn't monkeypatching so much as changing which implementation of utcnow() this module receives.

Yes, which is why I lean on the side: Hey, if we do encounter a case where we want to swap the implementation of utcnow() against something else e.g. a controllable fakeutcnow() then, DI is my tool of choice.

And for a class where time plays an important role I wouldn't see it as uncommon to just depend on a time_provider: Callable[[], int]. But anyway, this can become quite a bit philosophical and I just happen to have build DI-heavy systems for many years so I'm just biased into that direction.

It's actually refreshing to look at other solutions so I'm not opposed to the existing code 👍

cburgdorf · 2019-02-12T23:22:46Z

trinity/nodes/full.py

@@ -23,6 +24,9 @@ def __init__(self, event_bus: Endpoint, trinity_config: TrinityConfig) -> None:
        self._node_port = trinity_config.port
        self._max_peers = trinity_config.max_peers

+        app_config = trinity_config.get_app_config(Eth1AppConfig)


Just noting that without further checking this would crash if the Node class gets reused for the beacon chain which however I don't think will happen (as described in #240) Just noting.

pipermerriam

Some idle comments, not a full review.

p2p/persistance.py

pipermerriam · 2019-02-13T02:11:16Z

p2p/exceptions.py

+
+
+class WrongNetworkFailure(HandshakeFailure):
+    timeout = 60 * 60 * 24  # one day


Maybe instead of storing these timeout on the exceptions we should keep this information separate in a mapping from Exception -> timeout

I was unsure on this. On the one hand it doesn't seem to be BasePeerInfoPersistence's responsibility, but on the other hand you're right, this isn't really a property of the failure itself.

I could make a map in p2p/persistence.py, I didn't know if that was adding too much boilerplate to this PR

Okay, I've moved them into a map, that's probably a big enough change to be worth looking at again: 129b8d6

pipermerriam

It doesn't have to be addressed in this PR but I want to give consideration to how this interacts with preferred peers and the concept of reserved connection slots. We'll eventually need to ensure this mechanism doesn't blacklist a peer in a vipnode type setup where peers have paid for reserved slots.

Note that this is probably best solved via a multi-peer pool setup since it's probably not sane to try to support a single peer pool having multiple sets of rules for peer connections since those rules would very likely end of conflicting with each other.

p2p/persistance.py

pipermerriam · 2019-02-13T16:16:04Z

p2p/peer_pool.py

@@ -79,11 +83,16 @@ def __init__(self,
                 privkey: datatypes.PrivateKey,
                 context: BasePeerContext,
                 max_peers: int = DEFAULT_MAX_PEERS,
+                 peer_info: BasePeerInfoPersistance = None,


How about just not having a default at this level and taking care of it at the level above this where it is instantiated.

p2p/peer_pool.py

pipermerriam · 2019-02-13T16:23:10Z

p2p/peer_pool.py

+        except HandshakeFailure as e:
+            self.logger.debug("Could not complete handshake with %r: %s", remote, repr(e))
+            self.peer_info.record_failure(
+                remote, timeout=e.timeout, reason=type(e).__name__


It'd be nice to have a fixed set of values for the reason string. Maybe instead of using __name__ we could maintain a mapping from exception classes to a canonical "reason" string/value, or potentially something like an Enum with error codes?

p2p/persistance.py

pipermerriam · 2019-02-13T16:38:00Z

p2p/persistance.py

+        enode = remote.uri()
+        row = self._fetch_node(remote)
+        now = current_time()
+        if row:


If you want, it looks like it's possible to emulate an UPSERT according to this stack overflow answer.

https://stackoverflow.com/questions/418898/sqlite-upsert-not-insert-or-replace

p2p/persistance.py

tests/p2p/test_persistance.py

pipermerriam · 2019-02-13T17:16:59Z

tests/p2p/test_persistance.py

+
+def random_node():
+    address = kademlia.Address('127.0.0.1', 30303)
+    return kademlia.Node(generate_privkey(), address)


Just noting this instpired me to create #279

lithp · 2019-02-13T23:26:21Z

looks nice :) this is minor but shall i suggest s/persistance/persistence/?

heh, oops, good point @ralexstokes

Temporarily blacklist peers when specific errors occur. e.g. when a peer has the wrong genesis don't attempt to reconnect to it for at least a day. This improves the speed at which peers are found because less time is wasted reconnecting to peers we already know don't work. - Create SQLlite-backed SQLitePeerInfo - Create memory-backed MemoryPeerInfo - FullNode stores peer info in a file called 'nodedb' in the data dir

lithp · 2019-02-14T09:15:31Z

It doesn't have to be addressed in this PR but I want to give consideration to how this interacts with preferred peers and the concept of reserved connection slots. We'll eventually need to ensure this mechanism doesn't blacklist a peer in a vipnode type setup where peers have paid for reserved slots.

I believe that they could be orthogonal notions. Even if you don't go all the way and have separate PeerPools this seems like something which the PeerPool will have knowledge about. In that case, it's not a stretch to imagine the PeerPool skipping the PeerInfo check and connecting even if we've previously seen an error from the node.

The alternative might be telling PeerInfo about which nodes are preferred/reserved? It can then choose to return should_connect_to() -> True for those nodes, but that feels like mixing the concern of remembering node quality with the concern of picking which nodes to connect to, so I think my final answer is that reserved peers are higher up in the stack than what PeerInfo cares about, historical data on peers, making it independent of this PR.

Note that this is probably best solved via a multi-peer pool setup since it's probably not sane to try to support a single peer pool having multiple sets of rules for peer connections since those rules would very likely end of conflicting with each other.

Yeah, multiple peer pools with different rules seems like the simplest change. I'm trying to imagine alternatives: Maybe peers each have a PeerManager? Currently other parts of the code-base liberally call peer.disconnect(...). A PeerPool for reserved instances might deal with that by noticing disconnects and immediately reconnecting. If a PeerManager was injected into each Peer and called inside of peer.disconnect it might contain logic like cancelling the disconnect (ReservedPeerManager) or waiting until 5 timeouts have passed before finally giving up on a peer (ETHPeerManager).

lithp · 2019-02-14T09:17:46Z

Okay, I believe that I've addressed all the feedback. I finished my TODOs and squashed, if one of you could please give it a final lookover I'll address anything you notice and merge!

cburgdorf

Looks good,

lithp force-pushed the lithp/persist-node-info branch 7 times, most recently from 954fdc7 to ebf132c Compare February 12, 2019 07:17

lithp changed the title ~~[DNM] Persist Peer Information between runs~~ [WIP] Persist Peer Information between runs Feb 12, 2019

lithp changed the title ~~[WIP] Persist Peer Information between runs~~ Persist Peer Information between runs Feb 12, 2019

lithp mentioned this pull request Feb 12, 2019

[POC] peer stats + more aggressive peer discovery #261

Closed

lithp requested a review from cburgdorf February 12, 2019 22:25

cburgdorf reviewed Feb 12, 2019

View reviewed changes

p2p/persistance.py Outdated Show resolved Hide resolved

cburgdorf reviewed Feb 12, 2019

View reviewed changes

p2p/persistance.py Outdated Show resolved Hide resolved

cburgdorf reviewed Feb 12, 2019

View reviewed changes

pipermerriam reviewed Feb 13, 2019

View reviewed changes

pipermerriam approved these changes Feb 13, 2019

View reviewed changes

lithp force-pushed the lithp/persist-node-info branch 2 times, most recently from 05b18d3 to b6783a3 Compare February 14, 2019 06:27

lithp force-pushed the lithp/persist-node-info branch from 3f7ad17 to 793eb19 Compare February 14, 2019 08:46

cburgdorf approved these changes Feb 14, 2019

View reviewed changes

Merge branch 'master' into lithp/persist-node-info

3d6e3ed

lithp merged commit 31691c0 into master Feb 14, 2019

lithp deleted the lithp/persist-node-info branch February 14, 2019 21:24

		T = TypeVar('T', bound=Callable[..., Any])


		def must_be_open(func: T) -> T:



		class BasePeerInfoPersistance(abc.ABC):
		def __init__(self, logger: ExtendedDebugLogger) -> None:



		class WrongNetworkFailure(HandshakeFailure):
		timeout = 60 * 60 * 24 # one day

Persist Peer Information between runs #270

Persist Peer Information between runs #270

Conversation

lithp commented Feb 12, 2019 • edited Loading

What was wrong?

How was it fixed?

Cute Animal Picture

cburgdorf Feb 12, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ralexstokes commented Feb 12, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lithp Feb 13, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lithp Feb 13, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pipermerriam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lithp Feb 13, 2019 • edited Loading

Choose a reason for hiding this comment

lithp Feb 14, 2019 • edited Loading

Choose a reason for hiding this comment

pipermerriam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lithp commented Feb 13, 2019

lithp commented Feb 14, 2019

lithp commented Feb 14, 2019

cburgdorf left a comment

Choose a reason for hiding this comment

lithp commented Feb 12, 2019 •

edited

Loading

cburgdorf Feb 12, 2019 •

edited

Loading

lithp Feb 13, 2019 •

edited

Loading

lithp Feb 13, 2019 •

edited

Loading

lithp Feb 13, 2019 •

edited

Loading

lithp Feb 14, 2019 •

edited

Loading