Skip to content
This repository has been archived by the owner on Jul 1, 2021. It is now read-only.

Persist Peer Information between runs #270

Merged
merged 2 commits into from
Feb 14, 2019
Merged

Conversation

lithp
Copy link
Contributor

@lithp lithp commented Feb 12, 2019

A cleaned up version of #261

What was wrong?

Trinity wastes time trying to connect to nodes which it already knows won't work (they're on the wrong network or using the wrong genesis).

How was it fixed?

Trinity now keeps a sqlite database with information on peers. It records failures and doesn't attempt to reconnect to useless peers until a timeout has passed.

This PR doesn't do things like remember good nodes or exponential backoff, but it hopefully lays a foundation which will make those changes easy in the future.

Todo:

  • aiosqlite? -- decide not to do this because the library seems rather immature and I don't expect sqlite will block the thread for very long, even though it's ugly that it blocks the thread
  • When logging use localtime? That or explicitly mention that the logs use UTC
  • Add an entry to release notes

Cute Animal Picture

cute-baby-elephants-50-5902071911e92__700

@lithp lithp force-pushed the lithp/persist-node-info branch 7 times, most recently from 954fdc7 to ebf132c Compare February 12, 2019 07:17
@lithp lithp changed the title [DNM] Persist Peer Information between runs [WIP] Persist Peer Information between runs Feb 12, 2019
@lithp lithp changed the title [WIP] Persist Peer Information between runs Persist Peer Information between runs Feb 12, 2019
@lithp lithp requested a review from cburgdorf February 12, 2019 22:25
p2p/peer_pool.py Outdated
@@ -79,11 +83,16 @@ def __init__(self,
privkey: datatypes.PrivateKey,
context: BasePeerContext,
max_peers: int = DEFAULT_MAX_PEERS,
peer_info: BasePeerInfoPersistance = None,
Copy link
Contributor

@cburgdorf cburgdorf Feb 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it make sense to put NoopPeerInfoPersistance() as a default instead of None? And then removing line 92-93.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point. My instinct is to avoid passing objects as parameter defaults but NoopPeerInfoPersistance has no state so it's safe here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about just not having a default at this level and taking care of it at the level above this where it is instantiated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about just not having a default at this level and taking care of it at the level above this where it is instantiated.

My idea was that BasePeerPool would always have a peer_info without needing to put null checks everywhere. This also makes the interface a little simpler, if the caller doesn't care about persistence then it can implicitly not-care, vs explicitly passing in a Noop object.

p2p/persistance.py Outdated Show resolved Hide resolved
return True


class ClosedException(Exception):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should derive from BaseP2PError and probably even move into p2p/exceptions.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, I'm willing to do it but why do you think that error is more appropriate? This exception doesn't have anything to do with peers, it signals an invalid state which should never be reached. I thought about making it an assert False but think this stacktrace is a little easier to read.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If seeing this is truly something a user should never encounter then downgrading to Exception is acceptable. If it's helpful to have a custom exception but we don't want to expose it to users, then maybe add that explanation to the docstring. If a user might encounter this in some situation (which it sounds like they shouldn't) then it should be in p2p.exceptions and inherit from the base exception.

Do as you see fit 😸

@ralexstokes
Copy link
Member

looks nice :) this is minor but shall i suggest s/persistance/persistence/?

T = TypeVar('T', bound=Callable[..., Any])


def must_be_open(func: T) -> T:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like we are throwing away valuable type information here ( T has a bound of Callable[..., Any]).

Did you check if mypy still thinks can_connect_to will return bool after this decorator is applied on? You can test it if you put in reveal_type(can_connect_to) and run mypy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://mypy.readthedocs.io/en/latest/generics.html#declaring-decorators

According to the docs preserving function signatures is done as follows:

FuncType = Callable[..., Any]
F = TypeVar('F', bound=FuncType)

# A decorator that preserves the signature.
def my_decorator(func: F) -> F:
   def wrapper(*args, **kwds):
        ...
    return cast(F, wrapper)

And the current code looks roughly equivalent to this.

Copy link
Contributor Author

@lithp lithp Feb 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you check if mypy still thinks can_connect_to will return bool after this decorator is applied on?

I didn't but this decorator is called on more methods than can_connect_to, and I wanted it to work no matter what the signature of the wrapped method is. You're right that it's not very pretty though 😢 I found a ticket which tracks mypy support for this: python/mypy#3157


def _fetch_node(self, remote: Node) -> sqlite3.Row:
enode = remote.uri()
cursor = self.db.execute('SELECT * from bad_nodes WHERE enode = ?', (enode,))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't worked with SQL for ages and never did so in Python. I think @pipermerriam already commented on SQL injections in the previous PR so I assume it is double checked that this is the right way to do it and we are on the safe side?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, using the ? in the query string and placing the values as positional arguments does keep us safe from sql injection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes me sad that this is the correct way but it's indeed the correct way. named params are nicer but this sqlite3 library doesn't support them.

Copy link
Contributor Author

@lithp lithp Feb 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this is undocumented but after looking at the sqlite3 library, params are passed directly to sqlite, which means named parameters are possible.

That means the code could look like this:

self.db.execute(
    '''
    INSERT INTO bad_nodes (enode, until, reason, error_count)
    VALUES (:enode, :until, :reason, :error_count)
    ''',
    {
       'enode': enode,
       'until': time_to_str(until),
       'reason': reason,
       'error_count': error_count,
    }
)

But that might be a level of boilerplate which the queries aren't yet big enough to justify

p2p/persistance.py Outdated Show resolved Hide resolved

class MemoryPeerInfoPersistance(SQLPeerInfoPersistance):
def __init__(self, logger: ExtendedDebugLogger) -> None:
super().__init__(Path(":memory:"), logger)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet ❤️

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, in memory SQLite is pretty sexy. I recall loving this for testing Django apps.



class BasePeerInfoPersistance(abc.ABC):
def __init__(self, logger: ExtendedDebugLogger) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than monkeypatching current_time in the tests I personally would prefer the __init__ to take a parameter with time_provider: Callable[[], datetime.datetime] = datetime.datetime.utcnow. So with a sensible default value yet over-writable for testing without reaching for monkeypatch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm minorly 👎 on this. Time is one of the few things that I'm ok with mocking as I haven't come across a better mechanism. I would advocate removing the current_time function in favor of just using datetime.datetime.utcnow() and then in our tests, using the monkeypatch fixture to swap that function out for one that returns the desired value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine either way 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea on this is that there are two big categories of references, there's the "immutable" world of static references (immutable globals such as + or, in this file, sqlite3), and also the mutable world of objects which programs typically concern themselves with.

Dependency injection works well for the latter, there are legitimately different implementations of BasePeerInfoPersistance floating around and almost all classes won't care exactly which one they use, so by taking responsibility for constructing BasePeerInfoPersistance away from those classes we can make the code more flexible and much easier to test.

I think DI makes less sense for the former. Imports are fixed at the beginning of runtime, there aren't going to be multiple implementations of datetime.datetime.utcnow() floating around so you lose no flexibility by explicitly referencing it. And if you squint, python is already doing DI! This module is asking for an implementation of datetime.datetime and python is providing one. test_persistence isn't monkeypatching so much as changing which implementation of utcnow() this module receives.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_persistence isn't monkeypatching so much as changing which implementation of utcnow() this module receives.

Yes, which is why I lean on the side: Hey, if we do encounter a case where we want to swap the implementation of utcnow() against something else e.g. a controllable fakeutcnow() then, DI is my tool of choice.

And for a class where time plays an important role I wouldn't see it as uncommon to just depend on a time_provider: Callable[[], int]. But anyway, this can become quite a bit philosophical and I just happen to have build DI-heavy systems for many years so I'm just biased into that direction.

It's actually refreshing to look at other solutions so I'm not opposed to the existing code 👍

@@ -23,6 +24,9 @@ def __init__(self, event_bus: Endpoint, trinity_config: TrinityConfig) -> None:
self._node_port = trinity_config.port
self._max_peers = trinity_config.max_peers

app_config = trinity_config.get_app_config(Eth1AppConfig)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting that without further checking this would crash if the Node class gets reused for the beacon chain which however I don't think will happen (as described in #240) Just noting.

Copy link
Member

@pipermerriam pipermerriam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some idle comments, not a full review.

p2p/persistance.py Outdated Show resolved Hide resolved
p2p/persistance.py Outdated Show resolved Hide resolved


class WrongNetworkFailure(HandshakeFailure):
timeout = 60 * 60 * 24 # one day
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe instead of storing these timeout on the exceptions we should keep this information separate in a mapping from Exception -> timeout

Copy link
Contributor Author

@lithp lithp Feb 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unsure on this. On the one hand it doesn't seem to be BasePeerInfoPersistence's responsibility, but on the other hand you're right, this isn't really a property of the failure itself.

I could make a map in p2p/persistence.py, I didn't know if that was adding too much boilerplate to this PR

Copy link
Contributor Author

@lithp lithp Feb 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I've moved them into a map, that's probably a big enough change to be worth looking at again: 129b8d6

Copy link
Member

@pipermerriam pipermerriam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't have to be addressed in this PR but I want to give consideration to how this interacts with preferred peers and the concept of reserved connection slots. We'll eventually need to ensure this mechanism doesn't blacklist a peer in a vipnode type setup where peers have paid for reserved slots.

Note that this is probably best solved via a multi-peer pool setup since it's probably not sane to try to support a single peer pool having multiple sets of rules for peer connections since those rules would very likely end of conflicting with each other.

p2p/persistance.py Outdated Show resolved Hide resolved
p2p/peer_pool.py Outdated
@@ -79,11 +83,16 @@ def __init__(self,
privkey: datatypes.PrivateKey,
context: BasePeerContext,
max_peers: int = DEFAULT_MAX_PEERS,
peer_info: BasePeerInfoPersistance = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about just not having a default at this level and taking care of it at the level above this where it is instantiated.

p2p/peer_pool.py Outdated Show resolved Hide resolved
p2p/peer_pool.py Outdated
except HandshakeFailure as e:
self.logger.debug("Could not complete handshake with %r: %s", remote, repr(e))
self.peer_info.record_failure(
remote, timeout=e.timeout, reason=type(e).__name__
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to have a fixed set of values for the reason string. Maybe instead of using __name__ we could maintain a mapping from exception classes to a canonical "reason" string/value, or potentially something like an Enum with error codes?

p2p/persistance.py Outdated Show resolved Hide resolved
enode = remote.uri()
row = self._fetch_node(remote)
now = current_time()
if row:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want, it looks like it's possible to emulate an UPSERT according to this stack overflow answer.

https://stackoverflow.com/questions/418898/sqlite-upsert-not-insert-or-replace

p2p/persistance.py Outdated Show resolved Hide resolved
p2p/persistance.py Outdated Show resolved Hide resolved
tests/p2p/test_persistance.py Outdated Show resolved Hide resolved

def random_node():
address = kademlia.Address('127.0.0.1', 30303)
return kademlia.Node(generate_privkey(), address)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting this instpired me to create #279

@lithp
Copy link
Contributor Author

lithp commented Feb 13, 2019

looks nice :) this is minor but shall i suggest s/persistance/persistence/?

heh, oops, good point @ralexstokes

@lithp lithp force-pushed the lithp/persist-node-info branch 2 times, most recently from 05b18d3 to b6783a3 Compare February 14, 2019 06:27
Temporarily blacklist peers when specific errors occur. e.g. when a peer
has the wrong genesis don't attempt to reconnect to it for at least a
day. This improves the speed at which peers are found because less time
is wasted reconnecting to peers we already know don't work.

- Create SQLlite-backed SQLitePeerInfo
- Create memory-backed MemoryPeerInfo
- FullNode stores peer info in a file called 'nodedb' in the data dir
@lithp
Copy link
Contributor Author

lithp commented Feb 14, 2019

It doesn't have to be addressed in this PR but I want to give consideration to how this interacts with preferred peers and the concept of reserved connection slots. We'll eventually need to ensure this mechanism doesn't blacklist a peer in a vipnode type setup where peers have paid for reserved slots.

I believe that they could be orthogonal notions. Even if you don't go all the way and have separate PeerPools this seems like something which the PeerPool will have knowledge about. In that case, it's not a stretch to imagine the PeerPool skipping the PeerInfo check and connecting even if we've previously seen an error from the node.

The alternative might be telling PeerInfo about which nodes are preferred/reserved? It can then choose to return should_connect_to() -> True for those nodes, but that feels like mixing the concern of remembering node quality with the concern of picking which nodes to connect to, so I think my final answer is that reserved peers are higher up in the stack than what PeerInfo cares about, historical data on peers, making it independent of this PR.

Note that this is probably best solved via a multi-peer pool setup since it's probably not sane to try to support a single peer pool having multiple sets of rules for peer connections since those rules would very likely end of conflicting with each other.

Yeah, multiple peer pools with different rules seems like the simplest change. I'm trying to imagine alternatives: Maybe peers each have a PeerManager? Currently other parts of the code-base liberally call peer.disconnect(...). A PeerPool for reserved instances might deal with that by noticing disconnects and immediately reconnecting. If a PeerManager was injected into each Peer and called inside of peer.disconnect it might contain logic like cancelling the disconnect (ReservedPeerManager) or waiting until 5 timeouts have passed before finally giving up on a peer (ETHPeerManager).

@lithp
Copy link
Contributor Author

lithp commented Feb 14, 2019

Okay, I believe that I've addressed all the feedback. I finished my TODOs and squashed, if one of you could please give it a final lookover I'll address anything you notice and merge!

Copy link
Contributor

@cburgdorf cburgdorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, :shipit:

@lithp lithp merged commit 31691c0 into master Feb 14, 2019
@lithp lithp deleted the lithp/persist-node-info branch February 14, 2019 21:24
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants