Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue for peer blacklisting #520

Open
pipermerriam opened this issue Apr 19, 2019 · 6 comments
Open

Tracking issue for peer blacklisting #520

pipermerriam opened this issue Apr 19, 2019 · 6 comments

Comments

@pipermerriam
Copy link
Member

@pipermerriam pipermerriam commented Apr 19, 2019

What is wrong?

In working on #24 I have found that we need to first implement better blacklisting. Without this, we end up with a pool filled with lots of mediocre or bad peers. Loose evaluation looks like the following end up being problems.

  • Malicious peers: Sends provably invalid or wrong responses.
  • Useless peers:
    • Predominantly incomplete responses.
    • Announced HEAD is too far behind our local HEAD
  • Poorly connected peers: always times out
  • Lazy peers: does not respond to queries they should be able to respond to.
    • Announced HEAD block number N, doesn't respond to query for block/header at height <= N.
    • Announced HEAD block number N, doesn't respond to GetNodeData for state root from block within range(N - X, N) blocks of 100, where X is somewhere in the range 64 <-> 128

How can it be fixed

I'm looking at doing the following.

First: New internal plugin that runs the blacklist tracker as a service over the event bus. This is needed because we need to communicate with this service from many different parts of the codebase and it's not really viable to pass around a reference to the tracker to all of these different code locations.

Then, I plan to implement some version of the following rules.

Malicious peers

Immediate disconnect and put on blacklist for some appropriate amount of time.

Useless peers

Mark them as useless and blacklist for appropriate amount of time. Disconnect oldest/random useless peer if pool is full.

Poorly connected, Lazy

Use token bucket based approach. If token bucket is empty, disconnect and blacklist for appropriate amount of time.

@carver

This comment has been minimized.

Copy link
Contributor

@carver carver commented Apr 20, 2019

Poorly connected, Lazy

Use token bucket based approach. If token bucket is empty, disconnect and blacklist for appropriate amount of time.

Here, you're measuring something like timeouts per second? So they get some leeway to have some timeouts all in a row, but their average has to stay below some watermark. Interesting. The appropriate timeout rate might be something that is worth adapting to the pool average. Maybe you have a terrible network connection and it's your fault the timeouts are happening so we don't want to churn our peers in that case? Just a thought...

@pipermerriam

This comment has been minimized.

Copy link
Member Author

@pipermerriam pipermerriam commented Apr 22, 2019

Yes, I think we need a aggregate pool statistics for this to be meaningful for exactly that reason. This also extends in the other direction towards determining what a good peer is since throughput in isolation is meaningless.

@pipermerriam

This comment has been minimized.

Copy link
Member Author

@pipermerriam pipermerriam commented Apr 26, 2019

(note this has less to do with the existing blacklist records and more about the more detailed records about peers that are coming)

Thinking about database size, I went reaching for a mechanism to get the following two things.

  1. Records become stale when we have not seen that node after some period (2 weeks?)
  2. Once a record is stale it may expire resulting in all information about it being removed from the database.
  3. The likelihood of a given record being removed increases from 0% -> 100% across some time period (2 weeks -> 6 months?)
  4. The distribution of this probability is such that
    • a value very close to the lower bound has close to zero chance of expiration
    • probability grows super-linearly towards 100% as time goes by

Looking to achieve this by adding an expires_at to each of the Remote table entries that gets updated everytime we touch the record in some way. The value will be randomly picked according to the above distribution. We can then occasionally sweep the database for records which are expired and remove them.

@carver

This comment has been minimized.

Copy link
Contributor

@carver carver commented Apr 26, 2019

What's the rationale for the probabilistic expiration? No objection, just didn't follow.

It seems like a first draft that just expires everything after a month is okay, it gets something out the door that doesn't leak disk space too badly. (probably by tracking the last modified date instead of an expiration date in the DB, so that changing the expiration algorithm is trivial)

@pipermerriam

This comment has been minimized.

Copy link
Member Author

@pipermerriam pipermerriam commented Apr 27, 2019

There are assumptions here so I'll measure before I run with anything.

The motivation is disk footprint and keeping the sqlite database at a reasonable size. 6 month expiration window seems like too long. A node that is online for a continuous 6 months seems like it could build up a very large database (need to measure). My thought is to handle the following two cases gracefully.

  • long running node that is online consistently doesn't build up too much data.
  • node that only comes online occasionally, maybe even with some long-ish offline periods still maintains some network data to facilitate quick startup.

So I want to prune data from this database, but do it such that it thins the data out over time rather than a hard cuttoff to allow for a long tail of historical data that should still be able to facilitate the node re-connecting quickly.

Mechanism may sound complex but implementation should be as simple as a single small function like expire_seconds = TWO_WEEKS + SIX_MONTHS * (1 - random.random(0, 1)**4) and an occasional task that sweeps the database of any records who's expiration has passed.

I mostly wrote this down to remember it. I'm not convinced it is necessary but the mechanism seems simple enough and should behave in a manner that satisfies the two edge cases above nicely.

@carver

This comment has been minimized.

Copy link
Contributor

@carver carver commented Apr 29, 2019

So I want to prune data from this database, but do it such that it thins the data out over time rather than a hard cuttoff to allow for a long tail of historical data that should still be able to facilitate the node re-connecting quickly.

Ah, so maybe the goal is to thin out the old nodes, but only thin them out if we have enough nodes in the DB to connect to. Maybe as simple as a rule to only thin out if there are fewer than 1000 entries, or something like it.

Mechanism may sound complex but implementation should be as simple as a single small function like expire_seconds = TWO_WEEKS + SIX_MONTHS * (1 - random.random(0, 1)**4) and an occasional task that sweeps the database of any records who's expiration has passed.

My concern isn't that the code is difficult, but wanting to make it easier to answer the question "why am I not trying to reconnect to this node anymore."

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.