Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement DHT Census #117

Merged

Conversation

pipermerriam
Copy link
Member

Replaces #109

This implements full DHT census taking. The current algorithm is simplistic, effective, and crude.

  • perform a RFN on a random location in the network to gather a small set of "known nodes"
  • for each node, check liviliness via ping and record data_radius value
  • for each node, enumerate it's routing table with many FIND_NODES requests on buckets 245-256
  • repeat until all nodes have either been verified or time'd out.

This is then saved into the database, with the census itself having simple fields for time bounds, started_at and duration and each node records (enr_fk, timestamp, data_radius).

No web views have been added for this model yet.

@pipermerriam
Copy link
Member Author

This is ready for review

@perama-v
Copy link
Contributor

Looks great. I've given it a run and found one issue.

The second round (15mins/900s later) does not begin if the first round is not .done() .

The first round is done if

errored + finished == known

After 20 mins, the following is observed:

[INFO  glados_cartographer] Census progress known=335 alive=47 finished=46 errored=30 pending=259 elapsed=1079 rps=0

The census is not done because:

30 + 46 = 76 != 335 
# (err + fin != known)

Thus, many known nodes that are neither finished nor pending (that is, they are pending).

I note a few dozen:

Error fetching routing table info enr.node_id=<id> msg=Call(Custom(ErrorObject { code: ServerError(-32000), message: "FindNodes request timeout: Timeout", data: None }))

I tried adding these as an errored node

debug!(enr.node_id=?H256::from(enr.node_id().raw()), msg=?msg, "Error fetching routing table info");

census.add_errored(NodeId(enr.node_id().raw())).await; # new

Thought this was insufficient to resolve the issue.

One thought was that the second round commencement is gated by the semaphore (default n=1), which
is held by the first round. But increasing concurrency did not result in additional rounds being started while the first was in progress.

Will think more on it.

);

let found_enrs = client
// Initialize our search with a random-ish set of ENRs
let initial_enrs = client
.recursive_find_nodes(EthPortalNodeId(target.raw()))
.await
.unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Encountered this unwrap in the wild at 1241 seconds (20 mins). Logs:

[INFO  glados_cartographer] Census complete known=420 alive=172 finished=163 errored=257 pending=0 duration=1241 rps=0
...
[DEBUG glados_cartographer] Waiting for channels to exit
...
[INFO  glados_cartographer] Census finished

Err:

panicked at 'called `Result::unwrap()` on an `Err` value: RequestTimeout'

@pipermerriam pipermerriam force-pushed the piper/implement-census-database-models branch from edc93c4 to 6291c42 Compare May 10, 2023 17:49
@pipermerriam
Copy link
Member Author

@perama-v thanks for the evaluation. Highlighted that I need to start applying the error handling idioms I've been picking up more consistently. I've gone through that code with another pass to put in explicit handling of error cases. Wondering if you would be willing to give the glados-cartographer/src/lib.rs one more pass and let me know if you see any patterns that I should change or avoid. Currently running it for a bit here to see if it more reliably handled errors without fully failing or stalling.

@pipermerriam
Copy link
Member Author

ick, still hanging sometimes and I don't know why, no urgency in reviewing this at the moment.

@perama-v
Copy link
Contributor

Nice, looks like you fixed it. The problem was an outdated ethportal-api, fixed with this:

pipermerriam#2

@perama-v
Copy link
Contributor

Some numbers: trin observed with ~30 concurrent RPC API connections. Audit finishes after ~1min with alive/errored as follows:

Census complete known=433 alive=173 finished=173 errored=260 pending=0 duration=64 rps=9

Copy link
Collaborator

@mrferris mrferris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks great 👍

Functionality wise, some issues that you may already be aware of:

  1. Sometimes when enumerating a node's routing table I see a bunch of either of these errors:
    Error fetching routing table info enr.node_id=0xcbb4f4743bb0ee289ed6689f4d9046a8e7b49524c1a903c3703b800ed99d7bca distance=245 msg=ParseError(Error("invalid length 0, expected struct FindNodesInfo with 2 elements", line: 1, column: 2))
    Error fetching routing table info enr.node_id=0xcbb4f4743bb0ee289ed6689f4d9046a8e7b49524c1a903c3703b800ed99d7bca distance=249 msg=ParseError(Error("invalid type: string \"enr:-JS4QN3B1sBIvKL_9TGv_N8rch3bDQm437dXIQYLXvn1tvrhXTjv8wZaGgIhI9SKAebZ7-HEWizPAnjA-23ELhOeyoUBY450IDAuMS4wLTdlNmNlYYJpZIJ2NIJpcISygPbLiXNlY3AyNTZrMaECZREuqcOT9g2wl96nmOZcqMhMn-XthZyEFHsUpS2yTR-DdWRwgiMo\", expected u8", line: 1, column: 207))

  2. I'm seeing much smaller census numbers than you & perama:
    Census complete known=13 alive=10 finished=10 errored=3 pending=0 duration=53 rps=0
    This is running against a trin node whose portal_historyRoutingTableInfo endpoint reports as having 71 nodes in its routing table. I know that cartographer isn't just getting the entire routing table to start with (should it?) but I'm not sure why my numbers are so much lower on a node that knows about so many other nodes.

const DEFAULT_CENSUS_INTERVAL: &str = "900";

// Number of concurrent requests that can be in progress towards the connected portal client.
const DEFAULT_CONCURRENCY: &str = "1";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason that this isn't at 4 like in glados-audit?

@perama-v
Copy link
Contributor

Number 1 I encountered, which was the impetus for the bump for from ethportal-api 1.6 to 2.0-alpha

@perama-v
Copy link
Contributor

It seems sensitive to concurrency:

  • concurrency=1: Hangs, with pending=1, never starts second census
  • concurrency=4: Hangs, with pending=266, never starts second census
  • concurrency=30: Finishes, starts next census

Which feels like:

  • 1: Too low to add new nodes, too low to finish list of pending nodes.
  • 4: Enough to investigate new nodes, too low to finish list of pending nodes.
  • 30: Enough to investigate new nodes, enough to finish list of pending nodes.

@pipermerriam
Copy link
Member Author

I'm pretty sure that this will deadlock at any concurrency level if you let it run long enough....

@perama-v
Copy link
Contributor

Fair. I just killed it after about 15 successes and haven't looked too closely for a cause

pipermerriam and others added 2 commits August 31, 2023 15:13
Implement database models for census

make permit dropping more reliable

improved error handling for census taking

d

log available permits

d

amidoingitright?
@mrferris mrferris force-pushed the piper/implement-census-database-models branch from beb7caf to 47695dc Compare September 11, 2023 19:04
@mrferris
Copy link
Collaborator

It looks like the deadlock here was happening when the to_ping buffer gets full: the enumeration tasks would block while waiting for entries to be taken off the to_ping buffer, never freeing up their permits to allow the liveliness tasks to take anything off of the to_ping buffer.

I fixed this by creating two semaphores, one for the enumeration tasks and one for the liveliness tasks. That way they can never block each other. Other solution proposals welcome.

The RPC client was also defaulting to 60 second timeouts, which makes sense for a RFN call, but not for PINGs and FINDNODES. The timeouts for those was lowered to 2 seconds, which speeds things up.

Copy link
Member Author

@pipermerriam pipermerriam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't do a full "review" but I browsed what you pushed yesterday. 🚀 nice job finding that deadlock. Makes perfect sense now that I see it. Feel free to :shipit: if you're inclined to just move it forward, otherwise, I can get you a real review later today.

@mrferris mrferris force-pushed the piper/implement-census-database-models branch from 2981723 to 74d88d0 Compare September 14, 2023 20:11
@mrferris mrferris merged commit d586e22 into ethereum:master Sep 15, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants