-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement DHT Census #117
Implement DHT Census #117
Conversation
This is ready for review |
Looks great. I've given it a run and found one issue. The second round (15mins/900s later) does not begin if the first round is not The first round is done if glados/glados-cartographer/src/lib.rs Line 147 in edc93c4
After 20 mins, the following is observed:
The census is not done because: 30 + 46 = 76 != 335
# (err + fin != known) Thus, many known nodes that are neither finished nor pending (that is, they are I note a few dozen:
I tried adding these as an errored node glados/glados-cartographer/src/lib.rs Line 445 in edc93c4
census.add_errored(NodeId(enr.node_id().raw())).await; # new Thought this was insufficient to resolve the issue. One thought was that the second round commencement is gated by the semaphore (default n=1), which Will think more on it. |
glados-cartographer/src/lib.rs
Outdated
); | ||
|
||
let found_enrs = client | ||
// Initialize our search with a random-ish set of ENRs | ||
let initial_enrs = client | ||
.recursive_find_nodes(EthPortalNodeId(target.raw())) | ||
.await | ||
.unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Encountered this unwrap in the wild at 1241 seconds (20 mins). Logs:
[INFO glados_cartographer] Census complete known=420 alive=172 finished=163 errored=257 pending=0 duration=1241 rps=0
...
[DEBUG glados_cartographer] Waiting for channels to exit
...
[INFO glados_cartographer] Census finished
Err:
panicked at 'called `Result::unwrap()` on an `Err` value: RequestTimeout'
edc93c4
to
6291c42
Compare
@perama-v thanks for the evaluation. Highlighted that I need to start applying the error handling idioms I've been picking up more consistently. I've gone through that code with another pass to put in explicit handling of error cases. Wondering if you would be willing to give the |
ick, still hanging sometimes and I don't know why, no urgency in reviewing this at the moment. |
Nice, looks like you fixed it. The problem was an outdated |
Some numbers:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks great 👍
Functionality wise, some issues that you may already be aware of:
-
Sometimes when enumerating a node's routing table I see a bunch of either of these errors:
Error fetching routing table info enr.node_id=0xcbb4f4743bb0ee289ed6689f4d9046a8e7b49524c1a903c3703b800ed99d7bca distance=245 msg=ParseError(Error("invalid length 0, expected struct FindNodesInfo with 2 elements", line: 1, column: 2))
Error fetching routing table info enr.node_id=0xcbb4f4743bb0ee289ed6689f4d9046a8e7b49524c1a903c3703b800ed99d7bca distance=249 msg=ParseError(Error("invalid type: string \"enr:-JS4QN3B1sBIvKL_9TGv_N8rch3bDQm437dXIQYLXvn1tvrhXTjv8wZaGgIhI9SKAebZ7-HEWizPAnjA-23ELhOeyoUBY450IDAuMS4wLTdlNmNlYYJpZIJ2NIJpcISygPbLiXNlY3AyNTZrMaECZREuqcOT9g2wl96nmOZcqMhMn-XthZyEFHsUpS2yTR-DdWRwgiMo\", expected u8", line: 1, column: 207))
-
I'm seeing much smaller census numbers than you & perama:
Census complete known=13 alive=10 finished=10 errored=3 pending=0 duration=53 rps=0
This is running against a trin node whoseportal_historyRoutingTableInfo
endpoint reports as having 71 nodes in its routing table. I know thatcartographer
isn't just getting the entire routing table to start with (should it?) but I'm not sure why my numbers are so much lower on a node that knows about so many other nodes.
glados-cartographer/src/cli.rs
Outdated
const DEFAULT_CENSUS_INTERVAL: &str = "900"; | ||
|
||
// Number of concurrent requests that can be in progress towards the connected portal client. | ||
const DEFAULT_CONCURRENCY: &str = "1"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason that this isn't at 4 like in glados-audit
?
Number 1 I encountered, which was the impetus for the bump for from ethportal-api 1.6 to 2.0-alpha |
It seems sensitive to concurrency:
Which feels like:
|
I'm pretty sure that this will deadlock at any concurrency level if you let it run long enough.... |
Fair. I just killed it after about 15 successes and haven't looked too closely for a cause |
5c37ade
to
25c64ad
Compare
Implement database models for census make permit dropping more reliable improved error handling for census taking d log available permits d amidoingitright?
beb7caf
to
47695dc
Compare
It looks like the deadlock here was happening when the I fixed this by creating two semaphores, one for the The RPC client was also defaulting to 60 second timeouts, which makes sense for a RFN call, but not for PINGs and FINDNODES. The timeouts for those was lowered to 2 seconds, which speeds things up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't do a full "review" but I browsed what you pushed yesterday. 🚀 nice job finding that deadlock. Makes perfect sense now that I see it. Feel free to if you're inclined to just move it forward, otherwise, I can get you a real review later today.
2981723
to
74d88d0
Compare
Replaces #109
This implements full DHT census taking. The current algorithm is simplistic, effective, and crude.
data_radius
valueThis is then saved into the database, with the census itself having simple fields for time bounds,
started_at
andduration
and each node records (enr_fk
,timestamp
,data_radius
).No web views have been added for this model yet.