-
Notifications
You must be signed in to change notification settings - Fork 490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add service checking direct reachability from peers #195
Conversation
Just in case, the testing script was # initial peer
import hivemind
import reachability2 as reachability
TMP_IDENTITY_PATH = '/tmp/petals_probe.id'
dht = hivemind.DHT(identity_path=TMP_IDENTITY_PATH, host_maddrs=['/ip4/0.0.0.0/tcp/41337', '/ip6/::/tcp/41337'], start=True)
# optionally add some of conn_manager=False, use_relay=False, use_auto_relay=False, auto_nat=False, nat_port_map=False
reachability.ReachabilityProtocol.attach_to_dht(dht)
# ... depending on the eval, add extra nodes with or without client_mode=True
# client (notebook)
await reachability.check_reachability(
initial_peers=INITIAL_PEERS, host_maddrs=X, announce_maddrs=Y)
# optionally add some of conn_manager=False, use_relay=False, use_auto_relay=False, auto_nat=False, nat_port_map=False |
src/petals/server/reachability2.py
Outdated
|
||
@classmethod | ||
def attach_to_dht(cls, dht: hivemind.DHT, **kwargs): | ||
return dht.run_coroutine(partial(_attach_to_dht, cls=cls, **kwargs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: run_coroutine should be changed by the time this gets merged. See details in the pr header. The run_coroutine version was implemented as a hack so i can begin iterating on the main code.
[current plan] Instead, attach_to_dht should create a simple process that self-terminates if not dht.is_alive - or if terminated by the user. Not sure this is the best solution. comments welcome
3 hours test:
|
d6ec1f6
to
5c229c5
Compare
5c229c5
to
681d778
Compare
681d778
to
c087ea1
Compare
@@ -4,3 +4,6 @@ | |||
"/dns/bootstrap2.petals.ml/tcp/31338/p2p/QmQGTqmM7NKjV6ggU1ZCap8zWiyKR89RViDXiqehSiCpY5", | |||
"/dns6/bootstrap2.petals.ml/tcp/31338/p2p/QmQGTqmM7NKjV6ggU1ZCap8zWiyKR89RViDXiqehSiCpY5", | |||
] | |||
|
|||
# The reachability API is currently used only when connecting to the public swarm | |||
REACHABILITY_API_URL = "http://health.petals.ml" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed the env variable here since this check currently works only for the public swarm, and the env var were making an impression that you could enable it for the private swarm too.
We can update the logic for this in future PR: e.g., make the server run the check if the swarm is public or custom API URL is provided.
b58c900
to
b789a6b
Compare
b789a6b
to
378b78b
Compare
Hi @borzunov - I am running ubuntu 2022 - attempting to run this locally on workstation 3090 card the other week - it didn't connect successfully - though I have transmission on machine - and it works fine downloading torrents. I'm happy to run any tests to get to bottom if there's interest. I'll give this updated code a whirl later on and report back if connection problem persists. |
Motivation
Servers joining from behind NATs/firewalls usually take several minutes to join a libp2p relay before they become accessible from the outside Internet. Moreover, requests to such servers are slower and more likely to fail (e.g., if the server switches a relay at the moment). If such servers host certain DHT keys, the swarm may occasionally lose read/write access to these keys, which results in:
This PRs modifies servers so that DHT keys are only hosted on directly reachable servers (the ones who aren't behind NAT/firewall). This way, DHT becomes more stable and works faster. Of course, trhe servers behind NATs/firewalls still accept requests for running inference/forward/backward for blocks they hold (it's more acceptable for this kind of requests to be slower or fail).
Intended usage:
server.py L121
What didn't work
libp2p contexts - after more careful examination, the option that attempts direct connection does not cover all our use cases. It will attempt to connect the way we want, but if it wails, it will still reuse existing connections.
anything that reuses a P2P node - p2p nodes have background activity that causes them to .connect to each other outside of our direct control. Based on logs, these are: libp2p's internal DHT, autonat checks, talking to relays for other peers.
creating a temporary P2P(no_listen=True) on the fly inside rpc_check: worked nice in tests, but caused timeouts on the mainnet. Connecting to peers over the internet is way longer than connecting on localhost (duh)
What exactly happens when you connect? doConnect->d.Host.connect->[interface]Connect->implementation.
When connecting to a an already connected peer, this is a no-op
Sanity checks:
ps aux| grep p2pd | wc -l
)