Add service checking direct reachability from peers #195

justheuristic · 2023-01-11T03:57:18Z

Motivation

Servers joining from behind NATs/firewalls usually take several minutes to join a libp2p relay before they become accessible from the outside Internet. Moreover, requests to such servers are slower and more likely to fail (e.g., if the server switches a relay at the moment). If such servers host certain DHT keys, the swarm may occasionally lose read/write access to these keys, which results in:

Clients being unable to find any servers hosting a certain block.
All servers starting rebalancing to the same place to close the alleged "gap" in the swarm.

This PRs modifies servers so that DHT keys are only hosted on directly reachable servers (the ones who aren't behind NAT/firewall). This way, DHT becomes more stable and works faster. Of course, trhe servers behind NATs/firewalls still accept requests for running inference/forward/backward for blocks they hold (it's more acceptable for this kind of requests to be slower or fail).

Intended usage:

server.py L121

if client_mode is None:    
    reachable = asyncio.run(check_reachability(
        initial_peers=initial_peers, use_relay=False, use_auto_relay=False, **host_and_announce_maddrs
    ))
    client_mode = reachable is False
dht = DHT(**everything_as_usual, client_mode=client_mode)
if not client_mode:
    ReachabilityProtocol.attach_to_dht(dht)

make check_reachability into a regular non-asyncio function
switch attach_to_dht from run_coroutine to a background process. why:
- the current run_coroutine creates a background asyncio task. there is no guarantee if it will throughout dht lifespan
- injecting background tasks can overload dht and cause it to be less responsive
integrate into petals/server/server.py (L121)

What didn't work

libp2p contexts - after more careful examination, the option that attempts direct connection does not cover all our use cases. It will attempt to connect the way we want, but if it wails, it will still reuse existing connections.
anything that reuses a P2P node - p2p nodes have background activity that causes them to .connect to each other outside of our direct control. Based on logs, these are: libp2p's internal DHT, autonat checks, talking to relays for other peers.
- i was unable to convince a normal p2p not to spontaneously connect after creation
- this means that both my early code and health.petals code would sometimes produce incorrect results
- specifically, if my unreachable laptop connects to this dude there is about 50% chance that the next request to api/v1/is_reachable will say that my laptop is reachable. Because as a matter of fact, it is reachable by the test node specifically
- if a connection already exists, it does nothing
creating a temporary P2P(no_listen=True) on the fly inside rpc_check: worked nice in tests, but caused timeouts on the mainnet. Connecting to peers over the internet is way longer than connecting on localhost (duh)

What exactly happens when you connect? doConnect->d.Host.connect->[interface]Connect->implementation.
When connecting to a an already connected peer, this is a no-op

Sanity checks:

justheuristic · 2023-01-11T06:27:21Z

Just in case, the testing script was

# initial peer
import hivemind
import reachability2 as reachability
TMP_IDENTITY_PATH = '/tmp/petals_probe.id'
dht = hivemind.DHT(identity_path=TMP_IDENTITY_PATH, host_maddrs=['/ip4/0.0.0.0/tcp/41337', '/ip6/::/tcp/41337'], start=True)
# optionally add some of conn_manager=False, use_relay=False, use_auto_relay=False, auto_nat=False, nat_port_map=False
reachability.ReachabilityProtocol.attach_to_dht(dht)
# ... depending on the eval, add extra nodes with or without client_mode=True

# client (notebook)
await reachability.check_reachability(
    initial_peers=INITIAL_PEERS, host_maddrs=X, announce_maddrs=Y)
# optionally add some of conn_manager=False, use_relay=False, use_auto_relay=False, auto_nat=False, nat_port_map=False

justheuristic · 2023-01-11T07:06:09Z

src/petals/server/reachability2.py

+
+    @classmethod
+    def attach_to_dht(cls, dht: hivemind.DHT, **kwargs):
+        return dht.run_coroutine(partial(_attach_to_dht, cls=cls, **kwargs))


Note: run_coroutine should be changed by the time this gets merged. See details in the pr header. The run_coroutine version was implemented as a hack so i can begin iterating on the main code.

[current plan] Instead, attach_to_dht should create a simple process that self-terminates if not dht.is_alive - or if terminated by the user. Not sure this is the best solution. comments welcome

justheuristic · 2023-01-11T15:41:49Z

3 hours test:

the .serve with run_coroutine would sometimes fail after an hour or so, killed with GeenratorExit
the client code still works if there are some alive nodes

borzunov · 2023-01-12T11:32:03Z

src/petals/constants.py

@@ -4,3 +4,6 @@
    "/dns/bootstrap2.petals.ml/tcp/31338/p2p/QmQGTqmM7NKjV6ggU1ZCap8zWiyKR89RViDXiqehSiCpY5",
    "/dns6/bootstrap2.petals.ml/tcp/31338/p2p/QmQGTqmM7NKjV6ggU1ZCap8zWiyKR89RViDXiqehSiCpY5",
 ]
+
+# The reachability API is currently used only when connecting to the public swarm
+REACHABILITY_API_URL = "http://health.petals.ml"


I've removed the env variable here since this check currently works only for the public swarm, and the env var were making an impression that you could enable it for the private swarm too.

We can update the logic for this in future PR: e.g., make the server run the check if the swarm is public or custom API URL is provided.

johndpope · 2023-01-12T23:19:25Z

Hi @borzunov - I am running ubuntu 2022 - attempting to run this locally on workstation 3090 card the other week - it didn't connect successfully - though I have transmission on machine - and it works fine downloading torrents. I'm happy to run any tests to get to bottom if there's interest. I'll give this updated code a whirl later on and report back if connection problem persists.

justheuristic added 3 commits January 11, 2023 06:38

reachability script

1f30fd7

move to server

9ed72d9

black-isort

a97e6ac

justheuristic requested a review from borzunov January 11, 2023 04:02

justheuristic changed the title ~~A script to test reachability~~ A script to test reachability from peers Jan 11, 2023

justheuristic added 3 commits January 11, 2023 07:15

duh

b446718

duh

9536422

black-isort

ff63744

justheuristic commented Jan 11, 2023

View reviewed changes

justheuristic added 2 commits January 11, 2023 16:50

Merge branch 'main' into check_reachability

7aad991

Merge branch 'main' into check_reachability

7ba646e

justheuristic and others added 3 commits January 12, 2023 06:20

black-isort

58d71df

Merge branch 'main' into check_reachability

9ffc4e2

Fix minor issues

9894855

borzunov changed the title ~~A script to test reachability from peers~~ Add service checking direct reachability from peers Jan 12, 2023

borzunov force-pushed the check_reachability branch from d6ec1f6 to 5c229c5 Compare January 12, 2023 09:59

Fix minor issues (2)

c575ae9

borzunov force-pushed the check_reachability branch from 5c229c5 to 681d778 Compare January 12, 2023 10:51

Implement graceful shutdown

c087ea1

borzunov force-pushed the check_reachability branch from 681d778 to c087ea1 Compare January 12, 2023 10:54

Add comment

17dcbbb

borzunov reviewed Jan 12, 2023

View reviewed changes

borzunov force-pushed the check_reachability branch 2 times, most recently from b58c900 to b789a6b Compare January 12, 2023 13:20

Query random key to collect more DHT neighbors

378b78b

borzunov force-pushed the check_reachability branch from b789a6b to 378b78b Compare January 12, 2023 13:28

borzunov added 2 commits January 12, 2023 15:02

Ignore exceptions when creating reachability service

6f30759

Refactor, add run_dht.py

02b7196

borzunov and others added 6 commits January 12, 2023 19:03

Update comment and defaults in run_dht.py

344c664

Remove debug things

5ac3b99

Use startup_timeout=60 for the stripped probe

d5440cc

Bump loglevels for some messages

974c643

Don't log requests triggered by ourselves

5fa285c

Merge branch 'main' into check_reachability

1f1e409

borzunov merged commit 771ca59 into main Jan 12, 2023

borzunov deleted the check_reachability branch January 12, 2023 23:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add service checking direct reachability from peers #195

Add service checking direct reachability from peers #195

justheuristic commented Jan 11, 2023 •

edited by borzunov

Loading

justheuristic commented Jan 11, 2023 •

edited

Loading

justheuristic Jan 11, 2023 •

edited

Loading

justheuristic commented Jan 11, 2023

borzunov Jan 12, 2023

johndpope commented Jan 12, 2023

Add service checking direct reachability from peers #195

Add service checking direct reachability from peers #195

Conversation

justheuristic commented Jan 11, 2023 • edited by borzunov Loading

Motivation

Intended usage:

What didn't work

Sanity checks:

justheuristic commented Jan 11, 2023 • edited Loading

justheuristic Jan 11, 2023 • edited Loading

Choose a reason for hiding this comment

justheuristic commented Jan 11, 2023

borzunov Jan 12, 2023

Choose a reason for hiding this comment

johndpope commented Jan 12, 2023

justheuristic commented Jan 11, 2023 •

edited by borzunov

Loading

justheuristic commented Jan 11, 2023 •

edited

Loading

justheuristic Jan 11, 2023 •

edited

Loading