Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add service checking direct reachability from peers #195

Merged
merged 23 commits into from
Jan 12, 2023
Merged

Conversation

justheuristic
Copy link
Collaborator

@justheuristic justheuristic commented Jan 11, 2023

Motivation

Servers joining from behind NATs/firewalls usually take several minutes to join a libp2p relay before they become accessible from the outside Internet. Moreover, requests to such servers are slower and more likely to fail (e.g., if the server switches a relay at the moment). If such servers host certain DHT keys, the swarm may occasionally lose read/write access to these keys, which results in:

  • Clients being unable to find any servers hosting a certain block.
  • All servers starting rebalancing to the same place to close the alleged "gap" in the swarm.

This PRs modifies servers so that DHT keys are only hosted on directly reachable servers (the ones who aren't behind NAT/firewall). This way, DHT becomes more stable and works faster. Of course, trhe servers behind NATs/firewalls still accept requests for running inference/forward/backward for blocks they hold (it's more acceptable for this kind of requests to be slower or fail).

Intended usage:

server.py L121

if client_mode is None:    
    reachable = asyncio.run(check_reachability(
        initial_peers=initial_peers, use_relay=False, use_auto_relay=False, **host_and_announce_maddrs
    ))
    client_mode = reachable is False
dht = DHT(**everything_as_usual, client_mode=client_mode)
if not client_mode:
    ReachabilityProtocol.attach_to_dht(dht)
  • make check_reachability into a regular non-asyncio function
  • switch attach_to_dht from run_coroutine to a background process. why:
    • the current run_coroutine creates a background asyncio task. there is no guarantee if it will throughout dht lifespan
    • injecting background tasks can overload dht and cause it to be less responsive
  • integrate into petals/server/server.py (L121)

What didn't work

  • libp2p contexts - after more careful examination, the option that attempts direct connection does not cover all our use cases. It will attempt to connect the way we want, but if it wails, it will still reuse existing connections.

  • anything that reuses a P2P node - p2p nodes have background activity that causes them to .connect to each other outside of our direct control. Based on logs, these are: libp2p's internal DHT, autonat checks, talking to relays for other peers.

    • i was unable to convince a normal p2p not to spontaneously connect after creation
    • this means that both my early code and health.petals code would sometimes produce incorrect results
    • specifically, if my unreachable laptop connects to this dude there is about 50% chance that the next request to api/v1/is_reachable will say that my laptop is reachable. Because as a matter of fact, it is reachable by the test node specifically
    • if a connection already exists, it does nothing
  • creating a temporary P2P(no_listen=True) on the fly inside rpc_check: worked nice in tests, but caused timeouts on the mainnet. Connecting to peers over the internet is way longer than connecting on localhost (duh)

What exactly happens when you connect? doConnect->d.Host.connect->[interface]Connect->implementation.
When connecting to a an already connected peer, this is a no-op

Sanity checks:

  • check_reachability with initial_peers=[] returns None
  • a local node on my PC is not reachable from nora
  • a local node on my PC is reachable from another local node
  • a local node on my PC is not reachable if there is one local node and 4 nodes on nora and beleriand
  • a local node on nora is reachable if there are 4 nodes on nora and one misconfigured local laptop node that believes itself to be reachable
  • a local node on nora is reachable on another nora-like computer
  • terranova (ipv6-only) is not reachable from nora
  • a node on nora with no_listen=True is not reachable
  • after check_reachability, the number of p2pds does not change (ps aux| grep p2pd | wc -l )
  • a node with misconfigured announce_maddrs would succeed or fail depending on the specific misconfigurations
    • to the best of my knowledge, if a misconfigured node registers as reachable, it is indeed reachable
    • BUT there are some cases where misconfigured node is still reachable, but registers as not reachable
    • one example: ipv6-only node with ipv4-only announce maddrs
  • test petals.run_server startup
  • public swarm with 2-3 servers running new code
  • public swarm, but you are the only server with new code
  • keep it running for 3h

@justheuristic justheuristic changed the title A script to test reachability A script to test reachability from peers Jan 11, 2023
@justheuristic
Copy link
Collaborator Author

justheuristic commented Jan 11, 2023

Just in case, the testing script was

# initial peer
import hivemind
import reachability2 as reachability
TMP_IDENTITY_PATH = '/tmp/petals_probe.id'
dht = hivemind.DHT(identity_path=TMP_IDENTITY_PATH, host_maddrs=['/ip4/0.0.0.0/tcp/41337', '/ip6/::/tcp/41337'], start=True)
# optionally add some of conn_manager=False, use_relay=False, use_auto_relay=False, auto_nat=False, nat_port_map=False
reachability.ReachabilityProtocol.attach_to_dht(dht)
# ... depending on the eval, add extra nodes with or without client_mode=True

# client (notebook)
await reachability.check_reachability(
    initial_peers=INITIAL_PEERS, host_maddrs=X, announce_maddrs=Y)
# optionally add some of conn_manager=False, use_relay=False, use_auto_relay=False, auto_nat=False, nat_port_map=False


@classmethod
def attach_to_dht(cls, dht: hivemind.DHT, **kwargs):
return dht.run_coroutine(partial(_attach_to_dht, cls=cls, **kwargs))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: run_coroutine should be changed by the time this gets merged. See details in the pr header. The run_coroutine version was implemented as a hack so i can begin iterating on the main code.

[current plan] Instead, attach_to_dht should create a simple process that self-terminates if not dht.is_alive - or if terminated by the user. Not sure this is the best solution. comments welcome

@justheuristic
Copy link
Collaborator Author

3 hours test:

  • the .serve with run_coroutine would sometimes fail after an hour or so, killed with GeenratorExit
  • the client code still works if there are some alive nodes

@borzunov borzunov changed the title A script to test reachability from peers Add service checking direct reachability from peers Jan 12, 2023
@@ -4,3 +4,6 @@
"/dns/bootstrap2.petals.ml/tcp/31338/p2p/QmQGTqmM7NKjV6ggU1ZCap8zWiyKR89RViDXiqehSiCpY5",
"/dns6/bootstrap2.petals.ml/tcp/31338/p2p/QmQGTqmM7NKjV6ggU1ZCap8zWiyKR89RViDXiqehSiCpY5",
]

# The reachability API is currently used only when connecting to the public swarm
REACHABILITY_API_URL = "http://health.petals.ml"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the env variable here since this check currently works only for the public swarm, and the env var were making an impression that you could enable it for the private swarm too.

We can update the logic for this in future PR: e.g., make the server run the check if the swarm is public or custom API URL is provided.

@borzunov borzunov force-pushed the check_reachability branch 2 times, most recently from b58c900 to b789a6b Compare January 12, 2023 13:20
@borzunov borzunov merged commit 771ca59 into main Jan 12, 2023
@borzunov borzunov deleted the check_reachability branch January 12, 2023 23:05
@johndpope
Copy link

Hi @borzunov - I am running ubuntu 2022 - attempting to run this locally on workstation 3090 card the other week - it didn't connect successfully - though I have transmission on machine - and it works fine downloading torrents. I'm happy to run any tests to get to bottom if there's interest. I'll give this updated code a whirl later on and report back if connection problem persists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants