Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize RemoteSequenceManager #106

Merged
merged 28 commits into from
Dec 1, 2022
Merged

Optimize RemoteSequenceManager #106

merged 28 commits into from
Dec 1, 2022

Conversation

justheuristic
Copy link
Collaborator

@justheuristic justheuristic commented Nov 30, 2022

By the time this text is copied to commit message, this PR will have

  • made RemoteSequenceManager into a background thread that pre-fetches information instead of running just in time
  • moved routing-related stuff to petals.client.routing
  • extract remote peer routing information to RemoteSequenceInfo
  • made sure that the code survives continued use (e.g. one hour)
  • updated every spot where update_ is called manually
  • modified get_sequence to check that the thread is alive, warn if not
  • removed max_retries, switched rpc_info to exponential backoff
  • fixed a bg that causes RemoteSeq* to lose user-defined hyperparameters (e.g. timeout) upon subsequencing (sequential[3:5])
  • moved client-side points strategy to client.routing or consciously decided not to
  • ensured that RemoteSequenceManager thread created in get_remote_module properly shuts down when the module is destroyed
  • resolved minor affected todos (sanity check: diff should NOT contain todos diff)
  • modified tests to no longer use PYTHONPATH
  • worked around this error:
traceback
WARN     /home/jheuristic/Documents/exp/bloom-demo/src/petals/client/routing/sequence_manager.py:sequence_manager.py:194 Tried to call rpc_info, but caught P2PDaemonError('protocol not supported')
Traceback (most recent call last):
  File "/home/jheuristic/Documents/exp/bloom-demo/src/petals/client/routing/sequence_manager.py", line 184, in rpc_info
    outputs = RemoteExpertWorker.run_coroutine(
  File "/home/jheuristic/Documents/exp/hivemind/hivemind/moe/client/remote_expert_worker.py", line 47, in run_coroutine
    result = future.result()
  File "/home/jheuristic/anaconda3/lib/python3.8/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/home/jheuristic/anaconda3/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
  File "/home/jheuristic/Documents/exp/hivemind/hivemind/moe/client/remote_expert_worker.py", line 25, in receive_tasks
    result = await cor
  File "/home/jheuristic/Documents/exp/hivemind/hivemind/p2p/servicer.py", line 95, in caller
    return await asyncio.wait_for(
  File "/home/jheuristic/anaconda3/lib/python3.8/asyncio/tasks.py", line 455, in wait_for
    return await fut
  File "/home/jheuristic/Documents/exp/hivemind/hivemind/p2p/p2p_daemon.py", line 579, in call_protobuf_handler
    return await self._call_unary_protobuf_handler(peer_id, name, input, output_protobuf_type)
  File "/home/jheuristic/Documents/exp/hivemind/hivemind/p2p/p2p_daemon.py", line 592, in _call_unary_protobuf_handler
    response = await self._client.call_unary_handler(peer_id, handle_name, serialized_input)
  File "/home/jheuristic/Documents/exp/hivemind/hivemind/p2p/p2p_daemon_bindings/p2pclient.py", line 71, in call_unary_handler
    return await self.control.call_unary_handler(peer_id, proto, data)
  File "/home/jheuristic/Documents/exp/hivemind/hivemind/p2p/p2p_daemon_bindings/control.py", line 308, in call_unary_handler
    return await self._pending_calls[call_id]
hivemind.p2p.p2p_daemon_bindings.utils.P2PDaemonError: protocol not supported
WARN     /home/jheuristic/Documents/exp/bloom-demo/src/petals/client/routing/sequence_manager.py:sequence_manager.py:194 Tried to call rpc_info, but caught P2PDaemonError('protocol not supported')
Traceback (most recent call last):
  File "/home/jheuristic/Documents/exp/bloom-demo/src/petals/client/routing/sequence_manager.py", line 184, in rpc_info
    outputs = RemoteExpertWorker.run_coroutine(
  File "/home/jheuristic/Documents/exp/hivemind/hivemind/moe/client/remote_expert_worker.py", line 47, in run_coroutine
    result = future.result()
  File "/home/jheuristic/anaconda3/lib/python3.8/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/home/jheuristic/anaconda3/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
  File "/home/jheuristic/Documents/exp/hivemind/hivemind/moe/client/remote_expert_worker.py", line 25, in receive_tasks
    result = await cor
  File "/home/jheuristic/Documents/exp/hivemind/hivemind/p2p/servicer.py", line 95, in caller
    return await asyncio.wait_for(
  File "/home/jheuristic/anaconda3/lib/python3.8/asyncio/tasks.py", line 455, in wait_for
    return await fut
  File "/home/jheuristic/Documents/exp/hivemind/hivemind/p2p/p2p_daemon.py", line 579, in call_protobuf_handler
    return await self._call_unary_protobuf_handler(peer_id, name, input, output_protobuf_type)
  File "/home/jheuristic/Documents/exp/hivemind/hivemind/p2p/p2p_daemon.py", line 592, in _call_unary_protobuf_handler
    response = await self._client.call_unary_handler(peer_id, handle_name, serialized_input)
  File "/home/jheuristic/Documents/exp/hivemind/hivemind/p2p/p2p_daemon_bindings/p2pclient.py", line 71, in call_unary_handler
    return await self.control.call_unary_handler(peer_id, proto, data)
  File "/home/jheuristic/Documents/exp/hivemind/hivemind/p2p/p2p_daemon_bindings/control.py", line 308, in call_unary_handler
    return await self._pending_calls[call_id]
hivemind.p2p.p2p_daemon_bindings.utils.P2PDaemonError: protocol not supported

4review:

  • on expiration time in RemoteModuleInfo
  • on reaction to server faults
  • ...

@justheuristic justheuristic changed the title Optimize client-side routing (unironically) Optimize RemoteSequenceManager Dec 1, 2022
@@ -0,0 +1 @@
"""Client-side functions responsible for choosing the best server, """
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Client-side functions responsible for choosing the best server, """
"""Client-side functions responsible for choosing the best server."""

start=True,
)

def trigger_update(self):
Copy link
Collaborator

@borzunov borzunov Dec 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename:

  • (optional) trigger_update() -> update() (maybe with wait=True/False)
  • update_() -> _update() (it's a private method, a user can't call it because it's not thread-safe)

logger.debug(f"{self.__class__.__name__} is shutting down")
break

if not self.trigger.is_set() and time.perf_counter() - self.last_update_time >= self.update_period:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if not self.trigger.is_set() and time.perf_counter() - self.last_update_time >= self.update_period:
if not self.trigger.is_set():

Simplify

finally:
del update_manager

logger.info(f"{self.__class__.__name__} thread exited")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.info(f"{self.__class__.__name__} thread exited")
logger.debug(f"{self.__class__.__name__} thread exited")

(not useful to the end user)

def update_(self):
"""Perform an immediate and synchronous refresh, may take time"""
for attempt_no in itertools.count():
new_block_infos = petals.dht_utils.get_remote_module_infos(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please wrap this in try-except, so retry happens in except

E.g., if the client has a network outage, get_remote_module_infos will fail but we still want to retry

self.dht, self.block_uids, expiration_time=float("inf"), frozen=False
)
with self.lock_changes:
self.sequence_info.update_(new_block_infos)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried that it is not updated with OFFLINEs

justheuristic and others added 4 commits December 1, 2022 09:35
# Conflicts:
#	src/petals/client/remote_sequential.py
#	src/petals/client/sequence_manager.py
#	src/petals/client/sequential_autograd.py
@justheuristic justheuristic merged commit a2066a4 into main Dec 1, 2022
@justheuristic justheuristic deleted the routing_unironically branch December 1, 2022 07:25
justheuristic added a commit that referenced this pull request Dec 1, 2022
Fix an issue in span selection that was introduced in #106
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants