Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RPC] Report RPC Session Timeout to Client Instead of "kShutdown" #15187

Merged
merged 1 commit into from
Jul 2, 2023

Conversation

Johnson9009
Copy link
Contributor

By using RPC server in NPU board, at some time a compiled model will hang the NPU, because of the buggy operator libraries of NPU toolchain, so we must to use the session_timeout to ensure the board resource can be released by the hang jobs.

Currently the handling of session timeout error in RPC server is not good, it just kill the server loop sub process, then in the destructor of class RPCEndpoint will send the code of kShutdown to the RPC client, but the RPC client expect receive the code of kReturn or kException, so users will see the error message that like the one reported in #15151, this error report will make users very confused and don't know what's happened.

When using tuning to search a good schedule for operators, we only want to ignore the RPC session timeout error that indicate the schedule generated is an illegal one, but other error reported by the RPC server may help us find the potential bug of our tool chain built on top of TVM, so the RPC session timeout error should be split to a standalone TVM error class.

This PR implemented these requirements by sending the RPC session timeout error message as a PRC server exception to the RPC client before kill the server loop sub process.

@tvm-bot
Copy link
Collaborator

tvm-bot commented Jun 30, 2023

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

  • No users to tag found in teams: rpc See #10317 for details

Generated by tvm-bot

@junrushao
Copy link
Member

This is a nice addition! I'm curious what its implication is to the existing auto tuning system though - for example, will it affect AutoTVM's time out mechanism? CC @zxybazh

@Johnson9009
Copy link
Contributor Author

This is a nice addition! I'm curious what its implication is to the existing auto tuning system though - for example, will it affect AutoTVM's time out mechanism? CC @zxybazh

@junrushao In my opinion, there isn't any effect for the existing auto tuning system, because currently they all catch the very general error type, e.g., Exception, TVMError, and the RPCSessionTimeoutError is a subclass of TVMError, so it will be caught too.

for future in futures:
try:
res = future.result()
results.append(res)
except Exception as ex: # pylint: disable=broad-except
tb = traceback.format_exc()
results.append(

costs.sort()
costs = tuple(costs[1:-1])
except TVMError as exc:
msg = str(exc)
if "Stack trace returned" in msg:
msg = msg[: msg.index("Stack trace returned")]
if "CUDA Source" in msg:
msg = msg[: msg.index("CUDA Source")]
costs = (traceback.format_exc(), RuntimeError(msg[:1024]))
errno = MeasureErrorNo.RUNTIME_DEVICE
tstamp = time.time()

remote.remove("")
dev.free_raw_stream(stream)
# pylint: disable=broad-except
except Exception:
dev.free_raw_stream(stream)
costs = (MAX_FLOAT,)
error_no = MeasureErrorNo.RUNTIME_DEVICE
error_msg = make_traceback_info()
shutil.rmtree(os.path.dirname(build_res.filename))

@junrushao
Copy link
Member

Yep, and that's why I am curious. Thanks for the explanation!

@junrushao junrushao merged commit 683dfb0 into apache:main Jul 2, 2023
18 checks passed
@Johnson9009 Johnson9009 deleted the rpc_timeout branch July 2, 2023 07:52
gmeeker added a commit to gmeeker/tvm that referenced this pull request Jan 6, 2024
Fix regression in (apache#15187) when multiprocessing start method is not 'fork',
which prevented tuning from working. This affects macOS and Windows.
Also in python 3.14 the default start method will be 'spawn'.
gmeeker added a commit to gmeeker/tvm that referenced this pull request Jan 6, 2024
Fix regression in (apache#15187) when multiprocessing start method is not 'fork',
which prevented tuning from working. This affects macOS and Windows.
Also in python 3.14 the default start method will be 'spawn'.
gmeeker added a commit to gmeeker/tvm that referenced this pull request Jan 6, 2024
Fix regression in (apache#15187) when multiprocessing start method is not 'fork',
which prevented tuning from working. This affects macOS and Windows.
Also in python 3.14 the default start method will be 'spawn'.
Johnson9009 pushed a commit that referenced this pull request Jan 12, 2024
* [RPC] Fix tuning on macOS and Windows (#15771)

Fix regression in (#15187) when multiprocessing start method is not 'fork',
which prevented tuning from working. This affects macOS and Windows.
Also in python 3.14 the default start method will be 'spawn'.

* [RPC] clean up _serve_loop function
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants