-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RPC] Report RPC Session Timeout to Client Instead of "kShutdown" #15187
Conversation
Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.
Generated by tvm-bot |
This is a nice addition! I'm curious what its implication is to the existing auto tuning system though - for example, will it affect AutoTVM's time out mechanism? CC @zxybazh |
@junrushao In my opinion, there isn't any effect for the existing auto tuning system, because currently they all catch the very general error type, e.g., Exception, TVMError, and the RPCSessionTimeoutError is a subclass of TVMError, so it will be caught too. tvm/python/tvm/autotvm/measure/measure_methods.py Lines 369 to 375 in e178375
tvm/python/tvm/autotvm/measure/measure_methods.py Lines 683 to 693 in e178375
tvm/python/tvm/auto_scheduler/measure.py Lines 1150 to 1159 in e178375
|
Yep, and that's why I am curious. Thanks for the explanation! |
Fix regression in (apache#15187) when multiprocessing start method is not 'fork', which prevented tuning from working. This affects macOS and Windows. Also in python 3.14 the default start method will be 'spawn'.
Fix regression in (apache#15187) when multiprocessing start method is not 'fork', which prevented tuning from working. This affects macOS and Windows. Also in python 3.14 the default start method will be 'spawn'.
Fix regression in (apache#15187) when multiprocessing start method is not 'fork', which prevented tuning from working. This affects macOS and Windows. Also in python 3.14 the default start method will be 'spawn'.
By using RPC server in NPU board, at some time a compiled model will hang the NPU, because of the buggy operator libraries of NPU toolchain, so we must to use the session_timeout to ensure the board resource can be released by the hang jobs.
Currently the handling of session timeout error in RPC server is not good, it just kill the server loop sub process, then in the destructor of class
RPCEndpoint
will send the code ofkShutdown
to the RPC client, but the RPC client expect receive the code ofkReturn
orkException
, so users will see the error message that like the one reported in #15151, this error report will make users very confused and don't know what's happened.When using tuning to search a good schedule for operators, we only want to ignore the RPC session timeout error that indicate the schedule generated is an illegal one, but other error reported by the RPC server may help us find the potential bug of our tool chain built on top of TVM, so the RPC session timeout error should be split to a standalone TVM error class.
This PR implemented these requirements by sending the RPC session timeout error message as a PRC server exception to the RPC client before kill the server loop sub process.