[RPC] Report RPC Session Timeout to Client Instead of "kShutdown" #15187

Johnson9009 · 2023-06-30T08:12:00Z

By using RPC server in NPU board, at some time a compiled model will hang the NPU, because of the buggy operator libraries of NPU toolchain, so we must to use the session_timeout to ensure the board resource can be released by the hang jobs.

Currently the handling of session timeout error in RPC server is not good, it just kill the server loop sub process, then in the destructor of class RPCEndpoint will send the code of kShutdown to the RPC client, but the RPC client expect receive the code of kReturn or kException, so users will see the error message that like the one reported in #15151, this error report will make users very confused and don't know what's happened.

When using tuning to search a good schedule for operators, we only want to ignore the RPC session timeout error that indicate the schedule generated is an illegal one, but other error reported by the RPC server may help us find the potential bug of our tool chain built on top of TVM, so the RPC session timeout error should be split to a standalone TVM error class.

This PR implemented these requirements by sending the RPC session timeout error message as a PRC server exception to the RPC client before kill the server loop sub process.

tvm-bot · 2023-06-30T08:12:03Z

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

No users to tag found in teams: rpc _{See #10317 for details}

_{Generated by tvm-bot}

junrushao · 2023-06-30T23:01:25Z

This is a nice addition! I'm curious what its implication is to the existing auto tuning system though - for example, will it affect AutoTVM's time out mechanism? CC @zxybazh

Johnson9009 · 2023-07-02T00:42:11Z

This is a nice addition! I'm curious what its implication is to the existing auto tuning system though - for example, will it affect AutoTVM's time out mechanism? CC @zxybazh

@junrushao In my opinion, there isn't any effect for the existing auto tuning system, because currently they all catch the very general error type, e.g., Exception, TVMError, and the RPCSessionTimeoutError is a subclass of TVMError, so it will be caught too.

tvm/python/tvm/autotvm/measure/measure_methods.py

Lines 369 to 375 in e178375

    
           for future in futures: 
        
               try: 
        
                   res = future.result() 
        
                   results.append(res) 
        
               except Exception as ex:  # pylint: disable=broad-except 
        
                   tb = traceback.format_exc() 
        
                   results.append(

tvm/python/tvm/autotvm/measure/measure_methods.py

Lines 683 to 693 in e178375

    
                   costs.sort() 
        
                   costs = tuple(costs[1:-1]) 
        
           except TVMError as exc: 
        
               msg = str(exc) 
        
               if "Stack trace returned" in msg: 
        
                   msg = msg[: msg.index("Stack trace returned")] 
        
               if "CUDA Source" in msg: 
        
                   msg = msg[: msg.index("CUDA Source")] 
        
               costs = (traceback.format_exc(), RuntimeError(msg[:1024])) 
        
               errno = MeasureErrorNo.RUNTIME_DEVICE 
        
           tstamp = time.time()

tvm/python/tvm/auto_scheduler/measure.py

Lines 1150 to 1159 in e178375

    
                   remote.remove("") 
        
                   dev.free_raw_stream(stream) 
        
               # pylint: disable=broad-except 
        
               except Exception: 
        
                   dev.free_raw_stream(stream) 
        
                   costs = (MAX_FLOAT,) 
        
                   error_no = MeasureErrorNo.RUNTIME_DEVICE 
        
                   error_msg = make_traceback_info() 
        
           shutil.rmtree(os.path.dirname(build_res.filename))

junrushao · 2023-07-02T00:56:45Z

Yep, and that's why I am curious. Thanks for the explanation!

Fix regression in (apache#15187) when multiprocessing start method is not 'fork', which prevented tuning from working. This affects macOS and Windows. Also in python 3.14 the default start method will be 'spawn'.

* [RPC] Fix tuning on macOS and Windows (#15771) Fix regression in (#15187) when multiprocessing start method is not 'fork', which prevented tuning from working. This affects macOS and Windows. Also in python 3.14 the default start method will be 'spawn'. * [RPC] clean up _serve_loop function

[RPC] Report RPC Session Timeout to Client Instead of "kShutdown"

5f04cfc

Johnson9009 force-pushed the rpc_timeout branch from dc7ca00 to 5f04cfc Compare June 30, 2023 11:12

junrushao approved these changes Jul 2, 2023

View reviewed changes

junrushao merged commit 683dfb0 into apache:main Jul 2, 2023
18 checks passed

Johnson9009 deleted the rpc_timeout branch July 2, 2023 07:52

ysh329 mentioned this pull request Jul 12, 2023

[Release] v0.13.0 Release Candidate Notes #15295

Closed

gmeeker mentioned this pull request Jan 6, 2024

[RPC] Fix tuning on macOS and Windows (#15771) #16357

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RPC] Report RPC Session Timeout to Client Instead of "kShutdown" #15187

[RPC] Report RPC Session Timeout to Client Instead of "kShutdown" #15187

Johnson9009 commented Jun 30, 2023

tvm-bot commented Jun 30, 2023

junrushao commented Jun 30, 2023

Johnson9009 commented Jul 2, 2023

junrushao commented Jul 2, 2023

[RPC] Report RPC Session Timeout to Client Instead of "kShutdown" #15187

[RPC] Report RPC Session Timeout to Client Instead of "kShutdown" #15187

Conversation

Johnson9009 commented Jun 30, 2023

tvm-bot commented Jun 30, 2023

junrushao commented Jun 30, 2023

Johnson9009 commented Jul 2, 2023

junrushao commented Jul 2, 2023