Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Exception in threads kills entire process #7335

Closed
mccollum-amzn opened this issue Aug 4, 2017 · 4 comments · Fixed by #9681
Closed

Exception in threads kills entire process #7335

mccollum-amzn opened this issue Aug 4, 2017 · 4 comments · Fixed by #9681

Comments

@mccollum-amzn
Copy link

For bugs or installation issues, please provide the following information.
The more information you provide, the more likely people will be able to help you.

Environment info

Operating System: MacOS

Compiler: Clang

Package used (Python/R/Scala/Julia): Python

MXNet commit hash (git rev-parse HEAD): 3a48185

Error Message:

I am looking at some code that adds new operators to MXNet. In a few edge cases, this code uses the CHECK macros to assert certain properties. When a CHECK fails, it throws an exception using the LOG_FATAL macro. This exception makes its way up to ExecuteOprBlock() in the ThreadedEngine class. From here, it is logged using LOG_ERROR. This causes the exception to be printed out to the console (twice actually, due to a simple MXNet bug) and then another exception is thrown out of the thread’s run handler. Following the C++ spec, this second exception causes terminate() to be called on the entire process, exiting MXNet. This has a few side-effects I’d like some feedback on.

First, the caught exception is only ever logged to the console. Anyone using Jupyter will never see any errors unless they have access to the console that launched the kernel. If you are using a hosted notebook solution, where you don’t see the console, the process will exit and zero information will be provided back to the user. This is a pretty awful user-experience.

Second, the environment itself exits. If you were a few days into training a model when the problem occurs, all your work will be lost. You’ll see a stack-trace, but you’ll be forced to start everything over again.

Third, this means that MXNet behaves very different between the NaiveEngine and the regular threaded engine. In Naïve mode, the exception is printed inside the interpreter and your environment is retained. In Threaded mode, the exception is only logged to the console, the interpreter exits, and you lose all your work.

Are these behaviours we want to keep? It seems to me that the proper thing would be for the ThreadedEngine to catch the exception and pass it back to the main thread where it could be treated the same as exceptions in the main thread.

We could use C++11 exception_ptr (http://en.cppreference.com/w/cpp/error/exception_ptr) to pass the exceptions back to the main thread. One problem is that we need to ensure related operations in other threads are also terminated. Something like:

  1. Print out related operands to a log file.
  2. Pass the exception to the main thread for processing and display.
  3. Kill the operators depending on the operator that threw the exception.

Minimum reproducible example

Modify any operator that executes inside a thread to include something like:

CHECK(0) << "Exception thrown here";
@bhavinthaker
Copy link
Contributor

Comments from Mu Li:

Hi Cliff,

Thank you for your summary. We thought to fix it before. One solution is using C++11 execption_ptr (http://en.cppreference.com/w/cpp/error/exception_ptr) to pass all exceptions to the main thread, so that we can catch all of them at the python frontend.

This feature is not on our team roadmap now, it will be great if you can work on it.

Thanks
Mu

Comments from Junyuan Xie:

One problem with this is that once an operator fails inside engine it will never be complete and all subsequent operations will hang.

It’s unclear how one should recover from this.

Comments from Junru Shao:

Will it be a good try to
print out related operands to some log file
pass the exception to main thread
kill the operators depending on the operator that throws the exception?

Thanks,
Junru

@szha
Copy link
Member

szha commented Nov 4, 2017

This issue is closed due to lack of activity in the last 90 days. Feel free to ping me to reopen if this is still an active issue. Thanks!
Also, do please check out our forum (and Chinese version) for general "how-to" questions.

@eric-haibin-lin
Copy link
Member

@anirudh2290

@anirudh2290
Copy link
Member

anirudh2290 commented Jan 23, 2018

Please see: Exception Handling Wiki

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants