New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable aborting an Operation result, fix dead lock #173
Enable aborting an Operation result, fix dead lock #173
Conversation
When an exception is thrown during Operation::computeResult() the computing thread would leave the ResultTable in its unfinished state in the subtree cache. Another thread waiting on it to be finished would wait indefinitely. We fix this by introducing the concept of aborting an Operation result. When an exception is thrown in Operation::computeResult() we call the newly added ResultTable::abort() to mark the result as aborted unlocking waiting threads. Then we propagate the exception up the stack of recursive Operation::getResult() calls. To improve error output we propagate the exception with the original details but a new QUERY_ABORTED error code which allows us to only print the inner most Operation which caused the original error while all Operations up the stack are silently aborted. The aborted Operation result is left in the cache intentionally. It is cleared during ResultTable::abort() and thus takes little memory and will be pruned from the cache like any other result until which time it may serve as a zero cost warning that a particular subtree fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, but I think we should catch all exceptions in the try block for computeResult
to ensure that no unexpected exception type can introduce deadlocks.
src/engine/Operation.h
Outdated
computeResult(newResult.get()); | ||
try { | ||
computeResult(newResult.get()); | ||
} catch (const ad_semsearch::Exception& e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about non ad_semsearch exceptions (e.g. a standard library call that fails, or std::bad_alloc
)? A second catch block that uses catch (...)
to catch everything and then throw;
to rethrow whatever it caught would help prevent deadlocks when an unexpected exception is thrown.
src/util/Exception.h
Outdated
@@ -23,8 +23,9 @@ using std::string; | |||
__PRETTY_FUNCTION__); \ | |||
} // NOLINT | |||
// Rethrow an exception | |||
#define AD_RETHROW(e) \ | |||
throw semsearch::Exception(e.getErrorCode(), e.getErrorDetails()) // NOLINT | |||
#define AD_RETHROW(e) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any point to an AD_RETRHOW
given that throw;
will rethrow the current exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right it doesn't have a point and in fact it isn't even used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requested type changes as discussed in person
src/engine/ResultTable.h
Outdated
@@ -20,7 +20,7 @@ using std::vector; | |||
|
|||
class ResultTable { | |||
public: | |||
enum Status { FINISHED = 0, OTHER = 1 }; | |||
enum Status { IN_PROGESS = 0, FINISHED = 1, ABORTED = 2 }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think currently ResultTable::awaitFinished
never returns in the aborted case.
I think the lambda in the condition_variable wait should check for any status != IN_PROGRESS
. Then the outer calls can decide what to do on Aborted cases or possibly timeouts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Argh you're absolutely right, totally missed that.
src/engine/Operation.h
Outdated
computeResult(newResult.get()); | ||
try { | ||
computeResult(newResult.get()); | ||
} catch (const ad_semsearch::Exception& e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ad:semsearch::Exception should inherit from std::exception and implement what().
ad_semsearch::AbortException should be a separate type.
Then you can write one catch block for AbortException (simply rethrow) and one for all other exceptions (print the what()
message and throw and AbortException. In particular we also want to handle std::bad_alloc
etc.
Also make ad_semsearch::Exception inherit from std::exception and use a separate ad_semsearch::AbortException to propagae excetpions during query abortion
@joka921, @floriankramer I've addressed your review comments. The change became a bit more involved than the last version but we should also now be catching stuff like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from a non critical catch all block this looks good to me.
newResult->abort(); | ||
// Rethrow as QUERY_ABORTED allowing us to print the Operation | ||
// only at innermost failure of a recursive call | ||
throw ad_semsearch::Exception(ad_semsearch::Exception::QUERY_ABORTED, | ||
e.getErrorDetails()); | ||
throw ad_semsearch::AbortException(e); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think a catch all block in the form of catch(...) {
would be a good idea here, as we might otherwise run into deadlocks when weird exceptions are being thrown (as pretty much anything could be used as an exception object, and exceptions do not have to inherit from std::exception). With our current code all exceptions should inherit from std::exception, so this is not a critical problem right now, but might help with otherwise unexpected behavior in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, then I wouldn't have an object to pass to the AbortException
constructor so I would need that to be an extra block. Let's see how that looks and then we can decide if we rather want to crash on these weird exceptions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to cppreference all standard library exceptions inherit from std::exception
so I'm not sure if this warrants uglier code. If there's an exception that isn't from the standard library or ad_semsearch::Exception
something would be pretty fishy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with both solutions. We generally do not use third-party libraries in this part, so the catch(...)
is not necessary for me. But it does not bother me too much, since it is very explicit and readable what happens here.
This would catch even those exceptions that don't inherit from std::exception. Since those are pretty weird, I'm not 100% sure if this is much better than just crashing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is one point I am not sure about (aborting queries recursively). The rest is just cosmetic.
src/ServerMain.cpp
Outdated
LOG(ERROR) << e.getFullErrorMessage() << '\n'; | ||
return 1; | ||
} catch (const std::exception& e) { | ||
LOG(ERROR) << e.what() << std::endl; | ||
} | ||
return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way that we legally reach this return 0
statement? We could remove it to suggest that it is never reached. In addition provably server.run()
could get a [[noreturn]]
attribute if it is indeed an infinite loop that can be only left by throwing. (Unrelated to your changes, I just see this now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried marking it [[noreturn]]
but the compiler thinks it does return because it uses thread::join()
and the compiler can't see that the threads will only terminate with exceptions.
computeResult(newResult.get()); | ||
} catch (const ad_semsearch::AbortException& e) { | ||
// AbortExceptions have already been printed simply rethrow to | ||
// unwind the callstack until the whole query is aborted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we have to recursively set the newResult->abort()
also in this recursive call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we absolutely do need to abort()
here as well, missed that with the last change.
newResult->abort(); | ||
// Rethrow as QUERY_ABORTED allowing us to print the Operation | ||
// only at innermost failure of a recursive call | ||
throw ad_semsearch::Exception(ad_semsearch::Exception::QUERY_ABORTED, | ||
e.getErrorDetails()); | ||
throw ad_semsearch::AbortException(e); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with both solutions. We generally do not use third-party libraries in this part, so the catch(...)
is not necessary for me. But it does not bother me too much, since it is very explicit and readable what happens here.
|
||
//! Get error message pertaining to code | ||
string getErrorMessage() const { return errorCodeAsString(_errorCode); } | ||
string getErrorMessage() const noexcept { | ||
return errorCodeAsString(_errorCode); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory this might throw (involves a std::string
constructor). But this still might be intended, because in those rare cases we probably cannot save our software at all anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think this would be a very weird error for which crashing might be the best thing to do.
@joka921 I've pushed a new version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Fell free to merge (possibly rebase with my PR that you merged earlier today, if they share changed files.
When an exception is thrown during Operation::computeResult() the
computing thread would leave the ResultTable in its unfinished state in
the subtree cache. Another thread waiting on it to be finished would
wait indefinitely.
We fix this by introducing the concept of aborting an Operation result.
When an exception is thrown in Operation::computeResult() we call the
newly added ResultTable::abort() to mark the result as aborted unlocking
waiting threads. Then we propagate the exception up the stack of
recursive Operation::getResult() calls.
To improve error output we propagate the exception with the original
details but a new QUERY_ABORTED error code which allows us to only print
the inner most Operation which caused the original error while all
Operations up the stack are silently aborted.
The aborted Operation result is left in the cache intentionally. It is
cleared during ResultTable::abort() and thus takes little memory and
will be pruned from the cache like any other result until which time it
may serve as a zero cost warning that a particular subtree fails.