New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First draft of a timeout for operations #346
Conversation
- Every Operation that is executed and that may possibly take a long time regularly checks whether it still has time to continue and else throw a TimeoutException. - This commit implements such checks for almost all operations. - For sorting, there is no efficient way to implement a timeout without changing the implementation of the library function. We thus estimate the execution time before starting the operation, and only start it if the estimate is not much higher than the time budget. - The Timeout currently can be set for each query the HTTP parameter "timeout", the value is a floating point value that specifies the number of seconds.
33dc9e0
to
4c1b6a4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to make sure, that the other operations get a timeout support at some point.
- In the transitive path the HashMaps take quite a long time to destroy, this leads to strange timeout behavior.
- The Pattern trick should get a timeout functionality as part of Florian's extension.
- With parallelization (not only of pattern trick) in mind, we should research the best pattern to cancel parallel computations in OpenMP etc.
src/util/Timer.h
Outdated
/// Factory function for a timer that never times out | ||
static TimeoutTimer unlimited() { return TimeoutTimer(UnlimitedTag{}); } | ||
/// Factory function for a timer that times out after a number of microseconds | ||
static TimeoutTimer usecLimited(off_t usecs) { return TimeoutTimer(usecs); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fromMicroseconds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial code review in 1-1 with Johannes, will do the rest later today.
Thanks a lot for this valuable PR!
src/util/Timer.h
Outdated
if (_isUnlimited) { | ||
return false; | ||
} | ||
auto prevRunning = isRunning(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isRunningSnapshot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// NOTE: we cannot get the current timer value without stopping the timer.
src/util/Timer.h
Outdated
} | ||
auto prevRunning = isRunning(); | ||
stop(); | ||
auto res = usecs() > _timeLimitInMicroseconds; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hasTimedOut
src/util/Timer.h
Outdated
|
||
/// Did this timer already timeout | ||
/// Can't be const because of the internals of the Timer class. | ||
bool isTimeout() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hasTimedOut
src/util/Timer.h
Outdated
static_cast<double>(_timeLimitInMicroseconds) / (1000 * 1000); | ||
throw TimeoutException(additionalMessage + | ||
"A Timeout occured. The time limit was "s + | ||
std::to_string(seconds) + "seconds"s); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed precision or integer milliseconds with thousand separator.
src/engine/Operation.h
Outdated
// set a global timeout timer for all child operations. | ||
// As soon as this runs out, the complete tree will fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Use the same timeout timer for all children of an operation (= query plan rooted at that operation). As soon as one child times out, the whole operation times out.
src/engine/Operation.cpp
Outdated
@@ -80,7 +92,14 @@ shared_ptr<const ResultTable> Operation::getResult(bool isRoot) { | |||
// Passing the raw pointer here is ok as the result shared_ptr remains | |||
// in scope | |||
try { | |||
if (_timeoutTimer->wlock()->isTimeout()) { | |||
throw ad_utility::TimeoutException("uncaught Timeout before " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Timeout in operation without timeout functionality, before "
src/engine/Join.cpp
Outdated
@@ -18,7 +18,7 @@ Join::Join(QueryExecutionContext* qec, std::shared_ptr<QueryExecutionTree> t1, | |||
size_t t2JoinCol, bool keepJoinColumn) | |||
: Operation(qec) { | |||
// Make sure subtrees are ordered so that identical queries can be identified. | |||
if (t1.get()->asString() < t2.get()->asString()) { | |||
if (t1 && t2 && t1.get()->asString() < t2.get()->asString()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Throw exception if t1 or t2 is null
if (j >= b.size()) { | ||
// The next i might still match | ||
break; | ||
} | ||
} | ||
++i; | ||
checkTimeoutAfterNCalls(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this one is needed, but it doesn't harm either
src/engine/TransitivePath.cpp
Outdated
if (left[i][leftSideCol] == last_elem) { | ||
// We can repeat the last output | ||
size_t num_new = last_result_end - last_result_begin; | ||
size_t res_row = res.size(); | ||
res.resize(res.size() + num_new); | ||
checkTimeoutAfterNCalls(num_new); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not num_new * res_width ?
I presume that res_width is never very large. But what about num_new?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Num new can in theory be very large (but this means, that the transitive path is easy to compute and mostly only copies.
src/engine/TransitivePath.cpp
Outdated
@@ -522,6 +533,7 @@ void TransitivePath::computeTransitivePathRightBound( | |||
res(res_row + j, c) = res(last_result_begin + j, c); | |||
} | |||
} | |||
checkTimeoutAfterNCalls(num_new); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finished review after 1-1 with Johannes earlier that day.
One general comment: There are various constants in the various operations concerning the maxCount that determines how often a (relatively expensive) timeout check is made. Here is a suggestion to make this a bit less ad hoc:
- Have a global constant NUM_OPERATIONS_BETWEEN_TIMEOUT_CHECKS with a reasonable value. No need to have this configurable from the command line for now, but it would be good to have one place in the code where this is defined.
- Have a constant NUM_OPERATIONS_HASHSET_LOOKUP that estimates the number of operations for a lookup in a hash set (used several times in the code in timeout critical sections)
- Any other such constants if needed. But maybe the two above are already enough.
src/engine/GroupBy.cpp
Outdated
|
||
void GroupBy::processGroup(const GroupBy::Aggregate& a, size_t blockStart, | ||
size_t blockEnd, const IdTableView<IN_WIDTH>& input, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's with the two empty lines here?
src/engine/GroupBy.cpp
Outdated
switch (a._type) { | ||
case ParsedQuery::AggregateType::AVG: { | ||
float res = 0; | ||
if (inputTypes[a._inCol] == ResultTable::ResultType::VERBATIM) { | ||
if (a._distinct) { | ||
for (size_t i = blockStart; i <= blockEnd; i++) { | ||
checkTimeoutAfterNCalls(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a loop body with a HashSet.find. In another operation you used a lower value for numCalls then 32000 for that. There are a few more occurrences of this below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, this would be much cleaner, I will adapt those constants.
src/engine/GroupBy.cpp
Outdated
for (size_t pos = 1; pos < input.size(); pos++) { | ||
checkTimeoutAfterNCalls(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not call this with argument currentGroupBlock.size() here, given the loop below? Or do we usually break early from that loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, you are right.
src/engine/GroupBy.cpp
Outdated
if (a._type == ParsedQuery::AggregateType::GROUP_CONCAT) { | ||
delete static_cast<std::string*>(a._userData); | ||
|
||
auto cleanup = [&]() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specify variables in capture explicitly when it'a a short list and/or a short lambda?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general you are right. In this case I would argue that the default capture by reference is not problematic, because we do not pass the lambda outside, and only use it as a "copied code block", but with NEVER using default captures you agree with one of my favourite C++ authors, so I am totally convinced.
src/engine/GroupBy.cpp
Outdated
} catch (...) { | ||
cleanup(); | ||
throw; | ||
} | ||
|
||
cleanup(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you briefly explain which case is handled by this which was not properly handled before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As soon as we had a groupConcat operation (which requires manual cleanup of the _userData
) and the CALL_FIXED_SIZE_2 (basically a call to doGroupBy
) would throw an exception, then the _userData
memory would leak. I don't know if this was previously safe, but in the meantime we at least have Timeout
and OutOfMemory
exceptions, which might very well occur here and can lead to the memory leak.
As stated before: This type erasure - style _userData
should become a type with a proper constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(which I would do in a separate, small PR)
@@ -180,11 +188,20 @@ void Union::computeUnion( | |||
// This if clause is only here to avoid creating the call to insert when | |||
// it would not be possible to call the function due to not matching | |||
// columns. | |||
AD_CHECK(LEFT_WIDTH == OUT_WIDTH); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this of the kind "This should never happen?"
If yes, do we still need the if in the next line now that we have the AD_CHECK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
YES, we do, because the next line is if constexpr
(the direct copy/insert operations would not compile if the columns do not match).
src/engine/Union.cpp
Outdated
// always copy chunkSize results at once and then check for a timeout | ||
size_t numChunks = right.size() / chunkSize; | ||
for (size_t i = 0; i < numChunks; ++i) { | ||
res.insert(res.end(), right.begin() + i * chunkSize, | ||
right.begin() + (i + 1) * chunkSize); | ||
checkTimeout(); | ||
} | ||
res.insert(res.end(), right.begin() + numChunks * chunkSize, | ||
right.end()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code duplication, in other places you have used a lambda to avoid this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed
src/global/Pattern.h
Outdated
using ref = value_type&; | ||
using const_ref = const value_type&; | ||
|
||
value_type& operator[](const size_t pos) { return _data[pos]; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do you need this and how can the compiler tell whether to use this operator or the one from the next line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
&* These are overloads that only differ in the constness of the result AND of the method. The compiler chooses this method if this
is not const, and else the second one. This is an absolute standard pattern for containers, and the magic behind the following behavior:
std::vector<int> a{3};
a[0] = 42; // perfectly fine, the non-const operator[] returns a non-const reference.
const std::vector<int> b{3};
b[0] = 43; // compiler error, b is const, the const overload of operator[] returns a const ref to which we may not write
I am not sure if this is needed in one of the other PRs, but in general it is good to have a complete interface (other methods of the Pattern
class also have the const and non-const overloads, e.g. back
.
if (p._meta.relationExists(key)) { | ||
const FullRelationMetaData& rmd = p._meta.getRmd(key)._rmdPairs; | ||
result->reserve(rmd.getNofElements() + 2); | ||
result->resize(rmd.getNofElements()); | ||
p._file.read(result->data(), rmd.getNofElements() * 2 * sizeof(Id), | ||
rmd._startFullIndex); | ||
rmd._startFullIndex, std::move(timer)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that gives the answer to my earlier question in IndexScan.cpp . I wonder though, why this function is here and not in IndexScan.cpp . Especially since it seems to be the only timeout-relevant function in class Index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I came to the same conclusion above, but I think it is not relevant for the timeout PR
src/util/File.h
Outdated
@@ -191,14 +192,22 @@ class File { | |||
//! Read nofBytesToRead bytes from file starting at the given offset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove extra space in comment
Still TODO: - Global Constants for Operations until check. - Verify impact of Timeout - Proper Estimates for sorting.
It is tricky to test. TODO: exclude the test from the make test command b.c. it takes too long and integrate the SortPerformanceEstimator into the ExecutionContext.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot!
|
||
for (size_t numColumns = 7; numColumns < 15; numColumns++) { | ||
bool isFirst = true; | ||
for (size_t i = 1000000; i < 100000000; i = static_cast<size_t>(i * 1.5)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1'000'000
100'000'000
I understood that a single one of these sorts takes seconds, which is fine. Minutes for all of them together is maybe a bit too long. How about testing fewer of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a random generator to only execute every 6th test case on average.
On my average laptop the tests then run for about 23 seconds which I consider acceptable
src/engine/Engine.h
Outdated
template <int WIDTH, typename C, | ||
typename = std::enable_if_t< | ||
std::is_invocable_v<C, typename IdTableStatic<WIDTH>::row_type, | ||
typename IdTableStatic<WIDTH>::row_type>>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// The effect of the third template argument is that if C does not have operator() with the specified arguments, then this template does not match (so that there is no confusion with the templated function above). This is an instance of SFINAE.
Johannes: maybe add return type bool as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this, it's great to finally have a working timeout!
src/WriteIndexListsMain.cpp
Outdated
SortPerformanceEstimator estimator(allocator); | ||
QueryExecutionContext qec(index, engine, &cache, &pinnedSizes, allocator, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sortPerformanceEstimator
src/engine/OrderBy.cpp
Outdated
SORT_ESTIMATE_CANCELLATION_FACTOR > | ||
remainingSecs) { | ||
->getSortPerformanceEstimator() | ||
.estimateSortTimeInSeconds(subRes->size(), subRes->width()) > |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the function returns the time, it is better called estimatedSortTimeInSeconds
That also reads better in the code where it is used, like here
@@ -64,7 +64,7 @@ class QueryExecutionContext { | |||
_pinnedSizes(pinnedSizes), | |||
_allocator(std::move(allocator)), | |||
_costFactors(), | |||
_sortPerformanceEstimator(_allocator) {} | |||
_sortPerformanceEstimator(sortPerformanceEstimator) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops :-)
src/engine/Server.h
Outdated
@@ -35,6 +36,7 @@ class Server { | |||
cacheMaxSizeGBSingleEntry * (1ull << 30u) / sizeof(Id)), | |||
_allocator(ad_utility::makeAllocationMemoryLeftThreadsafeObject( | |||
maxMemGB * (1ull << 30u))), | |||
_sortPerformanceEstimator(_allocator), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit confusing that we _sortPerformanceEstimator(sortPerformanceEstimator)
above and _sortPerformanceEstimator(_allocator)
here. Not sure what to do about that though. If you think it's fine, then it's fine with me, too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Constructing a SortPerformanceEstimator
is expensive (it involves sorting large inputs for a time measurement). So we want to construct the Estimator (at most) once in a program.
There is only one Server
which stores an Estimator, and passes it via (cheap) copy to the QueryExecutionContext
class. It is however crucial, that the QueryExecutionContexts don't always recreate the estimators, as the Server creates on QueryExecutionContext per query.
This is what I got wrong initially.
src/engine/Sort.cpp
Outdated
SORT_ESTIMATE_CANCELLATION_FACTOR > | ||
remainingSecs) { | ||
->getSortPerformanceEstimator() | ||
.estimateSortTimeInSeconds(subRes->size(), subRes->width()) > |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above: estimated...
@@ -18,25 +18,26 @@ class SortPerformanceEstimator { | |||
explicit SortPerformanceEstimator( | |||
const ad_utility::AllocatorWithLimit<Id>& allocator); | |||
|
|||
// Create a random IdTable with specified number of rows and columns. Sort | |||
// Create a random IdTable with the specified number of rows and columns. Sort | |||
// this table and return the time in seconds that this sorting took. | |||
static double measureSortingTimeInSeconds( | |||
size_t numRows, size_t numColumns, | |||
const ad_utility::AllocatorWithLimit<Id>& allocator); | |||
|
|||
// Return an Estimate for how long sorting an IdTable with the specified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Compute and return an estimate ...
src/global/Constants.h
Outdated
@@ -10,7 +10,7 @@ static const size_t STXXL_MEMORY_TO_USE = 1024L * 1024L * 1024L * 2L; | |||
static const size_t STXXL_DISK_SIZE_INDEX_BUILDER = 1000 * 1000; | |||
static const size_t STXXL_DISK_SIZE_INDEX_TEST = 10; | |||
|
|||
static constexpr size_t DEFAULT_MEM_FOR_QUERIES_IN_GB = 80; | |||
static constexpr size_t DEFAULT_MEM_FOR_QUERIES_IN_GB = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why so little?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can go for 4
here.
The problem is now, if this is too high, then the initial construction of the SortPerformanceEstimator will use that amount of money. So with a too high value here, bad things happen. Most notably, the end2end tests don't work properly. But I'll change it to 4
here (most people should have at least 8 Gigabytes on their machine) and explicitly scale down the memory for the end2end tests.
* A simple and fast Pseudo-Random-Number-Generator called Xoroshiro128+, for | ||
* details see | ||
* https://thompsonsed.co.uk/random-number-generators-for-c-performance-tested | ||
* Limiting the range of the generated numbers is currently not supported | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing full stop + add:
Assumes that Int
fits into 64 bits. If used with a type that does not fulfill this property, the template will not match (because of the std::enable_if) and there will be a compile-time error.
src/util/Random.h
Outdated
size_t _futureIndex = 0; | ||
size_t _elementIndex = 0; | ||
std::array<uint64_t, 4> _shuffleTable; | ||
// The actual algorithm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this comment doing here, is it a leftover?
src/util/Random.h
Outdated
* more flexibility and stronger probabilistic guarantees | ||
*/ | ||
template <typename Int, typename = std::enable_if_t<std::is_integral_v<Int>>> | ||
class RandomIntGenerator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vote for SlowRandomIntGenerator
in order to pique interest for anyone seeing this used in the code.
Timeout can be specified via timeout=3 (3 seconds) in the http-get request.
Currently only checked between operations and inside the expensive dummy join and transitive path operations