-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix infinite parallel join probe loop #7925
Conversation
✅ Deploy Preview for meta-velox canceled.
|
@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
39c2aad
to
05479d7
Compare
@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
ef95744
to
7c76b4f
Compare
@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
velox/exec/HashTable.h
Outdated
@@ -864,8 +896,8 @@ class HashTable : public BaseHashTable { | |||
// many threads can set this. | |||
std::atomic<bool> hasDuplicates_{false}; | |||
|
|||
// Offset of next row link for join build side, 0 if none. Copied | |||
// from 'rows_'. | |||
// Offset of next row link for join build side, 0 if none. Copied from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "Copied from 'rows_'" means here? Does this need to be int64_t?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I change to "set from 'rows_'". This is (relative) offset from a pointer so int32_t should be sufficient.
std::vector<char*>* overflow) { | ||
// The insertable rows are in the table, all get put in the hash | ||
// table or array. | ||
TableInsertPartitionInfo* partitionInfo) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice. Thanks.
There is infinite loop problem found in parallel table build in Meta internal test. If bucketOffset_ (int64_t) in ProbeState exceeds int32_t max limit, then HashTable::nextBucketOffset(int32_t) might cast it to a smaller value which cause fullProbe to restart the probe from a lower bucket offset and repeat this process forever. This PR fixes the problem by (1) change nextBucketOffset to take int64_t to int32_t to prevent offset integer overflow, and (2) fall back to the overflow insert path if the next bucket offset is below the starting bucket offset; (3) to prevent the similar bug to happen, we add to count the number of probed buckets and throws if the count exceeds the total number of buckets in the table. This fails a query but will not hang the driver thread. Also refactor the code by putting all the parallel join build related parameter into a TableInsertPartitionInfo data structure to simplify the implementation.
@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@xiaoxmeng merged this pull request in bcd6652. |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
There is infinite loop problem found in parallel table build in Meta internal test.
If bucketOffset_ (int64_t) in ProbeState exceeds int32_t max limit, then
HashTable::nextBucketOffset(int32_t) might cast it to a smaller value which cause
fullProbe to restart the probe from a lower bucket offset and repeat this process
forever.
This PR fixes the problem by (1) change nextBucketOffset to take int64_t to int32_t
to prevent offset integer overflow, and (2) fall back to the overflow insert path if the
next bucket offset is below the starting bucket offset; (3) to prevent the similar bug
to happen, we add to count the number of probed buckets and throws if the count
exceeds the total number of buckets in the table. This fails a query but will not hang
the driver thread.
Also refactor the code by putting all the parallel join build related parameter into a
TableInsertPartitionInfo data structure to simplify the implementation.