New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-6013: [Java] Support range searcher #4925
Conversation
int high = targetVector.getValueCount() - 1; | ||
|
||
while (low <= high) { | ||
int mid = (low + high) / 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overflow? Didn't we already have binary search someplace else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion for overflow.
This algorithm is similar to binary search. However, they solve different problems, and I think this one is more useful in practice.
In particular, given a (sorted) vector with duplicated elements. This algorithm finds the bounds: the first/last element equal to the given element.
By different problems, I mean it is possible to solve the problem for binary search by this algorithm. But it is not efficient to solve the problem for this algorithm by binary search.
To see this, note that binary search finds any match element, and after that, we may either go left-ward or right-ward to find the first/last match. However, the time complexity for the latter can be O(n), which is worse than the time complexity of this algorithm O(logn).
int low = 0; | ||
int high = targetVector.getValueCount() - 1; | ||
|
||
while (low <= high) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to factor out some duplicate code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible, however, it may not be good for readability and code structure.
This is because, the only difference is in the case that the target element is equal to the source element. In this case, we either go left-ward or right-ward by binary search.
So to extract the common code, we need to extract the logic for going left-ward or right-ward. This involves passing and returning multiple states: low, mid, high, plus a flag indicating is we have found the bound. This makes the code less clear.
fd15f9b
to
d7438d7
Compare
Codecov Report
@@ Coverage Diff @@
## master #4925 +/- ##
==========================================
+ Coverage 87.49% 89.63% +2.13%
==========================================
Files 998 664 -334
Lines 141784 97936 -43848
Branches 1418 0 -1418
==========================================
- Hits 124058 87786 -36272
+ Misses 17364 10150 -7214
+ Partials 362 0 -362 Continue to review full report at Codecov.
|
* @param <V> the vector type. | ||
* @return the index of the first matched element if any, and -1 otherwise. | ||
*/ | ||
public static <V extends ValueVector> int getLowerBound( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the name is confusing. maybe, getFirstMatch() or getLowestMatch() ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pravindra Thanks for your good suggestion.
d7438d7
to
f9ba6d0
Compare
java/algorithm/src/main/java/org/apache/arrow/algorithm/search/VectorRangeSearcher.java
Outdated
Show resolved
Hide resolved
java/algorithm/src/main/java/org/apache/arrow/algorithm/search/VectorRangeSearcher.java
Outdated
Show resolved
Hide resolved
f9ba6d0
to
4690f69
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
low = mid + 1; | ||
} else { | ||
// the key equals the mid value, find the lower bound by going left-ward. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following requires less branching. I haven't proved if it properly works (only toyed with ipython), so be cautious (or add more test!).
int low = 0;
int high = targetVector.getValueCount() - 1;
int ret = SEARCH_FAIL_RESULT;
while (low <= high) {
int mid = low + (high - low) / 2;
int result = comparator.compare(keyIndex, mid);
if (result < 0) {
// the key is smaller
high = mid - 1;
} else if (result > 0) {
// the key is larger
low = mid + 1;
} else {
// A match is found, stash it.
ret = mid;
// Binary search will continue by clamping on the left
high = mid - 1;
}
}
return ret;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fsaintjacques Thanks a lot for your comments, and the improved algorithm.
I think to prove the correctness of the algorithm, we need to maintain some loop invariants. Suppose we call the boundary point (lower or upper bound) the bp, the following invariants must hold throughout:
- low <= bp
- high >= bp
It seems in some cases, this statement may break the second invariant (Maybe I am wrong)?
high = mid - 1
Anyway, it is interesting question. Would you please give more details about the reasoning behind your algorithm?
@liyafan82 do you want to follow up on the potentially improved algo suggested by @fsaintjacques as part of this CR ? Or, do it as part of a different follow up jira/CR ? I'm fine either way. |
@pravindra I prefer refining the algorithm in a following CR, because we may need some effort to revise/verify/test the refined algorithm. |
I've created ARROW-6093 to track the algo improvement suggested by @fsaintjacques Thanks @liyafan82 for the change. |
@pravindra I see. Thank you for your effort. |
…angeSearcher This is a follow up Jira for the improvement suggested by Francois Saint-Jacques in the PR for #4925 After careful analysis and proof, I found @fsaintjacques 's proposal really works. The idea behind the algorithm is tricky. This PR is the implementation based on @fsaintjacques 's proposal. The benefit of this implementation: 1. the code is much clearer 2. there is fewer branches, which is beneficial for performance. The drawback of this implementation: 1. there can be dumb iterations of the main loop. That is, the while loop may run a few more times after the "solution" is found, just to wait for the loop termination condition is satisfied. However, the time complexity of the algorithm is still O(logn), and the number of dumb iterations can not be large. Closes #5011 from liyafan82/fly_0805_range and squashes the following commits: f47a2e7 <liyafan82> reduce branches in algo for first match in VectorRangeSearcher Authored-by: liyafan82 <fan_li_ya@foxmail.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
For a sorted vector, the range searcher finds the first/last occurrence of a particular element. The search is based on binary search, which takes O(logn) time. Closes apache#4925 from liyafan82/fly_0723_range and squashes the following commits: 4690f69 <liyafan82> Support range searcher Authored-by: liyafan82 <fan_li_ya@foxmail.com> Signed-off-by: Pindikura Ravindra <ravindra@dremio.com>
…angeSearcher This is a follow up Jira for the improvement suggested by Francois Saint-Jacques in the PR for apache#4925 After careful analysis and proof, I found @fsaintjacques 's proposal really works. The idea behind the algorithm is tricky. This PR is the implementation based on @fsaintjacques 's proposal. The benefit of this implementation: 1. the code is much clearer 2. there is fewer branches, which is beneficial for performance. The drawback of this implementation: 1. there can be dumb iterations of the main loop. That is, the while loop may run a few more times after the "solution" is found, just to wait for the loop termination condition is satisfied. However, the time complexity of the algorithm is still O(logn), and the number of dumb iterations can not be large. Closes apache#5011 from liyafan82/fly_0805_range and squashes the following commits: f47a2e7 <liyafan82> reduce branches in algo for first match in VectorRangeSearcher Authored-by: liyafan82 <fan_li_ya@foxmail.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
For a sorted vector, the range searcher finds the first/last occurrence of a particular element.
The search is based on binary search, which takes O(logn) time.