ARROW-6013: [Java] Support range searcher #4925

liyafan82 · 2019-07-23T10:59:27Z

For a sorted vector, the range searcher finds the first/last occurrence of a particular element.

The search is based on binary search, which takes O(logn) time.

emkornfield · 2019-07-26T06:36:49Z

java/algorithm/src/main/java/org/apache/arrow/algorithm/search/VectorRangeSearcher.java

+    int high = targetVector.getValueCount() - 1;
+
+    while (low <= high) {
+      int mid = (low + high) / 2;


overflow? Didn't we already have binary search someplace else?

Good suggestion for overflow.

This algorithm is similar to binary search. However, they solve different problems, and I think this one is more useful in practice.

In particular, given a (sorted) vector with duplicated elements. This algorithm finds the bounds: the first/last element equal to the given element.

By different problems, I mean it is possible to solve the problem for binary search by this algorithm. But it is not efficient to solve the problem for this algorithm by binary search.

To see this, note that binary search finds any match element, and after that, we may either go left-ward or right-ward to find the first/last match. However, the time complexity for the latter can be O(n), which is worse than the time complexity of this algorithm O(logn).

emkornfield · 2019-07-26T06:38:24Z

java/algorithm/src/main/java/org/apache/arrow/algorithm/search/VectorRangeSearcher.java

+    int low = 0;
+    int high = targetVector.getValueCount() - 1;
+
+    while (low <= high) {


is it possible to factor out some duplicate code?

It is possible, however, it may not be good for readability and code structure.

This is because, the only difference is in the case that the target element is equal to the source element. In this case, we either go left-ward or right-ward by binary search.

So to extract the common code, we need to extract the logic for going left-ward or right-ward. This involves passing and returning multiple states: low, mid, high, plus a flag indicating is we have found the bound. This makes the code less clear.

codecov-io · 2019-07-27T08:32:09Z

Codecov Report

Merging #4925 into master will increase coverage by 2.13%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #4925      +/-   ##
==========================================
+ Coverage   87.49%   89.63%   +2.13%     
==========================================
  Files         998      664     -334     
  Lines      141784    97936   -43848     
  Branches     1418        0    -1418     
==========================================
- Hits       124058    87786   -36272     
+ Misses      17364    10150    -7214     
+ Partials      362        0     -362

Impacted Files	Coverage Δ
r/src/recordbatch.cpp
r/R/Table.R
js/src/util/fn.ts
go/arrow/array/bufferbuilder.go
r/src/symbols.cpp
rust/datafusion/src/execution/projection.rs
rust/datafusion/src/execution/filter.rs
rust/arrow/src/csv/writer.rs
rust/datafusion/src/bin/main.rs
go/arrow/ipc/file_reader.go
... and 324 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 38b0176...d7438d7. Read the comment docs.

pravindra · 2019-07-29T07:24:13Z

java/algorithm/src/main/java/org/apache/arrow/algorithm/search/VectorRangeSearcher.java

+   * @param <V> the vector type.
+   * @return the index of the first matched element if any, and -1 otherwise.
+   */
+  public static <V extends ValueVector> int getLowerBound(


nit: the name is confusing. maybe, getFirstMatch() or getLowestMatch() ?

@pravindra Thanks for your good suggestion.

java/algorithm/src/main/java/org/apache/arrow/algorithm/search/VectorRangeSearcher.java

pravindra

lgtm

fsaintjacques · 2019-07-31T14:21:34Z

java/algorithm/src/main/java/org/apache/arrow/algorithm/search/VectorRangeSearcher.java

+        low = mid + 1;
+      } else {
+        // the key equals the mid value, find the lower bound by going left-ward.
+


The following requires less branching. I haven't proved if it properly works (only toyed with ipython), so be cautious (or add more test!).

int low = 0; int high = targetVector.getValueCount() - 1; int ret = SEARCH_FAIL_RESULT; while (low <= high) { int mid = low + (high - low) / 2; int result = comparator.compare(keyIndex, mid); if (result < 0) { // the key is smaller high = mid - 1; } else if (result > 0) { // the key is larger low = mid + 1; } else { // A match is found, stash it. ret = mid; // Binary search will continue by clamping on the left high = mid - 1; } } return ret;

@fsaintjacques Thanks a lot for your comments, and the improved algorithm.

I think to prove the correctness of the algorithm, we need to maintain some loop invariants. Suppose we call the boundary point (lower or upper bound) the bp, the following invariants must hold throughout:

low <= bp

high >= bp

It seems in some cases, this statement may break the second invariant (Maybe I am wrong)?

high = mid - 1

Anyway, it is interesting question. Would you please give more details about the reasoning behind your algorithm?

pravindra · 2019-08-01T06:07:38Z

@liyafan82 do you want to follow up on the potentially improved algo suggested by @fsaintjacques as part of this CR ? Or, do it as part of a different follow up jira/CR ?

I'm fine either way.

liyafan82 · 2019-08-01T07:59:44Z

@liyafan82 do you want to follow up on the potentially improved algo suggested by @fsaintjacques as part of this CR ? Or, do it as part of a different follow up jira/CR ?

I'm fine either way.

@pravindra I prefer refining the algorithm in a following CR, because we may need some effort to revise/verify/test the refined algorithm.

pravindra · 2019-08-01T09:21:47Z

I've created ARROW-6093 to track the algo improvement suggested by @fsaintjacques

Thanks @liyafan82 for the change.

liyafan82 · 2019-08-01T09:49:21Z

I've created ARROW-6093 to track the algo improvement suggested by @fsaintjacques

Thanks @liyafan82 for the change.

@pravindra I see. Thank you for your effort.

@fsaintjacques

…angeSearcher This is a follow up Jira for the improvement suggested by Francois Saint-Jacques in the PR for #4925 After careful analysis and proof, I found @fsaintjacques 's proposal really works. The idea behind the algorithm is tricky. This PR is the implementation based on @fsaintjacques 's proposal. The benefit of this implementation: 1. the code is much clearer 2. there is fewer branches, which is beneficial for performance. The drawback of this implementation: 1. there can be dumb iterations of the main loop. That is, the while loop may run a few more times after the "solution" is found, just to wait for the loop termination condition is satisfied. However, the time complexity of the algorithm is still O(logn), and the number of dumb iterations can not be large. Closes #5011 from liyafan82/fly_0805_range and squashes the following commits: f47a2e7 <liyafan82> reduce branches in algo for first match in VectorRangeSearcher Authored-by: liyafan82 <fan_li_ya@foxmail.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>

For a sorted vector, the range searcher finds the first/last occurrence of a particular element. The search is based on binary search, which takes O(logn) time. Closes apache#4925 from liyafan82/fly_0723_range and squashes the following commits: 4690f69 <liyafan82> Support range searcher Authored-by: liyafan82 <fan_li_ya@foxmail.com> Signed-off-by: Pindikura Ravindra <ravindra@dremio.com>

@fsaintjacques

…angeSearcher This is a follow up Jira for the improvement suggested by Francois Saint-Jacques in the PR for apache#4925 After careful analysis and proof, I found @fsaintjacques 's proposal really works. The idea behind the algorithm is tricky. This PR is the implementation based on @fsaintjacques 's proposal. The benefit of this implementation: 1. the code is much clearer 2. there is fewer branches, which is beneficial for performance. The drawback of this implementation: 1. there can be dumb iterations of the main loop. That is, the while loop may run a few more times after the "solution" is found, just to wait for the loop termination condition is satisfied. However, the time complexity of the algorithm is still O(logn), and the number of dumb iterations can not be large. Closes apache#5011 from liyafan82/fly_0805_range and squashes the following commits: f47a2e7 <liyafan82> reduce branches in algo for first match in VectorRangeSearcher Authored-by: liyafan82 <fan_li_ya@foxmail.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>

emkornfield reviewed Jul 26, 2019

View reviewed changes

liyafan82 force-pushed the fly_0723_range branch from fd15f9b to d7438d7 Compare July 27, 2019 07:47

pravindra reviewed Jul 29, 2019

View reviewed changes

liyafan82 force-pushed the fly_0723_range branch from d7438d7 to f9ba6d0 Compare July 30, 2019 07:43

emkornfield reviewed Jul 31, 2019

View reviewed changes

java/algorithm/src/main/java/org/apache/arrow/algorithm/search/VectorRangeSearcher.java Outdated Show resolved Hide resolved

emkornfield reviewed Jul 31, 2019

View reviewed changes

java/algorithm/src/main/java/org/apache/arrow/algorithm/search/VectorRangeSearcher.java Outdated Show resolved Hide resolved

[ARROW-6013][Java] Support range searcher

4690f69

liyafan82 force-pushed the fly_0723_range branch from f9ba6d0 to 4690f69 Compare July 31, 2019 03:41

pravindra approved these changes Jul 31, 2019

View reviewed changes

fsaintjacques reviewed Jul 31, 2019

View reviewed changes

fsaintjacques added the Component: Java label Jul 31, 2019

pravindra closed this in e3ba3de Aug 1, 2019

liyafan82 mentioned this pull request Aug 5, 2019

ARROW-6093: [Java] reduce branches in algo for first match in VectorRangeSearcher #5011

Closed

This was referenced Aug 1, 2019

[Java] Support range searcher #22419

Closed

[Java] reduce branches in algo for first match in VectorRangeSearcher #22492

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-6013: [Java] Support range searcher #4925

ARROW-6013: [Java] Support range searcher #4925

liyafan82 commented Jul 23, 2019

emkornfield Jul 26, 2019

liyafan82 Jul 27, 2019

emkornfield Jul 26, 2019

liyafan82 Jul 27, 2019

codecov-io commented Jul 27, 2019

pravindra Jul 29, 2019

liyafan82 Jul 30, 2019

pravindra left a comment

fsaintjacques Jul 31, 2019 •

edited

liyafan82 Aug 1, 2019

pravindra commented Aug 1, 2019

liyafan82 commented Aug 1, 2019

pravindra commented Aug 1, 2019

liyafan82 commented Aug 1, 2019

ARROW-6013: [Java] Support range searcher #4925

ARROW-6013: [Java] Support range searcher #4925

Conversation

liyafan82 commented Jul 23, 2019

emkornfield Jul 26, 2019

Choose a reason for hiding this comment

liyafan82 Jul 27, 2019

Choose a reason for hiding this comment

emkornfield Jul 26, 2019

Choose a reason for hiding this comment

liyafan82 Jul 27, 2019

Choose a reason for hiding this comment

codecov-io commented Jul 27, 2019

Codecov Report

pravindra Jul 29, 2019

Choose a reason for hiding this comment

liyafan82 Jul 30, 2019

Choose a reason for hiding this comment

pravindra left a comment

Choose a reason for hiding this comment

fsaintjacques Jul 31, 2019 • edited

Choose a reason for hiding this comment

liyafan82 Aug 1, 2019

Choose a reason for hiding this comment

pravindra commented Aug 1, 2019

liyafan82 commented Aug 1, 2019

pravindra commented Aug 1, 2019

liyafan82 commented Aug 1, 2019

fsaintjacques Jul 31, 2019 •

edited