Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-6013: [Java] Support range searcher #4925

Closed
wants to merge 1 commit into from

Conversation

liyafan82
Copy link
Contributor

For a sorted vector, the range searcher finds the first/last occurrence of a particular element.

The search is based on binary search, which takes O(logn) time.

int high = targetVector.getValueCount() - 1;

while (low <= high) {
int mid = (low + high) / 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overflow? Didn't we already have binary search someplace else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion for overflow.

This algorithm is similar to binary search. However, they solve different problems, and I think this one is more useful in practice.

In particular, given a (sorted) vector with duplicated elements. This algorithm finds the bounds: the first/last element equal to the given element.

By different problems, I mean it is possible to solve the problem for binary search by this algorithm. But it is not efficient to solve the problem for this algorithm by binary search.

To see this, note that binary search finds any match element, and after that, we may either go left-ward or right-ward to find the first/last match. However, the time complexity for the latter can be O(n), which is worse than the time complexity of this algorithm O(logn).

int low = 0;
int high = targetVector.getValueCount() - 1;

while (low <= high) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to factor out some duplicate code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible, however, it may not be good for readability and code structure.

This is because, the only difference is in the case that the target element is equal to the source element. In this case, we either go left-ward or right-ward by binary search.

So to extract the common code, we need to extract the logic for going left-ward or right-ward. This involves passing and returning multiple states: low, mid, high, plus a flag indicating is we have found the bound. This makes the code less clear.

@codecov-io
Copy link

Codecov Report

Merging #4925 into master will increase coverage by 2.13%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4925      +/-   ##
==========================================
+ Coverage   87.49%   89.63%   +2.13%     
==========================================
  Files         998      664     -334     
  Lines      141784    97936   -43848     
  Branches     1418        0    -1418     
==========================================
- Hits       124058    87786   -36272     
+ Misses      17364    10150    -7214     
+ Partials      362        0     -362
Impacted Files Coverage Δ
r/src/recordbatch.cpp
r/R/Table.R
js/src/util/fn.ts
go/arrow/array/bufferbuilder.go
r/src/symbols.cpp
rust/datafusion/src/execution/projection.rs
rust/datafusion/src/execution/filter.rs
rust/arrow/src/csv/writer.rs
rust/datafusion/src/bin/main.rs
go/arrow/ipc/file_reader.go
... and 324 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 38b0176...d7438d7. Read the comment docs.

* @param <V> the vector type.
* @return the index of the first matched element if any, and -1 otherwise.
*/
public static <V extends ValueVector> int getLowerBound(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the name is confusing. maybe, getFirstMatch() or getLowestMatch() ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pravindra Thanks for your good suggestion.

Copy link
Contributor

@pravindra pravindra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

low = mid + 1;
} else {
// the key equals the mid value, find the lower bound by going left-ward.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following requires less branching. I haven't proved if it properly works (only toyed with ipython), so be cautious (or add more test!).

    int low = 0;
    int high = targetVector.getValueCount() - 1;
    int ret = SEARCH_FAIL_RESULT;

     while (low <= high) {
      int mid = low + (high - low) / 2;
      int result = comparator.compare(keyIndex, mid);
      if (result < 0) {
        // the key is smaller
        high = mid - 1;
      } else if (result > 0) {
        // the key is larger
        low = mid + 1;
      } else {
        // A match is found, stash it.
        ret = mid;
        // Binary search will continue by clamping on the left
        high = mid - 1; 
      }
     }
   return ret;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fsaintjacques Thanks a lot for your comments, and the improved algorithm.

I think to prove the correctness of the algorithm, we need to maintain some loop invariants. Suppose we call the boundary point (lower or upper bound) the bp, the following invariants must hold throughout:

  1. low <= bp
  2. high >= bp

It seems in some cases, this statement may break the second invariant (Maybe I am wrong)?

high = mid - 1

Anyway, it is interesting question. Would you please give more details about the reasoning behind your algorithm?

@pravindra
Copy link
Contributor

@liyafan82 do you want to follow up on the potentially improved algo suggested by @fsaintjacques as part of this CR ? Or, do it as part of a different follow up jira/CR ?

I'm fine either way.

@liyafan82
Copy link
Contributor Author

@liyafan82 do you want to follow up on the potentially improved algo suggested by @fsaintjacques as part of this CR ? Or, do it as part of a different follow up jira/CR ?

I'm fine either way.

@pravindra I prefer refining the algorithm in a following CR, because we may need some effort to revise/verify/test the refined algorithm.

@pravindra pravindra closed this in e3ba3de Aug 1, 2019
@pravindra
Copy link
Contributor

I've created ARROW-6093 to track the algo improvement suggested by @fsaintjacques

Thanks @liyafan82 for the change.

@liyafan82
Copy link
Contributor Author

I've created ARROW-6093 to track the algo improvement suggested by @fsaintjacques

Thanks @liyafan82 for the change.

@pravindra I see. Thank you for your effort.

emkornfield pushed a commit that referenced this pull request Aug 10, 2019
…angeSearcher

This is a follow up Jira for the improvement suggested by Francois Saint-Jacques in the PR for
#4925

After careful analysis and proof, I found @fsaintjacques 's proposal really works. The idea behind the algorithm is tricky. This PR is the implementation based on @fsaintjacques 's proposal.

The benefit of this implementation:
1. the code is much clearer
2. there is fewer branches, which is beneficial for performance.

The drawback of this implementation:
1. there can be dumb iterations of the main loop. That is, the while loop may run a few more times after the "solution" is found, just to wait for the loop termination condition is satisfied.

However, the time complexity of the algorithm is still O(logn), and the number of dumb iterations can not be large.

Closes #5011 from liyafan82/fly_0805_range and squashes the following commits:

f47a2e7 <liyafan82>  reduce branches in algo for first match in VectorRangeSearcher

Authored-by: liyafan82 <fan_li_ya@foxmail.com>
Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
pprudhvi pushed a commit to pprudhvi/arrow that referenced this pull request Aug 11, 2019
For a sorted vector, the range searcher finds the first/last occurrence of a particular element.

The search is based on binary search, which takes O(logn) time.

Closes apache#4925 from liyafan82/fly_0723_range and squashes the following commits:

4690f69 <liyafan82>  Support range searcher

Authored-by: liyafan82 <fan_li_ya@foxmail.com>
Signed-off-by: Pindikura Ravindra <ravindra@dremio.com>
pprudhvi pushed a commit to pprudhvi/arrow that referenced this pull request Aug 11, 2019
…angeSearcher

This is a follow up Jira for the improvement suggested by Francois Saint-Jacques in the PR for
apache#4925

After careful analysis and proof, I found @fsaintjacques 's proposal really works. The idea behind the algorithm is tricky. This PR is the implementation based on @fsaintjacques 's proposal.

The benefit of this implementation:
1. the code is much clearer
2. there is fewer branches, which is beneficial for performance.

The drawback of this implementation:
1. there can be dumb iterations of the main loop. That is, the while loop may run a few more times after the "solution" is found, just to wait for the loop termination condition is satisfied.

However, the time complexity of the algorithm is still O(logn), and the number of dumb iterations can not be large.

Closes apache#5011 from liyafan82/fly_0805_range and squashes the following commits:

f47a2e7 <liyafan82>  reduce branches in algo for first match in VectorRangeSearcher

Authored-by: liyafan82 <fan_li_ya@foxmail.com>
Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants