Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-6355: [Java] Make range equal visitor reusable #5195

Closed
wants to merge 5 commits into from

Conversation

liyafan82
Copy link
Contributor

According to the discussion in #4993 (comment), we often encountered this scenario: we compare values repeatedly. The comparisons differs only in the parameters (vector to compare, start index, etc).

According to the current API, we have to create a new RangeEqualVisitor object each time the comparison is performed. This leads to non-trivial performance overhead.

To address this problem, we make the RangeEqualVisitor reusable, and allow the client to change parameters of an existing visitor.

Copy link
Contributor

@tianchen92 tianchen92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this, mostly looks good to me, some minor comments.

@@ -121,9 +126,9 @@ protected boolean compareStructVectors(NonNullableStructVector left) {
return false;
}

ApproxEqualsVisitor visitor = new ApproxEqualsVisitor(floatEpsilon, doubleEpsilon);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also do this for compareListVectors/compareFixedSizeLiseVectors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems compareListVector/compareFixedSizeListVectors in this class does not change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. Fixed now.

for (String name : left.getChildFieldNames()) {
RangeEqualsVisitor visitor = new RangeEqualsVisitor(rightVector.getChild(name),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also for compareListVectors/compareFixedSizeListVectors?

Copy link
Contributor Author

@liyafan82 liyafan82 Aug 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Thanks for the good point.

for (int k = 0; k < leftChildren.size(); k++) {
ApproxEqualsVisitor visitor = new ApproxEqualsVisitor(rightChildren.get(k),
floatEpsilon, doubleEpsilon);
visitor.set(rightChildren.get(k), 0, 0, rightChildren.get(k).getValueCount(), true);
if (!leftChildren.get(k).accept(visitor, null)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we give a message if visitor is directly used without setting params?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the code to make leftStart, rightStart and length have -1 as default value. Since each visit method calls validate method first, if the set method is not called, an exception will be thrown.

@codecov-io
Copy link

codecov-io commented Aug 27, 2019

Codecov Report

Merging #5195 into master will increase coverage by 17.44%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #5195       +/-   ##
===========================================
+ Coverage   72.25%    89.7%   +17.44%     
===========================================
  Files         837      693      -144     
  Lines      112023   104518     -7505     
  Branches     1437        0     -1437     
===========================================
+ Hits        80946    93761    +12815     
+ Misses      30715    10757    -19958     
+ Partials      362        0      -362
Impacted Files Coverage Δ
cpp/src/arrow/util/iterator.h 84.84% <0%> (-15.16%) ⬇️
cpp/src/gandiva/function_holder_registry.h 75% <0%> (-5%) ⬇️
cpp/src/arrow/python/extension_type.cc 72.82% <0%> (-3.85%) ⬇️
cpp/src/arrow/compute/kernel.h 61.16% <0%> (-1.53%) ⬇️
python/pyarrow/types.pxi 67.42% <0%> (-1.23%) ⬇️
cpp/src/arrow/vendored/xxhash/xxhash.c 72.39% <0%> (-1.21%) ⬇️
cpp/src/plasma/thirdparty/ae/ae.c 70.75% <0%> (-0.95%) ⬇️
python/pyarrow/table.pxi 87.44% <0%> (-0.69%) ⬇️
python/pyarrow/tests/test_extension_type.py 99.56% <0%> (-0.44%) ⬇️
cpp/src/arrow/result.h 91.3% <0%> (-0.37%) ⬇️
... and 805 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d4d4a12...ffe0e6a. Read the comment docs.

@emkornfield
Copy link
Contributor

@pravindra I think you've been reviewing the Visitor changes, mind taking a look at this?

for (String name : left.getChildFieldNames()) {
ApproxEqualsVisitor visitor = new ApproxEqualsVisitor(rightVector.getChild(name),
floatEpsilon, doubleEpsilon);
visitor.set(rightVector.getChild(name), 0, 0, rightVector.getChild(name).getValueCount(), true);
Copy link
Contributor

@pravindra pravindra Aug 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are two other ways to pass along parameters
a. for parameters that do not change between invocation, can put them in the constructor of the vistor
b. for parameters that are different for each invocation, pass along as a pojo/object in the accept() call (instead of null).

did you consider these options ? The drawback of the current approach is that it's mandating a strict order between (visitor.set and the accept call) i.e the accept call must always be preceeded by the visitor.set call (not doing so will be caught at runtime only).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pravindra Sounds good.
Wrapping changing parameters into a pojo is a good idea. Too many parameters may confuse users and easily produce bugs.

@liyafan82
Copy link
Contributor Author

I have revised the code in response to @pravindra 's comments.
Since the related parameters are wrapped in a pojo, the API changes, so the code changes significantly.

Brief change list:

  1. all visiting related parameters (left vector, right vector, left start index, right start index, length, need type check) are wrapped in a pojo named RangeEqualsParameter.
  2. parameter checking related logic is moved to RangeEqualsParameter.
  3. much code of ApproxEqualsVisitor is removed, making use of the logic in the super class.
  4. names of the primary API has changed. The API for RangeEqualsVisitor is changed from equals to rangeEquals, and the API for VectorEqualsVisitor is changed from equals to vectorEquals. Note that the original equals method may conflict with Object.equals in some scenarios.

Despite above changes, I still think there are some problems with the current code. For example,

  1. there are if statements/duplicated members in ApproxEqualsVisitor. I think this is an example scenario that can benefit from ARROW-6247: [Java] Provide a common interface for float4 and float8 vectors #5132
  2. the comparion of float4 and float8 are based on wrapped objects Float and Double, which may have performance penalty.
  3. since the left/right vectors have been moved to the pojo, I am not sure if the current VectorVisitor API (visit(leftVector, value);) is still appropriate.

Please give some comments.

@pravindra
Copy link
Contributor

@liyafan82 I'll review this over the weekend.

@liyafan82
Copy link
Contributor Author

@liyafan82 I'll review this over the weekend.

@pravindra thank you so much for your effort.

@pravindra
Copy link
Contributor

@liyafan82 the POJO is efficient to use but it's making it harder to follow code.

From the use case you mentioned, we want to use the same visitor instance to compare different ranges. Given that, I think if the POJO has only the range (leftStart, rightStart, index), and the rest should be in the constructor.

If we separate out the Range from the actual visitor, it makes the type/range validation also efficient.

  • for a new range, only the range validation should happen
  • the type checking should be done only when the left/right vectors change

@liyafan82 I tried this out on top of your change on top of your PR. Please see if this makes sense.

liyafan82#1

@liyafan82
Copy link
Contributor Author

liyafan82 commented Sep 2, 2019

@liyafan82 the POJO is efficient to use but it's making it harder to follow code.

From the use case you mentioned, we want to use the same visitor instance to compare different ranges. Given that, I think if the POJO has only the range (leftStart, rightStart, index), and the rest should be in the constructor.

If we separate out the Range from the actual visitor, it makes the type/range validation also efficient.

  • for a new range, only the range validation should happen
  • the type checking should be done only when the left/right vectors change

@liyafan82 I tried this out on top of your change on top of your PR. Please see if this makes sense.

liyafan82#1

@pravindra Thanks a lot for your effort.

Your code looks good to me in general. By moving the vectors out of the range parameter, the code readability improves a lot. However, there are some small problems: the visitors for comparing UnionVector and StructVector cannot be reused.

By weighing the benefits and costs, I generally prefer your your changes.
So if no one else has any other concerns, I will accept your changes.

@emkornfield What do you think?

@pravindra
Copy link
Contributor

However, there are some small problems: the visitors for comparing UnionVector and
StructVector cannot be reused.

maybe, alloc these on first call and reuse ?

for (int k = 0; k < leftChildren.size(); k++) {
      if (childVisitors[k] == null) {
        childVisitors[k] = createInnerVisitor(leftChildren.get(k), rightChildren.get(k));
      }
      if (!childVisitors[k].rangeEquals(range)) {
        return false;
      }
}
return true;

@liyafan82
Copy link
Contributor Author

However, there are some small problems: the visitors for comparing UnionVector and
StructVector cannot be reused.

maybe, alloc these on first call and reuse ?

for (int k = 0; k < leftChildren.size(); k++) {
      if (childVisitors[k] == null) {
        childVisitors[k] = createInnerVisitor(leftChildren.get(k), rightChildren.get(k));
      }
      if (!childVisitors[k].rangeEquals(range)) {
        return false;
      }
}
return true;

Sounds good for short vectors.
For long vectors, the number of inner visitors equals the vector length, which is non-trivial overhead IMO.

To have one inner visitor and make it reusable, I think we should have an API for the visitor to override the left and right vectors. However, that would make the visitor mutable, which is not a good design.

What do you think?

@pravindra
Copy link
Contributor

pravindra commented Sep 4, 2019

For long vectors, the number of inner visitors equals the vector length, which is non-trivial
overhead IMO.

What' s the overhead ? heap-memory or allocation calls ? I don't believe heap-memory is much of an overhead in this case. The number of allocation calls with this approach remains same with the single-use case (VectorEqualsVisitor) but reduces drastically in the multi reuse case (eg. dedup/dictionary).

To have one inner visitor and make it reusable, I think we should have an API for the visitor to
override the left and right vectors. However, that would make the visitor mutable, which is not a > good design.

agree with the design part. I'm not convinced of the benefit of efficiency either with making the visitor mutable - the mutability requires updating every field in the POJO for each child call, and doing type-checks repeatedly. That can be significantly higher than the allocation cost.

@liyafan82
Copy link
Contributor Author

For long vectors, the number of inner visitors equals the vector length, which is non-trivial
overhead IMO.

What' s the overhead ? heap-memory or allocation calls ? I don't believe heap-memory is much of an overhead in this case. The number of allocation calls with this approach remains same with the single-use case (VectorEqualsVisitor) but reduces drastically in the multi reuse case (eg. dedup/dictionary).

To have one inner visitor and make it reusable, I think we should have an API for the visitor to
override the left and right vectors. However, that would make the visitor mutable, which is not a > good design.

agree with the design part. I'm not convinced of the benefit of efficiency either with making the visitor mutable - the mutability requires updating every field in the POJO for each child call, and doing type-checks repeatedly. That can be significantly higher than the allocation cost.

@pravindra I agree with you that the overhead for heap-memory and allocation calls are insignificant. I also agree with you about the high cost of type-checks.

Your method works well for StructVector. However, it may not work well for UnionVector. This is because leftChildren.get(k) and rightChildren.get(k) may get different vectors for different values of k. So the visitor cannot be reused.

That being said, I think your method is already much better than the original implementation. Maybe we can solve the problems for UnionVector or StructVector in a later issue.

What do you think?

@pravindra
Copy link
Contributor

Maybe we can solve the problems for UnionVector or StructVector in a later issue.

lgtm. thanks !

Move out Range from the other visitor params
@pravindra pravindra closed this in 6330b2f Sep 4, 2019
@tianchen92
Copy link
Contributor

The current implementation seems a little confused to me:
i. The original code visitor holds right vector and do compare through ValueVector#accept API, now seems the accept API and VectorVisitor interface totally useless since the leftVector is never used:

@OverRide
public Boolean visit(BaseFixedWidthVector left, Range range) {
return compareBaseFixedWidthVectors(range);
}

ii. Since we now need to pass both left vector and right vector, it's more like a comparator than a visitor, if in this way, I think all visitor API should be removed?

cc @pravindra @liyafan82

@pravindra
Copy link
Contributor

pravindra commented Sep 5, 2019

not sure what you mean - we are still doing an accept with the visitor API.

https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/compare/RangeEqualsVisitor.java#L99

yes, It's essentially a comparator (it's used to compute VectorEquals or RangeEquals) - the fact that it internally a visitor pattern (rather than say, instanceof checks) doesn't need to be exposed to the users of the API.

@tianchen92
Copy link
Contributor

For example, in BaseFixedWidthVector we pass 'this' as leftVector:

@OverRide
public <OUT, IN> OUT accept(VectorVisitor<OUT, IN> visitor, IN value) {
return visitor.visit(this, value);
}

However in RangeEqualsVisitor, this param is not used

@OverRide
public Boolean visit(BaseFixedWidthVector left, Range range) {
return compareBaseFixedWidthVectors(range);
}

@liyafan82
Copy link
Contributor Author

For example, in BaseFixedWidthVector we pass 'this' as leftVector:

@OverRide
public <OUT, IN> OUT accept(VectorVisitor<OUT, IN> visitor, IN value) {
return visitor.visit(this, value);
}

However in RangeEqualsVisitor, this param is not used

@OverRide
public Boolean visit(BaseFixedWidthVector left, Range range) {
return compareBaseFixedWidthVectors(range);
}

Hi @tianchen92 , thanks for the good point.
As you have observed, the left parameter is not actually used. I think this is an optimization by design, because otherwise, we will have to set the left vector each time the method is called, and this may involve repeated type validation, which is expensive as indicated above by @pravindra .

@tianchen92
Copy link
Contributor

@liyafan82 Thanks for your feedback and I think i totally understand what you mean.
My main concern is that the current implementation seems inconsistent with ValueVector#accept API.
As we see, the leftVector passed via ValueVector#accept is not used, we can simply achieve compare purpose like below,

RangeEqualsVisitor visitor = new RangeEqualsVisitor(leftVector, rightVector);
visitor.rangeEquals(range)

If we use accept API like below, it seems something duplicated for leftVector

RangeEqualsVisitor visitor = new RangeEqualsVisitor(leftVector, rightVector);
leftVector.accept(visitor, range)

More important, it could not prevent users using like this way which is not correct:

RangeEqualsVisitor visitor = new RangeEqualsVisitor(vector1, vector2);
vector3.accept(visitor, range)

In this aspect, with the current implementation, the 'accept' api seems useless and not sure if we should remove it.

@pravindra
Copy link
Contributor

More important, it could not prevent users using like this way which is not correct:
RangeEqualsVisitor visitor = new RangeEqualsVisitor(vector1, vector2);
vector3.accept(visitor, range)

The end-result would be that it would still give the results of RangeEquals on vector1/vector2. but, it's harmless.

The new semantics of RangeEqualsVisitor allows it to more efficient by reducing the number of type checks.

In this aspect, with the current implementation, the 'accept' api seems useless and not sure if
we should remove it.

The accept/visit APIs provide the standard benefits of visitor

  • no instanceof conditions/switch statements for each vector type
  • the comparison logic (or similar additional functionality) is decoupled from the actual ValueVector (and moved to the Visitor)

@tianchen92
Copy link
Contributor

I see, thanks for clarify.

@pravindra
Copy link
Contributor

@tianchen92 sorry, I thought some more on this and I think the case you pointed out is not actually harmless.

RangeEqualsVisitor visitor = new RangeEqualsVisitor(vector1, vector2);
vector3.accept(visitor, range)

if vector1/vector2 are say, StructVectors and vector3 is an IntVector - things can go bad. we'll use the compareBaseFixedWidthVectors() and do wrong type-casts for vector1/vector2. can you please open a jira for this ?

@tianchen92
Copy link
Contributor

Sure, tracked in https://issues.apache.org/jira/browse/ARROW-6472

@liyafan82
Copy link
Contributor Author

@tianchen92 Thank you for openning a JIRA for tracking this.
I think we need some discussions about the acceptable behaviors concerning the vectors passed from the constructor and the visitor method.

IMO, the visitor is a good idea, which provides much flexibility.

emkornfield pushed a commit that referenced this pull request Sep 17, 2019
As discussed in #5195 (comment), there are some problems with the current ways of comparing floating point vectors, we solve them in this PR:

1. there are if statements/duplicated members in ApproxEqualsVisitor, making the code redundant and less clear.
2. the comparion of float4 and float8 are based on wrapped objects Float and Double, which may have performance penalty.

Closes #5304 from liyafan82/fly_0905_float and squashes the following commits:

907c17d <liyafan82>  Remove value boxing/unboxing for ApproxEqualsVisitor

Authored-by: liyafan82 <fan_li_ya@foxmail.com>
Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
pravindra pushed a commit that referenced this pull request Sep 25, 2019
Related to [ARROW-6472](https://issues.apache.org/jira/browse/ARROW-6472).

If we use visitor API this way:
>RangeEqualsVisitor visitor = new RangeEqualsVisitor(vector1, vector2);
vector3.accept(visitor, range)

if vector1/vector2 are say, StructVector}}s and vector3 is an {{IntVector - things can go bad. we'll use the compareBaseFixedWidthVectors() and do wrong type-casts for vector1/vector2.

Discussions see:
#5195 (comment)
https://issues.apache.org/jira/browse/ARROW-6472

Closes #5483 from tianchen92/ARROW-6472 and squashes the following commits:

3d3d295 <tianchen> add test
12e4aa2 <tianchen> ARROW-6472:  ValueVector#accept may has potential cast exception

Authored-by: tianchen <niki.lj@alibaba-inc.com>
Signed-off-by: Pindikura Ravindra <ravindra@dremio.com>
alippai pushed a commit to alippai/arrow that referenced this pull request Sep 26, 2019
Related to [ARROW-6472](https://issues.apache.org/jira/browse/ARROW-6472).

If we use visitor API this way:
>RangeEqualsVisitor visitor = new RangeEqualsVisitor(vector1, vector2);
vector3.accept(visitor, range)

if vector1/vector2 are say, StructVector}}s and vector3 is an {{IntVector - things can go bad. we'll use the compareBaseFixedWidthVectors() and do wrong type-casts for vector1/vector2.

Discussions see:
apache#5195 (comment)
https://issues.apache.org/jira/browse/ARROW-6472

Closes apache#5483 from tianchen92/ARROW-6472 and squashes the following commits:

3d3d295 <tianchen> add test
12e4aa2 <tianchen> ARROW-6472:  ValueVector#accept may has potential cast exception

Authored-by: tianchen <niki.lj@alibaba-inc.com>
Signed-off-by: Pindikura Ravindra <ravindra@dremio.com>
projjal pushed a commit to projjal/arrow that referenced this pull request Mar 8, 2020
According to the discussion in apache#4993 (comment), we often encountered this scenario: we compare values repeatedly. The comparisons differs only in the parameters (vector to compare, start index, etc).

According to the current API, we have to create a new RangeEqualVisitor object each time the comparison is performed. This leads to non-trivial performance overhead.

To address this problem, we make the RangeEqualVisitor reusable, and allow the client to change parameters of an existing visitor.

Closes apache#5195 from liyafan82/fly_0826_reuse and squashes the following commits:

ffe0e6a <liyafan82> Merge pull request #1 from pravindra/pull-5195
073bc78 <Pindikura Ravindra> Test: Move out Range from the visitor params
7482414 <liyafan82>  Wrapper visit parameters into a pojo
53c1e0b <liyafan82> Merge branch 'master' into fly_0826_reuse
a1f7046 <liyafan82>  Make range equal visitor reusable

Lead-authored-by: liyafan82 <fan_li_ya@foxmail.com>
Co-authored-by: Pindikura Ravindra <ravindra@dremio.com>
Co-authored-by: liyafan82 <42827532+liyafan82@users.noreply.github.com>
Signed-off-by: Pindikura Ravindra <ravindra@dremio.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants