Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-15187: [Java] Made reader initialization lazy and added new getTransferPair() function that takes in a Field type #34424

Merged
merged 13 commits into from
Jun 18, 2023

Conversation

rtadepalli
Copy link
Contributor

@rtadepalli rtadepalli commented Mar 3, 2023

Rationale for this change

This PR closes #15187. FieldReader is being allocated directly in the constructor today, and this PR changes it such that the initialization becomes lazy. Additionally, a new function getTransferPair(Field, Allocator) is introduced so that a new Field method is not constructed each time getTransferPair is called on the Vector.

What changes are included in this PR?

  1. Introduce a new getTransferPair method.
  2. Make initializing FieldReader lazy.

Are these changes tested?

Yes, some tests have been added to verify these changes.

Are there any user-facing changes?

I am not 100% sure if there are any user facing changes.

There should not be any breaking changes.

@github-actions
Copy link

github-actions bot commented Mar 3, 2023

⚠️ GitHub issue #15187 has been automatically assigned in GitHub to PR creator.

@@ -71,7 +73,11 @@ public BigIntVector(String name, FieldType fieldType, BufferAllocator allocator)
*/
public BigIntVector(Field field, BufferAllocator allocator) {
super(field, allocator, TYPE_WIDTH);
reader = new BigIntReaderImpl(BigIntVector.this);
reader = () -> {
Copy link
Contributor Author

@rtadepalli rtadepalli Mar 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not thread safe, but I am not sure if adding the Google core libraries dependency to get Suppliers::memoize is allowed. I don't know of other ways to make this thread safe without adding any throws clause.

I tried adding apache.commons.lang3#LazyInitializer but that I think requires that I add throws ConcurrentException to the getReader function, doesn't seem that great.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to liberate the necessary code from either of these libraries? I'm not sure how much baggage they would drag with them

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this rtadepalli. Instead of the above logic, should we add a new constructor with a new boolean parameter createReader. We create the reader based on the boolean flag. And in the getReader if the reader is null, we throw an exception. Callers who dont need the reader can call the new constructor, and if they try accessing the reader then they get an exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prashanthbdremio I think adding a new constructor would not be backwards compatible, unless we just default the boolean to true in the BigIntVector(field, allocator); constructor (and constructors in other classes). If we do this, I guess we're not really changing much, since the reader will be eagerly created anyways. If you think that is fine, I can make the change.

@lwhite1 I have made some changes to the way the reader is being initialized -- please take another look. Thanks.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Mar 3, 2023
@rtadepalli
Copy link
Contributor Author

rtadepalli commented Mar 3, 2023

Right now, this only covers derived classes for BaseFixedWidthVector, since that's what the ticket referenced. Should I make changes to derived classes of BaseVariableWidthVector too?

@rtadepalli rtadepalli changed the title GH-15187: [Java] Made reader initialization lazy and added new getTransferPair() function that takes in a Field type GH-15187: [Java] Made reader initialization lazy and added new getTransferPair() function that takes in a Field type Mar 3, 2023
@github-actions
Copy link

github-actions bot commented Mar 3, 2023

⚠️ GitHub issue #15187 has been automatically assigned in GitHub to PR creator.

@lidavidm
Copy link
Member

lidavidm commented Mar 3, 2023

@lwhite1 or @davisusanibar are you able to take a first look at this?

@lidavidm
Copy link
Member

lidavidm commented Mar 6, 2023

Right now, this only covers derived classes for BaseFixedWidthVector, since that's what the ticket referenced. Should I make changes to derived classes of BaseVariableWidthVector too?

Ah, I forgot to reply. I think this is OK. We can make a follow up for the rest of the vector classes.

@amol-
Copy link
Contributor

amol- commented Mar 30, 2023

Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍

@amol- amol- closed this Mar 30, 2023
@rtadepalli
Copy link
Contributor Author

I am not sure I understand -- there are other PRs open against main that have been untouched far longer than this has been. Is there some sort of a cleanup going on? It's still relevant, I just haven't had the chance to work on this because something else came up. I'd appreciate it if this was reopened.

@lidavidm lidavidm reopened this Mar 30, 2023
@lidavidm
Copy link
Member

@rtadepalli sorry about that, some contributors decided to perform some spring cleaning and a few PRs got caught up in that. I reopened the PR.

@amol- amol- marked this pull request as draft April 13, 2023 15:25
@amol-
Copy link
Contributor

amol- commented Apr 13, 2023

@rtadepalli As you are planning to continue working on this one I converted it to a Draft until you consider it complete. Feel free to tag people when you think it's ready for further review.

Copy link
Contributor Author

@rtadepalli rtadepalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Covered both fixed and variable length vectors here. cc @lidavidm.

@@ -71,7 +73,11 @@ public BigIntVector(String name, FieldType fieldType, BufferAllocator allocator)
*/
public BigIntVector(Field field, BufferAllocator allocator) {
super(field, allocator, TYPE_WIDTH);
reader = new BigIntReaderImpl(BigIntVector.this);
reader = () -> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prashanthbdremio I think adding a new constructor would not be backwards compatible, unless we just default the boolean to true in the BigIntVector(field, allocator); constructor (and constructors in other classes). If we do this, I guess we're not really changing much, since the reader will be eagerly created anyways. If you think that is fine, I can make the change.

@lwhite1 I have made some changes to the way the reader is being initialized -- please take another look. Thanks.

@@ -130,6 +130,12 @@ public TransferPair makeTransferPair(ValueVector target) {
return underlyingVector.makeTransferPair(target);
}

@Override
protected Class<? extends FieldReader> getReaderImplClass() {
throw new UnsupportedOperationException("Readers for extension types depend on the underlying vector, " +
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throwing an exception here because getReader is delegating to the underlying vector, which to my understanding would always be a subclass of BaseValueVector. This means that there'd be a concrete implementation of getReaderImplClass that'd be available, so delegating to the getReader method of the underlying vector is fine and there's no need of an implementation here.

Please let me know if this is not the case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subclasses of ExtensionTypeVector should override this if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be fine I think -- should I add a comment explicitly saying that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, on second thought: maybe remove the override for getReader here, and put the delegation here instead? That way, users can override this and get the expected behavior. Otherwise they have to override both methods which is a bit of a mess.

if (fieldReader == null) {
try {
fieldReader =
(FieldReader) Class.forName(getReaderImplClass().getName()).getConstructor(getClass())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getReaderImplClass already returns a Class, why do we have to look it up again?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: why can't getReaderImplClass return a factory function instead of a class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please clarify what you mean by return a factory function? I can change the usage to getReaderImplClass().getInstance if that is preferable. I just used Class.forName since I am more familiar with that option is all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't these all be Function<FieldVector, FieldReader>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well returning a FieldReader instance would mean that I'd need to construct it. I want to do that in the synchronized block so I am returning the class type to construct an instance out of.

Out of curiosity, what is the advantage of returning a Function type over the current implementation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getReaderImplClass can return a Function which you then call.

Explicit reflection/metaprogramming isn't necessary here over just a factory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh I guess the construction would only happen when I call .apply which would in the synchronized block. That does seem nicer...

Any chance I could clean that up in a follow up ticket? Or would you prefer I make the changes here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, looks like we commented the same thing at the same time, :jinx:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not include reflection here that isn't strictly necessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized we don't need really need the ValueVector as an argument to the Function, so I decided to return a Supplier instead.


/**
* Default implementation to create a reader for the vector. Depends on the individual vector
* class' implementation of `getReaderImpl()` to initialize the reader appropriately.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be {@link #getReaderImplClass}? JavaDoc does not use Markdown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jun 13, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 14, 2023
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jun 15, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 16, 2023
Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, though it turns out the supplier is not necessary either (sorry)

}
synchronized (this) {
if (fieldReader == null) {
fieldReader = getReaderImpl().get();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll just point out...if you don't actually need to pass anything, this can just be getReaderInstance() and you can directly return the constructed reader. The whole thing with the class or even the function/supplier isn't necessary to make it lazy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sigh... don't know why I didn't realize this last night. Will change.

@@ -130,6 +130,12 @@ public TransferPair makeTransferPair(ValueVector target) {
return underlyingVector.makeTransferPair(target);
}

@Override
protected Class<? extends FieldReader> getReaderImplClass() {
throw new UnsupportedOperationException("Readers for extension types depend on the underlying vector, " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subclasses of ExtensionTypeVector should override this if needed.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jun 16, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 17, 2023
@rtadepalli
Copy link
Contributor Author

@lidavidm ready for another look, thanks.

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! And sorry for all the back-and-forth.

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Jun 17, 2023
@lidavidm
Copy link
Member

FYI, if it's ready, can you take this out of draft mode?

@rtadepalli
Copy link
Contributor Author

Oh I thought I didn't have permissions to take it out of draft mode but then I realized that I was looking at this PR while logged out on my phone. Done 👍

@rtadepalli rtadepalli marked this pull request as ready for review June 18, 2023 16:08
@lidavidm lidavidm merged commit c3eaf4c into apache:main Jun 18, 2023
16 checks passed
@conbench-apache-arrow
Copy link

Conbench analyzed the 6 benchmark runs on commit c3eaf4cb.

There were 4 benchmark results indicating a performance regression:

The full Conbench report has more details.

lriggs pushed a commit to dremio/arrow that referenced this pull request Sep 7, 2023
…w `getTransferPair()` function that takes in a `Field` type (apache#34424)

This PR closes apache#15187. `FieldReader` is being allocated directly in the constructor today, and this PR changes it such that the initialization becomes lazy. Additionally, a new function `getTransferPair(Field, Allocator)` is introduced so that a new `Field` method is not constructed each time `getTransferPair` is called on the Vector.

1. Introduce a new `getTransferPair` method.
2. Make initializing `FieldReader` lazy.

Yes, some tests have been added to verify these changes.

I am not 100% sure if there are any user facing changes.

There should not be any breaking changes.
* Closes: apache#15187

Authored-by: Ramasai <ramasai.tadepalli+3108@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
lriggs added a commit to dremio/arrow that referenced this pull request Sep 8, 2023
* apacheGH-15187: [Java] Made `reader` initialization lazy and added new `getTransferPair()` function that takes in a `Field` type (apache#34424)

This PR closes apache#15187. `FieldReader` is being allocated directly in the constructor today, and this PR changes it such that the initialization becomes lazy. Additionally, a new function `getTransferPair(Field, Allocator)` is introduced so that a new `Field` method is not constructed each time `getTransferPair` is called on the Vector.

1. Introduce a new `getTransferPair` method.
2. Make initializing `FieldReader` lazy.

Yes, some tests have been added to verify these changes.

I am not 100% sure if there are any user facing changes.

There should not be any breaking changes.
* Closes: apache#15187

Authored-by: Ramasai <ramasai.tadepalli+3108@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>

* Fix import

* Fix imports

---------

Signed-off-by: David Li <li.davidm96@gmail.com>
Co-authored-by: rtadepalli <105760760+rtadepalli@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants