Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-1533: [JAVA] realloc should consider the existing buffer capacity for computing target memory requirement #1112

Closed
wants to merge 1 commit into from

Conversation

siddharthteotia
Copy link
Contributor

cc @jacques-n , This is same as #1097

The latter one was closed as I had to rename the branch correctly and use the correct JIRA number.

@wesm
Copy link
Member

wesm commented Sep 18, 2017

@siddharthteotia it's not necessary to use a particular branch name, as long as the PR title is correct

@siddharthteotia
Copy link
Contributor Author

@wesm , For easy self-tracking, I typically have the same local branch name as the one in my fork that I used to create the PR.

For this JIRA (ARROW-1533), I incorrectly used JIRA# 1553 almost everywhere -- branch name, commit message etc.

I didn't realize this until I had to actually work on ARROW-1553. So renamed the local branch and did a force update on my fork to get everything proper for ARROW-1533.

if (newAllocationSize > MAX_ALLOCATION_SIZE) {
long baseSize = allocationSizeInBytes;
final int currentBufferCapacity = data.capacity();
if (baseSize < (long)currentBufferCapacity) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic here is not very straight forward to me:

  • In what case allocationSizeInBytes is less than data.capacity()?
  • Why do we want to set the new allocation to data.capacity() * 2 instead of allocationSizeInBytes * 2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a vector A, write data to it to an extent that you trigger 2 reallocs at least.
Now transfer the vector to vector B.
Now do something that triggers reAlloc() for vector B -- the reAlloc() will segfault because allocateSizeInBytes is still at the initial default value whereas this vector's buffer is probably 4x the size of that.
reAlloc will try to copy data from 128K sized buffer (at least) onto a 64K buffer and segfault

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm...Thanks for the explanation.

What is the semantics of transfer and why doesn't it set allocateSizeInBytes for vector B in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We saw the problem while dealing with complex JSON schema -- detailed problem description is here #1097

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transfer() function is aimed at transferring the data buffer (along with ownership) of one vector to target vector of same type.

It is not clear to me if we would want to transfer more state in the function. Secondly, there could be other cases where allocationSizeInBytes < buffer capacity. So in any case we are probably better off safeguarding reAlloc() regardless.

One case could be vector reset() which resets the value of allocationSizeInBytes as well. If we reAlloc() a vector after doing reset(), I think we will run into the same problem.

Copy link
Contributor

@icexelloss icexelloss Sep 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, didn't realize there is another PR, thanks for the context.

Edit: Didn't see the second comment which answers my question.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. But doesn't this change cause weird reAlloc() effect after reset() though? i.e. if we reset a 128Kb double vector to 32Kb and then reAlloc(), it would be 256Kb instead of 64Kb, and it would also have the old value (incorrect) from 32-128Kb?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reset() already has a problem that I think we should fix across all vector types. reset() should actually not re-initialize allocationSizeInBytes at all because reset() is typically aimed at zeroing out the buffer, resetting mutator/accessor etc -- the underlying buffer capacity still remains the same after reset.

So for your example, 128KB buffer remains a 128Kb buffer zeroed out upon reset(). On a subsequent reAlloc(), we will go to 256KB. There won't be any incorrect or garbage bits lying around on the data buffer because the entire buffer is zeroed out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I missing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Yeah you are right. This is correct.

However, in general, I find the usage between allocationSizeInBytes and data.capacity() confusing. They seem to be the same thing but is inconsistent in various places.

For instance, it's not clear to me why don't we just double data.capacity() in reAlloc() instead of checking both allocationSizeInBytes and data.capacity().

Maybe we should have a follow up Jira to:

  • Document the difference between the two
  • Check if allocationSizeInBytes and data.capacity() are used correctly in all places

final long newAllocationSize = allocationSizeInBytes*2L;
long baseSize = allocationSizeInBytes;
final int currentBufferCapacity = data.capacity();
if (baseSize < (long)currentBufferCapacity) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be same as FixedValueVector. Maybe some refactor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't understand the comment. The same problem is fixed across reAlloc() of all vectors -- bit vector, fixed width and variable width.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I meant, since these logic looks similar, I am wondering if we can refactor the shared logic into the base class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had thought about it but since with ARROW-1463 the inheritance hierarchy and templates might look different, I thought may there is not much gain in refactoring now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Yeah I don't love it but if we are going to fix this in ARROW-1463 then I am ok.

@jacques-n
Copy link
Contributor

Good additional questions that we should address in ARROW-1463. +1 on getting this merged.

@icexelloss
Copy link
Contributor

LGTM too.

@wesm
Copy link
Member

wesm commented Sep 19, 2017

+1

@asfgit asfgit closed this in 2706b7f Sep 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants