ARROW-1533: [JAVA] realloc should consider the existing buffer capacity for computing target memory requirement #1112

siddharthteotia · 2017-09-18T22:28:53Z

The latter one was closed as I had to rename the branch correctly and use the correct JIRA number.

…computing target memory requirement

wesm · 2017-09-18T23:25:12Z

@siddharthteotia it's not necessary to use a particular branch name, as long as the PR title is correct

siddharthteotia · 2017-09-19T00:57:17Z

@wesm , For easy self-tracking, I typically have the same local branch name as the one in my fork that I used to create the PR.

For this JIRA (ARROW-1533), I incorrectly used JIRA# 1553 almost everywhere -- branch name, commit message etc.

I didn't realize this until I had to actually work on ARROW-1553. So renamed the local branch and did a force update on my fork to get everything proper for ARROW-1533.

icexelloss · 2017-09-19T03:48:26Z

java/vector/src/main/codegen/templates/FixedValueVectors.java

-    if (newAllocationSize > MAX_ALLOCATION_SIZE)  {
+    long baseSize  = allocationSizeInBytes;
+    final int currentBufferCapacity = data.capacity();
+    if (baseSize < (long)currentBufferCapacity) {


The logic here is not very straight forward to me:

In what case allocationSizeInBytes is less than data.capacity()?

Why do we want to set the new allocation to data.capacity() * 2 instead of allocationSizeInBytes * 2?

Take a vector A, write data to it to an extent that you trigger 2 reallocs at least.
Now transfer the vector to vector B.
Now do something that triggers reAlloc() for vector B -- the reAlloc() will segfault because allocateSizeInBytes is still at the initial default value whereas this vector's buffer is probably 4x the size of that.
reAlloc will try to copy data from 128K sized buffer (at least) onto a 64K buffer and segfault

Hmm...Thanks for the explanation.

What is the semantics of transfer and why doesn't it set allocateSizeInBytes for vector B in this case?

We saw the problem while dealing with complex JSON schema -- detailed problem description is here #1097

transfer() function is aimed at transferring the data buffer (along with ownership) of one vector to target vector of same type.

It is not clear to me if we would want to transfer more state in the function. Secondly, there could be other cases where allocationSizeInBytes < buffer capacity. So in any case we are probably better off safeguarding reAlloc() regardless.

One case could be vector reset() which resets the value of allocationSizeInBytes as well. If we reAlloc() a vector after doing reset(), I think we will run into the same problem.

Oh, didn't realize there is another PR, thanks for the context.

Edit: Didn't see the second comment which answers my question.

I see. But doesn't this change cause weird reAlloc() effect after reset() though? i.e. if we reset a 128Kb double vector to 32Kb and then reAlloc(), it would be 256Kb instead of 64Kb, and it would also have the old value (incorrect) from 32-128Kb?

reset() already has a problem that I think we should fix across all vector types. reset() should actually not re-initialize allocationSizeInBytes at all because reset() is typically aimed at zeroing out the buffer, resetting mutator/accessor etc -- the underlying buffer capacity still remains the same after reset.

So for your example, 128KB buffer remains a 128Kb buffer zeroed out upon reset(). On a subsequent reAlloc(), we will go to 256KB. There won't be any incorrect or garbage bits lying around on the data buffer because the entire buffer is zeroed out

Am I missing something?

I see. Yeah you are right. This is correct.

However, in general, I find the usage between allocationSizeInBytes and data.capacity() confusing. They seem to be the same thing but is inconsistent in various places.

For instance, it's not clear to me why don't we just double data.capacity() in reAlloc() instead of checking both allocationSizeInBytes and data.capacity().

Maybe we should have a follow up Jira to:

Document the difference between the two

Check if allocationSizeInBytes and data.capacity() are used correctly in all places

icexelloss · 2017-09-19T03:50:43Z

java/vector/src/main/codegen/templates/VariableLengthVectors.java

-    final long newAllocationSize = allocationSizeInBytes*2L;
+    long baseSize = allocationSizeInBytes;
+    final int currentBufferCapacity = data.capacity();
+    if (baseSize < (long)currentBufferCapacity) {


Seems to be same as FixedValueVector. Maybe some refactor?

Sorry, I don't understand the comment. The same problem is fixed across reAlloc() of all vectors -- bit vector, fixed width and variable width.

Oh, I meant, since these logic looks similar, I am wondering if we can refactor the shared logic into the base class.

I had thought about it but since with ARROW-1463 the inheritance hierarchy and templates might look different, I thought may there is not much gain in refactoring now?

I see. Yeah I don't love it but if we are going to fix this in ARROW-1463 then I am ok.

jacques-n · 2017-09-19T14:49:16Z

Good additional questions that we should address in ARROW-1463. +1 on getting this merged.

icexelloss · 2017-09-19T15:16:47Z

LGTM too.

wesm · 2017-09-19T16:09:54Z

+1

ARROW-1533: realloc should consider the existing buffer capacity for …

4c97be4

…computing target memory requirement

siddharthteotia force-pushed the ARROW-1533 branch from 7524325 to 4c97be4 Compare September 18, 2017 22:37

icexelloss reviewed Sep 19, 2017

View reviewed changes

asfgit closed this in 2706b7f Sep 19, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-1533: [JAVA] realloc should consider the existing buffer capacity for computing target memory requirement #1112

ARROW-1533: [JAVA] realloc should consider the existing buffer capacity for computing target memory requirement #1112

siddharthteotia commented Sep 18, 2017

wesm commented Sep 18, 2017

siddharthteotia commented Sep 19, 2017

icexelloss Sep 19, 2017

siddharthteotia Sep 19, 2017

icexelloss Sep 19, 2017

siddharthteotia Sep 19, 2017

siddharthteotia Sep 19, 2017

icexelloss Sep 19, 2017 •

edited

Loading

icexelloss Sep 19, 2017

siddharthteotia Sep 19, 2017

siddharthteotia Sep 19, 2017

icexelloss Sep 19, 2017

icexelloss Sep 19, 2017

siddharthteotia Sep 19, 2017

icexelloss Sep 19, 2017

siddharthteotia Sep 19, 2017

icexelloss Sep 19, 2017

jacques-n commented Sep 19, 2017

icexelloss commented Sep 19, 2017

wesm commented Sep 19, 2017

ARROW-1533: [JAVA] realloc should consider the existing buffer capacity for computing target memory requirement #1112

ARROW-1533: [JAVA] realloc should consider the existing buffer capacity for computing target memory requirement #1112

Conversation

siddharthteotia commented Sep 18, 2017

wesm commented Sep 18, 2017

siddharthteotia commented Sep 19, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

icexelloss Sep 19, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacques-n commented Sep 19, 2017

icexelloss commented Sep 19, 2017

wesm commented Sep 19, 2017

icexelloss Sep 19, 2017 •

edited

Loading