Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-4532: [Java] fix bug causing very large varchar value buffers #3613

Closed
wants to merge 2 commits into from

Conversation

pravindra
Copy link
Contributor

@pravindra pravindra commented Feb 11, 2019

The varchar/varbinary vectors have a setSafe variant that takes an arrow buf and a start/end offsets. The method needs to expand the buffer to make space for 'end - start' bytes. but, due to a bug, it was expanding it to accommodate 'end' bytes, thereby making the value buffer much larger than required.

@siddharthteotia
Copy link
Contributor

may be edit the commit message slightly to describe the problem?

@wesm
Copy link
Member

wesm commented Feb 11, 2019

Yes. It helps for the changelog when other developers can see when a bug they run into was fixed

@pravindra pravindra changed the title ARROW-4532: [Java] fix a size compute bug ARROW-4532: [Java] fix bug causing very large varchar value buffers Feb 12, 2019
Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These comments are style that might help somebody new to the code base ramp up quicker (i.e. me) so take them with a grain of salt.

for (int i = 0; i < numValues; i++) {
int start = fromVector.getstartOffset(i);
int end = fromVector.getstartOffset(i + 1);
toVector.setSafe(i, 1, start, end, fromDataBuffer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

making this a named constant or putting in comments what the value represents might make the test easier to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}
fromVector.setValueCount(numValues);
ArrowBuf fromDataBuffer = fromVector.getDataBuffer();
assertEquals(BaseAllocator.nextPowerOfTwo(numValues * 11), fromDataBuffer.capacity());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test seems like it is potentially brittle if the allocator expansion algorithm was changed. Is this an intrinsic property of this method? Maybe make this an inequality?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

fromVector.setInitialCapacity(numValues);
fromVector.allocateNew();
for (int i = 0; i < numValues; ++i) {
fromVector.setSafe(i, "hello world".getBytes(), 0, 11);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might make the test easier to read if 0 and 11 where constants or had comments next to them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

int end = fromVector.getstartOffset(i + 1);
toVector.setSafe(i, 1, start, end, fromDataBuffer);
}
assertEquals(fromDataBuffer.capacity(), toVector.getDataBuffer().capacity());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might pay to add a comment why the capacity is expected to be equal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.

MinorType.VARCHAR, allocator);
final VarCharVector toVector = newVector(VarCharVector.class, EMPTY_SCHEMA_PATH,
MinorType.VARCHAR, allocator)) {
fromVector.setInitialCapacity(numValues);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might make the unit test easier to read if you separated each section by a comment.
i.e.
//Setup

// Execute

// Verify

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've added comments.

@@ -1394,6 +1394,35 @@ public void testFillEmptiesNotOverfill() {
}
}

@Test
public void testSetSafeWithArrowBuf() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add to the title doesntAllocateExcessiveMemory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@emkornfield
Copy link
Contributor

@pravindra thanks!

@codecov-io
Copy link

Codecov Report

Merging #3613 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3613      +/-   ##
==========================================
- Coverage   87.78%   87.77%   -0.01%     
==========================================
  Files         673      673              
  Lines       82776    82839      +63     
  Branches     1069     1069              
==========================================
+ Hits        72661    72713      +52     
- Misses      10004    10011       +7     
- Partials      111      115       +4
Impacted Files Coverage Δ
go/arrow/math/int64_avx2_amd64.go 0% <0%> (-100%) ⬇️
go/arrow/memory/memory_avx2_amd64.go 0% <0%> (-100%) ⬇️
go/arrow/math/float64_avx2_amd64.go 0% <0%> (-100%) ⬇️
go/arrow/math/uint64_avx2_amd64.go 0% <0%> (-100%) ⬇️
go/arrow/memory/memory_amd64.go 28.57% <0%> (-14.29%) ⬇️
go/arrow/math/math_amd64.go 31.57% <0%> (-5.27%) ⬇️
js/src/visitor/indexof.ts 97.26% <0%> (-1.37%) ⬇️
cpp/src/arrow/python/decimal.cc 98.03% <0%> (-0.85%) ⬇️
cpp/src/plasma/store.cc 91.85% <0%> (-0.34%) ⬇️
go/arrow/math/float64_amd64.go 33.33% <0%> (ø) ⬆️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d7b6ca7...1b4f224. Read the comment docs.

Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM

Thanks @pravindra for the comments, they are really good for understanding what is going on!

@xhochy xhochy closed this in 61b2926 Feb 12, 2019
tanyaschlusser pushed a commit to tanyaschlusser/arrow that referenced this pull request Feb 21, 2019
The varchar/varbinary vectors have a setSafe variant that takes an arrow buf and a start/end offsets. The method needs to expand the buffer to make space for 'end - start' bytes. but, due to a bug, it was expanding it to accommodate 'end' bytes, thereby making the value buffer much larger than required.

Author: Pindikura Ravindra <ravindra@dremio.com>

Closes apache#3613 from pravindra/varchar and squashes the following commits:

1b4f224 <Pindikura Ravindra> ARROW-4532: fix review comments
8f88e3a <Pindikura Ravindra> ARROW-4532:  fix a size compute bug
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants