Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-5357: [Rust] Change Buffer::len to represent total bytes instead of used bytes #4331

Closed
wants to merge 4 commits into from

Conversation

sunchao
Copy link
Member

@sunchao sunchao commented May 16, 2019

No description provided.

@sunchao sunchao changed the title Arrow 5357 ARROW-5357: [Rust] Change Buffer::len to represent total bytes instead of used bytes May 16, 2019
@sunchao
Copy link
Member Author

sunchao commented May 17, 2019

After discussing with @paddyhoran on the JIRA, I'm now inclined to keep both len and capacity in Buffer, which allows us to still maintain most of the existing functionalities. To fix the flatbuffer conversion issue, we'll need ARROW-5358 which allows us to compare ArrayData properly.

@nevi-me let me know if this sounds good to you.

@nevi-me
Copy link
Contributor

nevi-me commented May 17, 2019

I saw the conversation, would we be able to use capacity when writing to IPC from the current Buffer?

@sunchao
Copy link
Member Author

sunchao commented May 17, 2019

Yes, both reading and writing to IPC buffer will use capacity instead of len.

@nevi-me
Copy link
Contributor

nevi-me commented May 22, 2019

Apologies for not responding sooner @sunchao

It's only mutable buffer that has capacity, and when we freeze it to create a buffer, we 'lose' the capacity.

#[derive(PartialEq, Debug)]
pub struct Buffer {
    /// Reference-counted pointer to the internal byte buffer.
    data: Arc<BufferData>,

    /// The offset into the buffer.
    offset: usize,
}

struct BufferData {
    ptr: *const u8,
    len: usize,
    // no capacity
}

@sunchao
Copy link
Member Author

sunchao commented May 22, 2019

@nevi-me yes that's right. As I mentioned above, I think it's better to keep both len and capacity in Buffer (same as MutableBuffer).

I'm currently working on https://issues.apache.org/jira/browse/ARROW-5358 which hopefully will allow us to compare Array and ArrayData. It should be able to help resolve the issue you are seeing. I plan to come back to this after that is done.

@nevi-me
Copy link
Contributor

nevi-me commented May 22, 2019

Okay, great! I'll keep #4167 open, and continue #4330 without ArrayData comparisons for now

@codecov-io
Copy link

Codecov Report

Merging #4331 into master will decrease coverage by 4.92%.
The diff coverage is 88.88%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #4331       +/-   ##
===========================================
- Coverage    87.5%   82.57%    -4.93%     
===========================================
  Files         998       86      -912     
  Lines      141784    24935   -116849     
  Branches     1418        0     -1418     
===========================================
- Hits       124065    20591   -103474     
+ Misses      17357     4344    -13013     
+ Partials      362        0      -362
Impacted Files Coverage Δ
rust/arrow/src/memory.rs 100% <100%> (ø) ⬆️
rust/arrow/src/array/array.rs 92.57% <100%> (ø) ⬆️
rust/arrow/src/tensor.rs 92.3% <80%> (-0.48%) ⬇️
rust/arrow/src/buffer.rs 91.3% <88.88%> (+1.52%) ⬆️
python/pyarrow/ipc.pxi
cpp/src/arrow/csv/chunker-test.cc
cpp/src/parquet/column_page.h
cpp/src/parquet/bloom_filter-test.cc
cpp/src/arrow/array/builder_decimal.cc
r/src/symbols.cpp
... and 907 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 38b0176...7eac173. Read the comment docs.

@paddyhoran
Copy link
Contributor

@nevi-me @sunchao I know there was some related discussion here but are we ok to move forward with this PR (i.e. adding capacity to Buffer). Is this ready for review?

@sunchao
Copy link
Member Author

sunchao commented Sep 9, 2019

I think it is ready for review. It'd be great if you can take a look.

@paddyhoran
Copy link
Contributor

Yep, sounds good. I'll take a look when I get a chance.

Copy link
Contributor

@paddyhoran paddyhoran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a few nits.

}

impl PartialEq for BufferData {
fn eq(&self, other: &BufferData) -> bool {
if self.len != other.len {
return false;
}
if self.capacity != other.capacity {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not against the current implementation, but I wonder if we should only compare the "meaningful" data?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. It seems wrong that on one hand the memcmp is comparing up to len and we bypass that if there is a mismatch on capacity

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks guys! I'm slightly confused - here we are comparing meaningful data, no? and if capacity mismatch then it is considered not equal and we skip comparing the data content.

The equality is defined as: 1) both length should be equal, 2) both capacity should be equal, and 3) data content up to length should be equal. Let me know if this definition sounds good to you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me if two arrays only differ by the amount of padding that they have then I would consider them equal. When I perform operations using these two arrays I will get the same answer (because the padding, or rather amount of padding, does not impact the result). However:

  • I am focused on computation, maybe there are other implication in IPC, etc.
  • In practice, this probably won't come up as we round up padding to a multiple of 64 bytes, but it could happen.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Array equality is defined separately in array/equal.rs and yes, it does take account on what you said (it compares buffer content with data()). In the current context we are discussing equality of buffers though, which IMO, when looking at in isolation, should consider both len and capacity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, yes I agree.

@@ -92,12 +102,12 @@ impl Debug for BufferData {

impl Buffer {
/// Creates a buffer from an existing memory region (must already be byte-aligned)
pub fn from_raw_parts(ptr: *const u8, len: usize) -> Self {
pub fn from_raw_parts(ptr: *const u8, len: usize, capacity: usize) -> Self {
assert!(
memory::is_aligned(ptr, memory::ALIGNMENT),
"memory not aligned"
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also check that we are padded to 64 bytes. We assume this in the SIMD implementations and it's recommended here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably test for this also.

}

impl PartialEq for BufferData {
fn eq(&self, other: &BufferData) -> bool {
if self.len != other.len {
return false;
}
if self.capacity != other.capacity {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. It seems wrong that on one hand the memcmp is comparing up to len and we bypass that if there is a mismatch on capacity

@@ -471,6 +487,9 @@ impl PartialEq for MutableBuffer {
if self.len != other.len {
return false;
}
if self.capacity != other.capacity {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue .. memcmp compares up to len so it doesn't matter if the capacities match or not

@nevi-me
Copy link
Contributor

nevi-me commented Dec 29, 2019

I initially thought this would be a blocker for IPC, but I've managed to implement the reader and writer without this. I'd vote for closing this PR @sunchao.

@paddyhoran
Copy link
Contributor

I'd vote for closing this PR @sunchao.

I would actually like to see this PR merged (maybe after further review). Length and capacity are two distinct concepts in Arrow and having capacity be explicit is easier to understand especially for newcomers.

I also think it will be useful when converting from other libraries to Arrow. In these cases, we will have to check the alignment and padding (capacity) of the data that is passed to us.

@sunchao
Copy link
Member Author

sunchao commented Dec 31, 2019

Yes I think it is still useful to have size and capacity as two separate properties of buffer. @andygrove could you take another look on this? I think your concern has been addressed via the comments.

@emkornfield
Copy link
Contributor

@andygrove want to take look?

@paddyhoran
Copy link
Contributor

@sunchao can you rebase this if you get a chance (or let me know if you are stuck for time and I will)? It turns out that this PR fixes UB as pointed out by @jturner314 on #6397.

I'll re-review early next week and hopefully we can get this merged.

@sunchao
Copy link
Member Author

sunchao commented Feb 15, 2020

Sure @paddyhoran . Updated.

Copy link
Contributor

@paddyhoran paddyhoran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - @andygrove you requested changes previously, do you want to re-review or can I merge this one?

@andygrove
Copy link
Member

Apologies for missing the earlier mentions. I really need to figure out a better way of managing all the emails I get so that these mentions stand out.

@wesm
Copy link
Member

wesm commented Feb 20, 2020

@andygrove the easiest thing is to set up a gmail filter that sends any GitHub e-mail with "you were mentioned" directly to Inbox (vice versa, to auto-archive any GitHub e-mail that does not contain this). A separate filter rule should be used to apply a label unconditionally to Arrow-related GitHub e-mails

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants