-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-12052: [Rust] Add Child Data to Arrow's C FFI implementation. … #9778
Conversation
151be97
to
b6fbf30
Compare
I am not an expert in this level of code -- perhaps @jorgecarleitao has time to take a look at this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this, @ritchie46 . it has the right ideas and looks great so far. 💯
I left some comments throughout the code.
My last general comment would be to add this type to the pyarrow-integration-tests
crate, which contains real tests against pyarrow
, which allow us to validate the behavior against the C++ implementation.
Cool! I think I've tackled all your comments @jorgecarleitao . I also added a test to the It turns out that we need to provide a name in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments, I'm happy with the implementation though. Thanks @ritchie46 , and I apologise for taking long to review this.
No worries! 😄 |
Codecov Report
@@ Coverage Diff @@
## master #9778 +/- ##
==========================================
- Coverage 82.59% 82.43% -0.17%
==========================================
Files 248 252 +4
Lines 58294 59024 +730
==========================================
+ Hits 48149 48655 +506
- Misses 10145 10369 +224
Continue to review full report at Codecov.
|
@nevi-me there are some new clippy warnings due to new Rust version not related to this PR. Do they need to be fixed? |
@ritchie46 I fixed them last night (depending on where in the world one is lol). CI's fine now. The integration failures are known issues at ARROW-12112 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM from looking at what you're doing to construct the lists.
@jorgecarleitao can do the honours of the PR blessing approval :)
It doesn't look like anything needs to be updated in |
@ritchie46 / @jorgecarleitao / @nevi-me is this one ready to go? There is one smally clippy lint left which I can fixup but I didn't want to ram this PR through to keep the queue down if it wasn't actually ready |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks a lot @ritchie46 for taking the extra mile to have the equality done right :) 💯
Also, thanks a lot @pitrou for your help here. 👍
…children and implements it for List and LargeList datatypes.
@alamb I fixed the clippy issue, so I think we're good to go. :) |
Integration test failure seems due to https://github.com/apache/arrow/pull/9778/checks?check_run_id=2253660088 (not related to this PR):
|
Thanks again @ritchie46 👍 |
Hmm.. This is sadly a bit too late. But the current implementation does an invalid read/write. I get a SIGILL if I run this test a 1000 times.
I hope I can find the location of this invalid read/write. If anybody has an idea of where did could occur that would be very much appreciated. Some additional info/ thoughts:
|
If I prevent the drop in release array, this issue is resolved but we leak data. TBH, I am stuck. @pitrou @jorgecarleitao have you got any idea how this can be resolved? could this be related? |
I am really sorry, this was sloppiness on my part: I should have checkout the code and go through it more carefully as FFI is always risky stuff. If you think it would take some pressure off, we can revert this PR until we find and fix this. Regardless, could you run the memory-check to see if we find the problem in our internal roundtrips? Something like
on the rust crate. This counts every alloc/realloc/dealloc over all buffers over all tests and verifies that the sum is zero. The test-threads must be 1 so that tests run sequentially and the last test is the memory check. I would try to run this before this PR's commit just to make sure that things work as expected since we do not run this as part of our CI. If it passes, then I would try again after this PR's commit. |
@ritchie46 The guidelines for implementers of a release callback here. I would suggest following the example: |
Yes, thank you. I will be going through that. |
Yes, in that case we panic instead of UB, which cleary is better.
Will do that. Some extra info: I realize that we don't have any owned child data in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this seems a bit delicate to get right, I think you should first add tests for roundtripping schemas:
- roundtrip a schema as is, Rust->Python->Rust
- roundtrip a schema as is, Python->Rust->Python
- create a primitive type in Rust, return
pa.list(primitive_type)
from Python - create a list type in Rust, return
list_type.value_type
from Python
For each case, verify the expected result, also check for allocation/deallocation/leaks.
Once you got that right, you can tackle the array roundtrip issue.
.child_data() | ||
.iter() | ||
.map(|arr| { | ||
let len = arr.len(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand: why isn't try_from
called recursively?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because I otherwise cannot pass nullable: bool
information from the parent. If should split this up in a function separate from try_from
to make this more explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, you'll need to handle recursive types more generally anyway. Think list(list(int8))
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would not work like this. I will fix that.
a = pyarrow.array([[], None, [1, 2], [4, 5, 6]], pyarrow.list_(pyarrow.int64())) | ||
b = arrow_pyarrow_integration_testing.round_trip(a) | ||
|
||
b.validate(full=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be interesting to add del a
above this, to make sure that b
keeps the data alive.
// <https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowSchema> | ||
FFI_ArrowSchema { | ||
format: CString::new(format).unwrap().into_raw(), | ||
name: std::ptr::null_mut(), | ||
// For child data a non null string is expected and is called item |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only for lists, though. For more general nested types you'll have to take the actual field name.
// at that point the child data is not yet known, but it is also not required to determine | ||
// the buffer length of the list arrays. | ||
"+l" => { | ||
let nullable = schema.flags == 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be (schema.flags & 2) != 0
. Also I'm surprised you're sprinkling magic numbers in the code instead of defining a constant.
// Safety | ||
// Should be set as this is expected from the C FFI definition | ||
debug_assert!(!schema.name.is_null()); | ||
let name = unsafe { CString::from_raw(schema.name as *mut c_char) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem right. The doc for from_raw
says:
This should only ever be called with a pointer that was earlier obtained by calling CString::into_raw. Other usage (e.g., trying to take ownership of a string that was allocated by foreign code) is likely to lead to undefined behavior or allocator corruption.
But we are exactly in the case where schema.name
can have been allocated by C++ or Python or anything else. It seems instead you should use CStr instead: "Representation of a borrowed C string".
(DataType::Utf8, 2) | (DataType::Binary, 2) => size_of::<u8>() * 8, | ||
(DataType::Utf8, _) | (DataType::Binary, _) => { | ||
(DataType::Utf8, 1) | (DataType::Binary, 1) | (DataType::List(_), 1) => size_of::<i32>() * 8, | ||
(DataType::Utf8, 2) | (DataType::Binary, 2) | (DataType::List(_), 2) => size_of::<u8>() * 8, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lists don't have a buffer number 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.. I thought it was a validity bitmap and an offset buffer and that the child data was counted differently.
What should be the correct number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The validity bitmap is buffer 0 and the offsets are buffer 1. You are defining a buffer 2 (of u8
) which doesn't exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The validity is the buffer 0, the offsets the buffer 1. The List has no buffer number two. If someone requests buffer 2 from a List Array, we should error instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, Thanks!
This PR adds child data to Arrow's C FFI implementation and implements it for `List` and `LargeList` datatypes. Closes apache#9778 from ritchie46/ffi_types Authored-by: Ritchie Vink <ritchie46@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
This PR adds child data to Arrow's C FFI implementation and implements it for `List` and `LargeList` datatypes. Closes apache#9778 from ritchie46/ffi_types Authored-by: Ritchie Vink <ritchie46@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
This PR adds child data to Arrow's C FFI implementation and implements it for `List` and `LargeList` datatypes. Closes apache#9778 from ritchie46/ffi_types Authored-by: Ritchie Vink <ritchie46@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
This PR adds child data to Arrow's C FFI implementation and implements it for
List
andLargeList
datatypes.