Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero-copy Python inputs using arrow #224

Closed
phil-opp opened this issue Mar 16, 2023 · 10 comments · Fixed by #228
Closed

Zero-copy Python inputs using arrow #224

phil-opp opened this issue Mar 16, 2023 · 10 comments · Fixed by #228

Comments

@phil-opp
Copy link
Collaborator

Motivation

While we already have support for zero-copy inputs in Rust nodes, we still require copying the data for Python nodes. This can lead to decreased performance when using messages with large data. To avoid these slowdowns, we want to support zero-copy inputs for Python nodes too.

Challenges

The fundamental challenge is that Python normally operates on owned data. So we have to use a special type such as the memoryview object to make the data accessible to Python without copying. What makes things more complex is that we need a custom freeing logic because we need to unmap the shared memory and report a drop event back to the sender node.

Using a non-standard data type makes it more complex to interact with the data. For example, special functions might be needed for reading, cloning, and slicing the data. Also, it is common to convert the data to other types (e.g. numpy arrays), which requires special conversion functions.

Apache Arrow

To keep the Python API simple and easy to use, it's a good idea to use some mature existing data format rather than creating our own custom data format. The Apache Arrow project provides such a data format that supports zero-copy data transfer across language boundaries. It's already quite popular and used in many projects and tools, so it looks like a good candidate for dora.

Arrow vs PyO3

  • arrow provides two useful features for us
    • a stable ABI for sharing data across programming languages without copying
    • a library with high-level functions to work with the data in Python without copying it (e.g. convert it to a numpy array view)
  • We could also create our own stable ABI types using PyO3
    • This approach is more flexible, e.g. it allows implementing custom methods.
    • However, we would need to implement all desired Python functionality and conversions manually.

The Arrow Array datatype

  • Most suitable data type for our use case since dora inputs are always byte arrays right now
    • (We might want to support additional arrow types in the future.)
  • Provides direct access to the data
    • Invalidating the data from the outside is not possible. This would be useful for ensuring that old inputs are properly freed.
    • We need to rely on the user code to not keep the inputs alive for too long, otherwise the shared memory remains blocked. (We could print some warnings if an input is still not dropped after some timeout.)
  • Memory management is done using a release callback.

Implementation

  • Accessing the data from Python is possible using pyarrow:
    let pa = py.import("pyarrow")?;
    
    let array = pa.getattr("Array")?.call_method1(
        "_import_from_c",
        (array_ptr as Py_uintptr_t, schema_ptr as Py_uintptr_t),
    )?;
  • The question is how we can create the ArrowArray from Rust with custom drop semantics.

The arrow2 crate

  • Provides a set of different array types that can be shared via arrow
  • Unfortunately, most of these types are unsuitable for our use-case, since they assume that the data is stored on the heap
    • our data is backed by shared memory instead, which needs special actions on drop
  • If leaking the data was acceptable, we could use a PrimitiveArray and the arrow2::ffi::mmap::slice function.
    • Unfortunately there is no way to figure out whether the data is no longer used.
    • Freeing the data too early would result in a segfault or in corrupted data, which is not acceptable.
  • Idea: Implement the arrow2::array::Array trait for a custom type
    • Then we could use the arrow2::ffi::export_array_to_c function to create the Arrow array
      • The arrow2 library automatically fills in a release implementation that calls the drop handler of our custom type
      • In this drop handler, we could unmap the shared memory and report the drop tokens to the dora daemon
    • Unfortunately, this approach is not possible because the export_array_to_c function only works with the predefined array types (via downcasting)
  • Conclusion: There appears to be no way to create an Arrow array with custom drop logic using the arrow2 crate

Manual Creation

  • The ABI layout of Arrow arrays is specified, so we could do the creation manually
  • Drawbacks: very unsafe, difficult to get right, a lot of work
  • Advantage: We can implement exactly the drop behavior that we like

The official arrow crate

@haixuanTao
Copy link
Collaborator

haixuanTao commented Mar 17, 2023

Thanks for this detailed analysis!

Curious about when you say:

Unfortunately there is no way to figure out whether the data is no longer used.

This is because we can't listen on the python arrow itself, right? This is why you mentioned that we could eventually listen for a drop of the event object itself. ( with a one-shot Tokio channel for ex.)

In the case we listen for a drop of the event object, wouldn't we be able to know when the data is no longer used?

@haixuanTao
Copy link
Collaborator

haixuanTao commented Mar 17, 2023

Unfortunately, most of these types are unsuitable for our use-case, since they assume that the data is stored on the heap

I have to check again but from what I remember, the unsafe slice gives you a Array<u8> that you can transmute for free to other types like array<f64> and the likes.

@phil-opp
Copy link
Collaborator Author

In the case we listen for a drop of the event object, wouldn't we be able to know when the data is no longer used?

Dropping the event is not enough since the user can take the data out of input events and store them somewhere else (e.g. in some list). So we really have to wait until the data is dropped too. With Arrow, this is signaled through the release callback, which is part of the data format.

The issue with the arrow2 crate is that it does not support to plug your own logic into the release callback.

Unfortunately, most of these types are unsuitable for our use-case, since they assume that the data is stored on the heap

I have to check again but from what I remember, the unsafe slice gives you a Array<u8> that you can transmute for free to other types like array<f64> and the likes.

The issue with the ffi::mmap::slice method is that it does not take any ownership of the data. So it will set the release callback to a no-op, which effectively requires us to keep the data allocated forever (since there is no way to find out whether it's safe to be freed).

@haixuanTao
Copy link
Collaborator

I see your point. Thanks for the clarification! I'll try to do some test as well next week.

@phil-opp
Copy link
Collaborator Author

phil-opp commented Mar 17, 2023

To give some more details:

@phil-opp
Copy link
Collaborator Author

Could you clarify why you chose the arrow2 crate instead of arrow in the first place? Is there any functionality missing from the arrow crate?

Also, it looks like there are some proposals to merging arrow2 and arrow: jorgecarleitao/arrow2#1429 and apache/arrow-rs#1176.

@haixuanTao
Copy link
Collaborator

haixuanTao commented Mar 17, 2023

Yeah, I thought that arrow/Buffer required the ownership of the data but it seems that as you mentioned you could do it with [from_custom_allocation](https://docs.rs/arrow/35.0.0/arrow/buffer/struct.Buffer.html#method.from_custom_allocation).

In many ways the arrow2 crate seemed more easy to work with vec and slice . But if this deallocation can only be done with arrow, there is no issue with using it.

@haixuanTao
Copy link
Collaborator

In any case, it should be simple to change from arrow2 to arrow as they both can read c pointers as input to make a array. I can add some comments to go from one to the other if you need.

Also, isn't the deallocation method of an array that we have built be called at the end of its lifetime, which in our case is when we export it to a python arrow array, and not when it's not used within python?

@phil-opp
Copy link
Collaborator Author

In any case, it should be simple to change from arrow2 to arrow as they both can read c pointers as input to make a array. I can add some comments to go from one to the other if you need.

Yeah, I think it shouldn't be too difficult to switch from arrow2 to arrow. The challenge will probably be to set up the Deallocation field correctly. I try to look into it today.

Also, isn't the deallocation method of an array that we have built be called at the end of its lifetime, which in our case is when we export it to a python arrow array, and not when it's not used within python?

This depends on whether the created arrow array only borrows the data or whether it takes ownership. The arrow2::ffi::mmap::slice function only borrows the data so the original array is dropped at the end of the scope as usual (which caused a segfault in the Python code). The arrow::buffer::Buffer::from_custom_allocation function will take ownership of the array instead, so the drop will happen once the last copy of the data is dropped (reference-counted using Arc).

When exporting the array to C through the FFI_ArrowArray::new function, the reference count will be increased by one. The idea is that the Python code will invoke the release callback once it's done with the data, which decreases the reference count and drops the data once the reference count reaches 0. (We might need an additional mem::forget when doing the conversion as it looks like the FFI_ArrowArray has a Drop implementation that calls release itself.)

@phil-opp
Copy link
Collaborator Author

Implemented in #228.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants