Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FFI for Arrow C Stream Interface #1384

Merged
merged 24 commits into from
Mar 31, 2022
Merged

Add FFI for Arrow C Stream Interface #1384

merged 24 commits into from
Mar 31, 2022

Conversation

viirya
Copy link
Member

@viirya viirya commented Mar 2, 2022

Which issue does this PR close?

Closes #1348.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@viirya viirya marked this pull request as draft March 2, 2022 09:25
@github-actions github-actions bot added the arrow Changes to the arrow crate label Mar 2, 2022
@viirya viirya marked this pull request as ready for review March 2, 2022 23:03
@viirya
Copy link
Member Author

viirya commented Mar 3, 2022

Hmm, not sure why

error[E0433]: failed to resolve: could not find `ipc` in the crate root
   --> arrow/src/ffi_stream.rs:388:16
    |
388 |     use crate::ipc::reader::FileReader;
    |                ^^^ could not find `ipc` in the crate root

only in Test Workspace on AMD64 Rust stable?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started looking through this, but I don't have time at this moment to complete a detailed review. It looks very cool though and I will try and review it more carefully over the coming days

use std::fs::File;

use crate::datatypes::Schema;
use crate::ipc::reader::FileReader;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these tests use ipc they probably need to be feature flagged like

#[cfg(feature = "ipc")]


assert_eq!(batch, expected_batch);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgecarleitao 's PRs in arrow2 had a nice roundtrip test:

https://github.com/jorgecarleitao/arrow2/pull/857/files#diff-dedf58162923efd1b8f544c8be899fe250c43eff0d7b6c394b5dcb020f87d3a3R7

It might be good to add one here as well (which would also mean the test for ffi didn't depend on the ipc feature)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrote the tests to roundtrip tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it doesn't depend on ipc now.

@@ -40,6 +40,8 @@ pub enum ArrowError {
ParquetError(String),
/// Error during import or export to/from the C Data Interface
CDataInterface(String),
/// Error during import or export to/from the C Stream Interface
CStreamInterface(String),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the Stream interface is built on the CDataInterface, what do you think about just reusing the same CDataInterface error variant instead of adding a new one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

// under the License.

//! Contains declarations to bind to the [C Stream Interface](https://arrow.apache.org/docs/format/CStreamInterface.html).
//!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how did you create this file (was it bindgen?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed the C Stream interface definition. Let me go to generate it using bindgen again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-generated the binding by bindgen and verified it looks the same. I added This was created by bindgen in the doc.

@alamb
Copy link
Contributor

alamb commented Mar 6, 2022

I plan to review this early next week (BTW the integration test failure has been fixed on master, if you want to merge / rebase to get a clean CI)

@viirya
Copy link
Member Author

viirya commented Mar 6, 2022

Thanks @alamb . I'm writing roundtrip test as you suggested. I will do rebase after that.

@codecov-commenter
Copy link

codecov-commenter commented Mar 7, 2022

Codecov Report

Merging #1384 (3716c16) into master (d384836) will increase coverage by 0.02%.
The diff coverage is 79.14%.

@@            Coverage Diff             @@
##           master    #1384      +/-   ##
==========================================
+ Coverage   82.70%   82.72%   +0.02%     
==========================================
  Files         187      189       +2     
  Lines       54208    54561     +353     
==========================================
+ Hits        44832    45138     +306     
- Misses       9376     9423      +47     
Impacted Files Coverage Δ
arrow/src/array/array_binary.rs 92.93% <ø> (+0.31%) ⬆️
arrow/src/ffi_stream.rs 79.03% <79.03%> (ø)
arrow/src/ffi.rs 87.50% <100.00%> (ø)
arrow/src/array/equal/list.rs 79.38% <0.00%> (-9.85%) ⬇️
parquet_derive/src/parquet_field.rs 65.98% <0.00%> (-0.23%) ⬇️
arrow/src/array/transform/mod.rs 86.35% <0.00%> (-0.23%) ⬇️
parquet/src/encodings/encoding.rs 93.37% <0.00%> (-0.19%) ⬇️
arrow/src/compute/kernels/filter.rs 88.00% <0.00%> (ø)
arrow/src/compute/kernels/length.rs 100.00% <0.00%> (ø)
...ion-testing/src/bin/arrow-json-integration-test.rs 0.00% <0.00%> (ø)
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1c25580...3716c16. Read the comment docs.

arrow/src/ffi_stream.rs Show resolved Hide resolved
arrow/src/ffi_stream.rs Outdated Show resolved Hide resolved
arrow/src/ffi_stream.rs Show resolved Hide resolved
Arc::into_raw(this)
}

/// Get `FFI_ArrowArrayStream` from raw pointer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe we should mention that the input ptr is consumed after the call and the ownership of the input FFI_ArrowArrayStream has been transferred to the returned value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, do we change the ownership if the input FFI_ArrowArrayStream? As it is behind a raw pointer, we cannot move it. That's why I need to do clone here.

}

impl ExportedArrayStream {
fn get_private_data(&self) -> Box<StreamPrivateData> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we just return &mut StreamPrivateData

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private_data is kept in the stream as a raw pointer. We cannot create and return a reference of a temporary object here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of something like this:

    fn get_private_data(&mut self) -> &mut StreamPrivateData {
        unsafe { &mut *((*self.stream).private_data as *mut StreamPrivateData) }
    }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this works. If we just take reference, we don't need move it. I will change it.

}

/// Get the last error from `ArrowArrayStreamReader`
fn get_stream_last_error(stream_reader: &ArrowArrayStreamReader) -> Option<String> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be a method of ArrowArrayStreamReader?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved

arrow/src/ffi_stream.rs Show resolved Hide resolved
let empty_schema = Arc::new(FFI_ArrowSchema::empty());
let schema_ptr = Arc::into_raw(empty_schema) as *mut FFI_ArrowSchema;

let ret_code = unsafe { self.stream.get_schema.unwrap()(stream_ptr, schema_ptr) };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can have a try_new ctor for ArrowArrayStreamReader and initialize the cached schema via get_schema there.


impl RecordBatchReader for ArrowArrayStreamReader {
fn schema(&self) -> SchemaRef {
if self.stream.get_schema.is_none() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just unwrap here instead of returning an empty schema. It's an implementation error if the callback is not set.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now schema is cached when constructing ArrowArrayStreamReader. Here we simply return it.

}

#[cfg(test)]
mod tests {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should have some integration tests between Python and Rust too in arrow-pyarrow-integration-testing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have some for ffi too? if so, I may follow it up. If not, maybe we can have both in later PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite sure. I'm assuming PyArrow supports it since Arrow C++ implemented the stream interface.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked PyArrow. It has the stream interface. But at Rust side, the arrow-pyarrow-integration-testing crate doesn't have some basic code for testing it yet. I may need sometime to try it and write some integration tests between Python and Rust.

I'm fine to hold this until I add the integration test, or I can work on in in later PR.

@sunchao
Copy link
Member

sunchao commented Mar 8, 2022

cc @jorgecarleitao too

@viirya
Copy link
Member Author

viirya commented Mar 13, 2022

@sunchao Any more comments? Thanks.

@sunchao
Copy link
Member

sunchao commented Mar 14, 2022

Will take another look soon. Sorry for the delay @viirya !

@viirya
Copy link
Member Author

viirya commented Mar 14, 2022

No problem! Thank you @sunchao !

@alamb
Copy link
Contributor

alamb commented Mar 16, 2022

FWIW I plan to make a 11.0 release candidate Thursday or Friday so if this is merged by then it will be included. Thanks a lot for this @viirya and @sunchao

}

#[cfg(test)]
mod tests {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite sure. I'm assuming PyArrow supports it since Arrow C++ implemented the stream interface.

}

impl ExportedArrayStream {
fn get_private_data(&self) -> Box<StreamPrivateData> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of something like this:

    fn get_private_data(&mut self) -> &mut StreamPrivateData {
        unsafe { &mut *((*self.stream).private_data as *mut StreamPrivateData) }
    }

arrow/src/ffi_stream.rs Outdated Show resolved Hide resolved
arrow/src/ffi_stream.rs Outdated Show resolved Hide resolved
/// # Safety
/// Assumes that the pointer represents valid C Stream Interfaces, both in memory
/// representation and lifetime via the `release` mechanism.
/// This function copies the content from the raw pointer and cleans it up to prevent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need this when we already have ArrowArrayStreamReader.from_raw?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if in any case, users need FFI_ArrowArrayStream instead of ArrowArrayStreamReader? I can remove it now, we can add it if it is needed in the future.


match schema {
Ok(mut schema) => {
unsafe {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can perhaps just do:

            Ok(mut schema) => unsafe {
                std::ptr::copy(&schema as *const FFI_ArrowSchema, out, 1);
                schema.release = None;
                0
            },

arrow/src/ffi_stream.rs Show resolved Hide resolved
arrow/src/ffi_stream.rs Show resolved Hide resolved

let stream_data = std::ptr::replace(raw_stream, FFI_ArrowArrayStream::empty());

let stream = Arc::new(stream_data);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we call Self::try_new(stream_data) here?

fn get_stream_last_error(&self) -> Option<String> {
self.stream.get_last_error?;

let stream_ptr = Arc::into_raw(self.stream.clone()) as *mut FFI_ArrowArrayStream;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm can we use Arc::as_ptr(&self.stream) as *mut FFI_ArrowArrayStream here?

type Item = Result<RecordBatch>;

fn next(&mut self) -> Option<Self::Item> {
self.stream.get_next?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems unnecessary

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the stream is released, get_next is None.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think we checked in the constructors to make sure it is not released?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, okay, I thought it is safer to make sure it is not released here too. I will use unwrap directly then.

ffi_array.release?;

let schema_ref = self.schema();
let schema = FFI_ArrowSchema::try_from(schema_ref.as_ref());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can just do:

        let schema = FFI_ArrowSchema::try_from(schema_ref.as_ref()).ok()?;

}
.to_data();

if data.is_err() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

return Some(Err(schema.err().unwrap()));
}

if ret_code == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we skip the above part of creating schema if ret_code is not zero?

@alamb
Copy link
Contributor

alamb commented Mar 23, 2022

I wonder if this one is ready to go? It seems like it has stalled (no thanks to me, of course, who hasn't reviewed it 😢 )

@viirya
Copy link
Member Author

viirya commented Mar 23, 2022

ah, I think I missed @chao's latest comments. I will address them soon. Thanks @alamb for reminding me. 😄

@sunchao
Copy link
Member

sunchao commented Mar 23, 2022

I think it's almost ready.😀 will take another round of look after the comments are addressed.

@viirya
Copy link
Member Author

viirya commented Mar 23, 2022

Thanks @sunchao. Addressed your latest comments.

@alamb
Copy link
Contributor

alamb commented Mar 28, 2022

Is this one ready to merge? I am preparing to create an arrow release candidate later this week

@viirya
Copy link
Member Author

viirya commented Mar 28, 2022

@sunchao Do you need to look this again?

@sunchao
Copy link
Member

sunchao commented Mar 29, 2022

Sorry I'm on vacation. Will take another look tomorrow.

//!
//! // export it
//! let stream = Arc::new(FFI_ArrowArrayStream::new(reader));
//! let stream_ptr = FFI_ArrowArrayStream::to_raw(stream) as *mut FFI_ArrowArrayStream;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be updated. Also, similar to export_array_into_raw in FFI Array, do we need to have a function to allow an importer to allocate struct memory for the exported stream?

Copy link
Member Author

@viirya viirya Mar 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added export_reader_into_raw and used it in test and doc now.

type Item = Result<RecordBatch>;

fn next(&mut self) -> Option<Self::Item> {
self.stream.get_next?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think we checked in the constructors to make sure it is not released?

/// that requires [FFI_ArrowArrayStream].
#[derive(Debug)]
pub struct ArrowArrayStreamReader {
stream: Arc<FFI_ArrowArrayStream>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can use Box here to indicate that stream is an unique reference, not shared.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it Box, it will be in trouble when we want to get raw pointer back. Because Box::into_raw(self.stream) will be complained self.stream cannot be moved, and we cannot clone this Box (FFI_ArrowArrayStream is not clonable).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks pretty good! just a few last comments.

arrow/src/ffi_stream.rs Show resolved Hide resolved
type Item = Result<RecordBatch>;

fn next(&mut self) -> Option<Self::Item> {
self.stream.get_next.unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems unnecessary since by contract get_next should be defined. Plus we also call unwrap at line 368.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, this can be removed.

fn next(&mut self) -> Option<Self::Item> {
self.stream.get_next.unwrap();

let stream_ptr = Arc::into_raw(self.stream.clone()) as *mut FFI_ArrowArrayStream;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just use Arc::as_ptr instead of clone here.

let record_batch = RecordBatch::from(&StructArray::from(data));

Some(Ok(record_batch))
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to drop array_ptr in the else branch too?

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@viirya
Copy link
Member Author

viirya commented Mar 31, 2022

Thanks @sunchao

@sunchao sunchao merged commit 15c87ae into apache:master Mar 31, 2022
@sunchao
Copy link
Member

sunchao commented Mar 31, 2022

Merged, thanks @viirya !

@alamb
Copy link
Contributor

alamb commented Mar 31, 2022

This is epic work @viirya and @sunchao -- thank you so much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FFI for Arrow C Stream Interface
4 participants