Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C#] Cannot round-trip record batch with PyArrow #27924

Closed
asfimport opened this issue Mar 26, 2021 · 14 comments
Closed

[C#] Cannot round-trip record batch with PyArrow #27924

asfimport opened this issue Mar 26, 2021 · 14 comments

Comments

@asfimport
Copy link

Has anyone ever tried to round-trip a record batch between Arrow C# and PyArrow? I can't get PyArrow to read the data correctly.

For context, I'm trying to do Arrow data-frames inter-process communication between C# and Python using shared memory (local TCP/IP is also an alternative). Ideally, I wouldn't even have to serialise the data and could just share the Arrow in-memory representation directly, but I'm not sure this is even possible with Apache Arrow. Full source code as attachment.

C#

using (var stream = sharedMemory.CreateStream(0, 0, MemoryMappedFileAccess.ReadWrite))
{
    var recordBatch = /* ... */

    using (var writer = new ArrowFileWriter(stream, recordBatch.Schema, leaveOpen: true))
    {
        writer.WriteRecordBatch(recordBatch);
        writer.WriteEnd();
    }
}

Python

shmem = open_shared_memory(args)
address = get_shared_memory_address(shmem)
buf = pa.foreign_buffer(address, args.sharedMemorySize)
stream = pa.input_stream(buf)
reader = pa.ipc.open_stream(stream)

Unfortunately, it fails with the following error: pyarrow.lib.ArrowInvalid: Expected to read 1330795073 metadata bytes, but only read 1230.

I can see that the memory content starts with ARROW1\x00\x00\xff\xff\xff\xff\x08\x01\x00\x00\x10\x00\x00\x00. It seems that using the API calls above, PyArrow reads "ARRO" as the length of the metadata.

I assume I'm using the API incorrectly. Has anyone got a working example?

Reporter: Tanguy Fautre
Assignee: Antoine Pitrou / @pitrou

Original Issue Attachments:

PRs and other links:

Note: This issue was originally created as ARROW-12100. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
There are two slightly different IPC formats: the file format and the stream format. It seems you're writing the file format from C# ("new ArrowFileWriter") and trying to read the result as an IPC stream ("pa.ipc.open_stream").

You should probably use the ArrowStreamWriter class on the C# side, rather ArrowFileWriter.

@asfimport
Copy link
Author

Tanguy Fautre:
@pitrou Thanks for the explanation, I didn't realise there was a difference between the two.

I've replaced ArrowFileWriter with ArrowStreamWriter, but I now get the following error in Python: OSError: Unexpected null field Field.children in flatbuffer-encoded metadata.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Hmm, interesting. This may be an incompatibility in the C# writer. Can you share your schema or even the code generating the data?

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
cc @eerhardt

@asfimport
Copy link
Author

Tanguy Fautre:
Latest version : ArrowSharedMemory_20210326_2.zip

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
In the middle term, the C# implementation should really be part of our integration testing routine, to avoid issues such as this one.

In the short term, though, I think we can relax the following check to allow "null children member" to be synonymous to "empty children array":

https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata_internal.cc#L757

@wesm What do you think?

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
[~GPSnoopy] Is there a way to compile and test your example on Linux?

@asfimport
Copy link
Author

Tanguy Fautre:
Not at the moment, I was just toying with a shared memory approach using memory maps. The C# and Python APIs are quite different for Linux. I might give it a go though.

@asfimport
Copy link
Author

Tanguy Fautre:
Mind you, for the specifically reproducing this bug, it's just easier to just write to a file and read it from Python.

Here is a version that just does that. I've just tested it on Ubuntu 20.04 as well: ArrowSharedMemory_20210329.zip.

You can run it just by typing dotnet run (assuming you've got dotnet-sdk-5 installed). Don't forget to change the hard-coded path to your python virtual environment in Program.cs.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Thank you, I can confirm now that we can easily workaround it on the C++ side.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
It seems okay to treat a null field as a length-0 list of children. That said, I think for the C# implementation to be suitable for any real world production use, it really needs to participate in the integration tests more formally.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Issue resolved by pull request 9837
#9837

@asfimport
Copy link
Author

Eric Erhardt / @eerhardt:
Sorry for the delay - my day job has kept me super busy.

I found the change that broke this - 3e71ea0 The change in C# unintentionally went from an "empty" vector of tables to a null vector of tables for children of fields. I agree the "correct" fix here is that C++ checks for nullptr. It does for the other vector of tables - like it does for metadata 

I think it is time to get C# implementation into the integration tests. Can someone give me a pointer to how to enable that?

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
There is a JSON representation format that the C# implementation needs to understand. It is described in https://arrow.apache.org/docs/format/Integration.html , but you may get more insight by running the integration tests themselves (the current ones) and look the generated JSON files

Integration testing uses an internal tool written in Python named Archery (see here for install instructions: https://arrow.apache.org/docs/developers/archery.html). You'll find the Archery bits related to integration testing in the dev/archery/archery/integration directory: https://github.com/apache/arrow/tree/master/dev/archery/archery/integration.

The C# implementation needs to expose endpoints (command line APIs) for four functionalities:

@asfimport asfimport added this to the 4.0.0 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants