Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-13816: [Go][C] Implement Consumer APIs for C Data Interface in Go #11037

Closed
wants to merge 21 commits into from

Conversation

zeroshade
Copy link
Member

This supercedes some portions of #10995 as I think it's better to centralize the cdata interface interactions in a specific module that is part of the arrow module rather than it being a part of the dataset module. So this should get reviewed and merged first.

CC @pitrou @emkornfield

@github-actions
Copy link

@zeroshade
Copy link
Member Author

The failure in the integration test is an issue with the rust implementation that is unrelated to this change it seems.

@pitrou
Copy link
Member

pitrou commented Aug 31, 2021

Yes, the integration test failure is unrelated.

@pitrou
Copy link
Member

pitrou commented Sep 1, 2021

Hmm, I feel incompetent to review Go code.
What would be interesting would be to validate this using PyArrow, for example:

  • have a Go test module that exposes some C APIs that import a Schema or Array and test something with it (like you did with panic calls)
  • have a Python test module that loads the Go test module using ctypes, export a Schema or Array, and gives the exported pointer to the Go test APIs

@zeroshade
Copy link
Member Author

@pitrou I'm rather happy with myself at the moment since I've managed to actually pull your suggestion off, though i'm not sure the best way to actually incorporate such a test into the current build system. For posterity, here's some snippets showing what I did:

Created a go file with the following:

package main

import (
    "fmt"
    "unsafe"

    "github.com/apache/arrow/go/arrow/cdata"
)

// #include <stdint.h>
import "C"

//export importSchema
func importSchema(ptr C.uintptr_t) {
    sc := (*cdata.CArrowSchema)(unsafe.Pointer(uintptr(ptr)))
    fmt.Println(cdata.ImportCArrowSchema(sc))
}

func main() {}

In order to build a c-shared library it apparently needs to be a main package. Anyway, the above file will export a function "importSchema" which takes a uintptr_t which is a pointer to an struct ArrowSchema. Running the command go build -buildmode=c-shared -o gotest.dll . if on windows or using gotest.so for linux. Then in python I did the following:

>>> import pyarrow as pa
>>> from pyarrow.cffi import ffi
>>> c_schema = ffi.new("struct ArrowSchema*")
>>> ptr_schema = int(ffi.cast("uintptr_t", c_schema))
>>> pa.schema([('ints', pa.list_(pa.int32()))], metadata={b'key1': b'value1'})._export_to_c(ptr_schema)
>>> from ctypes import *
>>> gotest = cdll.LoadLibrary('gotest.dll') # or gotest.so
>>> gotest.importSchema(c_ulonglong(ptr_schema))
schema:
   fields: 1
     - ints:  type=list<item: int32>, nullable
   metadata:  ["key1": "value1"]
0

So all in all, really cool. But I don't know where in the tests would make sense for how I should incorporate such a test where I build the go dynamic lib and then run the a python script with pyarrow to call the function. Any ideas?

@pitrou
Copy link
Member

pitrou commented Sep 2, 2021

Well, I don't know how Go runs tests, so I'm not able to answer that question. Perhaps @sbinet or @fsaintjacques have an idea.

@pitrou
Copy link
Member

pitrou commented Sep 2, 2021

Note that the CI runs the scripts https://github.com/apache/arrow/blob/master/ci/scripts/go_build.sh and https://github.com/apache/arrow/blob/master/ci/scripts/go_test.sh as part of its Docker routine: https://github.com/apache/arrow/blob/master/docker-compose.yml#L1196-L1199

(which you can also reproduce using archery docker run debian-go : see here for installing Archery)

@zeroshade
Copy link
Member Author

so, the go tests just run by running go test which executes the tests. It's more that I'm not sure the best way to use pyarrow in the tests here. Should I modify the docker so it actually builds the entire c++ -> python toolchain to build pyarrow? Or should I just add python to the docker image and then just pip install pyarrow in the go_test.sh script?

@pitrou
Copy link
Member

pitrou commented Sep 2, 2021

Ah, you can just pip install pyarrow, fortunately.

@zeroshade
Copy link
Member Author

Ok, i'll muck around with the tests and see if i can easily put something together to do that as an integration test then.

@zeroshade
Copy link
Member Author

@pitrou I've updated the test cases to add a test which installs pyarrow and runs a python script against a built golang shared object module to test and confirm exporting a schema and record batch from python to Go and Go importing it correctly. 😄

@zeroshade
Copy link
Member Author

@pitrou @emkornfield @sbinet @fsaintjacques

Any chance I might be able to get an approval on this? I have POC for implementing the producer APIs but it relies a lot on this same consumer code so I'd rather file it as a subsequent PR rather than make this one bigger.

@pitrou
Copy link
Member

pitrou commented Sep 7, 2021

I don't really read Go. I can try to give general advice but some Go developer will have to review the implementation.

go/arrow/cdata/utils.h Outdated Show resolved Hide resolved
go/arrow/cdata/cdata_testframework.c Outdated Show resolved Hide resolved
@pitrou
Copy link
Member

pitrou commented Sep 8, 2021

@zeroshade I pushed some changes to improve the Python tests and try to fix some allocation issues. Can you double-check them?

Comment on lines +63 to +67
def run_gc(self):
# Several Go GC runs can be required to run all finalizers
for i in range(5):
cgotest.runGC()
gc.collect()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fsaintjacques Does this look reasonable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou rather than doing this, would it make more sense for me to simply include calls to runtime.GC() in the test module's import functions? that way the python doesn't need to explicitly call the Go garbage collector?

Copy link
Member

@pitrou pitrou Sep 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now this would be sufficient, but when we test the other way (export from Go to Python) we may need to invoke the Go GC from Python after the Python Schema or RecordBatch is garbage-collected.

(sorry: may, not will)

@zeroshade
Copy link
Member Author

zeroshade commented Sep 8, 2021

@pitrou I'm updating the comments on the exported Import functions in interface.go so that they explicitly state that ArrowSchemaRelease will be called on the passed in schema regardless of error response or not.

Otherwise your changes look fine from my end. Thanks for the assistance.

@zeroshade
Copy link
Member Author

@emkornfield @sbinet @fsaintjacques would one of you be willing to just give this a once over and approve? While I am a committer now, I'm not sure I wanna go merging this without at least one approval on it haha. 😄

@zeroshade
Copy link
Member Author

@pitrou I know you're not completely comfortable reviewing the Go code, but given that i've been unable to get someone else to come comment on it, do you have any particular qualms with me merging this as it's holding up some other development that I want to get PRs up for with the Go library.

@pitrou
Copy link
Member

pitrou commented Sep 17, 2021

@zeroshade I'm fine with you merging this. Go ahead, and let's move forward :-)

@asfgit asfgit closed this in 6cff6a3 Sep 17, 2021
@zeroshade zeroshade deleted the go-cdata branch October 5, 2021 15:11
ViniciusSouzaRoque pushed a commit to s1mbi0se/arrow that referenced this pull request Oct 20, 2021
This supercedes some portions of apache#10995 as I think it's better to centralize the cdata interface interactions in a specific module that is part of the arrow module rather than it being a part of the dataset module. So this should get reviewed and merged first.

CC @pitrou @emkornfield

Closes apache#11037 from zeroshade/go-cdata

Lead-authored-by: Matthew Topol <mtopol@factset.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: Matt Topol <mtopol@factset.com>
Signed-off-by: Matthew Topol <mtopol@factset.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants