-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-14061: [Go][C++] Add Cgo Arrow Memory Pool Allocator #11206
Conversation
|
CC @emkornfield We really need to get someone else who can review Go stuff so I don't keep overwhelming you with so many PRs haha. Also, @pitrou may want to look at this or know who else I can tag given the connection with the C++ memory pool stuff here. |
This PR isn't based on git master, is it? |
// the context of a single function call. If the memory in use isn't allocated | ||
// on the C side, then it is not safe for any pointers to data to be held outside | ||
// of Go beyond the context of a single Cgo function call as it will be invisible | ||
// to the Go garbage collector and could potentially get moved without being updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this comment. In #11220 your releaseData
function seems to take care of this correctly (by defining a global container of exported arrays).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When making a CGO call, during the execution of it the Go garbage collector will pin memory being used so that any pointers to go memory will remain valid during the length of that cgo call. But the documentation is very clear that it is not safe to let C/C++ maintain pointers to Go memory beyond the context of that single call because after the cgo call returns, the Go garbage collector is free to move memory around as necessary. Normally the garbage collector can update any references in Go when it does this so that everything continues to work just fine, but it cannot update any pointers in C/C++ when it does this.
So the global container of exported arrays used by releaseData
serves two purposes:
- Ensuring that during the context of a single CGO call that there is a maintained reference to the Go objects so that it keeps the garbage collector from cleaning it up
- Allowing C/C++ to call
releaseData
to free the memory if we're handing it off, or allowing the ability to hand off memory to let C/C++ own it and control when it is released. this is only safe if the underlying buffers were allocated by C and not Go allocated memory.
For more information about this specifically in terms of passing pointers around: https://pkg.go.dev/cmd/cgo#hdr-Passing_pointers is the documentation on CGO that i'm referencing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so what it seems to mean is that Go should always use its own allocator when allocating array data, rather than the Go GC. Would there be a problem in doing that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that's also a possibility, rather than using the memorypool exposed by the arrow lib we could instead create our own allocator (or utilize a different one) that allocates memory using C always rather than the DefaultGoAllocator using Go-allocated memory. But that then changes some semantics in terms of memory usage and potential performance characteristics because CGO calls have a higher overhead.
If you aren't making calls to C or passing data around then using the Go allocator is perfectly fine and performant (though if you're doing a lot of allocations, there are a few libraries which have shown significant benefits to using things like jemalloc and manually managing the memory via C calls but this has to be done carefully to avoid paying large overheads in CGO calls and avoid introducing extra dependencies that may not be needed).
I can definitely see a future enhancement to make the default allocator do this, but I didn't want to change the default allocation behavior yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't want the vanilla go get github.com/apache/arrow/go/arrow
library to require having the C++ libarrow available to link against
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the C++ libarrow allocator is necessary, just any C-based allocator (or perhaps there's already a public Go package for this?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair, and i was considering something, though the drawback there is that we don't want every allocation to go through C because CGO has a higher overhead, so if you're making a lot of smaller allocations the overhead of the CGO calls could pose a potential problem in order to be the default from a performance standpoint. So I'd want to at a minimum hide it behind a build tag just in case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, cgo performance does seem horrible: https://about.sourcegraph.com/go/gophercon-2018-adventures-in-cgo-performance/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I figured it was simplest to use the memory pool that already exists in the libarrow library (as I use the libarrow library in my future changes to connect to the compute API), but I could potentially use something like https://github.com/spinlock/jemalloc-go to just include jemalloc as a go-getable dependency. Which i'll play around with and see how it does. I liked the libarrrow memory pool because of the consistency it provided and feature handling such as the logging proxy and the tracking of how many bytes had been allocated in it, but it is likely pretty easy to do similar functionality and use jemalloc directly. As long as consumers are using Reserve and otherwise pre-allocating memory, it would cut down the number of cgo calls to mitigate performance hits.
Alternately I could pursue looking into a slab style allocator that manages much larger chunks of memory that it hands out as a way to amortize the number of calls, but that's a separate thing to look into later.
@kou Do you want to take a look at the MinGW CI additions? |
@pitrou this is based on master though it wasn't up to date until just now when i re-fetched and merged upstream |
No, feel free to move forward. |
Continuing with the idea of exposing the Compute APIs within the Go implementation via CGO, in order to ensure safer memory handling there should be an allocator implementation which uses CGO in order to allocate memory via the C++ memory pool along with utilities for tracking memory leaks. Closes apache#11206 from zeroshade/arrow-14061 Lead-authored-by: Matthew Topol <mtopol@factset.com> Co-authored-by: Matt Topol <zotthewizard@gmail.com> Signed-off-by: Matthew Topol <mtopol@factset.com>
Continuing with the idea of exposing the Compute APIs within the Go implementation via CGO, in order to ensure safer memory handling there should be an allocator implementation which uses CGO in order to allocate memory via the C++ memory pool along with utilities for tracking memory leaks.