Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-14061: [Go][C++] Add Cgo Arrow Memory Pool Allocator #11206

Closed
wants to merge 24 commits into from

Conversation

zeroshade
Copy link
Member

Continuing with the idea of exposing the Compute APIs within the Go implementation via CGO, in order to ensure safer memory handling there should be an allocator implementation which uses CGO in order to allocate memory via the C++ memory pool along with utilities for tracking memory leaks.

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@zeroshade
Copy link
Member Author

CC @emkornfield

We really need to get someone else who can review Go stuff so I don't keep overwhelming you with so many PRs haha.

Also, @pitrou may want to look at this or know who else I can tag given the connection with the C++ memory pool stuff here.

@pitrou
Copy link
Member

pitrou commented Sep 29, 2021

This PR isn't based on git master, is it?

ci/docker/debian-10-go-cgo.dockerfile Outdated Show resolved Hide resolved
// the context of a single function call. If the memory in use isn't allocated
// on the C side, then it is not safe for any pointers to data to be held outside
// of Go beyond the context of a single Cgo function call as it will be invisible
// to the Go garbage collector and could potentially get moved without being updated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment. In #11220 your releaseData function seems to take care of this correctly (by defining a global container of exported arrays).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When making a CGO call, during the execution of it the Go garbage collector will pin memory being used so that any pointers to go memory will remain valid during the length of that cgo call. But the documentation is very clear that it is not safe to let C/C++ maintain pointers to Go memory beyond the context of that single call because after the cgo call returns, the Go garbage collector is free to move memory around as necessary. Normally the garbage collector can update any references in Go when it does this so that everything continues to work just fine, but it cannot update any pointers in C/C++ when it does this.

So the global container of exported arrays used by releaseData serves two purposes:

  1. Ensuring that during the context of a single CGO call that there is a maintained reference to the Go objects so that it keeps the garbage collector from cleaning it up
  2. Allowing C/C++ to call releaseData to free the memory if we're handing it off, or allowing the ability to hand off memory to let C/C++ own it and control when it is released. this is only safe if the underlying buffers were allocated by C and not Go allocated memory.

For more information about this specifically in terms of passing pointers around: https://pkg.go.dev/cmd/cgo#hdr-Passing_pointers is the documentation on CGO that i'm referencing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so what it seems to mean is that Go should always use its own allocator when allocating array data, rather than the Go GC. Would there be a problem in doing that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that's also a possibility, rather than using the memorypool exposed by the arrow lib we could instead create our own allocator (or utilize a different one) that allocates memory using C always rather than the DefaultGoAllocator using Go-allocated memory. But that then changes some semantics in terms of memory usage and potential performance characteristics because CGO calls have a higher overhead.

If you aren't making calls to C or passing data around then using the Go allocator is perfectly fine and performant (though if you're doing a lot of allocations, there are a few libraries which have shown significant benefits to using things like jemalloc and manually managing the memory via C calls but this has to be done carefully to avoid paying large overheads in CGO calls and avoid introducing extra dependencies that may not be needed).

I can definitely see a future enhancement to make the default allocator do this, but I didn't want to change the default allocation behavior yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't want the vanilla go get github.com/apache/arrow/go/arrow library to require having the C++ libarrow available to link against

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the C++ libarrow allocator is necessary, just any C-based allocator (or perhaps there's already a public Go package for this?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair, and i was considering something, though the drawback there is that we don't want every allocation to go through C because CGO has a higher overhead, so if you're making a lot of smaller allocations the overhead of the CGO calls could pose a potential problem in order to be the default from a performance standpoint. So I'd want to at a minimum hide it behind a build tag just in case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I figured it was simplest to use the memory pool that already exists in the libarrow library (as I use the libarrow library in my future changes to connect to the compute API), but I could potentially use something like https://github.com/spinlock/jemalloc-go to just include jemalloc as a go-getable dependency. Which i'll play around with and see how it does. I liked the libarrrow memory pool because of the consistency it provided and feature handling such as the logging proxy and the tracking of how many bytes had been allocated in it, but it is likely pretty easy to do similar functionality and use jemalloc directly. As long as consumers are using Reserve and otherwise pre-allocating memory, it would cut down the number of cgo calls to mitigate performance hits.

Alternately I could pursue looking into a slab style allocator that manages much larger chunks of memory that it hands out as a way to amortize the number of calls, but that's a separate thing to look into later.

go/arrow/memory/internal/cgoalloc/helpers.h Outdated Show resolved Hide resolved
go/arrow/memory/internal/cgoalloc/allocator.cc Outdated Show resolved Hide resolved
@pitrou
Copy link
Member

pitrou commented Sep 29, 2021

@kou Do you want to take a look at the MinGW CI additions?

@zeroshade
Copy link
Member Author

@pitrou this is based on master though it wasn't up to date until just now when i re-fetched and merged upstream

.github/workflows/go.yml Outdated Show resolved Hide resolved
ci/docker/debian-go-cgo.dockerfile Show resolved Hide resolved
@zeroshade
Copy link
Member Author

@pitrou @kou Do either of you have any further comments / issues with this as it stands?

@pitrou
Copy link
Member

pitrou commented Sep 30, 2021

No, feel free to move forward.

@asfgit asfgit closed this in 8568942 Sep 30, 2021
@zeroshade zeroshade deleted the arrow-14061 branch September 30, 2021 16:50
ViniciusSouzaRoque pushed a commit to s1mbi0se/arrow that referenced this pull request Oct 20, 2021
Continuing with the idea of exposing the Compute APIs within the Go implementation via CGO, in order to ensure safer memory handling there should be an allocator implementation which uses CGO in order to allocate memory via the C++ memory pool along with utilities for tracking memory leaks.

Closes apache#11206 from zeroshade/arrow-14061

Lead-authored-by: Matthew Topol <mtopol@factset.com>
Co-authored-by: Matt Topol <zotthewizard@gmail.com>
Signed-off-by: Matthew Topol <mtopol@factset.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants