Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EXPERIMENT: [C++] Access mimalloc through dynamically-resolved symbols #41128

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Apr 10, 2024

Rationale for this change

The memray memory profiler works by interposing certain dynamic symbols in the profiled process to replace them with their own functions that will collect memory allocation data. It will currently, to the best of my knowledge, only recognize system C calls such malloc, mmap...

When a third-party allocator like mimalloc or jemalloc is being used, such that Arrow does by default, memray does not see the logical allocation calls made through these allocator's APIs (because they are not interposed), but only the raw memory reservations that they issue using system routines.

This can lead people using memray to think that a given Arrow workload (or any workload using such allocators, really) that an inordinate amount of memory is being used, while the reported memory mostly represents non-committed virtual memory that the allocator keeps for performance reasons. Concrete example in GH-40301: we allocate a number of 1kiB buffers from mimalloc, but memray sees a similar number of 64MiB calls to mmap.

We discussed how to enhance memray such as to account for the corresponding logical allocations, and we came to the conclusion that it requires that Arrow exposes API calls that can be dynamically interposed. Since we typically build against a static libmimalloc.a, the mimalloc symbols cannot be exposed (at least, I cannot seem to get this to work on Ubuntu). This means we need to define our own symbols wrapping the mimalloc APIs.

What changes are included in this PR?

Define public, interposable symbols that redirect into the mimalloc APIs that we use.

Are these changes tested?

Not for now. We could probably test them, at least on Linux, by compiling an almost trivial shared library and interposing it using LD_PRELOAD.

Are there any user-facing changes?

No. There should not be any noticeable performance regression, except perhaps on memory pool micro-benchmarks.

Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@pitrou
Copy link
Member Author

pitrou commented Apr 10, 2024

@github-actions crossbow submit wheel-manylinux-2-28*

Copy link

Revision: b40f113

Submitted crossbow builds: ursacomputing/crossbow @ actions-468528d58b

Task Status
wheel-manylinux-2-28-cp310-amd64 GitHub Actions
wheel-manylinux-2-28-cp310-arm64 GitHub Actions
wheel-manylinux-2-28-cp311-amd64 GitHub Actions
wheel-manylinux-2-28-cp311-arm64 GitHub Actions
wheel-manylinux-2-28-cp312-amd64 GitHub Actions
wheel-manylinux-2-28-cp312-arm64 GitHub Actions
wheel-manylinux-2-28-cp38-amd64 GitHub Actions
wheel-manylinux-2-28-cp38-arm64 GitHub Actions
wheel-manylinux-2-28-cp39-amd64 GitHub Actions
wheel-manylinux-2-28-cp39-arm64 GitHub Actions

@pitrou
Copy link
Member Author

pitrou commented Apr 10, 2024

We can see the difference in the code generated for freeing mimalloc memory:

  • before, there is a direct call to mi_free:
00000000012e5900 <_ZN5arrow18BaseMemoryPoolImplINS_12_GLOBAL__N_117MimallocAllocatorEE4FreeEPhll>:
 12e5900:	55                   	push   %rbp
 12e5901:	48 3b 35 00 fb 47 00 	cmp    0x47fb00(%rip),%rsi        # 1765408 <_ZN5arrow11memory_pool8internal14zero_size_areaE@@Base-0x91b8>
 12e5908:	48 89 e5             	mov    %rsp,%rbp
 12e590b:	41 54                	push   %r12
 12e590d:	49 89 d4             	mov    %rdx,%r12
 12e5910:	53                   	push   %rbx
 12e5911:	48 89 fb             	mov    %rdi,%rbx
 12e5914:	74 09                	je     12e591f <_ZN5arrow18BaseMemoryPoolImplINS_12_GLOBAL__N_117MimallocAllocatorEE4FreeEPhll+0x1f>
 12e5916:	48 89 f7             	mov    %rsi,%rdi
 12e5919:	67 e8 21 af 16 00    	addr32 call 1450840 <mi_free>
 12e591f:	f0 4c 29 63 48       	lock sub %r12,0x48(%rbx)
 12e5924:	5b                   	pop    %rbx
 12e5925:	41 5c                	pop    %r12
 12e5927:	5d                   	pop    %rbp
 12e5928:	c3                   	ret    
  • after, there is an indirect call (the call*) through the jump table to arrow_mi_free:
00000000012e5b00 <_ZN5arrow18BaseMemoryPoolImplINS_12_GLOBAL__N_117MimallocAllocatorEE4FreeEPhll>:
 12e5b00:	55                   	push   %rbp
 12e5b01:	48 3b 35 c0 f8 47 00 	cmp    0x47f8c0(%rip),%rsi        # 17653c8 <_ZN5arrow11memory_pool8internal14zero_size_areaE@@Base-0x91f8>
 12e5b08:	48 89 e5             	mov    %rsp,%rbp
 12e5b0b:	41 54                	push   %r12
 12e5b0d:	49 89 d4             	mov    %rdx,%r12
 12e5b10:	53                   	push   %rbx
 12e5b11:	48 89 fb             	mov    %rdi,%rbx
 12e5b14:	74 09                	je     12e5b1f <_ZN5arrow18BaseMemoryPoolImplINS_12_GLOBAL__N_117MimallocAllocatorEE4FreeEPhll+0x1f>
 12e5b16:	48 89 f7             	mov    %rsi,%rdi
 12e5b19:	ff 15 51 6e 48 00    	call   *0x486e51(%rip)        # 176c970 <arrow_mi_free@@Base+0x47bad0>
 12e5b1f:	f0 4c 29 63 48       	lock sub %r12,0x48(%rbx)
 12e5b24:	5b                   	pop    %rbx
 12e5b25:	41 5c                	pop    %r12
 12e5b27:	5d                   	pop    %rbp
 12e5b28:	c3                   	ret    

@pitrou
Copy link
Member Author

pitrou commented Apr 10, 2024

Also, I do not see any significant and robust regressions in our nano-benchmarks when running them locally.

@pitrou
Copy link
Member Author

pitrou commented Apr 10, 2024

Interestingly, the binary wheels built in the Crossbow builds above use a different indirection code (a direct call to arrow_mi_free@plt instead of an indirect call*):

0000000001476850 <_ZN5arrow18BaseMemoryPoolImplINS_12_GLOBAL__N_117MimallocAllocatorEE4FreeEPhll>:
 1476850:	55                   	push   %rbp
 1476851:	48 89 d5             	mov    %rdx,%rbp
 1476854:	53                   	push   %rbx
 1476855:	48 89 fb             	mov    %rdi,%rbx
 1476858:	48 83 ec 08          	sub    $0x8,%rsp
 147685c:	48 3b 35 05 5c 9d 01 	cmp    0x19d5c05(%rip),%rsi        # 2e4c468 <_ZN5arrow11memory_pool8internal14zero_size_areaE@@Base-0x10898>
 1476863:	74 08                	je     147686d <_ZN5arrow18BaseMemoryPoolImplINS_12_GLOBAL__N_117MimallocAllocatorEE4FreeEPhll+0x1d>
 1476865:	48 89 f7             	mov    %rsi,%rdi
 1476868:	e8 13 56 02 ff       	call   49be80 <arrow_mi_free@plt>
 147686d:	f0 48 29 6b 48       	lock sub %rbp,0x48(%rbx)
 1476872:	48 83 c4 08          	add    $0x8,%rsp
 1476876:	5b                   	pop    %rbx
 1476877:	5d                   	pop    %rbp
 1476878:	c3                   	ret    

I've checked that LD_PRELOAD still works to interpose these functions (arrow_mi_free etc.).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant