Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize memory tracking with a memThreshold #18465

Merged
merged 1 commit into from
Sep 28, 2021

Conversation

ronawho
Copy link
Contributor

@ronawho ronawho commented Sep 27, 2021

Memory tracking uses a global mutex to serialize access to a hash table,
this makes concurrent allocations very slow. Previously, even if a
memory threshold was used we would still grab the table lock when
free'ing because we didn't know the pointer size. Here add a
chpl_mem_real_alloc_size that will use the jemalloc API to ask for the
real size of the allocation before acquiring the lock. This allows us
to avoid taking the lock when free'ing allocations below the threshold,
which saves a lot of time.

Note that chpl_mem_real_alloc_size returns the real allocation size,
not requested size. e.g. (chpl_real_alloc_size(chpl_malloc(7)) returns
8. This means that we'll still do some unnecessary locking if the
allocation size is between the requested size and the real size of an
allocation. If we wanted to avoid that we could silently adjust
memThreshold up to the next allocation size class.

Here's a concurrent allocation micro-benchmark that demonstrates the
overhead. Results are on 128-core Rome CPU:

use Time;
config const trials = 1_000_000;

var t: Timer; t.start();
coforall 1..here.maxTaskPar do
  for i in 1..trials do
    var s = i:string;
writeln(t.elapsed());
config Time
w/o memTrack 0.19s
w/ memTrack 144.50s
w/ threshold before 33.06s
w/ threshold now 0.22s

This is motivated by Arkouda, which uses memTrack as a means to detect
if an operation will exceed memory. We recently noticed concurrent
allocations were slower than expected and tracked it down to this.

Related to #10415
Resolves https://github.com/Cray/chapel-private/issues/1330

Memory tracking uses a global mutex to serialize access to a hash table,
this makes concurrent allocations very slow. Previously, even if a
memory threshold was used we would still grab the table lock when
free'ing because we didn't know the pointer size. Here add a
`chpl_mem_real_alloc_size` that will use the jemalloc API to ask for the
real size of the allocation before acquiring the lock. This allows us
to avoid taking the lock when free'ing allocations below the threshold,
which saves a lot of time.

Note that `chpl_mem_real_alloc_size` returns the real allocation size,
not requested size. e.g. (`chpl_real_alloc_size(chpl_malloc(7))` returns
`8`. This means that we'll still do some unnecessary locking if the
allocation size is between the requested size and the real size of an
allocation. If we wanted to avoid that we could silently adjust
memThreshold up to the next allocation size class.

Here's a concurrent allocation micro-benchmark that demonstrates the
overhead. Results are on 128-core Rome CPU:

```chpl
use Time;
config const trials = 1_000_000;

var t: Timer; t.start();
coforall 1..here.maxTaskPar do
  for i in 1..trials do
    var s = i:string;
writeln(t.elapsed());
```

| config              | Time    |
| ------------------- | ------: |
| w/o memTrack        |   0.19s |
| w/ memTrack         | 144.50s |
| w/ threshold before |  33.06s |
| w/ threshold now    |   0.22s |

This is motivated by Arkouda, which uses `memTrack` as a means to detect
if an operation will exceed memory. We recently noticed concurrent
allocations were slower than expected and tracked it down to this.

Related to 10415
Resolves 1330

Signed-off-by: Elliot Ronaghan <ronawho@gmail.com>
@ronawho ronawho marked this pull request as ready for review September 27, 2021 22:05
Copy link
Member

@gbtitus gbtitus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@ronawho ronawho merged commit a115645 into chapel-lang:main Sep 28, 2021
@ronawho ronawho deleted the opt-mem-track-with-threshold branch September 28, 2021 01:36
ronawho added a commit that referenced this pull request Sep 29, 2021
Fix `memThreshold` memory tracking optimization

[discussed with @gbtitus, full review post-commit]

#18465 optimized memory leak tracking to avoid taking a lock when the real
allocation size is below `memThreshold`, but there were a few bugs in
the initial implementation.

For cases where the memory layer doesn't support `chpl_real_alloc_size`
(`CHPL_MEM=cstdlib`) we weren't handling the unknown size sentinel so we
weren't ever running the free hook, which made it look like we were
leaking all memory. Fix that to run the hook if the "real size" is 0
(the unknown sentinel.)

This was also broken under configurations that separately allocate
arrays (like `CHPL_COMM=ugni`). For those configs we were taking a comm
layer allocation and trying to ask jemalloc for the size of it, but this
results in mixed allocator calls, which is undefined behavior. Fix that
by making the higher level interface compute the size and passing it in
instead of computing it below the level where we know what allocator
memory came from.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants