Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve memory instrumentation #1790

Merged
merged 6 commits into from
Apr 24, 2018

Conversation

jhogberg
Copy link
Contributor

This PR replaces the old memory instrumentation with a new implementation that scans carriers instead of wrapping erts_alloc/erts_free. The old implementation could not extract information without halting the emulator, had considerable runtime overhead, and the memory maps it produced were noisy and lacked critical information.

Since the new implementation walks through existing data structures there's no longer a need to start the emulator with special flags to get information about carrier utilization/fragmentation. Memory fragmentation is also easier to diagnose as it's presented on a per-carrier basis which eliminates the need to account for "holes" between mmap segments.

To help track allocations, each allocation can now be tagged with what it is and who allocated it at the cost of one extra word per allocation. This is controlled on a per-allocator basis with the +M<S>atags option, and is enabled by default for binary_alloc and driver_alloc (which is also used by NIFs).

The old interface has been removed since it can't be adapted to the new implementation and the module has always been experimental. It could coexist with the new one if we kept the old implementation around, but it will be removed unless someone makes a strong case for keeping it.

There's still some work left on the documentation, but with rc1 coming soon I figured it's best to get feedback as soon as possible.


Examples:

instrument:carriers/0-1 return a list with information about each carrier's total size, combined allocation size, allocation count, whether it's in the migration pool, and a histogram of free block sizes (log2, the first interval ends at 512 by default).

In the example below, ll_alloc carrier has no free blocks at all, the binary_alloc and eheap_alloc ones look healthy with a few very large blocks, and the fix_alloc carrier is somewhat fragmented with 22 free blocks smaller than 512 bytes (although this is not a problem for that allocator type).

1> instrument:carriers().
{ok,{512,
     [{ll_alloc,1048576,0,1048344,71,false,{0,0,0,0,0,0,0,0,0,0,0,0,0,0}},
      {binary_alloc,1048576,0,324640,13,false,{3,0,0,1,0,0,0,2,0,0,0,0,0,0}},
      {eheap_alloc,2097152,0,1037200,45,false,{2,1,1,3,4,3,2,2,0,0,0,0,0,0}},
      {fix_alloc,32768,0,29544,82,false,{22,0,0,0,0,0,0,0,0,0,0,0,0,0}},
      {...}|...]}}

instrument:allocations/0-1 return an allocation summary with histograms of allocated block sizes grouped by their origin and allocation type. The example below is taken with +Muatags true to track allocations on all allocator types.

2> instrument:allocations()
{ok,{128,0,
     #{udp_inet =>
           #{driver_event_state => {0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0}},
       tty_sl =>
           #{io_queue => {0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             drv_internal => {0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0}},
       system =>
           #{db_segment => {0,0,0,0,0,18,0,0,1,0,0,0,0,0,0,0,0,0},
             heap => {0,0,0,0,20,4,2,2,2,3,0,1,0,0,1,0,0,0},
             thr_prgr_data => {38,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             db_term => {271,3,1,52,80,1,0,0,0,0,0,0,0,0,0,0,0,0},
             code => {0,0,0,5,3,6,11,22,19,20,10,2,1,0,0,0,0,0},
             binary => {18,0,0,0,7,0,0,1,0,0,0,0,0,0,0,0,0,0},
             atom_entry => {8681,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             message => {0,40,78,2,2,0,0,0,0,0,0,0,0,0,0,0,0,0},
             ... }
       spawn_forker =>
           #{driver_select_data_state =>
                 {1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}},
       ram_file_drv => #{drv_binary => {0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0}},
       prim_file =>
           #{process_specific_data => {2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             nif_trap_export_entry => {0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             monitor_extended => {0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             drv_binary => {0,0,0,0,0,0,1,0,3,5,0,0,0,1,0,0,0,0},
             binary => {0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}},
       prim_buffer =>
           #{nif_internal => {0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             binary => {0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}}}}}

instrument:allocations/1 and instrument:carriers/1 let you tweak the histograms and which allocators to look in.

This increases the accuracy of crash dumps and the upcoming
allocation tagging feature.
This may be of interest in crash dumps and allows the upcoming
allocation tagging feature to track allocations on a per-NIF basis.

Note that this is only updated when user code calls a NIF; it's not
altered when the emulator calls NIFs during code upgrades or
tracing.
@jhogberg jhogberg self-assigned this Apr 19, 2018
@jhogberg jhogberg added the team:VM Assigned to OTP team VM label Apr 19, 2018
@jhogberg jhogberg force-pushed the john/erts/more-alloc-info/OTP-14961 branch from 352e106 to 12c31ee Compare April 20, 2018 13:40
This commit replaces the old memory instrumentation with a new
implementation that scans carriers instead of wrapping
erts_alloc/erts_free. The old implementation could not extract
information without halting the emulator, had considerable runtime
overhead, and the memory maps it produced were noisy and lacked
critical information.

Since the new implementation walks through existing data structures
there's no longer a need to start the emulator with special flags to
get information about carrier utilization/fragmentation. Memory
fragmentation is also easier to diagnose as it's presented on a
per-carrier basis which eliminates the need to account for "holes"
between mmap segments.

To help track allocations, each allocation can now be tagged with
what it is and who allocated it at the cost of one extra word per
allocation. This is controlled on a per-allocator basis with the
+M<S>atags option, and is enabled by default for binary_alloc and
driver_alloc (which is also used by NIFs).
@jhogberg jhogberg force-pushed the john/erts/more-alloc-info/OTP-14961 branch from 12c31ee to 573a5ab Compare April 24, 2018 09:34
@jhogberg jhogberg merged commit e2a760e into erlang:master Apr 24, 2018
@vans163
Copy link
Contributor

vans163 commented Feb 20, 2020

I just ended up reaching for this and turns out its gone :(

What this new rewrite lacks is the most important feature, the fact you can search which terms are in memory and how many references they have. This helps quickly identify what garbage is lying about, for example this module helped me identify the binary heap leak inside jiffy. Now it just tells you information but nothing detailed. (you got a pointer to where the term was inmemory, then you can just read it from the processes memory).

This new version completely removes this feature.

Any plans to perhaps add it back?

@jhogberg
Copy link
Contributor Author

jhogberg commented Feb 20, 2020

This new version completely removes this feature.

No, you've always needed a debugger to do what you've described and the rewrite merely changes how you use it.

The gdb commands below will get you more or less the same information as before, and it's easily extended to include a lot more than the old instrument module did.

# printf at the start of the block scan loop in gather_ahist_scan. Your
# line numbers may differ.

## *After* `block = SBC2BLK(allocator, carrier);`
dprintf erl_alloc_util.c:7638,"SB: %p, %p\n", (((char*)(block)) + sizeof(Block_t)), block->bhdr
## At `UWord block_size = MBC_BLK_SZ(block);`
dprintf erl_alloc_util.c:7653,"MB: %p, %p\n", (((char*)(block)) + sizeof(Block_t)), block->bhdr

# Log the above to a file
set logging overwrite on
set logging file allocations.txt
set logging redirect on
set logging on
set pagination off

# Break out of the debugger, and then run instrument:allocations().
# Don't forget to turn off logging and/or redirect when you're done.

@vans163
Copy link
Contributor

vans163 commented Feb 20, 2020

Thanks I will check that out, but its not correct. You did not need a debugger because you can just read back from /proc/mem and use os:getpid/0 to get the runtime pid. And the memory_data provided the ref counts as well as addresses. So I do not see the need to use a debugger to extract this information unless I am misunderstanding something..

@jhogberg
Copy link
Contributor Author

jhogberg commented Feb 21, 2020

Thanks I will check that out, but its not correct. You did not need a debugger because you can just read back from /proc/mem and use os:getpid/0 to get the runtime pid.

I'd say that's a debugger by a different name, you're reading the raw memory of another process without its consent.

And the memory_data provided the ref counts as well as addresses.

It didn't as far as I can tell, maybe I'm missing something:

Eterm erts_instr_get_memory_map(Process *proc)
{
MapStatBlock_t *org_mem_anchor;
Eterm hdr_tuple, md_list, res;
Eterm *hp;
Uint hsz;
MapStatBlock_t *bp;
#ifdef DEBUG
Eterm *end_hp;
#endif
if (!erts_instr_memory_map)
return am_false;
if (!atoms_initialized)
init_atoms();
if (!am_n)
init_am_n();
if (!am_c)
init_am_c();
if (!am_a)
init_am_a();
erts_mtx_lock(&instr_x_mutex);
erts_mtx_lock(&instr_mutex);
/* Header size */
hsz = 5 + 1 + (ERTS_ALC_N_MAX+1-ERTS_ALC_N_MIN)*(1 + 4);
/* Memory data list */
for (bp = mem_anchor; bp; bp = bp->next) {
if (is_internal_pid(bp->pid)) {
#if (_PID_NUM_SIZE - 1 > MAX_SMALL)
if (internal_pid_number(bp->pid) > MAX_SMALL)
hsz += BIG_UINT_HEAP_SIZE;
#endif
#if (_PID_SER_SIZE - 1 > MAX_SMALL)
if (internal_pid_serial(bp->pid) > MAX_SMALL)
hsz += BIG_UINT_HEAP_SIZE;
#endif
hsz += 4;
}
if ((UWord) bp->mem > MAX_SMALL)
hsz += BIG_UINT_HEAP_SIZE;
if (bp->size > MAX_SMALL)
hsz += BIG_UINT_HEAP_SIZE;
hsz += 5 + 2;
}
hsz += 3; /* Root tuple */
org_mem_anchor = mem_anchor;
mem_anchor = NULL;
erts_mtx_unlock(&instr_mutex);
hp = HAlloc(proc, hsz); /* May end up calling map_stat_alloc() */
erts_mtx_lock(&instr_mutex);
#ifdef DEBUG
end_hp = hp + hsz;
#endif
{ /* Build header */
ErtsAlcType_t n;
Eterm type_map;
Uint *hp2 = hp;
#ifdef DEBUG
Uint *hp2_end;
#endif
hp += (ERTS_ALC_N_MAX + 1 - ERTS_ALC_N_MIN)*4;
#ifdef DEBUG
hp2_end = hp;
#endif
type_map = make_tuple(hp);
*(hp++) = make_arityval(ERTS_ALC_N_MAX + 1 - ERTS_ALC_N_MIN);
for (n = ERTS_ALC_N_MIN; n <= ERTS_ALC_N_MAX; n++) {
ErtsAlcType_t t = ERTS_ALC_N2T(n);
ErtsAlcType_t a = ERTS_ALC_T2A(t);
ErtsAlcType_t c = ERTS_ALC_T2C(t);
if (!erts_allctrs_info[a].enabled)
a = ERTS_ALC_A_SYSTEM;
*(hp++) = TUPLE3(hp2, am_n[n], am_a[a], am_c[c]);
hp2 += 4;
}
ASSERT(hp2 == hp2_end);
hdr_tuple = TUPLE4(hp,
am.instr_hdr,
make_small(ERTS_INSTR_VSN),
make_small(MAP_STAT_BLOCK_HEADER_SIZE),
type_map);
hp += 5;
}
/* Build memory data list */
for (md_list = NIL, bp = org_mem_anchor; bp; bp = bp->next) {
Eterm tuple;
Eterm type;
Eterm ptr;
Eterm size;
Eterm pid;
if (is_not_internal_pid(bp->pid))
pid = am_undefined;
else {
Eterm c;
Eterm n;
Eterm s;
#if (ERST_INTERNAL_CHANNEL_NO > MAX_SMALL)
#error Oversized internal channel number
#endif
c = make_small(ERST_INTERNAL_CHANNEL_NO);
#if (_PID_NUM_SIZE - 1 > MAX_SMALL)
if (internal_pid_number(bp->pid) > MAX_SMALL) {
n = uint_to_big(internal_pid_number(bp->pid), hp);
hp += BIG_UINT_HEAP_SIZE;
}
else
#endif
n = make_small(internal_pid_number(bp->pid));
#if (_PID_SER_SIZE - 1 > MAX_SMALL)
if (internal_pid_serial(bp->pid) > MAX_SMALL) {
s = uint_to_big(internal_pid_serial(bp->pid), hp);
hp += BIG_UINT_HEAP_SIZE;
}
else
#endif
s = make_small(internal_pid_serial(bp->pid));
pid = TUPLE3(hp, c, n, s);
hp += 4;
}
#if ERTS_ALC_N_MAX > MAX_SMALL
#error Oversized memory type number
#endif
type = make_small(bp->type_no);
if ((UWord) bp->mem > MAX_SMALL) {
ptr = uint_to_big((UWord) bp->mem, hp);
hp += BIG_UINT_HEAP_SIZE;
}
else
ptr = make_small((UWord) bp->mem);
if (bp->size > MAX_SMALL) {
size = uint_to_big(bp->size, hp);
hp += BIG_UINT_HEAP_SIZE;
}
else
size = make_small(bp->size);
tuple = TUPLE4(hp, type, ptr, size, pid);
hp += 5;
md_list = CONS(hp, tuple, md_list);
hp += 2;
}
res = TUPLE2(hp, hdr_tuple, md_list);
ASSERT(hp + 3 == end_hp);
if (mem_anchor) {
for (bp = mem_anchor; bp->next; bp = bp->next)
;
ASSERT(org_mem_anchor);
org_mem_anchor->prev = bp;
bp->next = org_mem_anchor;
}
else {
mem_anchor = org_mem_anchor;
}
erts_mtx_unlock(&instr_mutex);
erts_mtx_unlock(&instr_x_mutex);
return res;
}

In either case it's rather trivial to modify the above example to provide reference count for binaries etc. The allocation tag tells you what type the object is and where it came from, so you can just set a breakpoint after the tag is read and print the reference count if it's a binary.

@vans163
Copy link
Contributor

vans163 commented Feb 21, 2020

I'd say that's a debugger by a different name, you're reading the raw memory of another process without its consent.

Still inside managed code it makes things much easier, as well as you get the actual pid that did the allocation. VS using GDB not only does that raise the skill level to debug memory issues much higher, it also creates the requirement to write a bunch of custom tooling around GDB now if you want to get more information out the running system (vs just dumping clear text logs).

Before this was all accessible from inside the emulator, without leaving it.

It didn't as far as I can tell, maybe I'm missing something:

I have not used it in a while, maybe it gave you the address to the allocated block the term was in (for shared binary heap terms atleast)?

Yea but its a very useful tool as it was before IMO for figuring out the hard stuff. Things like Recon (and this new instrument module) do not really cut it as they tell you there is a problem, but you cannot really pinpoint what exact code is causing the problem.

So you need to cut to GDB, but maybe that makes sense, pinpointing issues like this are a task for GDB otherwise it gets hard to maintain. Like its out of scope from inside managed code to be able to do stuff like this. Also GDB seems the goto tool as well for other issues like scheduler locking around ets tables.

@jhogberg
Copy link
Contributor Author

Still inside managed code it makes things much easier

Wrangling /proc/[pid]/mem sounds far more painful than using gdb to me, regardless of whether you're doing it from Erlang code or not.

I don't get it. It's like using a fingernail to drive in a screw, why not use the right tool for the job?

as well as you get the actual pid that did the allocation.

In return you had no idea where it originated and had to resort to reading block contents with a debugger to figure that out, which is pretty much only feasible with binaries that contain recognizable data.

The pids weren't particularly useful either since they were only present when a process was directly responsible for the allocation (ports, system tasks, async jobs, etc were hopeless), and they could be long dead by the time you inspected them.

VS using GDB not only does that raise the skill level to debug memory issues much higher,

It doesn't take long to pick up the basics and there are countless tutorials on the net to help you. I don't think it's too much to ask.

it also creates the requirement to write a bunch of custom tooling around GDB now if you want to get more information out the running system (vs just dumping clear text logs).

erts/etc/unix/etp-commands covers the worst of it.

I have not used it in a while, maybe it gave you the address to the allocated block the term was in (for shared binary heap terms atleast)?

No, it just prints the block address, type, and size. It doesn't say anything about the contents nor do they have anything to do with Erlang terms.

For example, the binary allocation type usually holds off-heap data structures for binaries together with their contents, but there's no way to tell which parts of the binaries are referenced by which (Erlang) terms, which processes or ETS tables hold on to it, and so on.

It's just a pile of bytes and you can't even say what the reference count is without knowing the struct layout. With gdb you can say p *(Binary*)(address) to get it nicely pretty-printed, but you need to figure it out yourself when using /proc/[pid]/mem.

Yea but its a very useful tool as it was before IMO for figuring out the hard stuff. Things like Recon (and this new instrument module) do not really cut it as they tell you there is a problem, but you cannot really pinpoint what exact code is causing the problem.

  1. The new module tells you which driver/NIF allocated the blocks (on a per-block basis at that). This was not possible with the old module regardless of whether you used a debugger or not.
  2. You had to use a debugger to pinpoint what exact code was causing the problem, it's not fair to expect the same from the new module alone.

So you need to cut to GDB, but maybe that makes sense, pinpointing issues like this are a task for GDB otherwise it gets hard to maintain. Like its out of scope from inside managed code to be able to do stuff like this. Also GDB seems the goto tool as well for other issues like scheduler locking around ets tables.

This basically sums it up. For what it's worth, I'm sorry this PR derailed your workflow.

@vans163
Copy link
Contributor

vans163 commented Feb 24, 2020

Would it be possible to extend this module so it can show all the terms in a processes eheap?

Like for example I know I have a memory issue, eheap is growing insanely fast if some unhappy path is taken. Up to 10MB is allocated for eheap then it stops growing (maybe this is some kind of internal limit after which GC is forced?). If the unhappy path is not taken in that 1% or something chance, everything is fine. Great, so how do I find now the unhappy path that is triggering the allocations?

One way would be if I could dump all the terms in the eheap or at least get a pointer to it. This way I can easily and quickly see. Of course the best way would be tie every term in the eheap to the line number in source code that allocated it. But this might be complex due to all the optimizations with memory copying?

@jhogberg
Copy link
Contributor Author

No, things like that are out of scope and you can already do this with gdb using etp-processes / etp-heapdump etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team:VM Assigned to OTP team VM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants