Improve memory instrumentation #1790

jhogberg · 2018-04-19T08:42:22Z

This PR replaces the old memory instrumentation with a new implementation that scans carriers instead of wrapping erts_alloc/erts_free. The old implementation could not extract information without halting the emulator, had considerable runtime overhead, and the memory maps it produced were noisy and lacked critical information.

Since the new implementation walks through existing data structures there's no longer a need to start the emulator with special flags to get information about carrier utilization/fragmentation. Memory fragmentation is also easier to diagnose as it's presented on a per-carrier basis which eliminates the need to account for "holes" between mmap segments.

To help track allocations, each allocation can now be tagged with what it is and who allocated it at the cost of one extra word per allocation. This is controlled on a per-allocator basis with the +M<S>atags option, and is enabled by default for binary_alloc and driver_alloc (which is also used by NIFs).

The old interface has been removed since it can't be adapted to the new implementation and the module has always been experimental. It could coexist with the new one if we kept the old implementation around, but it will be removed unless someone makes a strong case for keeping it.

There's still some work left on the documentation, but with rc1 coming soon I figured it's best to get feedback as soon as possible.

Examples:

instrument:carriers/0-1 return a list with information about each carrier's total size, combined allocation size, allocation count, whether it's in the migration pool, and a histogram of free block sizes (log2, the first interval ends at 512 by default).

In the example below, ll_alloc carrier has no free blocks at all, the binary_alloc and eheap_alloc ones look healthy with a few very large blocks, and the fix_alloc carrier is somewhat fragmented with 22 free blocks smaller than 512 bytes (although this is not a problem for that allocator type).

1> instrument:carriers().
{ok,{512,
     [{ll_alloc,1048576,0,1048344,71,false,{0,0,0,0,0,0,0,0,0,0,0,0,0,0}},
      {binary_alloc,1048576,0,324640,13,false,{3,0,0,1,0,0,0,2,0,0,0,0,0,0}},
      {eheap_alloc,2097152,0,1037200,45,false,{2,1,1,3,4,3,2,2,0,0,0,0,0,0}},
      {fix_alloc,32768,0,29544,82,false,{22,0,0,0,0,0,0,0,0,0,0,0,0,0}},
      {...}|...]}}

instrument:allocations/0-1 return an allocation summary with histograms of allocated block sizes grouped by their origin and allocation type. The example below is taken with +Muatags true to track allocations on all allocator types.

2> instrument:allocations()
{ok,{128,0,
     #{udp_inet =>
           #{driver_event_state => {0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0}},
       tty_sl =>
           #{io_queue => {0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             drv_internal => {0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0}},
       system =>
           #{db_segment => {0,0,0,0,0,18,0,0,1,0,0,0,0,0,0,0,0,0},
             heap => {0,0,0,0,20,4,2,2,2,3,0,1,0,0,1,0,0,0},
             thr_prgr_data => {38,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             db_term => {271,3,1,52,80,1,0,0,0,0,0,0,0,0,0,0,0,0},
             code => {0,0,0,5,3,6,11,22,19,20,10,2,1,0,0,0,0,0},
             binary => {18,0,0,0,7,0,0,1,0,0,0,0,0,0,0,0,0,0},
             atom_entry => {8681,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             message => {0,40,78,2,2,0,0,0,0,0,0,0,0,0,0,0,0,0},
             ... }
       spawn_forker =>
           #{driver_select_data_state =>
                 {1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}},
       ram_file_drv => #{drv_binary => {0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0}},
       prim_file =>
           #{process_specific_data => {2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             nif_trap_export_entry => {0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             monitor_extended => {0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             drv_binary => {0,0,0,0,0,0,1,0,3,5,0,0,0,1,0,0,0,0},
             binary => {0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}},
       prim_buffer =>
           #{nif_internal => {0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             binary => {0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}}}}}

instrument:allocations/1 and instrument:carriers/1 let you tweak the histograms and which allocators to look in.

This increases the accuracy of crash dumps and the upcoming allocation tagging feature.

This may be of interest in crash dumps and allows the upcoming allocation tagging feature to track allocations on a per-NIF basis. Note that this is only updated when user code calls a NIF; it's not altered when the emulator calls NIFs during code upgrades or tracing.

OTP-15024

This commit replaces the old memory instrumentation with a new implementation that scans carriers instead of wrapping erts_alloc/erts_free. The old implementation could not extract information without halting the emulator, had considerable runtime overhead, and the memory maps it produced were noisy and lacked critical information. Since the new implementation walks through existing data structures there's no longer a need to start the emulator with special flags to get information about carrier utilization/fragmentation. Memory fragmentation is also easier to diagnose as it's presented on a per-carrier basis which eliminates the need to account for "holes" between mmap segments. To help track allocations, each allocation can now be tagged with what it is and who allocated it at the cost of one extra word per allocation. This is controlled on a per-allocator basis with the +M<S>atags option, and is enabled by default for binary_alloc and driver_alloc (which is also used by NIFs).

vans163 · 2020-02-20T00:31:39Z

I just ended up reaching for this and turns out its gone :(

What this new rewrite lacks is the most important feature, the fact you can search which terms are in memory and how many references they have. This helps quickly identify what garbage is lying about, for example this module helped me identify the binary heap leak inside jiffy. Now it just tells you information but nothing detailed. (you got a pointer to where the term was inmemory, then you can just read it from the processes memory).

This new version completely removes this feature.

Any plans to perhaps add it back?

jhogberg · 2020-02-20T10:48:38Z

This new version completely removes this feature.

No, you've always needed a debugger to do what you've described and the rewrite merely changes how you use it.

The gdb commands below will get you more or less the same information as before, and it's easily extended to include a lot more than the old instrument module did.

# printf at the start of the block scan loop in gather_ahist_scan. Your
# line numbers may differ.

## *After* `block = SBC2BLK(allocator, carrier);`
dprintf erl_alloc_util.c:7638,"SB: %p, %p\n", (((char*)(block)) + sizeof(Block_t)), block->bhdr
## At `UWord block_size = MBC_BLK_SZ(block);`
dprintf erl_alloc_util.c:7653,"MB: %p, %p\n", (((char*)(block)) + sizeof(Block_t)), block->bhdr

# Log the above to a file
set logging overwrite on
set logging file allocations.txt
set logging redirect on
set logging on
set pagination off

# Break out of the debugger, and then run instrument:allocations().
# Don't forget to turn off logging and/or redirect when you're done.

vans163 · 2020-02-20T17:08:18Z

Thanks I will check that out, but its not correct. You did not need a debugger because you can just read back from /proc/mem and use os:getpid/0 to get the runtime pid. And the memory_data provided the ref counts as well as addresses. So I do not see the need to use a debugger to extract this information unless I am misunderstanding something..

jhogberg · 2020-02-21T09:00:29Z

Thanks I will check that out, but its not correct. You did not need a debugger because you can just read back from /proc/mem and use os:getpid/0 to get the runtime pid.

I'd say that's a debugger by a different name, you're reading the raw memory of another process without its consent.

And the memory_data provided the ref counts as well as addresses.

It didn't as far as I can tell, maybe I'm missing something:

otp/erts/emulator/beam/erl_instrument.c

Lines 637 to 831 in 6a99831

    
           Eterm erts_instr_get_memory_map(Process *proc) 
        
           { 
        
               MapStatBlock_t *org_mem_anchor; 
        
               Eterm hdr_tuple, md_list, res; 
        
               Eterm *hp; 
        
               Uint hsz; 
        
               MapStatBlock_t *bp; 
        
           #ifdef DEBUG 
        
               Eterm *end_hp; 
        
           #endif 
        
               if (!erts_instr_memory_map) 
        
           	return am_false; 
        
               if (!atoms_initialized) 
        
           	init_atoms(); 
        
               if (!am_n) 
        
           	init_am_n(); 
        
               if (!am_c) 
        
           	init_am_c(); 
        
               if (!am_a) 
        
           	init_am_a(); 
        
               erts_mtx_lock(&instr_x_mutex); 
        
               erts_mtx_lock(&instr_mutex); 
        
               /* Header size */ 
        
               hsz = 5 + 1 + (ERTS_ALC_N_MAX+1-ERTS_ALC_N_MIN)*(1 + 4); 
        
               /* Memory data list */ 
        
               for (bp = mem_anchor; bp; bp = bp->next) { 
        
           	if (is_internal_pid(bp->pid)) { 
        
           #if (_PID_NUM_SIZE - 1 > MAX_SMALL) 
        
           	    if (internal_pid_number(bp->pid) > MAX_SMALL) 
        
           		hsz += BIG_UINT_HEAP_SIZE; 
        
           #endif 
        
           #if (_PID_SER_SIZE - 1 > MAX_SMALL) 
        
           	    if (internal_pid_serial(bp->pid) > MAX_SMALL) 
        
           		hsz += BIG_UINT_HEAP_SIZE; 
        
           #endif 
        
           	    hsz += 4; 
        
           	} 
        
           	if ((UWord) bp->mem > MAX_SMALL) 
        
           	    hsz += BIG_UINT_HEAP_SIZE; 
        
           	if (bp->size > MAX_SMALL) 
        
           	    hsz += BIG_UINT_HEAP_SIZE; 
        
           	hsz += 5 + 2; 
        
               } 
        
               hsz += 3; /* Root tuple */ 
        
               org_mem_anchor = mem_anchor; 
        
               mem_anchor = NULL; 
        
               erts_mtx_unlock(&instr_mutex); 
        
               hp = HAlloc(proc, hsz); /* May end up calling map_stat_alloc() */ 
        
               erts_mtx_lock(&instr_mutex); 
        
           #ifdef DEBUG 
        
               end_hp = hp + hsz; 
        
           #endif 
        
               {	/* Build header */ 
        
           	ErtsAlcType_t n; 
        
           	Eterm type_map; 
        
           	Uint *hp2 = hp; 
        
           #ifdef DEBUG 
        
           	Uint *hp2_end; 
        
           #endif 
        
           	hp += (ERTS_ALC_N_MAX + 1 - ERTS_ALC_N_MIN)*4; 
        
           #ifdef DEBUG 
        
           	hp2_end = hp; 
        
           #endif 
        
           	type_map = make_tuple(hp); 
        
           	*(hp++) = make_arityval(ERTS_ALC_N_MAX + 1 - ERTS_ALC_N_MIN); 
        
           	for (n = ERTS_ALC_N_MIN; n <= ERTS_ALC_N_MAX; n++) { 
        
           	    ErtsAlcType_t t = ERTS_ALC_N2T(n); 
        
           	    ErtsAlcType_t a = ERTS_ALC_T2A(t); 
        
           	    ErtsAlcType_t c = ERTS_ALC_T2C(t); 
        
           	    if (!erts_allctrs_info[a].enabled) 
        
           		a = ERTS_ALC_A_SYSTEM; 
        
           	    *(hp++) = TUPLE3(hp2, am_n[n], am_a[a], am_c[c]); 
        
           	    hp2 += 4; 
        
           	} 
        
           	ASSERT(hp2 == hp2_end); 
        
           	hdr_tuple = TUPLE4(hp, 
        
           			   am.instr_hdr, 
        
           			   make_small(ERTS_INSTR_VSN), 
        
           			   make_small(MAP_STAT_BLOCK_HEADER_SIZE), 
        
           			   type_map); 
        
           	hp += 5; 
        
               } 
        
               /* Build memory data list */ 
        
               for (md_list = NIL, bp = org_mem_anchor; bp; bp = bp->next) { 
        
           	Eterm tuple; 
        
           	Eterm type; 
        
           	Eterm ptr; 
        
           	Eterm size; 
        
           	Eterm pid; 
        
           	if (is_not_internal_pid(bp->pid)) 
        
           	    pid = am_undefined; 
        
           	else { 
        
           	    Eterm c; 
        
           	    Eterm n; 
        
           	    Eterm s; 
        
           #if (ERST_INTERNAL_CHANNEL_NO > MAX_SMALL) 
        
           #error Oversized internal channel number 
        
           #endif 
        
           	    c = make_small(ERST_INTERNAL_CHANNEL_NO); 
        
           #if (_PID_NUM_SIZE - 1 > MAX_SMALL) 
        
           	    if (internal_pid_number(bp->pid) > MAX_SMALL) { 
        
           		n = uint_to_big(internal_pid_number(bp->pid), hp); 
        
           		hp += BIG_UINT_HEAP_SIZE; 
        
           	    } 
        
           	    else 
        
           #endif 
        
           		n = make_small(internal_pid_number(bp->pid)); 
        
           #if (_PID_SER_SIZE - 1 > MAX_SMALL) 
        
           	    if (internal_pid_serial(bp->pid) > MAX_SMALL) { 
        
           		s = uint_to_big(internal_pid_serial(bp->pid), hp); 
        
           		hp += BIG_UINT_HEAP_SIZE; 
        
           	    } 
        
           	    else 
        
           #endif 
        
           		s = make_small(internal_pid_serial(bp->pid)); 
        
           	    pid = TUPLE3(hp, c, n, s); 
        
           	    hp += 4; 
        
           	} 
        
           #if ERTS_ALC_N_MAX > MAX_SMALL 
        
           #error Oversized memory type number 
        
           #endif 
        
           	type = make_small(bp->type_no); 
        
           	if ((UWord) bp->mem > MAX_SMALL) { 
        
           	    ptr = uint_to_big((UWord) bp->mem, hp); 
        
           	    hp += BIG_UINT_HEAP_SIZE; 
        
           	} 
        
           	else 
        
           	    ptr = make_small((UWord) bp->mem); 
        
           	if (bp->size > MAX_SMALL) { 
        
           	    size = uint_to_big(bp->size, hp); 
        
           	    hp += BIG_UINT_HEAP_SIZE; 
        
           	} 
        
           	else 
        
           	    size = make_small(bp->size); 
        
           	tuple = TUPLE4(hp, type, ptr, size, pid); 
        
           	hp += 5; 
        
           	md_list = CONS(hp, tuple, md_list); 
        
           	hp += 2; 
        
               } 
        
               res = TUPLE2(hp, hdr_tuple, md_list); 
        
               ASSERT(hp + 3 == end_hp); 
        
               if (mem_anchor) { 
        
           	for (bp = mem_anchor; bp->next; bp = bp->next) 
        
           	    ; 
        
           	ASSERT(org_mem_anchor); 
        
           	org_mem_anchor->prev = bp;  
        
           	bp->next = org_mem_anchor; 
        
               } 
        
               else { 
        
           	mem_anchor = org_mem_anchor; 
        
               } 
        
               erts_mtx_unlock(&instr_mutex); 
        
               erts_mtx_unlock(&instr_x_mutex); 
        
               return res; 
        
           }

In either case it's rather trivial to modify the above example to provide reference count for binaries etc. The allocation tag tells you what type the object is and where it came from, so you can just set a breakpoint after the tag is read and print the reference count if it's a binary.

vans163 · 2020-02-21T15:06:15Z

I'd say that's a debugger by a different name, you're reading the raw memory of another process without its consent.

Still inside managed code it makes things much easier, as well as you get the actual pid that did the allocation. VS using GDB not only does that raise the skill level to debug memory issues much higher, it also creates the requirement to write a bunch of custom tooling around GDB now if you want to get more information out the running system (vs just dumping clear text logs).

Before this was all accessible from inside the emulator, without leaving it.

It didn't as far as I can tell, maybe I'm missing something:

I have not used it in a while, maybe it gave you the address to the allocated block the term was in (for shared binary heap terms atleast)?

Yea but its a very useful tool as it was before IMO for figuring out the hard stuff. Things like Recon (and this new instrument module) do not really cut it as they tell you there is a problem, but you cannot really pinpoint what exact code is causing the problem.

So you need to cut to GDB, but maybe that makes sense, pinpointing issues like this are a task for GDB otherwise it gets hard to maintain. Like its out of scope from inside managed code to be able to do stuff like this. Also GDB seems the goto tool as well for other issues like scheduler locking around ets tables.

jhogberg · 2020-02-21T18:31:01Z

Still inside managed code it makes things much easier

Wrangling /proc/[pid]/mem sounds far more painful than using gdb to me, regardless of whether you're doing it from Erlang code or not.

I don't get it. It's like using a fingernail to drive in a screw, why not use the right tool for the job?

as well as you get the actual pid that did the allocation.

In return you had no idea where it originated and had to resort to reading block contents with a debugger to figure that out, which is pretty much only feasible with binaries that contain recognizable data.

The pids weren't particularly useful either since they were only present when a process was directly responsible for the allocation (ports, system tasks, async jobs, etc were hopeless), and they could be long dead by the time you inspected them.

VS using GDB not only does that raise the skill level to debug memory issues much higher,

It doesn't take long to pick up the basics and there are countless tutorials on the net to help you. I don't think it's too much to ask.

it also creates the requirement to write a bunch of custom tooling around GDB now if you want to get more information out the running system (vs just dumping clear text logs).

erts/etc/unix/etp-commands covers the worst of it.

I have not used it in a while, maybe it gave you the address to the allocated block the term was in (for shared binary heap terms atleast)?

No, it just prints the block address, type, and size. It doesn't say anything about the contents nor do they have anything to do with Erlang terms.

For example, the binary allocation type usually holds off-heap data structures for binaries together with their contents, but there's no way to tell which parts of the binaries are referenced by which (Erlang) terms, which processes or ETS tables hold on to it, and so on.

It's just a pile of bytes and you can't even say what the reference count is without knowing the struct layout. With gdb you can say p *(Binary*)(address) to get it nicely pretty-printed, but you need to figure it out yourself when using /proc/[pid]/mem.

Yea but its a very useful tool as it was before IMO for figuring out the hard stuff. Things like Recon (and this new instrument module) do not really cut it as they tell you there is a problem, but you cannot really pinpoint what exact code is causing the problem.

The new module tells you which driver/NIF allocated the blocks (on a per-block basis at that). This was not possible with the old module regardless of whether you used a debugger or not.
You had to use a debugger to pinpoint what exact code was causing the problem, it's not fair to expect the same from the new module alone.

So you need to cut to GDB, but maybe that makes sense, pinpointing issues like this are a task for GDB otherwise it gets hard to maintain. Like its out of scope from inside managed code to be able to do stuff like this. Also GDB seems the goto tool as well for other issues like scheduler locking around ets tables.

This basically sums it up. For what it's worth, I'm sorry this PR derailed your workflow.

vans163 · 2020-02-24T15:03:29Z

Would it be possible to extend this module so it can show all the terms in a processes eheap?

Like for example I know I have a memory issue, eheap is growing insanely fast if some unhappy path is taken. Up to 10MB is allocated for eheap then it stops growing (maybe this is some kind of internal limit after which GC is forced?). If the unhappy path is not taken in that 1% or something chance, everything is fine. Great, so how do I find now the unhappy path that is triggering the allocations?

One way would be if I could dump all the terms in the eheap or at least get a pointer to it. This way I can easily and quickly see. Of course the best way would be tie every term in the eheap to the line number in source code that allocated it. But this might be complex due to all the optimizations with memory copying?

jhogberg · 2020-02-25T06:46:40Z

No, things like that are out of scope and you can already do this with gdb using etp-processes / etp-heapdump etc.

jhogberg added 5 commits April 16, 2018 15:44

erts: Update esdp->current_port on immediate port calls

c71d914

This increases the accuracy of crash dumps and the upcoming allocation tagging feature.

erts: Always keep a copy of driver names as an atom

9128efb

tools: Generate typespecs in doc build

d969e5d

erts: Fix rare deadlock in realloc when +ramv is enabled

26d72d0

OTP-15024

jhogberg self-assigned this Apr 19, 2018

jhogberg added the team:VM Assigned to OTP team VM label Apr 19, 2018

jhogberg force-pushed the john/erts/more-alloc-info/OTP-14961 branch from 352e106 to 12c31ee Compare April 20, 2018 13:40

jhogberg force-pushed the john/erts/more-alloc-info/OTP-14961 branch from 12c31ee to 573a5ab Compare April 24, 2018 09:34

jhogberg merged commit e2a760e into erlang:master Apr 24, 2018

rickard-green mentioned this pull request Nov 19, 2021

Allocator instances for dirty schedulers #5187

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve memory instrumentation #1790

Improve memory instrumentation #1790

jhogberg commented Apr 19, 2018

vans163 commented Feb 20, 2020 •

edited

Loading

jhogberg commented Feb 20, 2020 •

edited

Loading

vans163 commented Feb 20, 2020 •

edited

Loading

jhogberg commented Feb 21, 2020 •

edited

Loading

vans163 commented Feb 21, 2020

jhogberg commented Feb 21, 2020

vans163 commented Feb 24, 2020

jhogberg commented Feb 25, 2020

Improve memory instrumentation #1790

Improve memory instrumentation #1790

Conversation

jhogberg commented Apr 19, 2018

Examples:

vans163 commented Feb 20, 2020 • edited Loading

jhogberg commented Feb 20, 2020 • edited Loading

vans163 commented Feb 20, 2020 • edited Loading

jhogberg commented Feb 21, 2020 • edited Loading

vans163 commented Feb 21, 2020

jhogberg commented Feb 21, 2020

vans163 commented Feb 24, 2020

jhogberg commented Feb 25, 2020

vans163 commented Feb 20, 2020 •

edited

Loading

jhogberg commented Feb 20, 2020 •

edited

Loading

vans163 commented Feb 20, 2020 •

edited

Loading

jhogberg commented Feb 21, 2020 •

edited

Loading