erts: Implement max_heap_size process flag #1032

garazdawi · 2016-04-26T15:49:39Z

The max_heap_size process flag can be used to limit the growth of a process heap by killing it before it becomes too large to handle. It is possible to set the maximum using the erl +hmax option, system_flag(max_heap_size, ...), spawn_opt(Fun, [{max_heap_size, ...}]) and process_flag(max_heap_size, ...).

It is possible to configure the behaviour of the process when the maximum heap size is reached. The process may be sent an untrappable exit signal with reason kill and/or send an error_logger message with details on the process state. A new trace event called gc_max_heap_size is also triggered for the garbage_collection trace flag when the heap grows larger than the configured size.

If kill and error_logger are disabled, it is still possible to see that the maximum has been reached by
doing garbage collection tracing on the process.

The heap size is defined as the sum of the heap memory that the process is currently using. This includes all generational heaps, the stack, any messages that are considered to be part of the heap and any extra memory the garbage collector may need during collection.

In the current implementation this means that when a process is set using on_heap message queue data mode, the messages that are in the internal message queue are counted towards
this value. For off_heap, only matched messages count towards the size of the heap. For mixed, it depends on race conditions within the VM whether a message is part of the heap or not.

Below is an example run of the new behaviour:

Eshell V8.0  (abort with ^G)
1> f(P),P = spawn_opt(fun() -> receive ok -> ok end end, [{max_heap_size, 512}]).
<0.60.0>
2> erlang:trace(P, true, [garbage_collection, procs]).
1
3> [P ! lists:duplicate(M,M) || M <- lists:seq(1,15)],ok.
ok
4>
=ERROR REPORT==== 26-Apr-2016::16:25:10 ===
     Process:          <0.60.0>
     Context:          maximum heap size reached
     Max heap size:    512
     Total heap size:  723
     Kill:             true
     Error Logger:     true
     GC Info:          [{old_heap_block_size,0},
                        {heap_block_size,609},
                        {mbuf_size,145},
                        {recent_size,0},
                        {stack_size,9},
                        {old_heap_size,0},
                        {heap_size,211},
                        {bin_vheap_size,0},
                        {bin_vheap_block_size,46422},
                        {bin_old_vheap_size,0},
                        {bin_old_vheap_block_size,46422}]
flush().
Shell got {trace,<0.60.0>,gc_start,
                 [{old_heap_block_size,0},
                  {heap_block_size,233},
                  {mbuf_size,145},
                  {recent_size,0},
                  {stack_size,9},
                  {old_heap_size,0},
                  {heap_size,211},
                  {bin_vheap_size,0},
                  {bin_vheap_block_size,46422},
                  {bin_old_vheap_size,0},
                  {bin_old_vheap_block_size,46422}]}
Shell got {trace,<0.60.0>,gc_max_heap_size,
                 [{old_heap_block_size,0},
                  {heap_block_size,609},
                  {mbuf_size,145},
                  {recent_size,0},
                  {stack_size,9},
                  {old_heap_size,0},
                  {heap_size,211},
                  {bin_vheap_size,0},
                  {bin_vheap_block_size,46422},
                  {bin_old_vheap_size,0},
                  {bin_old_vheap_block_size,46422}]}
Shell got {trace,<0.60.0>,exit,killed}

To let a callback module decide whether or to receive another message from the peer, so that backpressure can be applied when it's inappropriate. This is to let a callback protect against reading more than can be processed, which is otherwise possible since diameter_tcp otherwise always asks for more. A callback is made after each message, and can answer to continue reading or to ask again after a timeout. It's each message instead of each packet partly for simplicity, but also since this should be sufficiently fine-grained. Per packet would require some interaction with the fragment timer that flushes partial messages that haven't been completely received.

The callback is now applied to the atom 'false' when asking if another message should be received on the socket, and to a received binary message after reception. Throttling on received messages makes it possible to distinguish between requests and answers. There is no callback on outgoing messages since these don't have to go through the transport process, even if they currently do.

In addition to returning ok or {timeout, Tmo}, let a throttling callback for message reception return a pid(), which is then notified if the message in question is either discarded or results in a request process. Notification is by way of messages of the form {diameter, discard | {request, pid()}} where the pid is that of a request process resulting from the received message. This allows the notification process to keep track of the maximum number of request processes a peer connection can have given rise to.

This can be used as a simple form of overload protection, discarding the message before it's passed into diameter to become one more request process in a flood. Replying with 3004 would be more appropriate when the request has been directed at a specific server (the RFC's requirement) however, and possibly it should be possible for a callback to do this as well.

As discussed in the parent commit. This is easier said than done in practice, but there's no harm in allowing it.

TCP packets can contain more than one message, so only ask to receive another message if it hasn't already been received.

In particular, let a callback decide when to receive the initial message.

By sending {diameter, {answer, pid()}} when an incoming answer is sent to the specified pid, instead of a discard message as previously. The latter now literally means that the message has been discarded.

That is, don't assume that it's only diameter_tcp doing so: allow it to be received when not throttling. This lets a callback module trigger a new throttling callback itself, but it's not clear if this will be useful in practice.

called 'literal_mmap' and 'exec_mmap'. Also moved existing erts_mmap info from 'mseg_alloc' to its own system_info({allocator, erts_mmap}) with "allocators" default_mmap, literal_mmap and exec_mmap.

ghost · 2016-04-26T15:57:11Z

Sounds useful. Any plans to use this in error_logger, where it would restart in a controlled fashion and then maybe be used in tandem with a water mark, or a more practical strategy?

garazdawi · 2016-04-26T16:00:42Z

@Tuncer not at the moment, depending on how useful we find that this feature is, we may expand the configuration to allow a message to be sent to a user defined process instead of just the error_logger.

okeuday · 2016-04-26T16:46:19Z

erts/doc/src/erl.xml

+      <marker id="+hmax"/>
+      <tag><c><![CDATA[+hmax Size]]></c></tag>
+      <item>
+        <p>Sets the default maximum heap size of processes to the size


You probably want to have something here saying that +hmax defaults to being disabled (being set to 0). That isn't clear from the other documentation.

okeuday · 2016-04-26T16:54:08Z

@garazdawi Thank you for adding this!

OTP-Maintainer · 2016-04-26T18:35:09Z

Patch has passed first testings and has been assigned to be reviewed

I am a script, I am not human

DeadZen · 2016-04-27T05:17:28Z

@garazdawi Seconded, kudos for adding this!

cmullaparthi · 2016-04-27T11:19:35Z

This is very useful, thank you! All we now need is an option for max_msg_queue_len and it will give us processes with bounded mailbox queues. If queue length is exceeded, the process should get killed and generate a crash report.

garazdawi · 2016-04-27T11:46:30Z

@cmullaparthi bounded message queues will most likely never be introduced. It is too expensive to keep track of in even small SMP systems and it gets much worse when the number of numa nodes grows. Some other mechanism should be used to limit the flow of incoming messages, like for instance a windowing scheme similar to how {active, N} works for gen_tcp.

cmullaparthi · 2016-04-27T15:17:50Z

@garazdawi Isn't the message queue length already tracked? I can do process_info(pid(), message_queue_len) and get back the length of the message queue. I'm assuming beam is already keeping track of the length here and not counting the length each time the above call is made?

psyeugenic · 2016-04-27T16:20:01Z

@cmullaparthi The inner message queue is tracked in the current implementation and it is only the inner queue you get from process_info(pid(), message_queue_len). Not the "total" message queue length. The tracking might change though. If we find it necessary or find a better scalable solution for the queues we might change the tracking and instead calculate queue length at process_info call time.

I say "total" in quotes because it is a bit vague when we say a message actually arrives at the receiving process. The outer queue(s) might just not be a part of the process. Think about how two processes, on different nodes, communicate with each other.

The thing to realize here is that only conceptually do we have one single queue but that is not necessarily the case in the implementation. It could be a whole tree of isolated queues. We don't want to expose, even more, of those contention points where we believe we have only one queue. It's just bad.

The absolute worst case would be to let the sender examine the total queue length of the receiver and kill it if the queue length exceeds some limit. That would just be insane.

One thing that might be needed is for OTP to help developers with flow control since that seems to be an issue. Some standard way to do a Sliding Window Protocol for processes.

cmullaparthi · 2016-04-27T16:58:19Z

@psyeugenic Thanks for the explanation. Perhaps there is some confusion here. When I referred to 'max_msg_queue_len', I had in mind a setting specific to a process, not the total number of message queues in the system. So the same way this max_heap_size option has been implemented, one could spawn a new process as:

spawn(Module, Function, Args, [{max_msg_queue_len, 1000}, ...])

And if the number of messages in the queue for this specific process exceeds 1000, it gets killed. As far as I understand, the scheduler obtains a lock on a process' message queue before depositing a message. Presumably, it is at this point it is incrementing the message_queue_len? Surely any messages which are enroute is irrelevant. The check should only be performed at the point of insertion into the queue.

By introducing the max_heap_size option, haven't you indirectly supported this feature? The amount of memory occupied by the message queue is considered to be part of the process heap size? Which means if the process builds up a message queue, it's heap size will increase and cause it to combust?

cmullaparthi · 2016-04-27T17:09:03Z

Yes, some flow control for message passing would be great. gen_tcp already has it - I almost always write TCP handling code with {active, once}. Golang provides flow control in channels which I think is quite powerful. Though it is a simpler use case because channels are only valid within a single process, whereas Erlang's message-passing spans nodes so it is impossible to provide some of those features.

That said, perhaps some of the flow control features can be limited to message-passing between processes on the same node...

Apologies for hijacking this thread. Happy to move this conversation elsewhere - this max_msg_heap_size option just triggered a long term itch :-)

psyeugenic · 2016-04-27T17:56:42Z

@cmullaparthi Nope, I was not talking about a global queue. I was talking about per process message queue(s). Reread my comments above with that mindset. =)

An Erlang process has multiple message queues. In the current implementation it has two queues, an inner and an outer queue. A sending process may touch the outer queue but never the inner queue of the receiver. The receiving process never touches the outer queue except when the inner queue is empty. Scalability - don't touch it. But the internals is very much beside the point. The internals may change. We don't want to expose internals or give guarantees that will kill performance.

This desire for message queue monitors or load shedders keeps coming back. I'm not indifferent to the issue or the desire for a simple solution. I'm telling you there isn't one. 😞

I totally agree, this conversation is not in the scope of this Pull Request.

OTP-Maintainer · 2016-04-27T18:22:33Z

Patch has passed first testings and has been assigned to be reviewed

I am a script, I am not human

garazdawi · 2016-04-28T07:05:39Z

By introducing the max_heap_size option, haven't you indirectly supported this feature? The amount of memory occupied by the message queue is considered to be part of the process heap size? Which means if the process builds up a message queue, it's heap size will increase and cause it to combust?

@cmullaparthi no, all messages are not part of the heap. Pre 19, or using the new on_heap message queue data option in 19.0, messages that are known to be in the queue are part of the heap. So if you don't inspect the queue somehow, either by doing a selective receive that is known to have skipped the message or doing process_info(Pid, messages|message_queue_len), there is no way to know if the message is counted as part of the heap.

My example in the PR description was very poorly chosen, as it is because I've written a fun in the shell that the limit gets triggered. If the same code was written using a module the process is never killed.

priestjim · 2016-04-28T07:40:49Z

Great feature @garazdawi 👍 ! I think we'd all love some kind of knob for non-heap binaries as well :)

new protocol version to handle new schema fields

Should maybe be moved to mnesia.erl and inlined?? Or is it used elsewhere?

Add ext to table/system information Add add_backend_type

Make ram_copies index always use ordered_set And use index type as prefered type not a implementation requirement, the standard implmentation will currently ignore the prefered type.

…orary processes Tables or data containers should be owned and monitored by mnesia_monitor and should thus be created by that process. Always create_table before loading it We need to create tables for ram_copies at least before loading them as they are intermittent. It is also needed to get mnesia monitor as the parent and supervisor of the data storage.

Minimal impact when talking to older nodes.

* dgud/mnesia/ext-backend/PR-858/OTP-13058: mnesia_ext: Add basic backend extension tests mnesia_ext: reuse snmp field for ext updates mnesia_ext: Create table/data containers from mnesia monitor not temporary processes mnesia_ext: Implement ext copies index mnesia_ext: Load table ext mnesia_ext: Dumper and schema changes mnesia_ext: Refactor mnesia_schema.erl mnesia_ext: Ext support in fragmented tables mnesia_ext: Backup handling mnesia_ext: Create schema functionality mnesia_ext: Add ext copies and db_fold to low level api mnesia_ext: Refactor record_validation code mnesia_ext: Add create_external and increase protocol version to monitor mnesia_ext: Add ext copies to records mnesia_ext: Add supervisor and behaviour modules

* anders/diameter/test/OTP-13438: Don't assume list comprehension evaluation order

* anders/diameter/overload/OTP-13330: Suppress dialyzer warning Remove dead case clause Let throttling callback send a throttle message Acknowledge answers to notification pids when throttling Throttle properly with TLS Don't ask throttling callback to receive more unless needed Let a throttling callback answer a received message Let a throttling callback discard a received message Let throttling callback return a notification pid Make throttling callbacks on message reception Add diameter_tcp option throttle_cb

* anders/diameter/info/OTP-13508: Add diameter:peer_find/1 Add diameter:peer_info/1

d5f8d55

The max_heap_size process flag can be used to limit the growth of a process heap by killing it before it becomes too large to handle. It is possible to set the maximum using the `erl +hmax` option, `system_flag(max_heap_size, ...)`, `spawn_opt(Fun, [{max_heap_size, ...}])` and `process_flag(max_heap_size, ...)`. It is possible to configure the behaviour of the process when the maximum heap size is reached. The process may be sent an untrappable exit signal with reason kill and/or send an error_logger message with details on the process state. A new trace event called gc_max_heap_size is also triggered for the garbage_collection trace flag when the heap grows larger than the configured size. If kill and error_logger are disabled, it is still possible to see that the maximum has been reached by doing garbage collection tracing on the process. The heap size is defined as the sum of the heap memory that the process is currently using. This includes all generational heaps, the stack, any messages that are considered to be part of the heap and any extra memory the garbage collector may need during collection. In the current implementation this means that when a process is set using on_heap message queue data mode, the messages that are in the internal message queue are counted towards this value. For off_heap, only matched messages count towards the size of the heap. For mixed, it depends on race conditions within the VM whether a message is part of the heap or not. Below is an example run of the new behaviour: Eshell V8.0 (abort with ^G) 1> f(P),P = spawn_opt(fun() -> receive ok -> ok end end, [{max_heap_size, 512}]). <0.60.0> 2> erlang:trace(P, true, [garbage_collection, procs]). 1 3> [P ! lists:duplicate(M,M) || M <- lists:seq(1,15)],ok. ok 4> =ERROR REPORT==== 26-Apr-2016::16:25:10 === Process: <0.60.0> Context: maximum heap size reached Max heap size: 512 Total heap size: 723 Kill: true Error Logger: true GC Info: [{old_heap_block_size,0}, {heap_block_size,609}, {mbuf_size,145}, {recent_size,0}, {stack_size,9}, {old_heap_size,0}, {heap_size,211}, {bin_vheap_size,0}, {bin_vheap_block_size,46422}, {bin_old_vheap_size,0}, {bin_old_vheap_block_size,46422}] flush(). Shell got {trace,<0.60.0>,gc_start, [{old_heap_block_size,0}, {heap_block_size,233}, {mbuf_size,145}, {recent_size,0}, {stack_size,9}, {old_heap_size,0}, {heap_size,211}, {bin_vheap_size,0}, {bin_vheap_block_size,46422}, {bin_old_vheap_size,0}, {bin_old_vheap_block_size,46422}]} Shell got {trace,<0.60.0>,gc_max_heap_size, [{old_heap_block_size,0}, {heap_block_size,609}, {mbuf_size,145}, {recent_size,0}, {stack_size,9}, {old_heap_size,0}, {heap_size,211}, {bin_vheap_size,0}, {bin_vheap_block_size,46422}, {bin_old_vheap_size,0}, {bin_old_vheap_block_size,46422}]} Shell got {trace,<0.60.0>,exit,killed}

OTP-Maintainer · 2016-05-10T18:07:31Z

Patch has passed first testings and has been assigned to be reviewed

I am a script, I am not human

isaacsanders · 2019-01-25T17:04:07Z

Would a report style error be less appropriate here? Using a format style error seems a little limiting from the perspective of the upstream user.

I don't know much, and would like to understand a bit more.

Anders Svensson and others added 11 commits March 13, 2016 07:10

Let a throttling callback answer a received message

9298872

As discussed in the parent commit. This is easier said than done in practice, but there's no harm in allowing it.

Don't ask throttling callback to receive more unless needed

eae5e81

TCP packets can contain more than one message, so only ask to receive another message if it hasn't already been received.

Throttle properly with TLS

8f9173b

In particular, let a callback decide when to receive the initial message.

Acknowledge answers to notification pids when throttling

c322099

By sending {diameter, {answer, pid()}} when an incoming answer is sent to the specified pid, instead of a discard message as previously. The latter now literally means that the message has been discarded.

Let throttling callback send a throttle message

14eb86d

That is, don't assume that it's only diameter_tcp doing so: allow it to be received when not throttling. This lets a callback module trigger a new throttling callback itself, but it's not clear if this will be useful in practice.

Don't assume list comprehension evaluation order

a54a911

erts: Produce statistics for literal and hipe super carriers

4319cd6

called 'literal_mmap' and 'exec_mmap'. Also moved existing erts_mmap info from 'mseg_alloc' to its own system_info({allocator, erts_mmap}) with "allocators" default_mmap, literal_mmap and exec_mmap.

garazdawi added team:VM Assigned to OTP team VM feature labels Apr 26, 2016

garazdawi self-assigned this Apr 26, 2016

okeuday reviewed Apr 26, 2016
View reviewed changes

Ulf Wiger and others added 25 commits May 9, 2016 14:51

mnesia_ext: Add supervisor and behaviour modules

22a1b43

mnesia_ext: Add ext copies to records

8015bd8

mnesia_ext: Add create_external and increase protocol version to monitor

7eec056

new protocol version to handle new schema fields

mnesia_ext: Refactor record_validation code

184175a

Should maybe be moved to mnesia.erl and inlined?? Or is it used elsewhere?

mnesia_ext: Add ext copies and db_fold to low level api

66e9920

mnesia_ext: Create schema functionality

549555f

Add ext to table/system information Add add_backend_type

mnesia_ext: Backup handling

c64fa62

mnesia_ext: Ext support in fragmented tables

0469ff7

mnesia_ext: Refactor mnesia_schema.erl

81e90e8

mnesia_ext: Dumper and schema changes

08bfbfe

mnesia_ext: Load table ext

3d13b01

mnesia_ext: Implement ext copies index

3aff647

Make ram_copies index always use ordered_set And use index type as prefered type not a implementation requirement, the standard implmentation will currently ignore the prefered type.

mnesia_ext: reuse snmp field for ext updates

6fad79d

Minimal impact when talking to older nodes.

mnesia_ext: Add basic backend extension tests

e24275f

Merge branch 'sverker/trace-send-receive-matchspec/OTP-13507'

7285415

Merge branch 'anders/diameter/test/OTP-13438'

4d6d523

* anders/diameter/test/OTP-13438: Don't assume list comprehension evaluation order

Merge branch 'anders/diameter/info/OTP-13508'

f46d8a7

* anders/diameter/info/OTP-13508: Add diameter:peer_find/1 Add diameter:peer_info/1

inets: Fix faulty merge commit

d7e7284

d5f8d55

Merge branch 'sverker/system_info-erts_mmap/OTP-13560'

5ca152a

erts: Fix pre-bif yield current_function

dc203f5

Update preloaded modules

dc30187

garazdawi force-pushed the lukas/erts/max_heap_size/OTP-13174 branch from 90d2278 to dc30187 Compare May 10, 2016 08:33

proxyles merged commit dc30187 into erlang:master May 12, 2016

garazdawi deleted the lukas/erts/max_heap_size/OTP-13174 branch February 25, 2017 10:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

erts: Implement max_heap_size process flag #1032

erts: Implement max_heap_size process flag #1032

garazdawi commented Apr 26, 2016 •

edited

Loading

ghost commented Apr 26, 2016

garazdawi commented Apr 26, 2016 •

edited

Loading

okeuday Apr 26, 2016

garazdawi Apr 27, 2016

okeuday commented Apr 26, 2016

OTP-Maintainer commented Apr 26, 2016

DeadZen commented Apr 27, 2016

cmullaparthi commented Apr 27, 2016

garazdawi commented Apr 27, 2016

cmullaparthi commented Apr 27, 2016

psyeugenic commented Apr 27, 2016

cmullaparthi commented Apr 27, 2016 •

edited

Loading

cmullaparthi commented Apr 27, 2016

psyeugenic commented Apr 27, 2016

OTP-Maintainer commented Apr 27, 2016

garazdawi commented Apr 28, 2016

priestjim commented Apr 28, 2016

OTP-Maintainer commented May 10, 2016

isaacsanders commented Jan 25, 2019

erts: Implement max_heap_size process flag #1032

erts: Implement max_heap_size process flag #1032

Conversation

garazdawi commented Apr 26, 2016 • edited Loading

ghost commented Apr 26, 2016

garazdawi commented Apr 26, 2016 • edited Loading

okeuday Apr 26, 2016

Choose a reason for hiding this comment

garazdawi Apr 27, 2016

Choose a reason for hiding this comment

okeuday commented Apr 26, 2016

OTP-Maintainer commented Apr 26, 2016

DeadZen commented Apr 27, 2016

cmullaparthi commented Apr 27, 2016

garazdawi commented Apr 27, 2016

cmullaparthi commented Apr 27, 2016

psyeugenic commented Apr 27, 2016

cmullaparthi commented Apr 27, 2016 • edited Loading

cmullaparthi commented Apr 27, 2016

psyeugenic commented Apr 27, 2016

OTP-Maintainer commented Apr 27, 2016

garazdawi commented Apr 28, 2016

priestjim commented Apr 28, 2016

OTP-Maintainer commented May 10, 2016

isaacsanders commented Jan 25, 2019

garazdawi commented Apr 26, 2016 •

edited

Loading

garazdawi commented Apr 26, 2016 •

edited

Loading

cmullaparthi commented Apr 27, 2016 •

edited

Loading