Do not allocate a new tuple when the value is the same #1890

josevalim · 2018-07-25T16:01:31Z

NOTE: This patch is wrong. Do not merge, see the first comment.

This patch optimizes tuple operations to not allocate new tuple
when an element is being replaced by the exact same value in memory.

Imagine this very common idiom:

Record#my_record{key = compute_new_value(Value, Condition)}

where:

compute_new_x(X, true) -> X + 1;
compute_new_x(X, false) -> X;

In many cases, we are not changing the value in key, however
the code prior to this patch would still allocate a new tuple.
This optimization changes this.

The cost of optimization is minimum, as it only adds a C array
access plus a pointer comparison. The major benefit is reducing
the GC pressure by avoiding allocating data.

We have only changed erlang:setelement/3. The benchmarks basically
create a tuple and perform the same operations, roughly 20000 times,
once replacing the key with the same value, and another with a
different value.

For a tuple with 4 elements, replacing the fourth element 20000 times
went from 557us to 388us.

For a tuple with 8 elements, replacing the fourth element 20000 times
went from 641us to 421us.

This patch optimizes tuple operations to not allocate new tuple when an element is being replaced by the exact same value in memory. Imagine this very common idiom: Record#my_record{key = compute_new_value(Value, Condition)} where: compute_new_x(X, true) -> X + 1; compute_new_x(X, false) -> X; In many cases, we are not changing the value in `key`, however the code prior to this patch would still allocate a new tuple. This optimization changes this. The cost of optimization is minimum, as it only adds a C array access plus a pointer comparison. The major benefit is reducing the GC pressure by avoiding allocating data. We have only changed erlang:setelement/3. The benchmarks basically create a tuple and perform the same operations, roughly 20000 times, once replacing the key with the same value, and another with a different value. For a tuple with 4 elements, replacing the fourth element 20000 times went from 557us to 388us. For a tuple with 8 elements, replacing the fourth element 20000 times went from 641us to 421us.

josevalim · 2018-07-25T16:08:57Z

Unfortunately this patch does not work as is. That's because the compiler may use the dsetelement operation, which assumes that setelement returns a new tuple and therefore perform operations in place. In order to solve this, we need to change how the compiler works.

One option is break the current setlement followed by a dsetlement in 2. One operation that copies and checks the tuple size, followed by multiple dsetelement operations. Therefore a setelement + dsetlement will become a copy_and_check + dsetlement + dsetlement. The downside of this aproach is that the optimization will only apply when updating a single key, limiting its use in records.

The other option is to replace the dsetelement operation into a new operation that is similar to how update_map_exact works. With this new operation, we will receive a tuple and a list of positions and values to update in the tuple. This makes it much easier to apply this particular optimization but it may have downsides.

I have sent a PR to get the discussion started and figure out if there is a way to move this optimization forward. Thank you!

josevalim · 2018-08-06T16:15:07Z

I will ping @bjorng because, even though the final impact of this change is the VM, we will need to change the compiler to achieve it (if we want to do this change in the first place). :)

bjorng · 2018-08-07T06:03:54Z

The other option is to replace the dsetelement operation into a new operation that is similar to how update_map_exact works. With this new operation, we will receive a tuple and a list of positions and values to update in the tuple.

That seems to be the best way to do it. That would speed up updating of records. We have considered adding such new instructions anyway as an optimization.

When loading old code that uses dsetelement, some kind of loader jiggery pokery could be used to ensure that the code continues to work (replacing calls to setelement/3 with calls to legacy_setelement/3 or similar). I can think of several ways to do that.

josevalim · 2018-08-07T07:00:14Z

That seems to be the best way to do it. That would speed up updating of records. We have considered adding such new instructions anyway as an optimization.

Do you think it would speed up even when updating a single element? Or do you think we should still keep setelement for single updates and updates where the position is not known at compile time?

bjorng · 2018-08-07T07:40:02Z

Do you think it would speed up even when updating a single element?

The update in itself would not be faster. But since the instruction would not clobber the X registers, there could possibly be an indirect speedup because of less register shuffling and possibly smaller stack frame. So it would probably be worth using the new instruction for updating only a single element at a known position.

Or do you think we should still keep setelement for single updates and updates where the position is not known at compile time?

We must definitely keep it for updates where the position is not known at compile time. It must also be kept in case it is applied. Users could also expect that calls to setelement/3 should be possible to trace, so maybe we should not replace explicit calls to setelement/3 with the instruction.

josevalim · 2018-08-07T09:53:10Z

Users could also expect that calls to setelement/3 should be possible to trace, so maybe we should not replace explicit calls to setelement/3 with the instruction.

I always forget about traceability. I think this introduces a dilemma. We don't have a syntax operation for updating tuples. The record syntax converts to setelement calls and we can trigger the dsetelement optimization by hand, which means loss of traceability:

bar(Tuple) ->
  erlang:setelement(1, erlang:setelement(2, Tuple, bar), baz).

emits:

0000000016F0D788: i_func_info_IaaI 0 foo bar 1
0000000016F0D7B0: allocate_tt 0 1
0000000016F0D7B8: move_rx r(0) x(1)
0000000016F0D7C0: move_x2_c bar
0000000016F0D7D0: move_cr 2 r(0)
0000000016F0D7E0: call_bif_e erlang:setelement/3
0000000016F0D7F0: set_tuple_element_sSP baz x(0) 0
0000000016F0D808: deallocate_return_Q 0

In fact, we leverage the fact the optimization can be triggered by hand in Elixir's implementation of records.

However, the optimization above only applies when the values being set are safe and if the updates happen in certain order. I believe we can easily lift the order restriction. We can lift the safety restriction too but it means a slight change of error semantics in cases like this:

erlang:setelement(1, erlang:setelement(20, Tuple, bar), error(oops)).

Note that the inner setelement operation is meant to be out of bounds. If we lift the safety restriction, it means we have to reorder the operations above so we have:

Var1 = bar,
Var2 = error(oops),
erlang:setelement(1, erlang:setelement(20, Tuple, Var1), Var2).

Now we get a different error than the original one. This can be problematic in face of any side-effects.

On one hand, we want to optimize code as much as possible. On the other hand, we want to keep the semantics clear, including tracing semantics. Assuming we are going to introduce a new operation, we need to choose when this new operation will be invoked. I believe we have to answer some questions:

Do we want to extract the arguments and their expressions to variables so we always achieve safety? I would say no as we are playing with the semantics a bit too dangerously.
Assuming the arguments are always safe: do we want to apply the optimization regardless if setting the higher position comes first or not? My vote is yes, because there is no effective change of semantics. In case of errors, the error is the same: badarg. Implementation wise, this implies we need pass the higher position as a separate argument to the underlying VM operation or we reorder the operations so the higher position is always the first.
Do we want to use this new operation even if we are updating a single element? I would say yes, since according to @bjorng this can bring benefits too.

However, applying rules 1 and 3 means we lose traceability in even more cases. One option to circumvent this is to introduce a new node in the Erlang Abstract Format for updating multiple tuple elements in a row. With this we can keep traceability, but it adds an AST node that doesn't have a syntactical representation. It also means that all code written that leverages dsetel will no longer do so unless they explicitly update to the new node.

I have a simpler proposal though: to always apply the optimization whenever the position is known at compile time and it is safe. As far as I know, this happen in other cases, such as invoking erlang:+/2. If we want to be extra clear, we can add a note to erlang:setelement/3 docs that say: erlang:setelement/3 may not be traceable (via erlang:trace/3) in cases the position is a literal number, as the compiler may optimize it away".

michalmuskala · 2018-08-07T11:29:26Z

I think it's fine for the setelement operation to not be traceable - it's already not in some cases, so I'd say it's fine to go all the way.

In a function like:

foo(X, Y, Z) ->
    setelement(1, setelement(2, X, Z), Y).

We're able to only trace the setelement(2, X, Z) operation, but not the setelement(1, _, Y).

bjorng · 2018-08-07T12:02:43Z

I think it's fine for the setelement operation to not be traceable - it's already not in some cases, so I'd say it's fine to go all the way.

I think I agree. Explicit use of setelement/3 is probably uncommon anyway. But we will have to discuss this in the OTP team next week when more people are back from their vacations.

Assuming the arguments are always safe: do we want to apply the optimization regardless if setting the higher position comes first or not? My vote is yes, because there is no effective change of semantics.

I agree.

Do we want to extract the arguments and their expressions to variables so we always achieve safety? I would say no as we are playing with the semantics a bit too dangerously.

I am not sure that I understand you completely. Do you mean that we should not change the evaluation order to make the optimization possible in more cases?

For example:

foo(Tuple) ->
    setelement(1, setelement(20, Tuple, bar), bar()).

Which is equivalent to:

foo(Tuple) ->
    T1 = setelement(20, Tuple, bar),
    T2 = bar(),
    setelement(1, T1, T2).

In this case, the optimization should not be applied. Is that what you meant?

josevalim · 2018-08-07T14:02:13Z

I am not sure that I understand you completely. Do you mean that we should not change the evaluation order to make the optimization possible in more cases?

Yes. I mean that we should not change the evaluation order to make the optimization possible in more cases.

In this case:

foo(Tuple) ->
    T1 = setelement(20, Tuple, bar),
    T2 = bar(),
    setelement(1, T1, T2).

We will end-up invoking the new operation twice, copying the tuple twice. Once to change the element at position 20 and another to change it at position 1.

While in this case:

foo(Tuple) ->
    T2 = bar(),
    T1 = setelement(20, Tuple, bar),
    setelement(1, T1, T2).

The operation is called only once and it will perform both changes at once, effectively copying the tuple a single time.

Here is my proposal:

Introduce a new operation similar to update_map_exact but for tuples. The operation will be invoked whenever there is a setelement call where its arguments are safe and the index is known at compile time
If there are multiple safe setelement calls in a row, they will merged and reordered, allowing the tuple to be copied only once. When reordering the positions, the higher position should come first, so the implementation only needs to do a boundary check on the first position
The old dsetelem operations will be rewritten to the new one on load
The existing setelement and the new operation will only copy the tuple if any of the elements effectively change (which is the original goal of this PR)

We will need a compiler pass that will convert the setelement calls into the new tuple operation. I assume a core pass is still the best way to go?

michalmuskala · 2018-08-07T15:08:35Z

A potential way to get around the issue of not changing the error semantics would to to emit a sort-of assert_tuple_size operation in place of each setelement and then fuse them at the end in the new instruction. Some later pass could optimise away the needless assertions.

bjorng · 2018-08-07T15:09:52Z

Yes. I mean that we should not change the evaluation order to make the optimization possible in more cases.

Good. I agree.

I have a suggestion for a slightly different implementation. I suggest that it should be more like update_map_assoc, that it, it can't fail. It must be preceded by either is_tuple + test_arity or is_tagged_tuple, and the compiler must guarantee that all indices are inside bounds. (beam_validator will of course complain if those rules are not followed.)

The reason for that is that before updating an Erlang record, there will always be code that checks the size and record tag of the tuple. Having the check in the update operation in the tuple update operation too would be unnecessary. How would that work for Elixir? Do you call setelement/3 on terms of unknown type or do you also always test the type before updating?

josevalim · 2018-08-07T15:13:00Z

To expand on what @michalmuskala said. Imagine this operation:

erlang:setelement(
  1,
  erlang:setelement(
    10,
    erlang:setelement(
      5,
      Tuple,
      some_fun()
    ),
    another_fun(),
  final_fun()
)

In this case it becomes this:

A1 = some_fun(),
B1 = erlang:setelement(5, Tuple, A1),
B2 = another_fun(),
C1 = erlang:setelement(10, B1, B2),
C2 = final_fun(),
erlang:setelement(1, C1, C2).

As we can see, we can't merge those operations because we have functions calls in the middle of setelement. However, @michalmuskala suggested for us to split the data validation from the data updating, in order to be able to merge this. So we could rewrite the code above to:

A1 = some_fun(),
assert_tuple_size(Tuple, 5),
B2 = another_fun(),
assert_tuple_size(Tuple, 10),
C2 = final_fun(),
%% assert_tuple_size(Tuple, 1) - no need to assert as we already checked for 10
update_tuple_elems(Tuple, 10 B2, 5 A1, 1 C2).

It feels like this could work. However I don't want to bite more than I can chew :) so I suggest to go ahead with the earlier proposal and we can consider this once that is in place.

josevalim · 2018-08-07T15:20:05Z

I have a suggestion for a slightly different implementation. I suggest that it should be more like update_map_assoc, that it, it can't fail.

Sorry @bjorng, we were writing at the same time. I think we got to similar conclusions through different means? I think we can make it so update_tuple_elems does not fail but I propose we automatically add the relevant assertions in case they are not there yet.

We could enable those optimizations only if the assertions are already there but I would instinctively prefer if we can apply the optimization in as many cases as possible. Do you see any downsides?

PS: In Elixir we don't check the tuple but we could make it so. Thanks for asking!

bjorng · 2018-08-07T16:01:57Z

I think we can make it so update_tuple_elems does not fail but I propose we automatically add the relevant assertions in case they are not there yet.

I don't have a strong opinion, but I think slightly would prefer to keep a setelement/3 call with unknown type. The reason is that the call to setelement/3 will do the type check and generate a badarg exception if something is wrong in a single instruction. Adding instructions to check the type would make the code slower, and there would have to be more than one instruction unless we introduce a new assert_tuple_size instruction. Also, explicit setelement/3 calls by the user would probably be without type tests preceding it, so the call would not be optimized and still be traceable.

Do you see any downsides?

Only really the (lack of) traceability of explicit setelement/3 calls. I don't use tracing much myself, but I know that there are many users that see tracing in production systems as a huge benefit. On the other hand, explicit calls to setelement/3 is probably uncommon, so its probably not much of an issue anyway.

In Elixir we don't check the tuple but we could make it so.

OK. So automatically adding assert_tuple_size instructions would simplify the Elixir compiler and probably reduce the code size of the generated code?

josevalim · 2018-08-07T19:19:39Z

The reason is that the call to setelement/3 will do the type check and generate a badarg exception if something is wrong in a single instruction.

I see. For the single setelement call, converting a setelement/3 into a size assertion + update means we are doing more work. Question: couldn't we keep them as two separate operations in the compiler and merge them into a single one in the loader?

Also, explicit setelement/3 calls by the user would probably be without type tests preceding it, so the call would not be optimized and still be traceable.

To me this is a downside. Especially because there is code that would be optimized today but it will no longer be optimized if we require an explicit size assertion.

OK. So automatically adding assert_tuple_size instructions would simplify the Elixir compiler and probably reduce the code size of the generated code?

I am not worried about the Elixir compiler in this case. :) I will be happy to change it to do whatever the Erlang compiler requires.

I took a quick look at OTP and Elixir codebases and the number of setelement calls are very few. So we are mostly talking about records here and, since we control the code emitted by records, any of the options are likely fine.

To summarize, we have three different approaches:

Introduce a new operation that performs checking and updating. I believe this is the least complex approach since it is similar to how dsetel works today. It will, however, apply to more cases than dsetel, which may mean increased performance but less traceability. Finally, since the checking is embedded into the operation, there are some cases where we would check more than once (as we do in Erlang record updates) but the performance cost here is probably insignificant.
Introduce a new operation that performs only updating. This means we can only apply the optimization if there is an assertion on the tuple size (or similar) before. As consequence, the optimization is enabled in less cases than today, but it gives traceability back.
Introduce two new operations, one for asserting and another for updating. This is likely the most complex one but it increases the amount of cases where we apply the instruction.

Anyway, I have enough information to move forward and we can resume this discussion once I have made a bit more progress. I will go with 1 or 2, depending on the implementation complexity and, once a PR is sent, we can discuss accordingly. Thanks!

bjorng · 2018-08-08T03:35:37Z

I would prefer 2 or 3. I am not sure that 3 is actually that more complex or harder to implement, but who knows what turns up when one actually starts implementing it...

Anyway, I have enough information to move forward and we can resume this discussion once I have made a bit more progress. I will go with 1 or 2, depending on the implementation complexity and, once a PR is sent, we can discuss accordingly.

Yes, an actual implementation will be useful to continue the discussion.

I assume that you are aware of our ongoing work with the SSA intermediated format in https://github.com/bjorng/otp/tree/bjorn/compiler/ssa? Part of your implementation will have to change when we'll merge that branch (probably before the end of August or early September).

josevalim · 2018-08-08T09:43:21Z

Beautiful. Since we have about 1 year to get this in, I will wait for the SSA pass to be merged and then resume work. I will close this for now. Thanks!

jhogberg added team:VM Assigned to OTP team VM enhancement labels Jul 26, 2018

bjorng self-assigned this Aug 7, 2018

josevalim closed this Aug 8, 2018

bjorng mentioned this pull request Aug 29, 2018

Introduce a new SSA-based intermediate format for the compiler #1935

Merged

josevalim mentioned this pull request Dec 20, 2018

Do not allocate new bitstring/binary on empty append #2055

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not allocate a new tuple when the value is the same #1890

Do not allocate a new tuple when the value is the same #1890

josevalim commented Jul 25, 2018

josevalim commented Jul 25, 2018 •

edited

josevalim commented Aug 6, 2018

bjorng commented Aug 7, 2018

josevalim commented Aug 7, 2018

bjorng commented Aug 7, 2018

josevalim commented Aug 7, 2018

michalmuskala commented Aug 7, 2018

bjorng commented Aug 7, 2018

josevalim commented Aug 7, 2018

michalmuskala commented Aug 7, 2018

bjorng commented Aug 7, 2018

josevalim commented Aug 7, 2018

josevalim commented Aug 7, 2018

bjorng commented Aug 7, 2018

josevalim commented Aug 7, 2018

bjorng commented Aug 8, 2018

josevalim commented Aug 8, 2018

Do not allocate a new tuple when the value is the same #1890

Do not allocate a new tuple when the value is the same #1890

Conversation

josevalim commented Jul 25, 2018

josevalim commented Jul 25, 2018 • edited

josevalim commented Aug 6, 2018

bjorng commented Aug 7, 2018

josevalim commented Aug 7, 2018

bjorng commented Aug 7, 2018

josevalim commented Aug 7, 2018

michalmuskala commented Aug 7, 2018

bjorng commented Aug 7, 2018

josevalim commented Aug 7, 2018

michalmuskala commented Aug 7, 2018

bjorng commented Aug 7, 2018

josevalim commented Aug 7, 2018

josevalim commented Aug 7, 2018

bjorng commented Aug 7, 2018

josevalim commented Aug 7, 2018

bjorng commented Aug 8, 2018

josevalim commented Aug 8, 2018

josevalim commented Jul 25, 2018 •

edited