Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jit: Optimized code generation for select_val #3043

Merged
merged 2 commits into from Feb 17, 2021
Merged

Conversation

frej
Copy link
Contributor

@frej frej commented Feb 8, 2021

Two patches optimizing the code generation for select_val:

  • Optimize the case when a select_val has exactly two choices which branch to the same label or a fail destination and where the values only differ by one bit.

  • The i_select_val_bins instruction has previously been implemented by a call to a global subroutine which does a binary search in a table. This patch changes the code generator to emit the binary search as code.

Optimize the case when a select_val has exactly two choices which
branch to the same label or a fail destination and where the values
only differ by one bit.

The optimization makes use of the observation that (V == X || V == Y)
is equivalent to (V | (X ^ Y)) == (X | Y) when (X ^ Y) has only one
bit set, which allows the select_val to be implemented with only one
compare. This optimization is unconditionally performed by both GCC
and LLVM on x86_64.

When this optimization is implemented in the code generator, the
change in performance compared to the unmodified system varies
depending on the probability distribution of the value the select_val
operates on. For illustration, consider a {select_val, X, Fail,
[{1,Dest}, {2, Dest}]}-instruction where the input only consists of
ones. In this case a slowdown of 8% is apparent. If the same
instruction is fed only twos, the slowdown decreases to around 3%.  If
the input is changed to be a 1/3 each of 1, 2, and 3, the performance
change depends strongly on how well the branch predictor functions in
the unmodified system. If the input is a repeated sequence i.e. (123)+
or 1+2+3+ the slowdown is between 3% and 4%. If the input is
completely random, in which case the branch predictor cannot help
things, the optimization provides a speedup of ~25%.

For the larger benchmarks in ebench's `small` class, the only
statistically significant performance changes are that `decode` is ~4%
faster and `fib` is ~2.5% faster.
The i_select_val_bins instruction has previously been implemented by a
call to a global subroutine which does a binary search in a
table. This patch changes the code generator to emit the binary search
as code.

The changed code generation strategy provides varying speedups
compared to the table search, depending on the number of choices and
size of the values searched for, but is in no case slower than the
legacy implementation.

For i_select_val_bins-instructions with more than 11 and less than 33
choices, a speedup of more than 40% can be expected. For a larger
number of choices (<= 4096), but where the largest key's tagged form
fits in a 16 bit immediate, a speedup of between 40% to 65% can be
expected. If the key values are too large to fit in a 16 bit
immediate, performance falls off when the table gets larger. Peak
performance at a speedup of more than 60% for small keys (tagged value
fits in 16 bits) occurs at a table size of 512 elements, and falls off
to 40% when the table contains 4096 keys. For large keys (the tagged
value does not fit in 32 bits), peak performance occurs at a table
size of 256 choices with a speedup of 70%, but performance falls off
to 25% when the number of choices is increased to 4096.

For small keys (tagged small integer fits in a 16 bit word), memory
use is reduced by around 5%. With large keys (the tagged value does
not fit in 32 bits), the size required for a i_select_val_bins
instruction can increase by up to 10%.
@bjorng bjorng added enhancement team:VM Assigned to OTP team VM labels Feb 9, 2021
@bjorng bjorng self-assigned this Feb 9, 2021
@bjorng bjorng added the testing currently being tested, tag is used by OTP internal CI label Feb 9, 2021
@bjorng bjorng changed the title jit: Optimized code generaton for select_val jit: Optimized code generation for select_val Feb 9, 2021
Copy link
Contributor

@garazdawi garazdawi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@bjorng bjorng merged commit e22ec8b into erlang:master Feb 17, 2021
@bjorng
Copy link
Contributor

bjorng commented Feb 17, 2021

Thanks for your pull request.

@frej frej deleted the frej/select branch February 17, 2021 13:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement team:VM Assigned to OTP team VM testing currently being tested, tag is used by OTP internal CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants