Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hex encoding fonctions on the binary module #3014

Merged
merged 1 commit into from
Mar 22, 2021

Conversation

gearnode
Copy link
Contributor

@gearnode gearnode commented Jan 29, 2021

I propose to add the encode_hex/1, encode_hex/2 and decode_hex/1
functions on the binary module.

Dealing with the hex format is frequent in many codebases. At this time,
we all copy-paste similar helpers in our codebases to encode and decode
a hex-encoded binary.

It's why I suggest enriching the binary module with these functions.

I've added these functions on the binary module because it makes sense
to me, and it avoids polluting the global namespace with a new
module. Maybe create a new module called hex, for example, is a better
solution.

I tried as much as possible to follow the contribution guide. Don't
hesitate to tell me if I made a mistake somewhere.

@mikpe
Copy link
Contributor

mikpe commented Jan 30, 2021

LGTM. My only objection is the naming, I think to_hex/1, to_hex/2, and from_hex/1 would be more in line with other data conversions in Erlang.

@essen
Copy link
Contributor

essen commented Jan 30, 2021

An option to add a space in between pairs or quads of hex values in to_hex would also be good.

ab ab ab ab
abcd abcd abcd abcd

Once you have that you can replace the hex implementation in ssl used for debugging.

@rlipscombe
Copy link
Contributor

rlipscombe commented Jan 30, 2021

Couple of points:

  1. I'm concerned about the performance hit of invoking the upper/lower function indirectly. Obviously, I've not measured it yet. I've often used this conversion really frequently (logging, e.g.) and a slow implementation can murder your performance.
  2. +1 to @essen's point about separators, though I might add the option to use a colon or a hyphen as well (think MAC addresses).

-spec enc_hex_digit_upper(0..15) -> byte().
enc_hex_digit_upper(Char) when Char =< 9 ->
Char + $0;
enc_hex_digit_upper(Char) when Char =< 15 ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The when Char =< 15 guard is redundant and should IMO be dropped for performance reasons.

Copy link
Contributor Author

@gearnode gearnode Jan 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After benchmark, it appears that deleting the clause does not significantly improve performance. Without real performance improvement, I prefer to keep code readable. Don't hesitate if you have benchmarks to support your opinion.

@gearnode
Copy link
Contributor Author

I'm concerned about the performance hit of invoking the upper/lower function indirectly. Obviously, I've not measured it yet. I've often used this conversion really frequently (logging, e.g.) and a slow implementation can murder your performance.

I've benchmark with timer:tc/3 and fprof an implementation without indirect functions call. No performance improvement at all. But I'm not an expert and maybe my benchmark tests are wrong. Don't hesitate to verify by performing your benchmark to ensure the implementation I propose has good performance.

@michalmuskala
Copy link
Contributor

michalmuskala commented Jan 31, 2021

So right now Erlang is going to have a base64 module for 64-bit encoding, and functions for 16-bit encoding in binary module. This seems less than ideal.

What about having a separate base module with functions to have both 64-bit encoding and 16-bit encoding and deprecating the base64 module?

@galdor
Copy link

galdor commented Jan 31, 2021

So right now Erlang is going to have a base64 module for 64-bit encoding, and functions for 16-bit encoding in binary module. This seems less than ideal.

While base16 and hex encodings are indeed the same, most developers know what hex encoding is, but have no idea about base16. Having simple hex encoding functions available would avoid multiple re-implementations everywhere.

What about having a separate base module with functions to have both 64-bit encoding and 16-bit encoding and deprecating the base64 module?

That would be nice to have indeed, but the scope of such a contribution would be way larger, especially since it would involve a new module (and therefore a new name in the global module namespace), and the deprecation of an existing module. If someone was to come forward to write, test and document a base module (and it would probably require an EEP first), it could be possible to update hex functions in binary to use base, keeping an interface for the term "hex encoding" that so many people know. Therefore having these functions right know would be useful and would not block any further improvement.

@mikpe
Copy link
Contributor

mikpe commented Jan 31, 2021

My opinion is that conversion to/from hex is essentially like any other number conversion (binary, octal, decimal) while base64 conversion is substantially different, so I would not want to conflate them.

@rickard-green rickard-green added the team:VM Assigned to OTP team VM label Feb 1, 2021
@garazdawi
Copy link
Contributor

I tend to use io_lib:format("~.16B",[1234]). and then io_lib:fread("~16u","4D2"). to do the reverse. From the discussions in this PR I assume that this is a no go because of performance issues?

Giving this a good name will be hard, the best/silliest I came up with was erlang:binary_to_binary(Binary, 16). :)

Maybe we should focus less on the base that we convert to and more on the transformation we are doing? i.e. binary:to_string/2 or binary:to_text/2 or binary:to_base/2? Or binary:transcode(Binary,[{base,16}])?

@odiferousmint
Copy link

odiferousmint commented Feb 2, 2021

While base16 and hex encodings are indeed the same, most developers know what hex encoding is, but have no idea about base16. Having simple hex encoding functions available would avoid multiple re-implementations everywhere.

I find this line of reasoning detrimental in general. Improve documentation perhaps to make those developers aware of what base16 is, but I do not believe that we should introduce inconsistencies and whatnot because of some developers who lack particular knowledge. We should not change the libraries and such to appeal to them, instead they should learn. We can help them about that through documentation, tutorials, and/or examples, etc.

Maybe we should focus less on the base that we convert to and more on the transformation we are doing? i.e. binary:to_string/2 or binary:to_text/2 or binary:to_base/2? Or binary:transcode(Binary,[{base,16}])?

The last one looks good enough. I am not entirely sure about the naming (although it looks OK to me) but being able to specify the base seems good, in which case I am not sure what to do with base64. If we only add hex, then adding base16 might make more sense for consistency.

Personally, I consider these two options good:

Either

  1. Make base16 as there already is a base64

or

  1. Get rid of base64, and have only one base where you can specify the base yourself, such as in your last comment

The second probably will not happen overnight, it has to be flagged as deprecated first, and slowly get rid of it. We can make base in the meantime and recommend people to use that.

@gearnode
Copy link
Contributor Author

gearnode commented Feb 4, 2021

@essen Interesting idea feel free to send a patch.

@max-au
Copy link
Contributor

max-au commented Feb 22, 2021

Or binary:transcode(Binary,[{base,16}])?

This one sounds great! We are also using base32 and base62, which would be trivial to add having a common API.

@ranjanified
Copy link
Contributor

Or binary:transcode(Binary,[{base,16}])?

This one sounds great! We are also using base32 and base62, which would be trivial to add having a common API.

base62??

@garazdawi
Copy link
Contributor

The OTP Technical board spent some time today talking about this PR and has come to the conclusion that we like the inclusion of this functionality as a simple conversion function for binaries into the binary module. The naming is consistent with the naming of the binary module (i.e. binary:encode_unsigned/1), so we agree with your choice there.

We do however want to keep this function simple, which means no options for configuration and the default should be uppercase (the same as integer_to_binary(..., 16)). If the user wants lowercase they can do string:lowercase(binary:encode_hex(<<123, 123>>)). The decode function should still be able to handle big and small (the same as list_to_binary(..., 16)).

On a more general note: We would not mind the addition of a base module that includes more complex conversions of iodata() to hexadecimal and other bases. It could deal with most of the different cases for printing and parsing hexadecimal mentioned in this PR and be inspired by the module with the same name in Elixir. Maybe it even would make sense to add a new ~ format character to io_lib for printing iodata() in hexadecimal format?

As with most things, a decision is never final and we may very change opinions about whether there should be an option to binary:encode_hex/2 or if the base module is a good idea. The above is our decision at this moment.

@garazdawi
Copy link
Contributor

I tend to use io_lib:format("~.16B",[1234]). and then io_lib:fread("~16u","4D2"). to do the reverse.

Just realized that this part of my previous comment makes no sense. Not sure what happened there as the rest of the comment talks about the correct things.

@gearnode gearnode force-pushed the add-binary-hex-encode-decode branch from 7506549 to 51a1f8e Compare March 12, 2021 13:43
@garazdawi garazdawi self-assigned this Mar 12, 2021
@garazdawi garazdawi added enhancement testing currently being tested, tag is used by OTP internal CI and removed testing currently being tested, tag is used by OTP internal CI labels Mar 12, 2021
@garazdawi
Copy link
Contributor

Thanks for the updates! The stdlib binary_module_SUITE:error_info testcase fails.

You can run the tests like this: (cd lib/stdlib && make test ARGS="-suite binary_module_SUITE -case error_info")

@garazdawi garazdawi removed the testing currently being tested, tag is used by OTP internal CI label Mar 16, 2021
@gearnode
Copy link
Contributor Author

@garazdawi Fixed :)

Dealing with hex format is frequent in many codebases. At this time, we
all copy-paste similar helper in our codebases to encode and decode a
hex-encoded binary.
@gearnode gearnode force-pushed the add-binary-hex-encode-decode branch from 746c67d to 46abba5 Compare March 18, 2021 16:03
@garazdawi garazdawi added the testing currently being tested, tag is used by OTP internal CI label Mar 18, 2021
@garazdawi garazdawi merged commit 46abba5 into erlang:master Mar 22, 2021
@NelsonVides
Copy link
Contributor

NelsonVides commented Mar 26, 2021

Oh, I've missed this PR, but I have some input when it comes to algorithms. Currently I'm using this implementation https://github.com/esl/base16/blob/master/src/base16.erl using lookup tables, and it has (on OTP23, non-jit) benchmarks like:

Benchee.run(
  %{
    "encode optimised" => fn input -> :base16.encode(input) end,
    "encode otp" => fn input -> :bench_erl.encode_hex(input) end
  },
  inputs: %{
    "1. Tiny" => :crypto.strong_rand_bytes(8),
    "2. Small" => :crypto.strong_rand_bytes(64),
    "3. Medium" => :crypto.strong_rand_bytes(256),
    "4. Medium Big" => :crypto.strong_rand_bytes(512),
    "5. Big" => :crypto.strong_rand_bytes(1024),
    "6. Large" => :crypto.strong_rand_bytes(64000),
    "7. Random" => :crypto.strong_rand_bytes(:rand.uniform(10000))
  },
  parallel: 12,
  memory_time: 5
)

Benchee.run(
  %{
    "decode optimised" => fn input -> :base16.decode(input) end,
    "decode otp" => fn input -> :bench_erl.my_decode(input) end
  },
  inputs: %{
    "1. Tiny" => :bench_erl.encode(:crypto.strong_rand_bytes(8)),
    "2. Small" => :bench_erl.encode(:crypto.strong_rand_bytes(64)),
    "3. Medium" => :bench_erl.encode(:crypto.strong_rand_bytes(256)),
    "4. Medium Big" => :bench_erl.encode(:crypto.strong_rand_bytes(512)),
    "5. Big" => :bench_erl.encode(:crypto.strong_rand_bytes(1024)),
    "6. Large" => :bench_erl.encode(:crypto.strong_rand_bytes(64000)),
    "7. Random" => :bench_erl.encode(:crypto.strong_rand_bytes(:rand.uniform(10000)))
  },
  parallel: 12,
  memory_time: 5
)

With results

Operating System: Linux
CPU Information: Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
Number of Available Cores: 12
Available memory: 30.99 GB
Elixir 1.11.0
Erlang 23.3

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 5 s
parallel: 12
inputs: 1. Tiny, 2. Small, 3. Medium, 4. Medium Big, 5. Big, 6. Large, 7. Random
Estimated total run time: 2.80 min

Benchmarking encode optimised with input 1. Tiny...
Benchmarking encode optimised with input 2. Small...
Benchmarking encode optimised with input 3. Medium...
Benchmarking encode optimised with input 4. Medium Big...
Benchmarking encode optimised with input 5. Big...
Benchmarking encode optimised with input 6. Large...
Benchmarking encode optimised with input 7. Random...
Benchmarking encode otp with input 1. Tiny...
Benchmarking encode otp with input 2. Small...
Benchmarking encode otp with input 3. Medium...
Benchmarking encode otp with input 4. Medium Big...
Benchmarking encode otp with input 5. Big...
Benchmarking encode otp with input 6. Large...
Benchmarking encode otp with input 7. Random...

##### With input 1. Tiny #####
Name                       ips        average  deviation         median         99th %
encode optimised      908.89 K        1.10 μs  ±3724.79%        0.77 μs        2.30 μs
encode otp            448.41 K        2.23 μs  ±1979.01%        1.62 μs        5.41 μs

Comparison:
encode optimised      908.89 K
encode otp            448.41 K - 2.03x slower +1.13 μs

Memory usage statistics:

Name                Memory usage
encode optimised           128 B
encode otp                 408 B - 3.19x memory usage +280 B

**All measurements for memory usage were the same**

##### With input 2. Small #####
Name                       ips        average  deviation         median         99th %
encode optimised      256.43 K        3.90 μs   ±867.06%        3.36 μs        7.78 μs
encode otp             92.94 K       10.76 μs   ±281.73%        9.96 μs       18.00 μs

Comparison:
encode optimised      256.43 K
encode otp             92.94 K - 2.76x slower +6.86 μs

Memory usage statistics:

Name                Memory usage
encode optimised        0.125 KB
encode otp               2.59 KB - 20.69x memory usage +2.46 KB

**All measurements for memory usage were the same**

##### With input 3. Medium #####
Name                       ips        average  deviation         median         99th %
encode optimised       70.45 K       14.19 μs   ±209.09%       13.17 μs       22.23 μs
encode otp             23.67 K       42.24 μs    ±80.39%       41.49 μs       49.08 μs

Comparison:
encode optimised       70.45 K
encode otp             23.67 K - 2.98x slower +28.05 μs

Memory usage statistics:

Name                Memory usage
encode optimised        0.125 KB
encode otp              10.09 KB - 80.69x memory usage +9.96 KB

**All measurements for memory usage were the same**

##### With input 4. Medium Big #####
Name                       ips        average  deviation         median         99th %
encode optimised       35.54 K       28.13 μs   ±235.21%       26.53 μs       44.21 μs
encode otp             12.07 K       82.86 μs   ±155.40%       81.86 μs       88.05 μs

Comparison:
encode optimised       35.54 K
encode otp             12.07 K - 2.95x slower +54.72 μs

Memory usage statistics:

Name                Memory usage
encode optimised        0.125 KB
encode otp              20.09 KB - 160.69x memory usage +19.96 KB

**All measurements for memory usage were the same**

##### With input 5. Big #####
Name                       ips        average  deviation         median         99th %
encode optimised       18.15 K       55.10 μs    ±75.57%       51.81 μs       88.18 μs
encode otp              6.02 K      166.05 μs    ±37.60%      164.49 μs      174.95 μs

Comparison:
encode optimised       18.15 K
encode otp              6.02 K - 3.01x slower +110.95 μs

Memory usage statistics:

Name                Memory usage
encode optimised        0.125 KB
encode otp              40.09 KB - 320.69x memory usage +39.96 KB

**All measurements for memory usage were the same**

##### With input 6. Large #####
Name                       ips        average  deviation         median         99th %
encode optimised        311.45        3.21 ms    ±12.57%        3.19 ms        3.45 ms
encode otp               97.38       10.27 ms    ±35.90%       10.02 ms       10.73 ms

Comparison:
encode optimised        311.45
encode otp               97.38 - 3.20x slower +7.06 ms

Memory usage statistics:

Name                Memory usage
encode optimised      0.00012 MB
encode otp               2.44 MB - 20000.69x memory usage +2.44 MB

**All measurements for memory usage were the same**

##### With input 7. Random #####
Name                       ips        average  deviation         median         99th %
encode optimised        4.09 K      244.30 μs    ±41.26%      241.09 μs      316.31 μs
encode otp              1.29 K      776.50 μs    ±12.24%      767.62 μs      871.66 μs

Comparison:
encode optimised        4.09 K
encode otp              1.29 K - 3.18x slower +532.19 μs

Memory usage statistics:

Name                Memory usage
encode optimised        0.125 KB
encode otp             185.71 KB - 1485.69x memory usage +185.59 KB

**All measurements for memory usage were the same**
Operating System: Linux
CPU Information: Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
Number of Available Cores: 12
Available memory: 30.99 GB
Elixir 1.11.0
Erlang 23.3

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 5 s
parallel: 12
inputs: 1. Tiny, 2. Small, 3. Medium, 4. Medium Big, 5. Big, 6. Large, 7. Random
Estimated total run time: 2.80 min

Benchmarking decode optimised with input 1. Tiny...
Benchmarking decode optimised with input 2. Small...
Benchmarking decode optimised with input 3. Medium...
Benchmarking decode optimised with input 4. Medium Big...
Benchmarking decode optimised with input 5. Big...
Benchmarking decode optimised with input 6. Large...
Benchmarking decode optimised with input 7. Random...
Benchmarking decode otp with input 1. Tiny...
Benchmarking decode otp with input 2. Small...
Benchmarking decode otp with input 3. Medium...
Benchmarking decode otp with input 4. Medium Big...
Benchmarking decode otp with input 5. Big...
Benchmarking decode otp with input 6. Large...
Benchmarking decode otp with input 7. Random...

##### With input 1. Tiny #####
Name                       ips        average  deviation         median         99th %
decode optimised      790.44 K        1.27 μs  ±3368.82%        1.01 μs        2.69 μs
decode otp            729.25 K        1.37 μs  ±2298.80%        1.13 μs        2.96 μs

Comparison:
decode optimised      790.44 K
decode otp            729.25 K - 1.08x slower +0.106 μs

Memory usage statistics:

Name                Memory usage
decode optimised           128 B
decode otp                 128 B - 1.00x memory usage +0 B

**All measurements for memory usage were the same**

##### With input 2. Small #####
Name                       ips        average  deviation         median         99th %
decode optimised      173.43 K        5.77 μs   ±491.13%        5.41 μs        9.51 μs
decode otp            172.33 K        5.80 μs   ±520.81%        5.47 μs        9.59 μs

Comparison:
decode optimised      173.43 K
decode otp            172.33 K - 1.01x slower +0.0366 μs

Memory usage statistics:

Name                Memory usage
decode optimised           128 B
decode otp                 128 B - 1.00x memory usage +0 B

**All measurements for memory usage were the same**

##### With input 3. Medium #####
Name                       ips        average  deviation         median         99th %
decode otp             40.85 K       24.48 μs   ±352.31%       23.89 μs       33.99 μs
decode optimised       40.37 K       24.77 μs   ±202.45%       24.13 μs       33.94 μs

Comparison:
decode otp             40.85 K
decode optimised       40.37 K - 1.01x slower +0.29 μs

Memory usage statistics:

Name                Memory usage
decode otp                 128 B
decode optimised           128 B - 1.00x memory usage +0 B

**All measurements for memory usage were the same**

##### With input 4. Medium Big #####
Name                       ips        average  deviation         median         99th %
decode otp             18.45 K       54.21 μs    ±52.78%       53.24 μs       76.57 μs
decode optimised       18.12 K       55.19 μs   ±210.14%       53.89 μs       77.39 μs

Comparison:
decode otp             18.45 K
decode optimised       18.12 K - 1.02x slower +0.98 μs

Memory usage statistics:

Name                Memory usage
decode otp                 128 B
decode optimised           128 B - 1.00x memory usage +0 B

**All measurements for memory usage were the same**

##### With input 5. Big #####
Name                       ips        average  deviation         median         99th %
decode otp              9.02 K      110.90 μs    ±34.58%      110.12 μs      145.32 μs
decode optimised        8.99 K      111.30 μs    ±54.76%      110.12 μs      147.16 μs

Comparison:
decode otp              9.02 K
decode optimised        8.99 K - 1.00x slower +0.39 μs

Memory usage statistics:

Name                Memory usage
decode otp                 128 B
decode optimised           128 B - 1.00x memory usage +0 B

**All measurements for memory usage were the same**

##### With input 6. Large #####
Name                       ips        average  deviation         median         99th %
decode optimised        145.34        6.88 ms     ±4.26%        6.83 ms        7.26 ms
decode otp              144.77        6.91 ms     ±7.36%        6.85 ms        7.52 ms

Comparison:
decode optimised        145.34
decode otp              144.77 - 1.00x slower +0.0268 ms

Memory usage statistics:

Name                Memory usage
decode optimised           128 B
decode otp                 128 B - 1.00x memory usage +0 B

**All measurements for memory usage were the same**

##### With input 7. Random #####
Name                       ips        average  deviation         median         99th %
decode optimised        4.61 K      217.10 μs    ±29.38%      216.02 μs      244.12 μs
decode otp              4.59 K      218.05 μs    ±41.66%      216.60 μs      242.93 μs

Comparison:
decode optimised        4.61 K
decode otp              4.59 K - 1.00x slower +0.94 μs

Memory usage statistics:

Name                Memory usage
decode optimised           128 B
decode otp                 128 B - 1.00x memory usage +0 B

**All measurements for memory usage were the same**

That is, always constant memory consumption vs linear, plus usually 2x-3x faster, regardless of the input size.

I haven't built benchmarks using a BEAM with a JIT, but what do you think of such algorithm? I could adapt it to always return uppercase and errors as in OTP and make a PR 🙂
@garazdawi @gearnode

edit: .beam file gets bigger of course, and actually the difference is in encoding, decoding is pretty much exactly the same.

@NelsonVides
Copy link
Contributor

NelsonVides commented Mar 27, 2021

I got all too curious by this, and adding elixir's Base.encode16/1 to the benchmarks bears results also with constant memory consumption (208B for Elixir vs 128B for my algorithm), but being on average 15%-20% faster than mine 🤔 😄
Base.decode16/1 uses constant 1136B memory and is around 10x faster!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement team:VM Assigned to OTP team VM testing currently being tested, tag is used by OTP internal CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.