Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add selectable alphabets (standard vs urlsafe) to base64 #6280

Merged
merged 1 commit into from
Sep 15, 2022

Conversation

juhlig
Copy link
Contributor

@juhlig juhlig commented Sep 5, 2022

In #5639 an addition to the base64 module to support an alternative, URL-safe encoding alphabet (base64url) was suggested. In the alternative encoding, the characters + and / (RFC 4648 Section 4), which may be problematic in URLs and file names, are replaced with - and _ (RFC 4648 Section 5), respectively.

This PR adds the new functions encode/2, encode_to_string/2, decode/2, decode_to_string/2, mime_decode/2 and mime_decode_to_string/2 as supplements of the respective existing 1-ary functions of the same names. The second parameters may be one of the atoms standard (meaning the section 4 alphabet; default for the 1-ary functions) and urlsafe (the section 5 alphabet), which denote the alphabet to be used for encoding and decoding. The decoding functions also accept undefined, via which they accept characters from both alphabets.

I'm still not clear on how to best spec this, and documentation is another matter, so this is a WIP.
As for tests, I added two new unit tests for now. If this PR is to be accepted, I'll extend the property test suite accordingly.

@github-actions
Copy link
Contributor

github-actions bot commented Sep 5, 2022

CT Test Results

       2 files       86 suites   33m 37s ⏱️
1 802 tests 1 754 ✔️ 48 💤 0
2 084 runs  2 034 ✔️ 50 💤 0

Results for commit 05e61dc.

♻️ This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

@bjorng bjorng self-assigned this Sep 6, 2022
@bjorng bjorng added team:VM Assigned to OTP team VM feature labels Sep 6, 2022
@bjorng
Copy link
Contributor

bjorng commented Sep 6, 2022

Having worked on improving the performance of base64 (mainly in #6031, #6259, and #6274), I would prefer that additions to the base64 module are done in a way that doesn't lower performance more than necessary.

Running base64_bench on my M1 MacBook Pro on the master branch, the results are the following:

== Testing with 1 MB ==
fun base64:encode/1: 1000 iterations in 3386 ms: 295 it/sec
fun base64:decode/1: 1000 iterations in 5167 ms: 193 it/sec

The results with this pull request are:

== Testing with 1 MB ==
fun base64:encode/1: 1000 iterations in 4142 ms: 241 it/sec
fun base64:decode/1: 1000 iterations in 6232 ms: 160 it/sec

One way to improve the performance for encoding without having completely separate code for the standard and urlsafe modes is to rewrite b64e/2 like this:

b64e(X, Offset) ->
    element(X+Offset,
	    {$A, $B, $C, $D, $E, $F, $G, $H, $I, $J, $K, $L, $M, $N,
	     $O, $P, $Q, $R, $S, $T, $U, $V, $W, $X, $Y, $Z,
	     $a, $b, $c, $d, $e, $f, $g, $h, $i, $j, $k, $l, $m, $n,
	     $o, $p, $q, $r, $s, $t, $u, $v, $w, $x, $y, $z,
	     $0, $1, $2, $3, $4, $5, $6, $7, $8, $9, $+, $/,
             $A, $B, $C, $D, $E, $F, $G, $H, $I, $J, $K, $L, $M, $N,
	     $O, $P, $Q, $R, $S, $T, $U, $V, $W, $X, $Y, $Z,
	     $a, $b, $c, $d, $e, $f, $g, $h, $i, $j, $k, $l, $m, $n,
	     $o, $p, $q, $r, $s, $t, $u, $v, $w, $x, $y, $z,
	     $0, $1, $2, $3, $4, $5, $6, $7, $8, $9, $-, $_}).

where Offset is to be calculated in the following way:

    Offset = case Mode of
                 standard -> 1;
                 urlsafe -> 65
             end

The benchmark results are:

== Testing with 1 MB ==
fun base64:encode/1: 1000 iterations in 3523 ms: 283 it/sec
fun base64:decode/1: 1000 iterations in 6240 ms: 160 it/sec

which is closer to the original performance and probably acceptable.

It should be possible to optimize decoding in a similar way.

@juhlig
Copy link
Contributor Author

juhlig commented Sep 6, 2022

@bjorng I refactored as you suggested. Running the base64_bench on my machine, it looks like it is roughly on par with current master, but it would be good if you could double-check ;)

lib/stdlib/src/base64.erl Outdated Show resolved Hide resolved
lib/stdlib/src/base64.erl Outdated Show resolved Hide resolved
@bjorng bjorng self-requested a review September 7, 2022 03:14
Copy link
Contributor

@bjorng bjorng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Benchmark results on my Intel Mac are almost exactly the same as for the master branch.

Regarding the specs, I think that the easiest solution is to extend base64_alphabet() to include the characters from both variants of Base64.

It seems that you are not testing the new mime_decode functions. I suggest that you add tests for them in the funs in the mime_decode/1 and mime_decode_to_string/1 test cases.

Since this is an extension of an API, the OTP Technical Board will have to approve it.

@juhlig
Copy link
Contributor Author

juhlig commented Sep 8, 2022

Looks good to me.

Benchmark results on my Intel Mac are almost exactly the same as for the master branch.

Great =)

Regarding the specs, I think that the easiest solution is to extend base64_alphabet() to include the characters from both variants of Base64.

Ok, then I'll do it that way.

It seems that you are not testing the new mime_decode functions. I suggest that you add tests for them in the funs in the mime_decode/1 and mime_decode_to_string/1 test cases.

Yeah, and there is also the property test suite that needs to be extended to cover the new functions. I'm currently working on that.

Since this is an extension of an API, the OTP Technical Board will have to approve it.

Sure.

@Maria-12648430
Copy link
Contributor

Maria-12648430 commented Sep 8, 2022

Something that just occurred to me when reading this PR... the undefined decoding mode (accepting both + / - and / / _ to refer to 62 and 63 respectively), while convenient if you don't know how the data was encoded, may be a bit dangerous, at least material for surprises, especially when used with mime_decode/2 (vs decode/2). @bjorng, you may want to bring that up with the OTB when discussing this extension.


Given the encoded string <<"AB+C+D/EF/G-H-I_J_KL">>, this mime-decodes differently in the different modes:

Mode      | ignored chars | cleaned string equivalent  | decodes to
=================================================================================================================
standard  | "-" and "_"   | <<"AB+C+D/E/FGHIJKL">>     | <<0,31,130,248,63,196,252,81,135,32,146,139>>
----------+---------------+----------------------------+---------------------------------------------------------
urlsafe   | "+" and "/"   | <<"ABCDEFG-H-I_J_KL">>     | <<0,16,131,16,81,190,31,226,63,39,242,139>>
----------+---------------+----------------------------+---------------------------------------------------------
undefined | none          | <<"AB+C+D/E/FG-H-I_J_KL">> | <<0,31,130,248,63,196,252,81,190,31,226,63,39,242,139>>
----------+---------------+----------------------------+---------------------------------------------------------

(It is similar with decode/2: the first two cases would result in an exception, the last case would decode.)

IMO, it is ok and expected that the same string may decode differently in standard and urlsafe modes, given that there the user has explicitly specified exactly how he wants the data to be interpreted. However, that undefined decodes yet differently than any one of the specific modes looks like a problem to me.


The possible ways I can think of to address the problem:

  • remove undefined, so the user has to specify how he wants the data to be interpreted and decoded
  • keep undefined but put a warning in the docs (but who ever reads docs before it is too late?)
  • rename undefined to something like mixed, to remove the impression of auto-detection (undefined --> "dunno, figure it out")

@juhlig
Copy link
Contributor Author

juhlig commented Sep 8, 2022

Good point @Maria-12648430, you do have a knack for things like that 😬

I'll leave things as they are for now, any of your suggestions is easy to do, whatever the decision may be.

@bjorng
Copy link
Contributor

bjorng commented Sep 8, 2022

@Maria-12648430 Yes, I will bring that up with the OTB.

@Maria-12648430
Copy link
Contributor

you do have a knack for things like that

More like a dark spot in my soul that urges me to poke holes in shiny things 😇

@juhlig
Copy link
Contributor Author

juhlig commented Sep 8, 2022

More like a dark spot in my soul that urges me to poke holes in shiny things 😇

That is why no company that values its pride should ever hire you 😁 You would find three different ways to capsize their Unsinkable Flagship®️ on your first day.

@Maria-12648430
Copy link
Contributor

You would find three different ways to capsize their Unsinkable Flagship®️ on your first day.

And enjoy it 😺 But make that 5 ways 😁

@juhlig
Copy link
Contributor Author

juhlig commented Sep 9, 2022

The last push adds specs, tests for mime_decode/2 (in separate cases, I think they don't fit well into the existing ones), and documentation. (Don't worry about the sloppy commit messages, I'll squash and tidy them up in the end 😉)

(On a side note, some of the links to RFC 4648 were broken in the docs, and I fixed them while I was at it.)

The property test suite will be extended later by @Maria-12648430 (thanks 🤗), since she implemented the current test suite and has more experience with property testing than I do.

Last but not least, I have been discussing the undefined decoding mode some more with @Maria-12648430, and we came to the conclusion that it would be best to remove/not include it. It is brittle, prone to be mistaken for auto-detection, and even if understood correctly (as mixed) there seems little real use for it. I have spec'ed and documented it all the same, in case the OTB sees any value in it after all, but by and large we would opt for removal and force users to be explicit.

@RaimoNiskanen
Copy link
Contributor

undefined could make a decision at the first +, /, -, _ character and then require all to be from the same alphabet.
I do not know if that would improve usefulness (enough)...

@Maria-12648430
Copy link
Contributor

Maria-12648430 commented Sep 9, 2022

Hi @RaimoNiskanen 🙂

undefined could make a decision at the first +, /, -, _ character and then require all to be from the same alphabet.

We have been discussing this also, but, well...

First, it would make the code more complex. There would either have to be extra clauses and/or guards on practically all functions involved in the decoding (and quite many of them, like undefined and +, undefined and -, etc; and as many functions work on several encoded characters at once, even undefined and + in the first character, undefined and + in the second character, etcetc), or on the helper function b64d with repercussions into the decoding functions, primarily meaning more case clauses or entire new case constructs where there were none before.
Both would degrade performance, something that @bjorng will understandably not like to see, given that he has worked hard in the past to improve base64 performance.

Second, it would still be brittle, just in a different way. It would be guesswork. I think it is better to not give raise to the illusion that the right alphabet will always be detected automatically.

Third and last, I don't think it is very useful in itself, possibly dangerous even. Users should know where the data to decode originates from, and what kind of alphabet to (not) expect/accept.
If we had undefined, users would just use it out of convenience and for good measure (because if you can have it both ways, why limit yourself to one?), under the illusion that the right alphabet will be reliably detected, as I said above.
Moreover, that may even work in development and pass all test cases, and even work in production for a long time, but every now and then strange things (incorrectly decoded and therefore corrupt or even mailicious data) may happen, for no obvious reason.

@Maria-12648430
Copy link
Contributor

The last commit adds the property tests. I didn't bother with undefined mode, though. That can be added later if necessary 😉

@bjorng
Copy link
Contributor

bjorng commented Sep 14, 2022

The OTB meeting agreed with the conclusion reached in the previous comments on this PR that the undefined option is harmful and that it should be removed. When you have removed it and cleaned up the commits, I will add this PR to our daily builds.

@Maria-12648430
Copy link
Contributor

The last commit adds the property tests.

Btw, I forgot to mention that I also fixed something that was pointed out by @seriyps after I did the initial property test suite, here.

@juhlig juhlig changed the title WIP: Add selectable alphabets (standard vs urlsafe) to base64 Add selectable alphabets (standard vs urlsafe) to base64 Sep 14, 2022
@juhlig
Copy link
Contributor Author

juhlig commented Sep 14, 2022

@bjorng all done, I removed undefined mode from code, tests and docs, squashed the commits and provided a nicer commit message.

@bjorng bjorng added the testing currently being tested, tag is used by OTP internal CI label Sep 14, 2022
@bjorng
Copy link
Contributor

bjorng commented Sep 14, 2022

Thanks! Added to our daily builds.

@bjorng
Copy link
Contributor

bjorng commented Sep 15, 2022

I forgot about the since tags. Please add "OTP @OTP-18247@" to the since tags for the new functions.

RFC 4648 defines two possible alphabets that may be used for
encoding and decoding, the standard alphabet in Section 4 and
an alternative URL and Filename safe alphabet in Section 5.

This commit adds the ability to specify one of the alphabets
for encoding and decoding.

Co-authored-by: Maria Scott <maria-12648430@hnc-agency.org>
@juhlig
Copy link
Contributor Author

juhlig commented Sep 15, 2022

I forgot about the since tags.

Dto. 😆

Please add "OTP @OTP-18247@" to the since tags for the new functions.

Done

@bjorng bjorng merged commit efbb529 into erlang:master Sep 15, 2022
@bjorng
Copy link
Contributor

bjorng commented Sep 15, 2022

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature team:VM Assigned to OTP team VM testing currently being tested, tag is used by OTP internal CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants