-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor base64 encoding #14611
base: master
Are you sure you want to change the base?
Refactor base64 encoding #14611
Conversation
I keep forgetting this exists Co-authored-by: David Keller <davidkeller@tuta.io>
# Clear out each buffer before each spec | ||
encode_buffer = IO::Memory.new | ||
decode_buffer = IO::Memory.new | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this 6 lines can be removed, and put initialization into it
looks good.
this pr
|
I thought memmove was the function which disallowed overlapped memory regions for performance reasons
As a note to anyone reviewing this: In the bulk encoding part (-> Line 297 in b662f34
This was adapted from the previous encoding method: Line 236 in f9ffa35
The old method used an unaligned U32 access with This can also be seen in the generated IR: body: ; preds = %while
%122 = load ptr, ptr %cstr, align 8, !dbg !17334
%123 = load i32, ptr %122, align 4, !dbg !17334 ; <- Notice the align 4
%124 = call i32 @"*UInt32#byte_swap:UInt32"(i32 %123), !dbg !17335
The new variant should now always work correctly, but may be less performant than only reading 3 bytes on weak-memory architectures. Hopefully we'll find an even more efficient method for the bulk encoding work once crystal gets SIMD support. |
Co-authored-by: Jason Frey <fryguy9@gmail.com>
This looks great! Thanks for improving on my PR! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I finally took the time to review this PR. Great work!
The main issue is passing raw pointers around, especially for buffers (pointer + size), even to private internal methods. In Crystal we always pass slices and only reach to raw pointers to circumvent some safety checks when we can prove it's safe.
Then the PR could be split into a couple PRs, to distinguish:
- the refactor & support for IO input/output;
- the unaligned fix/performance improvement for encoding pairs (with before/after benchmarks for x86_64 & aarch64).
break if read_bytes == 0 | ||
|
||
# Move unprocessed bytes to the beginning of input_buffer | ||
Intrinsics.memmove(input_buffer.to_unsafe, input_buffer.to_unsafe + unprocessable_bytes, unprocessable_bytes, is_volatile: false) unless unprocessable_bytes == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: we shouldn't have to call for intrinsics directly in usual code. The #move_to
and #move_from
methods on Pointer
and Slice
are available for this purpose.
The abstraction will be optimized away, at the exception of safety checks. We could consider unsafe variants, but so far we never noticed significant bottlenecks (but will love to be proved wrong).
issue: the move looks wrong, it's copying the first N bytes in place, when it should copy the last N bytes as the first N bytes?
last_bytes = Bytes.new(input_slice.to_unsafe - unprocessable_bytes, unprocessable_bytes)
last_bytes.move_to(input_buffer.to_slice)
suggestion: it could be useful to have a slice that represents the current part of the input buffer, instead of always pointing to the end. For example (sadly, creating the slice is still a bit awkward because of #14775):
slice = input_buffer.to_slice[0, available_bytes]
last_bytes = slice[slice.size - unprocessable_bytes, unprocessable_bytes]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: the move looks wrong
The move does indeed look wrong, input_slice.to_unsafe - unprocessable_bytes
(as you suggested) is the correct location.
Also, the two regions should never be able to overlap, so the memmove
could be replaced by memcpy
.
# Internal method for encoding bytes from one buffer as base64, using chunks allocated by this method. | ||
# Returns the amount of bytes written into output. | ||
private def encode_base64_chunked_internal( | ||
input : UInt8*, input_size : Int32, chars : UInt8*, *, newlines : Bool = false, pad : Bool = false, & : Bytes -> Nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: we don't pass a pointer then a size in crystal, we pass a Slice
. We also avoid passing buffers as raw pointers (what's their size?), but again pass a Slice.
The method signature should be:
input : UInt8*, input_size : Int32, chars : UInt8*, *, newlines : Bool = false, pad : Bool = false, & : Bytes -> Nil | |
input : Bytes, chars : Bytes, *, newlines : Bool = false, pad : Bool = false, & : Bytes -> Nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since chars
is required to be an array of exactly 64 ascii chars, it doesn't really make sense to pass two extra parameters with it. I'd rather use something like UInt8[64]*
, document the required behaviour or just pass String
.
Also, the method is internal, so i don't really understand why this is an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesn't really make sense to pass two extra parameters with it.
Which two extra parameters? Isn't it only one for the size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slice
also has a read_only
parameter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the reasons, but these copy shall be optimized away when LLVM inlines everything.
With chars
as Bytes instead of a raw pointer, the method could have a single raise unless chars.size == 64
assertion, which is likely to be optimized away until we try to pass something else that will safely fail, so we can safely access chars.to_unsafe[i]
for the rest of the method.
We can remove safety net, but we must prove that it significantly impacts performance, then try to balance risk vs performance accordingly, ideally keeping safety knowledge to the very method (avoid assumptions about callers) and otherwise have an explicit note in how the caller MUST call the method.
|
||
# Internal method for encoding bytes from one buffer as base64 into another one (backend of *every* encoding method). | ||
# Returns the amount of bytes written into output. | ||
private def encode_base64_buffer_internal(input : UInt8*, input_size : Int32, output : UInt8*, chars : UInt8*, *, newlines : Bool = false, pad : Bool = false) : Int32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: same here:
private def encode_base64_buffer_internal(input : UInt8*, input_size : Int32, output : UInt8*, chars : UInt8*, *, newlines : Bool = false, pad : Bool = false) : Int32 | |
private def encode_base64_buffer_internal(input : Bytes, output : Bytes, chars : Bytes, *, newlines : Bool = false, pad : Bool = false) : Int32 |
Then the method body won't have to meddle with pointer addresses, and can just take advantage of output.size
, or input += LINE_BYTES
moving both the pointer and reducing the size, ...
Intrinsics.memcpy(pointerof(value), input, 4, is_volatile: false) | ||
input += 3 | ||
|
||
value = value.byte_swap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: this assumes that the system is little-endian, what if it's big-endian?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the code doesn't work on big-endian systems.
Originally, the code did check the system endianness (byte_swap if IO::ByteFormat...
), but I removed that since crystal doesn't support any big endian system anyways (afaik).
i = 8 | ||
while i != 0 | ||
value = 0_u32 | ||
Intrinsics.memcpy(pointerof(value), input, 4, is_volatile: false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: again, no need for intrinsics:
Intrinsics.memcpy(pointerof(value), input, 4, is_volatile: false) | |
pointerof(value).as(UInt8*).copy_from(input, count: 4) |
This PR rewrites the entire base64-encoding logic.
While doing so, it adds methods used to encode (normal, urlsafe and strict) base64 from one
IO
into another.Also,
urlsafe_encode(data, io)
now received an optionalpadding = true
parameter to reach feature parity between all ways to encode base64 (buffer to buffer, buffer to IO, IO to IO).Encoding from buffer to buffer only saw small performance improvements, but everything related to
IO
s saw very significant performance improvements.The specs from #14604 have been copied over to this PR, although more should probably be added.
Benchmark Code
For this benchmark, I copied the
base64.cr
file from this repo intobase64blob.cr
and renamed the module inside toBase64Blob
. This way, both variants can be used in parallel.Direct Comparison
Throughput:
My Benchmark Results (Fedora 40, Ryzen 3600)
Direct comparisons
Throughput
Reading from and writing to
/dev/zero
, the code needed490.62ms
per1 GiB
(-> 2.04GiB / 2.19GB per second) in strict mode and565.33ms
using the default#encode
(-> 1.77GiB / 1.90GB per second)Closes #14604