Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor base64 encoding #14611

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

BlobCodes
Copy link
Contributor

@BlobCodes BlobCodes commented May 23, 2024

This PR rewrites the entire base64-encoding logic.
While doing so, it adds methods used to encode (normal, urlsafe and strict) base64 from one IO into another.
Also, urlsafe_encode(data, io) now received an optional padding = true parameter to reach feature parity between all ways to encode base64 (buffer to buffer, buffer to IO, IO to IO).

Encoding from buffer to buffer only saw small performance improvements, but everything related to IOs saw very significant performance improvements.

The specs from #14604 have been copied over to this PR, although more should probably be added.

Benchmark Code

For this benchmark, I copied the base64.cr file from this repo into base64blob.cr and renamed the module inside to Base64Blob. This way, both variants can be used in parallel.

Direct Comparison

require "benchmark"
require "base64"

require "./base64blob"

asma = "a" * 10
amed = "a" * 1000
abig = "a" * 1_000_000
dev_zero = File.new("/dev/zero", "w+")

macro bench(name, input, *output)
  puts "\nBenchmark: #{ {{name}} }"
  Benchmark.ips do |x|
    x.report("old") { Base64.encode({{input}}, {{output.splat}}) }
    x.report("new") { Base64Blob.encode({{input}}, {{output.splat}}) }
  end
end

bench("10 byte string to string", asma)
bench("1000 byte string to string", amed)
bench("1_000_0000 byte string to string", abig)

bench("10 byte string to IO", asma, dev_zero)
bench("1000 byte string to IO", amed, dev_zero)
bench("1_000_0000 byte string to IO", abig, dev_zero)

# For comparison with #14604
#bench("10 byte IO to IO", IO::Sized.new(dev_zero, 10), dev_zero)
#bench("1000 bytes IO to IO", IO::Sized.new(dev_zero, 1000), dev_zero)
#bench("1_000_000 bytes IO to IO", IO::Sized.new(dev_zero, 1_000_000), dev_zero)

Throughput:

require "benchmark"
require "./base64blob.cr"

input = File.new("/dev/zero", "r")
output = File.new("/dev/zero", "w")
input_size = 2**30 # 1GiB

Benchmark.ips do |x|
  x.report("throughput (1GiB)") { Base64Blob.strict_encode(IO::Sized.new(input, input_size), output) }
end
My Benchmark Results (Fedora 40, Ryzen 3600)

Direct comparisons

Benchmark: 10 byte string to string
old  34.96M ( 28.61ns) (± 0.34%)  32.0B/op   1.11× slower
new  38.97M ( 25.66ns) (± 0.33%)  32.0B/op        fastest

Benchmark: 1000 byte string to string
old   1.14M (874.90ns) (± 1.20%)  2.0kB/op   1.38× slower
new   1.58M (632.77ns) (± 0.48%)  2.0kB/op        fastest

Benchmark: 1_000_0000 byte string to string
old   1.43k (697.02µs) (± 0.59%)  1.3MB/op   1.52× slower
new   2.19k (457.18µs) (± 1.19%)  1.3MB/op        fastest

Benchmark: 10 byte string to IO
old   2.92M (342.82ns) (± 1.04%)  0.0B/op   1.14× slower
new   3.32M (301.40ns) (± 0.51%)  0.0B/op        fastest

Benchmark: 1000 byte string to IO
old 215.74k (  4.64µs) (± 0.73%)  0.0B/op   6.21× slower
new   1.34M (746.07ns) (± 0.64%)  0.0B/op        fastest

Benchmark: 1_000_0000 byte string to IO
old 229.40  (  4.36ms) (± 0.98%)  0.0B/op   9.29× slower
new   2.13k (469.31µs) (± 0.97%)  0.0B/op        fastest

Benchmark: 10 byte IO to IO
old   2.30M (435.64ns) (± 0.23%)  96.0B/op   8.59× slower
new  19.72M ( 50.70ns) (± 0.99%)  96.0B/op        fastest

Benchmark: 1000 bytes IO to IO
old 211.53k (  4.73µs) (± 1.49%)  96.0B/op   8.10× slower
new   1.71M (583.84ns) (± 0.49%)  96.0B/op        fastest

Benchmark: 1_000_000 bytes IO to IO
old 231.63  (  4.32ms) (± 0.59%)  93.0B/op   8.17× slower
new   1.89k (528.53µs) (± 0.57%)  96.0B/op        fastest

NOTE: The Buffer-to-IO encoding methods call #flush on the output IO, which explains why encoding 10 bytes to IO takes > 300ns. Without the explicit flush, this number goes down to 18.63ns

NOTE: The IO-to-IO encoding methods were compared against #14604

Throughput

Reading from and writing to /dev/zero, the code needed 490.62ms per 1 GiB (-> 2.04GiB / 2.19GB per second) in strict mode and 565.33ms using the default #encode (-> 1.77GiB / 1.90GB per second)

Closes #14604

jgaskins and others added 3 commits May 21, 2024 18:06
# Clear out each buffer before each spec
encode_buffer = IO::Memory.new
decode_buffer = IO::Memory.new
end
Copy link
Contributor

@kostya kostya May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this 6 lines can be removed, and put initialization into it

@straight-shoota
Copy link
Member

So just to clarify, this replaces #14604 and supplements #13574 which covers the decoding part. Right?

@kostya
Copy link
Contributor

kostya commented May 23, 2024

looks good.
check on https://github.com/kostya/crystal-metric
1.12.1

Base64Encode: ok in 0.930s, award 1.0
Base64Decode: ok in 0.942s, award 1.0
1.8721s, 2.00/2, 1.00

this pr

Base64Encode: ok in 0.918s, award 1.0
Base64Decode: ok in 0.937s, award 1.0
1.8552s, 2.00/2, 1.00

src/base64.cr Show resolved Hide resolved
src/base64.cr Show resolved Hide resolved
@BlobCodes
Copy link
Contributor Author

So just to clarify, this replaces #14604 and supplements #13574 which covers the decoding part. Right?

Yes, although it is completely independent of #13574

I thought memmove was the function which disallowed overlapped memory regions for performance reasons
@BlobCodes
Copy link
Contributor Author

As a note to anyone reviewing this:

In the bulk encoding part (-> #encode_base64_full_pairs_internal), memcpy is used to copy the actual data triples over to the variable:

Intrinsics.memcpy(pointerof(value), input, 4, is_volatile: false)

This was adapted from the previous encoding method:

n = cstr.as(UInt32*).value.byte_swap

The old method used an unaligned U32 access with Pointer (which reports correct alignment to LLVM).
This could have caused a crash on weak-memory architectures like RISC-V and older ARM versions where unaligned accesses are forbidden.

This can also be seen in the generated IR:

body:                                             ; preds = %while
  %122 = load ptr, ptr %cstr, align 8, !dbg !17334
  %123 = load i32, ptr %122, align 4, !dbg !17334 ; <- Notice the align 4
  %124 = call i32 @"*UInt32#byte_swap:UInt32"(i32 %123), !dbg !17335

The new variant should now always work correctly, but may be less performant than only reading 3 bytes on weak-memory architectures. Hopefully we'll find an even more efficient method for the bulk encoding work once crystal gets SIMD support.

src/base64.cr Show resolved Hide resolved
src/base64.cr Outdated Show resolved Hide resolved
BlobCodes and others added 2 commits May 24, 2024 16:42
src/base64.cr Show resolved Hide resolved
@jgaskins
Copy link
Contributor

jgaskins commented Jun 1, 2024

This looks great! Thanks for improving on my PR!

Copy link
Contributor

@ysbaddaden ysbaddaden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finally took the time to review this PR. Great work!

The main issue is passing raw pointers around, especially for buffers (pointer + size), even to private internal methods. In Crystal we always pass slices and only reach to raw pointers to circumvent some safety checks when we can prove it's safe.

Then the PR could be split into a couple PRs, to distinguish:

  1. the refactor & support for IO input/output;
  2. the unaligned fix/performance improvement for encoding pairs (with before/after benchmarks for x86_64 & aarch64).

break if read_bytes == 0

# Move unprocessed bytes to the beginning of input_buffer
Intrinsics.memmove(input_buffer.to_unsafe, input_buffer.to_unsafe + unprocessable_bytes, unprocessable_bytes, is_volatile: false) unless unprocessable_bytes == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: we shouldn't have to call for intrinsics directly in usual code. The #move_to and #move_from methods on Pointer and Slice are available for this purpose.

The abstraction will be optimized away, at the exception of safety checks. We could consider unsafe variants, but so far we never noticed significant bottlenecks (but will love to be proved wrong).

issue: the move looks wrong, it's copying the first N bytes in place, when it should copy the last N bytes as the first N bytes?

last_bytes = Bytes.new(input_slice.to_unsafe - unprocessable_bytes, unprocessable_bytes)
last_bytes.move_to(input_buffer.to_slice)

suggestion: it could be useful to have a slice that represents the current part of the input buffer, instead of always pointing to the end. For example (sadly, creating the slice is still a bit awkward because of #14775):

slice = input_buffer.to_slice[0, available_bytes]
last_bytes = slice[slice.size - unprocessable_bytes, unprocessable_bytes]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: the move looks wrong

The move does indeed look wrong, input_slice.to_unsafe - unprocessable_bytes (as you suggested) is the correct location.


Also, the two regions should never be able to overlap, so the memmove could be replaced by memcpy.

# Internal method for encoding bytes from one buffer as base64, using chunks allocated by this method.
# Returns the amount of bytes written into output.
private def encode_base64_chunked_internal(
input : UInt8*, input_size : Int32, chars : UInt8*, *, newlines : Bool = false, pad : Bool = false, & : Bytes -> Nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: we don't pass a pointer then a size in crystal, we pass a Slice. We also avoid passing buffers as raw pointers (what's their size?), but again pass a Slice.

The method signature should be:

Suggested change
input : UInt8*, input_size : Int32, chars : UInt8*, *, newlines : Bool = false, pad : Bool = false, & : Bytes -> Nil
input : Bytes, chars : Bytes, *, newlines : Bool = false, pad : Bool = false, & : Bytes -> Nil

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since chars is required to be an array of exactly 64 ascii chars, it doesn't really make sense to pass two extra parameters with it. I'd rather use something like UInt8[64]*, document the required behaviour or just pass String.

Also, the method is internal, so i don't really understand why this is an issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't really make sense to pass two extra parameters with it.

Which two extra parameters? Isn't it only one for the size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slice also has a read_only parameter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the reasons, but these copy shall be optimized away when LLVM inlines everything.

With chars as Bytes instead of a raw pointer, the method could have a single raise unless chars.size == 64 assertion, which is likely to be optimized away until we try to pass something else that will safely fail, so we can safely access chars.to_unsafe[i] for the rest of the method.

We can remove safety net, but we must prove that it significantly impacts performance, then try to balance risk vs performance accordingly, ideally keeping safety knowledge to the very method (avoid assumptions about callers) and otherwise have an explicit note in how the caller MUST call the method.


# Internal method for encoding bytes from one buffer as base64 into another one (backend of *every* encoding method).
# Returns the amount of bytes written into output.
private def encode_base64_buffer_internal(input : UInt8*, input_size : Int32, output : UInt8*, chars : UInt8*, *, newlines : Bool = false, pad : Bool = false) : Int32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: same here:

Suggested change
private def encode_base64_buffer_internal(input : UInt8*, input_size : Int32, output : UInt8*, chars : UInt8*, *, newlines : Bool = false, pad : Bool = false) : Int32
private def encode_base64_buffer_internal(input : Bytes, output : Bytes, chars : Bytes, *, newlines : Bool = false, pad : Bool = false) : Int32

Then the method body won't have to meddle with pointer addresses, and can just take advantage of output.size, or input += LINE_BYTES moving both the pointer and reducing the size, ...

Intrinsics.memcpy(pointerof(value), input, 4, is_volatile: false)
input += 3

value = value.byte_swap
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: this assumes that the system is little-endian, what if it's big-endian?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the code doesn't work on big-endian systems.
Originally, the code did check the system endianness (byte_swap if IO::ByteFormat...), but I removed that since crystal doesn't support any big endian system anyways (afaik).

i = 8
while i != 0
value = 0_u32
Intrinsics.memcpy(pointerof(value), input, 4, is_volatile: false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: again, no need for intrinsics:

Suggested change
Intrinsics.memcpy(pointerof(value), input, 4, is_volatile: false)
pointerof(value).as(UInt8*).copy_from(input, count: 4)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants