Refactor base64 encoding #14611

BlobCodes · 2024-05-23T02:49:32Z

This PR rewrites the entire base64-encoding logic.
While doing so, it adds methods used to encode (normal, urlsafe and strict) base64 from one IO into another.
Also, urlsafe_encode(data, io) now received an optional padding = true parameter to reach feature parity between all ways to encode base64 (buffer to buffer, buffer to IO, IO to IO).

Encoding from buffer to buffer only saw small performance improvements, but everything related to IOs saw very significant performance improvements.

The specs from #14604 have been copied over to this PR, although more should probably be added.

Benchmark Code

For this benchmark, I copied the base64.cr file from this repo into base64blob.cr and renamed the module inside to Base64Blob. This way, both variants can be used in parallel.

Direct Comparison

require "benchmark"
require "base64"

require "./base64blob"

asma = "a" * 10
amed = "a" * 1000
abig = "a" * 1_000_000
dev_zero = File.new("/dev/zero", "w+")

macro bench(name, input, *output)
  puts "\nBenchmark: #{ {{name}} }"
  Benchmark.ips do |x|
    x.report("old") { Base64.encode({{input}}, {{output.splat}}) }
    x.report("new") { Base64Blob.encode({{input}}, {{output.splat}}) }
  end
end

bench("10 byte string to string", asma)
bench("1000 byte string to string", amed)
bench("1_000_0000 byte string to string", abig)

bench("10 byte string to IO", asma, dev_zero)
bench("1000 byte string to IO", amed, dev_zero)
bench("1_000_0000 byte string to IO", abig, dev_zero)

# For comparison with #14604
#bench("10 byte IO to IO", IO::Sized.new(dev_zero, 10), dev_zero)
#bench("1000 bytes IO to IO", IO::Sized.new(dev_zero, 1000), dev_zero)
#bench("1_000_000 bytes IO to IO", IO::Sized.new(dev_zero, 1_000_000), dev_zero)

Throughput:

require "benchmark"
require "./base64blob.cr"

input = File.new("/dev/zero", "r")
output = File.new("/dev/zero", "w")
input_size = 2**30 # 1GiB

Benchmark.ips do |x|
  x.report("throughput (1GiB)") { Base64Blob.strict_encode(IO::Sized.new(input, input_size), output) }
end

My Benchmark Results (Fedora 40, Ryzen 3600)

Direct comparisons

Benchmark: 10 byte string to string
old  34.96M ( 28.61ns) (± 0.34%)  32.0B/op   1.11× slower
new  38.97M ( 25.66ns) (± 0.33%)  32.0B/op        fastest

Benchmark: 1000 byte string to string
old   1.14M (874.90ns) (± 1.20%)  2.0kB/op   1.38× slower
new   1.58M (632.77ns) (± 0.48%)  2.0kB/op        fastest

Benchmark: 1_000_0000 byte string to string
old   1.43k (697.02µs) (± 0.59%)  1.3MB/op   1.52× slower
new   2.19k (457.18µs) (± 1.19%)  1.3MB/op        fastest

Benchmark: 10 byte string to IO
old   2.92M (342.82ns) (± 1.04%)  0.0B/op   1.14× slower
new   3.32M (301.40ns) (± 0.51%)  0.0B/op        fastest

Benchmark: 1000 byte string to IO
old 215.74k (  4.64µs) (± 0.73%)  0.0B/op   6.21× slower
new   1.34M (746.07ns) (± 0.64%)  0.0B/op        fastest

Benchmark: 1_000_0000 byte string to IO
old 229.40  (  4.36ms) (± 0.98%)  0.0B/op   9.29× slower
new   2.13k (469.31µs) (± 0.97%)  0.0B/op        fastest

Benchmark: 10 byte IO to IO
old   2.30M (435.64ns) (± 0.23%)  96.0B/op   8.59× slower
new  19.72M ( 50.70ns) (± 0.99%)  96.0B/op        fastest

Benchmark: 1000 bytes IO to IO
old 211.53k (  4.73µs) (± 1.49%)  96.0B/op   8.10× slower
new   1.71M (583.84ns) (± 0.49%)  96.0B/op        fastest

Benchmark: 1_000_000 bytes IO to IO
old 231.63  (  4.32ms) (± 0.59%)  93.0B/op   8.17× slower
new   1.89k (528.53µs) (± 0.57%)  96.0B/op        fastest

NOTE: The Buffer-to-IO encoding methods call #flush on the output IO, which explains why encoding 10 bytes to IO takes > 300ns. Without the explicit flush, this number goes down to 18.63ns

NOTE: The IO-to-IO encoding methods were compared against #14604

Throughput

Reading from and writing to /dev/zero, the code needed 490.62ms per 1 GiB (-> 2.04GiB / 2.19GB per second) in strict mode and 565.33ms using the default #encode (-> 1.77GiB / 1.90GB per second)

Closes #14604

I keep forgetting this exists Co-authored-by: David Keller <davidkeller@tuta.io>

kostya · 2024-05-23T07:14:34Z

spec/std/base64_spec.cr

+      # Clear out each buffer before each spec
+      encode_buffer = IO::Memory.new
+      decode_buffer = IO::Memory.new
+    end


this 6 lines can be removed, and put initialization into it

straight-shoota · 2024-05-23T07:28:32Z

So just to clarify, this replaces #14604 and supplements #13574 which covers the decoding part. Right?

kostya · 2024-05-23T07:40:57Z

looks good.
check on https://github.com/kostya/crystal-metric
1.12.1

Base64Encode: ok in 0.930s, award 1.0
Base64Decode: ok in 0.942s, award 1.0
1.8721s, 2.00/2, 1.00

this pr

Base64Encode: ok in 0.918s, award 1.0
Base64Decode: ok in 0.937s, award 1.0
1.8552s, 2.00/2, 1.00

src/base64.cr

BlobCodes · 2024-05-23T10:22:37Z

So just to clarify, this replaces #14604 and supplements #13574 which covers the decoding part. Right?

Yes, although it is completely independent of #13574

I thought memmove was the function which disallowed overlapped memory regions for performance reasons

BlobCodes · 2024-05-23T11:31:55Z

As a note to anyone reviewing this:

In the bulk encoding part (-> #encode_base64_full_pairs_internal), memcpy is used to copy the actual data triples over to the variable:

crystal/src/base64.cr

Line 297 in b662f34

Intrinsics.memcpy(pointerof(value), input, 4, is_volatile: false)

This was adapted from the previous encoding method:

crystal/src/base64.cr

Line 236 in f9ffa35

n = cstr.as(UInt32*).value.byte_swap

The old method used an unaligned U32 access with Pointer (which reports correct alignment to LLVM).
This could have caused a crash on weak-memory architectures like RISC-V and older ARM versions where unaligned accesses are forbidden.

This can also be seen in the generated IR:

body:                                             ; preds = %while
  %122 = load ptr, ptr %cstr, align 8, !dbg !17334
  %123 = load i32, ptr %122, align 4, !dbg !17334 ; <- Notice the align 4
  %124 = call i32 @"*UInt32#byte_swap:UInt32"(i32 %123), !dbg !17335

The new variant should now always work correctly, but may be less performant than only reading 3 bytes on weak-memory architectures. Hopefully we'll find an even more efficient method for the bulk encoding work once crystal gets SIMD support.

src/base64.cr

Co-authored-by: Jason Frey <fryguy9@gmail.com>

src/base64.cr

jgaskins · 2024-06-01T02:17:32Z

This looks great! Thanks for improving on my PR!

ysbaddaden

I finally took the time to review this PR. Great work!

The main issue is passing raw pointers around, especially for buffers (pointer + size), even to private internal methods. In Crystal we always pass slices and only reach to raw pointers to circumvent some safety checks when we can prove it's safe.

Then the PR could be split into a couple PRs, to distinguish:

the refactor & support for IO input/output;
the unaligned fix/performance improvement for encoding pairs (with before/after benchmarks for x86_64 & aarch64).

ysbaddaden · 2024-07-03T12:37:22Z

src/base64.cr

+      break if read_bytes == 0
+
+      # Move unprocessed bytes to the beginning of input_buffer
+      Intrinsics.memmove(input_buffer.to_unsafe, input_buffer.to_unsafe + unprocessable_bytes, unprocessable_bytes, is_volatile: false) unless unprocessable_bytes == 0


suggestion: we shouldn't have to call for intrinsics directly in usual code. The #move_to and #move_from methods on Pointer and Slice are available for this purpose.

The abstraction will be optimized away, at the exception of safety checks. We could consider unsafe variants, but so far we never noticed significant bottlenecks (but will love to be proved wrong).

issue: the move looks wrong, it's copying the first N bytes in place, when it should copy the last N bytes as the first N bytes?

last_bytes = Bytes.new(input_slice.to_unsafe - unprocessable_bytes, unprocessable_bytes) last_bytes.move_to(input_buffer.to_slice)

suggestion: it could be useful to have a slice that represents the current part of the input buffer, instead of always pointing to the end. For example (sadly, creating the slice is still a bit awkward because of #14775):

slice = input_buffer.to_slice[0, available_bytes] last_bytes = slice[slice.size - unprocessable_bytes, unprocessable_bytes]

issue: the move looks wrong

The move does indeed look wrong, input_slice.to_unsafe - unprocessable_bytes (as you suggested) is the correct location.

Also, the two regions should never be able to overlap, so the memmove could be replaced by memcpy.

ysbaddaden · 2024-07-03T13:05:32Z

src/base64.cr

+  # Internal method for encoding bytes from one buffer as base64, using chunks allocated by this method.
+  # Returns the amount of bytes written into output.
+  private def encode_base64_chunked_internal(
+    input : UInt8*, input_size : Int32, chars : UInt8*, *, newlines : Bool = false, pad : Bool = false, & : Bytes -> Nil


issue: we don't pass a pointer then a size in crystal, we pass a Slice. We also avoid passing buffers as raw pointers (what's their size?), but again pass a Slice.

The method signature should be:

Suggested change

input : UInt8*, input_size : Int32, chars : UInt8*, *, newlines : Bool = false, pad : Bool = false, & : Bytes -> Nil

input : Bytes, chars : Bytes, *, newlines : Bool = false, pad : Bool = false, & : Bytes -> Nil

Since chars is required to be an array of exactly 64 ascii chars, it doesn't really make sense to pass two extra parameters with it. I'd rather use something like UInt8[64]*, document the required behaviour or just pass String.

Also, the method is internal, so i don't really understand why this is an issue.

it doesn't really make sense to pass two extra parameters with it.

Which two extra parameters? Isn't it only one for the size?

Slice also has a read_only parameter

I understand the reasons, but these copy shall be optimized away when LLVM inlines everything.

With chars as Bytes instead of a raw pointer, the method could have a single raise unless chars.size == 64 assertion, which is likely to be optimized away until we try to pass something else that will safely fail, so we can safely access chars.to_unsafe[i] for the rest of the method.

We can remove safety net, but we must prove that it significantly impacts performance, then try to balance risk vs performance accordingly, ideally keeping safety knowledge to the very method (avoid assumptions about callers) and otherwise have an explicit note in how the caller MUST call the method.

ysbaddaden · 2024-07-03T13:08:26Z

src/base64.cr

+
+  # Internal method for encoding bytes from one buffer as base64 into another one (backend of *every* encoding method).
+  # Returns the amount of bytes written into output.
+  private def encode_base64_buffer_internal(input : UInt8*, input_size : Int32, output : UInt8*, chars : UInt8*, *, newlines : Bool = false, pad : Bool = false) : Int32


issue: same here:

Suggested change

private def encode_base64_buffer_internal(input : UInt8*, input_size : Int32, output : UInt8*, chars : UInt8*, *, newlines : Bool = false, pad : Bool = false) : Int32

private def encode_base64_buffer_internal(input : Bytes, output : Bytes, chars : Bytes, *, newlines : Bool = false, pad : Bool = false) : Int32

Then the method body won't have to meddle with pointer addresses, and can just take advantage of output.size, or input += LINE_BYTES moving both the pointer and reducing the size, ...

ysbaddaden · 2024-07-03T13:28:59Z

src/base64.cr

+        Intrinsics.memcpy(pointerof(value), input, 4, is_volatile: false)
+        input += 3
+
+        value = value.byte_swap


issue: this assumes that the system is little-endian, what if it's big-endian?

Yes, the code doesn't work on big-endian systems.
Originally, the code did check the system endianness (byte_swap if IO::ByteFormat...), but I removed that since crystal doesn't support any big endian system anyways (afaik).

ysbaddaden · 2024-07-03T13:30:46Z

src/base64.cr

+      i = 8
+      while i != 0
+        value = 0_u32
+        Intrinsics.memcpy(pointerof(value), input, 4, is_volatile: false)


suggestion: again, no need for intrinsics:

Suggested change

Intrinsics.memcpy(pointerof(value), input, 4, is_volatile: false)

pointerof(value).as(UInt8*).copy_from(input, count: 4)

jgaskins and others added 3 commits May 21, 2024 18:06

Allow passing an IO as the Base64.encode source

752fab7

Replace explicit Bytes.new with .to_slice

f9ffa35

I keep forgetting this exists Co-authored-by: David Keller <davidkeller@tuta.io>

Rewrite base64 encoding

cf4cc03

kostya reviewed May 23, 2024

View reviewed changes

straight-shoota added performance topic:stdlib:text labels May 23, 2024

kostya reviewed May 23, 2024

View reviewed changes

src/base64.cr Show resolved Hide resolved

kostya reviewed May 23, 2024

View reviewed changes

src/base64.cr Show resolved Hide resolved

BlobCodes added 2 commits May 23, 2024 12:56

Fix off-by-one error

c7b7192

Use memcpy instead of memmove

b662f34

I thought memmove was the function which disallowed overlapped memory regions for performance reasons

kostya reviewed May 23, 2024

View reviewed changes

src/base64.cr Show resolved Hide resolved

Fryguy reviewed May 24, 2024

View reviewed changes

src/base64.cr Outdated Show resolved Hide resolved

BlobCodes and others added 2 commits May 24, 2024 16:42

Improve variable naming

50b3c78

Co-authored-by: Jason Frey <fryguy9@gmail.com>

Add comment explaining fast full pair encode

5f9bb9e

kostya reviewed May 24, 2024

View reviewed changes

src/base64.cr Show resolved Hide resolved

jgaskins mentioned this pull request Jun 1, 2024

Allow passing an IO as the Base64.encode source #14604

Closed

ysbaddaden requested changes Jul 3, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor base64 encoding #14611

Refactor base64 encoding #14611

BlobCodes commented May 23, 2024 •

edited by straight-shoota

Loading

kostya May 23, 2024 •

edited

Loading

straight-shoota commented May 23, 2024

kostya commented May 23, 2024

BlobCodes commented May 23, 2024

BlobCodes commented May 23, 2024

jgaskins commented Jun 1, 2024

ysbaddaden left a comment •

edited

Loading

ysbaddaden Jul 3, 2024

BlobCodes Jul 6, 2024

ysbaddaden Jul 3, 2024

BlobCodes Jul 6, 2024

straight-shoota Jul 6, 2024

BlobCodes Jul 6, 2024

ysbaddaden Jul 8, 2024

ysbaddaden Jul 3, 2024

ysbaddaden Jul 3, 2024

BlobCodes Jul 3, 2024

ysbaddaden Jul 3, 2024

	input : UInt8, input_size : Int32, chars : UInt8, *, newlines : Bool = false, pad : Bool = false, & : Bytes -> Nil
	input : Bytes, chars : Bytes, *, newlines : Bool = false, pad : Bool = false, & : Bytes -> Nil

	private def encode_base64_buffer_internal(input : UInt8, input_size : Int32, output : UInt8, chars : UInt8, , newlines : Bool = false, pad : Bool = false) : Int32
	private def encode_base64_buffer_internal(input : Bytes, output : Bytes, chars : Bytes, *, newlines : Bool = false, pad : Bool = false) : Int32

	Intrinsics.memcpy(pointerof(value), input, 4, is_volatile: false)
	pointerof(value).as(UInt8*).copy_from(input, count: 4)

Refactor base64 encoding #14611

Are you sure you want to change the base?

Refactor base64 encoding #14611

Conversation

BlobCodes commented May 23, 2024 • edited by straight-shoota Loading

Direct Comparison

Throughput:

Direct comparisons

Throughput

kostya May 23, 2024 • edited Loading

Choose a reason for hiding this comment

straight-shoota commented May 23, 2024

kostya commented May 23, 2024

BlobCodes commented May 23, 2024

BlobCodes commented May 23, 2024

jgaskins commented Jun 1, 2024

ysbaddaden left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BlobCodes commented May 23, 2024 •

edited by straight-shoota

Loading

kostya May 23, 2024 •

edited

Loading

ysbaddaden left a comment •

edited

Loading