Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor base64 encoding #14611

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

BlobCodes
Copy link
Contributor

@BlobCodes BlobCodes commented May 23, 2024

This PR rewrites the entire base64-encoding logic.
While doing so, it adds methods used to encode (normal, urlsafe and strict) base64 from one IO into another.
Also, urlsafe_encode(data, io) now received an optional padding = true parameter to reach feature parity between all ways to encode base64 (buffer to buffer, buffer to IO, IO to IO).

Encoding from buffer to buffer only saw small performance improvements, but everything related to IOs saw very significant performance improvements.

The specs from #14604 have been copied over to this PR, although more should probably be added.

Benchmark Code

For this benchmark, I copied the base64.cr file from this repo into base64blob.cr and renamed the module inside to Base64Blob. This way, both variants can be used in parallel.

Direct Comparison

require "benchmark"
require "base64"

require "./base64blob"

asma = "a" * 10
amed = "a" * 1000
abig = "a" * 1_000_000
dev_zero = File.new("/dev/zero", "w+")

macro bench(name, input, *output)
  puts "\nBenchmark: #{ {{name}} }"
  Benchmark.ips do |x|
    x.report("old") { Base64.encode({{input}}, {{output.splat}}) }
    x.report("new") { Base64Blob.encode({{input}}, {{output.splat}}) }
  end
end

bench("10 byte string to string", asma)
bench("1000 byte string to string", amed)
bench("1_000_0000 byte string to string", abig)

bench("10 byte string to IO", asma, dev_zero)
bench("1000 byte string to IO", amed, dev_zero)
bench("1_000_0000 byte string to IO", abig, dev_zero)

# For comparison with #14604
#bench("10 byte IO to IO", IO::Sized.new(dev_zero, 10), dev_zero)
#bench("1000 bytes IO to IO", IO::Sized.new(dev_zero, 1000), dev_zero)
#bench("1_000_000 bytes IO to IO", IO::Sized.new(dev_zero, 1_000_000), dev_zero)

Throughput:

require "benchmark"
require "./base64blob.cr"

input = File.new("/dev/zero", "r")
output = File.new("/dev/zero", "w")
input_size = 2**30 # 1GiB

Benchmark.ips do |x|
  x.report("throughput (1GiB)") { Base64Blob.strict_encode(IO::Sized.new(input, input_size), output) }
end
My Benchmark Results (Fedora 40, Ryzen 3600)

Direct comparisons

Benchmark: 10 byte string to string
old  34.96M ( 28.61ns) (± 0.34%)  32.0B/op   1.11× slower
new  38.97M ( 25.66ns) (± 0.33%)  32.0B/op        fastest

Benchmark: 1000 byte string to string
old   1.14M (874.90ns) (± 1.20%)  2.0kB/op   1.38× slower
new   1.58M (632.77ns) (± 0.48%)  2.0kB/op        fastest

Benchmark: 1_000_0000 byte string to string
old   1.43k (697.02µs) (± 0.59%)  1.3MB/op   1.52× slower
new   2.19k (457.18µs) (± 1.19%)  1.3MB/op        fastest

Benchmark: 10 byte string to IO
old   2.92M (342.82ns) (± 1.04%)  0.0B/op   1.14× slower
new   3.32M (301.40ns) (± 0.51%)  0.0B/op        fastest

Benchmark: 1000 byte string to IO
old 215.74k (  4.64µs) (± 0.73%)  0.0B/op   6.21× slower
new   1.34M (746.07ns) (± 0.64%)  0.0B/op        fastest

Benchmark: 1_000_0000 byte string to IO
old 229.40  (  4.36ms) (± 0.98%)  0.0B/op   9.29× slower
new   2.13k (469.31µs) (± 0.97%)  0.0B/op        fastest

Benchmark: 10 byte IO to IO
old   2.30M (435.64ns) (± 0.23%)  96.0B/op   8.59× slower
new  19.72M ( 50.70ns) (± 0.99%)  96.0B/op        fastest

Benchmark: 1000 bytes IO to IO
old 211.53k (  4.73µs) (± 1.49%)  96.0B/op   8.10× slower
new   1.71M (583.84ns) (± 0.49%)  96.0B/op        fastest

Benchmark: 1_000_000 bytes IO to IO
old 231.63  (  4.32ms) (± 0.59%)  93.0B/op   8.17× slower
new   1.89k (528.53µs) (± 0.57%)  96.0B/op        fastest

NOTE: The Buffer-to-IO encoding methods call #flush on the output IO, which explains why encoding 10 bytes to IO takes > 300ns. Without the explicit flush, this number goes down to 18.63ns

NOTE: The IO-to-IO encoding methods were compared against #14604

Throughput

Reading from and writing to /dev/zero, the code needed 490.62ms per 1 GiB (-> 2.04GiB / 2.19GB per second) in strict mode and 565.33ms using the default #encode (-> 1.77GiB / 1.90GB per second)

Closes #14604

jgaskins and others added 3 commits May 21, 2024 18:06
# Clear out each buffer before each spec
encode_buffer = IO::Memory.new
decode_buffer = IO::Memory.new
end
Copy link
Contributor

@kostya kostya May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this 6 lines can be removed, and put initialization into it

@straight-shoota
Copy link
Member

So just to clarify, this replaces #14604 and supplements #13574 which covers the decoding part. Right?

@kostya
Copy link
Contributor

kostya commented May 23, 2024

looks good.
check on https://github.com/kostya/crystal-metric
1.12.1

Base64Encode: ok in 0.930s, award 1.0
Base64Decode: ok in 0.942s, award 1.0
1.8721s, 2.00/2, 1.00

this pr

Base64Encode: ok in 0.918s, award 1.0
Base64Decode: ok in 0.937s, award 1.0
1.8552s, 2.00/2, 1.00

private PAD = '='.ord.to_u8
private NL = '\n'.ord.to_u8
private NR = '\r'.ord.to_u8

{% begin %}
private STREAM_MAX_INPUT_BUFFER_SIZE = {{ IO::DEFAULT_BUFFER_SIZE // (LINE_SIZE + 1) * (LINE_SIZE // 4 * 3) }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not understand why this is macro?, not just constant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't compile without the macro because the compiler cannot find the referenced constants (probably because they are private)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should certainly work. They're in the same namespace. Might be a compiler bug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified:

module A
  FOO = 1
  BAR = FOO
end
buffer = uninitialized UInt8[A::BAR]
# => Error: undefined constant FOO

-> #13018

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also can't use the fix described in #13018, since I cannot refer to the constant by full path (private):

private STREAM_MAX_INPUT_BUFFER_SIZE = IO::DEFAULT_BUFFER_SIZE // (Base64::LINE_SIZE + 1) * Base64::LINE_BYTES
# Error: private constant Base64::LINE_SIZE referenced

src/base64.cr Show resolved Hide resolved
@BlobCodes
Copy link
Contributor Author

So just to clarify, this replaces #14604 and supplements #13574 which covers the decoding part. Right?

Yes, although it is completely independent of #13574

I thought memmove was the function which disallowed overlapped memory regions for performance reasons
@BlobCodes
Copy link
Contributor Author

As a note to anyone reviewing this:

In the bulk encoding part (-> #encode_base64_full_pairs_internal), memcpy is used to copy the actual data triples over to the variable:

Intrinsics.memcpy(pointerof(value), input, 4, is_volatile: false)

This was adapted from the previous encoding method:

n = cstr.as(UInt32*).value.byte_swap

The old method used an unaligned U32 access with Pointer (which reports correct alignment to LLVM).
This could have caused a crash on weak-memory architectures like RISC-V and older ARM versions where unaligned accesses are forbidden.

This can also be seen in the generated IR:

body:                                             ; preds = %while
  %122 = load ptr, ptr %cstr, align 8, !dbg !17334
  %123 = load i32, ptr %122, align 4, !dbg !17334 ; <- Notice the align 4
  %124 = call i32 @"*UInt32#byte_swap:UInt32"(i32 %123), !dbg !17335

The new variant should now always work correctly, but may be less performant than only reading 3 bytes on weak-memory architectures. Hopefully we'll find an even more efficient method for the bulk encoding work once crystal gets SIMD support.

src/base64.cr Show resolved Hide resolved
src/base64.cr Outdated Show resolved Hide resolved
BlobCodes and others added 2 commits May 24, 2024 16:42
src/base64.cr Show resolved Hide resolved
@jgaskins
Copy link
Contributor

jgaskins commented Jun 1, 2024

This looks great! Thanks for improving on my PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants