Replace "Turbo-Base64" to "lemire/fastbase64" #31643

alexey-milovidov · 2021-11-22T20:52:35Z

bkuschel · 2021-11-23T13:11:15Z

The fastbase64 library seems to be a good option for avx2 (and avx2-512) SIMD support. It does not seem to have native support for Neon (AMD) or AltiVec (PowerPC), so if nothing else, there would be a fallback to the slower chromium base64 for those.

I saw that the aklomp library was attempted previously and unfortunately seems to have some issues as well.

alexey-milovidov · 2022-05-23T23:43:57Z

Looks abandoned.

mkmkme · 2023-07-18T15:29:22Z

Do you still want it to be done? I'd like to work on this.

alexey-milovidov · 2023-07-18T15:57:59Z

@mkmkme, Yes, it makes sense (although with low priority). It will be very appreciated!

mkmkme · 2023-07-19T10:34:22Z

It seems that so far this library is x86-only. I pinged the author to see if they have any plans on supporting ARM as well.

mkmkme · 2023-07-24T13:00:36Z

I've got the reply from them to use https://github.com/aklomp/base64
I integrated it, but so far the performance is about twice as bad as it was before my changes. I'll check if I can speed it up.

mkmkme · 2023-07-26T11:58:25Z

So, here's the summary of my research.

lemire/fastbase64 is not a proper library that was pointed by the author, but rather a PoC (for instance, it is x86-only). The author recommended using aklomp/base64, as that one inherited many of his algorithms as well.

WojciechMula/base64-avx512 mentioned in issue #41957 is also x86-only. It has a reference to a different repository by the same author, WojciechMula/base64simd, that is mentioned by the author to be problematic, and they recommend using aklomp/base64 as well.

While trying to integrate aklomp/base64 to CH I noticed that before Turbo-Base64 CH actually did use exactly that library (here's the PR that replaces it to TB64).

And the result is quite underwhelming. On ARM (M1 Max with macOS), aklomp/base64 is about 2 times slower.

Here's version with Turbo-Base64:

And here's with aklomp/base64:

I tried to poke to compile flags, but the result didn't get any better. I thought the reason might be ARM or macOS specific.
Double-checked on i7 with Fedora Linux installed, and result there is also about 1.5-2 times worse than with TB64: ~500MB/s with TB64 vs ~280MB/s with aklomp/base64.

My changes can be found here. Sorry I didn't make it into a pull request. I think it's not worth even a WIP pull request, because the results are not satisfactory so far.

I'd like to ping @alexey-milovidov and @rschu1ze. Maybe I'm missing something obvious? Or maybe it's not yet time to replace TB64 with something different?

den-crane · 2023-07-26T18:07:38Z

BTW, the behaviour of ARM and AMD builds is different .
It looks like they use different code.

mkmkme · 2023-07-26T18:21:57Z

Hey @den-crane!

Are you still able to reproduce the issue? There was a merged PR (#33779) since the version in question was released.

I just tested on my Macbook, and the command seems to work just fine:

It's not inside the docker, but I doubt that would change the behaviour.

den-crane · 2023-07-26T19:02:06Z

@mkmkme

Ampere A1

ClickHouse client version 23.6.2.18 (official build).
Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 23.6.2 revision 54464.

clickhouse :) select  base64Decode('AGy1cKzasOC/yOPct7tO3w');

SELECT base64Decode('AGy1cKzasOC/yOPct7tO3w')

Query id: cacb7607-1a58-4066-94b1-434d72ef9209


0 rows in set. Elapsed: 0.021 sec.

Received exception from server (version 23.6.2):
Code: 117. DB::Exception: Received from localhost:9000. 
DB::Exception: Failed to base64Decode input 'AGy1cKzasOC/yOPct7tO3w':
 While processing base64Decode('AGy1cKzasOC/yOPct7tO3w'). (INCORRECT_DATA)



Linux clickhouse 5.15.0-1021-oracle #27~20.04.1-Ubuntu SMP Mon Oct 17 01:56:24 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux

Mac M1

curl https://clickhouse.com/ | sh

./clickhouse local
ClickHouse local version 23.7.1.2417 (official build).

m1.local :) select  base64Decode('AGy1cKzasOC/yOPct7tO3w');

SELECT base64Decode('AGy1cKzasOC/yOPct7tO3w')

Query id: 813728fb-c65d-4cb6-a3b7-5a9f2d6bced7


0 rows in set. Elapsed: 0.399 sec.

Received exception:
Code: 117. 
DB::Exception: Failed to base64Decode input 'AGy1cKzasOC/yOPct7tO3w':
 While processing base64Decode('AGy1cKzasOC/yOPct7tO3w'). (INCORRECT_DATA)

arm

SELECT base64Decode('SGVsbG8gd29ybGQ');
DB::Exception: Failed to base64Decode input 'SGVsbG8gd29ybGQ': While processing base64Decode('SGVsbG8gd29ybGQ'). (INCORRECT_DATA)

amd

SELECT base64Decode('SGVsbG8gd29ybGQ');
┌─base64Decode('SGVsbG8gd29ybGQ')─┐
│ Hello world                     │
└─────────────────────────────────┘

mkmkme · 2023-07-26T19:40:29Z

Huh, that's interesting! I've downloaded CH via curl and got the same result as you.

However, the binary built on my machine works just fine! So there's a chance something is going on with official build.

Sorry, I'm also quite new to this project, so I'd like to raise two newbie questions:

Version downloaded from clickhouse.com is 23.7.1.2417. However, there's no such tag (and 23.7 release webinar is only tomorrow IIRC). What's that build? Is it nightly? How can I match it with the git revision?
Are the scripts for building official binaries stored in this repo as well? I'd like to try the build flags used during compilation of that binary to see whether I'll be able to reproduce the issue then

den-crane · 2023-07-26T19:43:53Z

1 Is it nightly

yes.
select * from system.build_options;

2 Are the scripts for building official binaries stored in this repo as well?

yes.

build flags

I think they should be available in system.build_options

mkmkme · 2023-07-27T07:52:19Z

Hmm, I stand corrected. Yesterday I think I mistakenly ran the version with aklomp/base64 thinking it was the version without any changes.

I double-checked now, and yes, the version with Turbo-Base64 does fail with your example.

rschu1ze · 2023-07-27T11:55:27Z

@mkmkme SYSTEM.BUILD_OPTIONS can be a bit overwhelming. For (performance) testing, it is sufficient to pass these flags to CMake: -DENABLE_LIBRARIES=0 -DCMAKE_BUILD_TYPE=RelWithDebInfo.

About #31643 (comment): Yes, different code gets called for x86 and ARM but I am sure that @mkmkme noticed that already:

ClickHouse/src/Functions/FunctionBase64Conversion.h

Line 30 in 14141c9

# if defined(__aarch64__)

(btw., this discussion really belongs into #45790)

I'd like to ping @alexey-milovidov and @rschu1ze. Maybe I'm missing something obvious? Or maybe it's not yet time to replace TB64 with something different?

Looking at your draft, I notice that it calls set_source_files_properties based on HAVE_* flags, e.g. HAVE_AVX. The default ClickHouse build includes only up to SSE4.2 not AVX1/2/512, see cmake/cpu_features.cmake. I assume that aklomp/base64 ran only with SSE4.2 (unless you passed ENABLE_AVX, ENABLE_AVX2 etc. to CMake). This is different from Turbo64 (the current base64 lib) which configures the encode/decode routine using a runtime dispatch (cpuid()) to the fastest available SIMD option, see tb64ini() in turbob64sse.c. I assume the comparison was unfair because of this.

mkmkme · 2023-07-27T17:27:26Z

Hey @rschu1ze !
Thanks a lot for your input. You're right, my draft was unfair to aklomp/base64. I have now added the commit to my branch that enables all the flags for all the files in aklomp/base64 (so that all codecs will be compiled into the library). The link is the same. I checked that for x86 it also uses cpuid. Also, I added a call for a dummy base64_encode during initialisation in order to initialise the codec. For arm64, the library doesn't use any runtime check as it says there is none. So the fact that on M1 Mac the result didn't change is not surprising. But it didn't change for x86, either. Turbo-Base64 version got 529MB/s on my Fedora laptop whilst aklomp/base64 loops at 293MB/s.

rschu1ze · 2023-07-27T18:28:12Z

Okay, that is unfortunate from a performance POV (assuming your measurements compared AVX2 vs. AVX2, AVX512 vs. AVX512, i.e. in a fair way). On the other hand, the motivation to go for aklomp/base64 given in #41957 (comment) is still true (and with more GitHub stars, aklomp/base64 also seems more future-proof). I personally doubt that there exists a ClickHouse use case which bottlenecks on base64 encoding/decoding speed + note that most functions in ClickHouse are not SIMD-accelerated at all; so I would support the move to aklomp/base64 regardless. (Maybe we can build a small repro (e.g. a standalone binary) and provide it to the maintainers of aklomp/base64 to check why performance is worse than with Turbo64?)

mkmkme · 2023-07-27T18:39:47Z

I'll try to build some small reproduction for that case.

alexey-milovidov · 2023-07-28T08:31:25Z

Accepting performance degradation is possible but not good.
Maybe a good way is to address all the concerns that still remain in the Turbo-Base64.

mkmkme · 2023-07-28T08:34:47Z

Right now I'm working on a minimal benchmark to properly compare these two libraries.
I think all my checks indeed were a bit unfair for aklomp/base64, because the internal benchmarks of both libraries show that aklomp/base64 is about 10% faster than Turbo-Base64:

Turbo-Base64

$ ./tb64app
detected simd (id=38->'arm_neon')

  E MB/s    size     ratio%   D MB/s   function
size=1Kb (Kb=1.000)
  4324.11       1336 133.60%  5464.65 tb64s        1000
  7480.79       1336 133.60%  6859.91 tb64x        1000
 15298.66       1336 133.60% 13961.35 tb64sse        1000
 15228.53       1336 133.60% 13931.00 tb64auto        1000
 68730.45       1000 100.00% 70625.61 memcpy        1000
size=10Kb (Kb=1.000)
  4320.59      13336 133.36%  5477.52 tb64s       10000
  7525.90      13336 133.36%  6798.98 tb64x       10000
 16652.86      13336 133.36% 15139.39 tb64sse       10000
 16496.37      13336 133.36% 15003.66 tb64auto       10000
 75722.63      10000 100.00% 75631.53 memcpy       10000
size=100Kb (Kb=1.000)
  4151.13     133336 133.34%  5367.85 tb64s      100000
  6856.95     133336 133.34%  6453.72 tb64x      100000
 16791.77     133336 133.34% 15163.17 tb64sse      100000
 16988.01     133336 133.34% 15283.64 tb64auto      100000
 33875.84     100000 100.00% 30186.64 memcpy      100000
size=1Mb (Mb=1.000.000)
  4174.09    1333336 133.33%  5463.83 tb64s     1000000
  6970.32    1333336 133.33%  6577.28 tb64x     1000000
 16950.05    1333336 133.33% 15245.75 tb64sse     1000000
 16961.66    1333336 133.33% 15332.50 tb64auto     1000000
 59295.98    1000000 100.00% 54144.61 memcpy     1000000
size=10Mb (Mb=1.000.000)
  4146.94   13333336 133.33%  5400.67 tb64s    10000000
  6949.33   13333336 133.33%  6509.04 tb64x    10000000
 16546.09   13333336 133.33% 14872.77 tb64sse    10000000
 16513.51   13333336 133.33% 14936.76 tb64auto    10000000
 50999.64   10000000 100.00% 51289.61 memcpy    10000000
size=20Mb (Mb=1.000.000)
  4137.58   26666668 133.33%  5356.94 tb64s    20000000
  6927.29   26666668 133.33%  6461.46 tb64x    20000000
 16698.04   26666668 133.33% 15101.11 tb64sse    20000000
 16767.57   26666668 133.33% 15073.72 tb64auto    20000000
 48018.36   20000000 100.00% 48046.36 memcpy    20000000

aklomp/base64 (note that it's upside down in comparison to TB64)

$ ./build-macos/bin/benchmark
Filling buffer with 10.0 MB of random data...
Testing with buffer size 10 MB, fastest of 10 * 1
NEON64	encode	12673.61 MB/sec
NEON64	decode	8569.90 MB/sec
plain	encode	4694.65 MB/sec
plain	decode	4008.02 MB/sec
Testing with buffer size 1 MB, fastest of 10 * 10
NEON64	encode	18598.91 MB/sec
NEON64	decode	12437.81 MB/sec
plain	encode	5655.98 MB/sec
plain	decode	4230.86 MB/sec
Testing with buffer size 100 KB, fastest of 10 * 100
NEON64	encode	18235.06 MB/sec
NEON64	decode	12525.40 MB/sec
plain	encode	5660.82 MB/sec
plain	decode	4148.30 MB/sec
Testing with buffer size 10 KB, fastest of 100 * 100
NEON64	encode	18571.83 MB/sec
NEON64	decode	12587.33 MB/sec
plain	encode	6133.88 MB/sec
plain	decode	4428.03 MB/sec
Testing with buffer size 1 KB, fastest of 100 * 1000
NEON64	encode	16344.14 MB/sec
NEON64	decode	12087.52 MB/sec
plain	encode	5966.80 MB/sec
plain	decode	4366.96 MB/sec

mkmkme · 2023-07-31T11:52:49Z

@rschu1ze @alexey-milovidov I've pushed a somewhat minimalistic benchmark to compare those libraries.
The code and the results can be found here: https://github.com/mkmkme/base64-benchmark

In that repo, I tried to build Turbo-Base64 exactly as it is built in CH. For aklomp/base64, I tried to reuse the building scripts from the upstream repo.

Surprisingly, this time I got aklomp/base64 significantly faster on M1 Max, but sometimes even slower on Core i7.

Please take a look. Test on your machine, if needed. And please tell me if this benchmark makes sense.

alexey-milovidov · 2023-07-31T12:29:52Z

https://github.com/mkmkme/base64-benchmark/blob/main/main.cpp#L55

This isn't the best benchmark - unrealistic data.

If the number 10000 means bytes, it is too much. Typical strings in a database are in order of 100 bytes.
String lengths should be randomly distributed.

The best benchmark will be on URL, Title, SearchPhrase columns from ClickBench.

mkmkme · 2023-08-01T11:54:28Z

@alexey-milovidov noted. I updated the benchmark to use the data taken from ClickBench.

I used URL, Title and SearchPhrase from the dataset, where lengthUTF = 50, 100, 200, 500. I also included the longest URL, Title and SearchPhrase from the table. The queries are included in the README.

The new results are also there. And so far those are exactly the same as before: aklomp/base64 is faster on M1 Max but slower on i7-8650U. I do hope I got all the compilation flags correctly.

Please take a look at https://github.com/mkmkme/base64-benchmark/blob/main/README.md

powturbo · 2023-08-01T18:00:57Z

Hi,
According to my benchmarks (see README) on several intel/amd cpu's Turbo-Base64 avx2 is nearly 2 times faster than aklomp/base64. There is also the avx512 in the latest GPL version which sets new records in base64 speed.
The short strings developed initially for Clickhouse are 3,5-4 times faster ( call tb64init(0,1) to enable this on avx2 or call directly _tb64v256enc/_tb64v256dec).
The results are slightly varying depending on the cpu used, but there are no doubts about the difference in speed.

I've actually no arm64 benchmarks on M1. In your benchmark you're wrongly calling the scalar functions tb64senc/tb64sdec while the correct functions for all architectures are tb64enc/tb64dec (or the direct SIMD/NEON calls tb64v128enc/tb64v128dec)

In general you can always call the auto functions: tb64enc/tb64dec (no need to initialize anything).

mkmkme · 2023-08-02T06:22:42Z

Hey @powturbo,

Thanks for your input! You're right I was using scalar functions in case of Turbo-Base64 functions on __aarch64__. I did this because it is like that currently in ClickHouse codebase. There seems to have been some bug on that architecture:

ClickHouse/src/Functions/FunctionBase64Conversion.h

Lines 49 to 57 in 357fee9

    
                   /* 
        
                   * Some bug in sse arm64 implementation? 
        
                   * `base64Encode(repeat('a', 46))` returns wrong padding character 
        
                   */ 
        
           #    if defined(__aarch64__) 
        
                   return tb64senc(reinterpret_cast<const uint8_t *>(src.data()), src.size(), reinterpret_cast<uint8_t *>(dst)); 
        
           #    else 
        
                   return _tb64e(reinterpret_cast<const uint8_t *>(src.data()), src.size(), reinterpret_cast<uint8_t *>(dst)); 
        
           #    endif

powturbo · 2023-08-03T15:20:08Z

I've now the results for apple M1. The benchmark was done by a third party. All libs with the latest versions.
Turbo-base64 encodes/decodes ~25% faster than base64 on ARM Neon
while the scalar functions are 64% faster than the base64 plain.

Note that base64 is using inline assembly.

E MB/s	size	ratio	D MB/s	50,000 bytes (2023.08)
24012.43	66668	133.34%	15352.09	tb64v128 (turbo-base64)
19087.55	66668	133.34%	12515.17	b64neon64 (aklomp/base64)
5611.48	66668	133.34%	5092.64	tb64s (scalar)
9782.45	66668	133.34%	3919.52	tb64x (scalar)
6181.37	66668	133.34%	3108.54	b64plain
45566.16	50000	100.00%	45484.13	memcpy

mkmkme · 2023-08-08T16:56:29Z

@alexey-milovidov @rschu1ze what'd you say? Given the new information, do you still want to move to the other library?

rschu1ze · 2023-08-14T11:43:34Z

(sorry for the delay, just returned from vacation)

The Turbo-Base64 version used by ClickHouse and in the benchmark is the last non-GPL version and ca. 3 years old. While the performance improvements since then (#31643 (comment)) are appreciated, they are not usable for ClickHouse due to licensing issues. The performance trade-off (slower x86, faster ARM) seems reasonable given that 1. base64-decoding is IMHO not exactly a bottleneck and 2. the version of aklomp/base64 used in the benchmark is also slightly outdated and there were more upstream optimizations for x86 since then. Overall, I think it's worth to move to aklomp/base64.

alexey-milovidov · 2023-08-15T05:28:30Z

The bug on AArch64

base64Encode(repeat('a', 46))` returns wrong padding character

is a blocker for using this library.

mkmkme · 2023-08-15T08:43:45Z

I'll double-check if I can still reproduce it

mkmkme · 2023-08-31T12:08:00Z

The bug on AArch64

base64Encode(repeat('a', 46))` returns wrong padding character

is a blocker for using this library.

Couldn't reproduce this on my AArch64 VM.
Will publish the PR for replacing the library today.

Closes ClickHouse#31643

alexey-milovidov added the build label Nov 22, 2021

alexey-milovidov mentioned this issue Nov 22, 2021

base64Encode padding error #30854

Closed

den-crane mentioned this issue Nov 25, 2021

Fix base64Encode adding trailing bytes on small strings #31797

Merged

alexey-milovidov assigned vdimir Jan 19, 2022

vdimir mentioned this issue Jan 20, 2022

Use tb64senc for base64Decode on aarch64 #33779

Merged

alexey-milovidov unassigned vdimir May 23, 2022

alexey-milovidov mentioned this issue Oct 1, 2022

AVX512VBMI implementation of base64Encode / base64Decode #41957

Closed

mkmkme mentioned this issue Jul 19, 2023

Any plans on supporting aarch64? lemire/fastbase64#15

Closed

rschu1ze mentioned this issue Aug 18, 2023

Fix base64 encoding and decoding for s390x ClickHouse/Turbo-Base64#2

Merged

mkmkme mentioned this issue Aug 31, 2023

Replace Turbo-Base64 with aklomp/base64 #54119

Merged

mkmkme added a commit to mkmkme/ClickHouse that referenced this issue Sep 6, 2023

Replace Turbo-Base64 with aklomp/base64

f98007a

Closes ClickHouse#31643

rschu1ze closed this as completed in #54119 Sep 9, 2023

mgorshkov mentioned this issue Jun 5, 2024

Dedicated support for Base64URL encoding #38832

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace "Turbo-Base64" to "lemire/fastbase64" #31643

Replace "Turbo-Base64" to "lemire/fastbase64" #31643

alexey-milovidov commented Nov 22, 2021

bkuschel commented Nov 23, 2021

alexey-milovidov commented May 23, 2022

mkmkme commented Jul 18, 2023

alexey-milovidov commented Jul 18, 2023

mkmkme commented Jul 19, 2023

mkmkme commented Jul 24, 2023

mkmkme commented Jul 26, 2023

den-crane commented Jul 26, 2023

mkmkme commented Jul 26, 2023

den-crane commented Jul 26, 2023 •

edited

mkmkme commented Jul 26, 2023

den-crane commented Jul 26, 2023

mkmkme commented Jul 27, 2023

rschu1ze commented Jul 27, 2023

mkmkme commented Jul 27, 2023

rschu1ze commented Jul 27, 2023

mkmkme commented Jul 27, 2023

alexey-milovidov commented Jul 28, 2023

mkmkme commented Jul 28, 2023

mkmkme commented Jul 31, 2023

alexey-milovidov commented Jul 31, 2023 •

edited

mkmkme commented Aug 1, 2023

powturbo commented Aug 1, 2023 •

edited

mkmkme commented Aug 2, 2023

powturbo commented Aug 3, 2023

mkmkme commented Aug 8, 2023

rschu1ze commented Aug 14, 2023

alexey-milovidov commented Aug 15, 2023

mkmkme commented Aug 15, 2023

mkmkme commented Aug 31, 2023

Replace "Turbo-Base64" to "lemire/fastbase64" #31643

Replace "Turbo-Base64" to "lemire/fastbase64" #31643

Comments

alexey-milovidov commented Nov 22, 2021

bkuschel commented Nov 23, 2021

alexey-milovidov commented May 23, 2022

mkmkme commented Jul 18, 2023

alexey-milovidov commented Jul 18, 2023

mkmkme commented Jul 19, 2023

mkmkme commented Jul 24, 2023

mkmkme commented Jul 26, 2023

den-crane commented Jul 26, 2023

mkmkme commented Jul 26, 2023

den-crane commented Jul 26, 2023 • edited

mkmkme commented Jul 26, 2023

den-crane commented Jul 26, 2023

mkmkme commented Jul 27, 2023

rschu1ze commented Jul 27, 2023

mkmkme commented Jul 27, 2023

rschu1ze commented Jul 27, 2023

mkmkme commented Jul 27, 2023

alexey-milovidov commented Jul 28, 2023

mkmkme commented Jul 28, 2023

mkmkme commented Jul 31, 2023

alexey-milovidov commented Jul 31, 2023 • edited

mkmkme commented Aug 1, 2023

powturbo commented Aug 1, 2023 • edited

mkmkme commented Aug 2, 2023

powturbo commented Aug 3, 2023

mkmkme commented Aug 8, 2023

rschu1ze commented Aug 14, 2023

alexey-milovidov commented Aug 15, 2023

mkmkme commented Aug 15, 2023

mkmkme commented Aug 31, 2023

den-crane commented Jul 26, 2023 •

edited

alexey-milovidov commented Jul 31, 2023 •

edited

powturbo commented Aug 1, 2023 •

edited