Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace "Turbo-Base64" to "lemire/fastbase64" #31643

Closed
alexey-milovidov opened this issue Nov 22, 2021 · 30 comments · Fixed by #54119
Closed

Replace "Turbo-Base64" to "lemire/fastbase64" #31643

alexey-milovidov opened this issue Nov 22, 2021 · 30 comments · Fixed by #54119
Labels

Comments

@alexey-milovidov
Copy link
Member

https://github.com/lemire/fastbase64

@bkuschel
Copy link
Contributor

The fastbase64 library seems to be a good option for avx2 (and avx2-512) SIMD support. It does not seem to have native support for Neon (AMD) or AltiVec (PowerPC), so if nothing else, there would be a fallback to the slower chromium base64 for those.

I saw that the aklomp library was attempted previously and unfortunately seems to have some issues as well.

@alexey-milovidov
Copy link
Member Author

Looks abandoned.

@mkmkme
Copy link
Contributor

mkmkme commented Jul 18, 2023

Do you still want it to be done? I'd like to work on this.

@alexey-milovidov
Copy link
Member Author

@mkmkme, Yes, it makes sense (although with low priority). It will be very appreciated!

@mkmkme
Copy link
Contributor

mkmkme commented Jul 19, 2023

It seems that so far this library is x86-only. I pinged the author to see if they have any plans on supporting ARM as well.

@mkmkme
Copy link
Contributor

mkmkme commented Jul 24, 2023

I've got the reply from them to use https://github.com/aklomp/base64
I integrated it, but so far the performance is about twice as bad as it was before my changes. I'll check if I can speed it up.

@mkmkme
Copy link
Contributor

mkmkme commented Jul 26, 2023

So, here's the summary of my research.

lemire/fastbase64 is not a proper library that was pointed by the author, but rather a PoC (for instance, it is x86-only). The author recommended using aklomp/base64, as that one inherited many of his algorithms as well.

WojciechMula/base64-avx512 mentioned in issue #41957 is also x86-only. It has a reference to a different repository by the same author, WojciechMula/base64simd, that is mentioned by the author to be problematic, and they recommend using aklomp/base64 as well.

While trying to integrate aklomp/base64 to CH I noticed that before Turbo-Base64 CH actually did use exactly that library (here's the PR that replaces it to TB64).

And the result is quite underwhelming. On ARM (M1 Max with macOS), aklomp/base64 is about 2 times slower.

Here's version with Turbo-Base64:
image

And here's with aklomp/base64:
image

I tried to poke to compile flags, but the result didn't get any better. I thought the reason might be ARM or macOS specific.
Double-checked on i7 with Fedora Linux installed, and result there is also about 1.5-2 times worse than with TB64: ~500MB/s with TB64 vs ~280MB/s with aklomp/base64.

My changes can be found here. Sorry I didn't make it into a pull request. I think it's not worth even a WIP pull request, because the results are not satisfactory so far.

I'd like to ping @alexey-milovidov and @rschu1ze. Maybe I'm missing something obvious? Or maybe it's not yet time to replace TB64 with something different?

@den-crane
Copy link
Contributor

BTW, the behaviour of ARM and AMD builds is different .
It looks like they use different code.

@mkmkme
Copy link
Contributor

mkmkme commented Jul 26, 2023

Hey @den-crane!

Are you still able to reproduce the issue? There was a merged PR (#33779) since the version in question was released.

I just tested on my Macbook, and the command seems to work just fine:
image

It's not inside the docker, but I doubt that would change the behaviour.

@den-crane
Copy link
Contributor

den-crane commented Jul 26, 2023

@mkmkme

Ampere A1

ClickHouse client version 23.6.2.18 (official build).
Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 23.6.2 revision 54464.

clickhouse :) select  base64Decode('AGy1cKzasOC/yOPct7tO3w');

SELECT base64Decode('AGy1cKzasOC/yOPct7tO3w')

Query id: cacb7607-1a58-4066-94b1-434d72ef9209


0 rows in set. Elapsed: 0.021 sec.

Received exception from server (version 23.6.2):
Code: 117. DB::Exception: Received from localhost:9000. 
DB::Exception: Failed to base64Decode input 'AGy1cKzasOC/yOPct7tO3w':
 While processing base64Decode('AGy1cKzasOC/yOPct7tO3w'). (INCORRECT_DATA)



Linux clickhouse 5.15.0-1021-oracle #27~20.04.1-Ubuntu SMP Mon Oct 17 01:56:24 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux

Mac M1

curl https://clickhouse.com/ | sh

./clickhouse local
ClickHouse local version 23.7.1.2417 (official build).

m1.local :) select  base64Decode('AGy1cKzasOC/yOPct7tO3w');

SELECT base64Decode('AGy1cKzasOC/yOPct7tO3w')

Query id: 813728fb-c65d-4cb6-a3b7-5a9f2d6bced7


0 rows in set. Elapsed: 0.399 sec.

Received exception:
Code: 117. 
DB::Exception: Failed to base64Decode input 'AGy1cKzasOC/yOPct7tO3w':
 While processing base64Decode('AGy1cKzasOC/yOPct7tO3w'). (INCORRECT_DATA)

arm

SELECT base64Decode('SGVsbG8gd29ybGQ');
DB::Exception: Failed to base64Decode input 'SGVsbG8gd29ybGQ': While processing base64Decode('SGVsbG8gd29ybGQ'). (INCORRECT_DATA)

amd

SELECT base64Decode('SGVsbG8gd29ybGQ');
┌─base64Decode('SGVsbG8gd29ybGQ')─┐
│ Hello world                     │
└─────────────────────────────────┘

@mkmkme
Copy link
Contributor

mkmkme commented Jul 26, 2023

Huh, that's interesting! I've downloaded CH via curl and got the same result as you.

However, the binary built on my machine works just fine! So there's a chance something is going on with official build.

Sorry, I'm also quite new to this project, so I'd like to raise two newbie questions:

  1. Version downloaded from clickhouse.com is 23.7.1.2417. However, there's no such tag (and 23.7 release webinar is only tomorrow IIRC). What's that build? Is it nightly? How can I match it with the git revision?
  2. Are the scripts for building official binaries stored in this repo as well? I'd like to try the build flags used during compilation of that binary to see whether I'll be able to reproduce the issue then

@den-crane
Copy link
Contributor

1 Is it nightly

yes.
select * from system.build_options;

2 Are the scripts for building official binaries stored in this repo as well?

yes.

build flags

I think they should be available in system.build_options

@mkmkme
Copy link
Contributor

mkmkme commented Jul 27, 2023

Hmm, I stand corrected. Yesterday I think I mistakenly ran the version with aklomp/base64 thinking it was the version without any changes.

I double-checked now, and yes, the version with Turbo-Base64 does fail with your example.

@rschu1ze
Copy link
Member

@mkmkme SYSTEM.BUILD_OPTIONS can be a bit overwhelming. For (performance) testing, it is sufficient to pass these flags to CMake: -DENABLE_LIBRARIES=0 -DCMAKE_BUILD_TYPE=RelWithDebInfo.

About #31643 (comment): Yes, different code gets called for x86 and ARM but I am sure that @mkmkme noticed that already:

# if defined(__aarch64__)

(btw., this discussion really belongs into #45790)

I'd like to ping @alexey-milovidov and @rschu1ze. Maybe I'm missing something obvious? Or maybe it's not yet time to replace TB64 with something different?

Looking at your draft, I notice that it calls set_source_files_properties based on HAVE_* flags, e.g. HAVE_AVX. The default ClickHouse build includes only up to SSE4.2 not AVX1/2/512, see cmake/cpu_features.cmake. I assume that aklomp/base64 ran only with SSE4.2 (unless you passed ENABLE_AVX, ENABLE_AVX2 etc. to CMake). This is different from Turbo64 (the current base64 lib) which configures the encode/decode routine using a runtime dispatch (cpuid()) to the fastest available SIMD option, see tb64ini() in turbob64sse.c. I assume the comparison was unfair because of this.

@mkmkme
Copy link
Contributor

mkmkme commented Jul 27, 2023

Hey @rschu1ze !
Thanks a lot for your input. You're right, my draft was unfair to aklomp/base64. I have now added the commit to my branch that enables all the flags for all the files in aklomp/base64 (so that all codecs will be compiled into the library). The link is the same. I checked that for x86 it also uses cpuid. Also, I added a call for a dummy base64_encode during initialisation in order to initialise the codec. For arm64, the library doesn't use any runtime check as it says there is none. So the fact that on M1 Mac the result didn't change is not surprising. But it didn't change for x86, either. Turbo-Base64 version got 529MB/s on my Fedora laptop whilst aklomp/base64 loops at 293MB/s.

@rschu1ze
Copy link
Member

Okay, that is unfortunate from a performance POV (assuming your measurements compared AVX2 vs. AVX2, AVX512 vs. AVX512, i.e. in a fair way). On the other hand, the motivation to go for aklomp/base64 given in #41957 (comment) is still true (and with more GitHub stars, aklomp/base64 also seems more future-proof). I personally doubt that there exists a ClickHouse use case which bottlenecks on base64 encoding/decoding speed + note that most functions in ClickHouse are not SIMD-accelerated at all; so I would support the move to aklomp/base64 regardless. (Maybe we can build a small repro (e.g. a standalone binary) and provide it to the maintainers of aklomp/base64 to check why performance is worse than with Turbo64?)

@mkmkme
Copy link
Contributor

mkmkme commented Jul 27, 2023

I'll try to build some small reproduction for that case.

@alexey-milovidov
Copy link
Member Author

Accepting performance degradation is possible but not good.
Maybe a good way is to address all the concerns that still remain in the Turbo-Base64.

@mkmkme
Copy link
Contributor

mkmkme commented Jul 28, 2023

Right now I'm working on a minimal benchmark to properly compare these two libraries.
I think all my checks indeed were a bit unfair for aklomp/base64, because the internal benchmarks of both libraries show that aklomp/base64 is about 10% faster than Turbo-Base64:

Turbo-Base64

$ ./tb64app
detected simd (id=38->'arm_neon')

  E MB/s    size     ratio%   D MB/s   function
size=1Kb (Kb=1.000)
  4324.11       1336 133.60%  5464.65 tb64s        1000
  7480.79       1336 133.60%  6859.91 tb64x        1000
 15298.66       1336 133.60% 13961.35 tb64sse        1000
 15228.53       1336 133.60% 13931.00 tb64auto        1000
 68730.45       1000 100.00% 70625.61 memcpy        1000
size=10Kb (Kb=1.000)
  4320.59      13336 133.36%  5477.52 tb64s       10000
  7525.90      13336 133.36%  6798.98 tb64x       10000
 16652.86      13336 133.36% 15139.39 tb64sse       10000
 16496.37      13336 133.36% 15003.66 tb64auto       10000
 75722.63      10000 100.00% 75631.53 memcpy       10000
size=100Kb (Kb=1.000)
  4151.13     133336 133.34%  5367.85 tb64s      100000
  6856.95     133336 133.34%  6453.72 tb64x      100000
 16791.77     133336 133.34% 15163.17 tb64sse      100000
 16988.01     133336 133.34% 15283.64 tb64auto      100000
 33875.84     100000 100.00% 30186.64 memcpy      100000
size=1Mb (Mb=1.000.000)
  4174.09    1333336 133.33%  5463.83 tb64s     1000000
  6970.32    1333336 133.33%  6577.28 tb64x     1000000
 16950.05    1333336 133.33% 15245.75 tb64sse     1000000
 16961.66    1333336 133.33% 15332.50 tb64auto     1000000
 59295.98    1000000 100.00% 54144.61 memcpy     1000000
size=10Mb (Mb=1.000.000)
  4146.94   13333336 133.33%  5400.67 tb64s    10000000
  6949.33   13333336 133.33%  6509.04 tb64x    10000000
 16546.09   13333336 133.33% 14872.77 tb64sse    10000000
 16513.51   13333336 133.33% 14936.76 tb64auto    10000000
 50999.64   10000000 100.00% 51289.61 memcpy    10000000
size=20Mb (Mb=1.000.000)
  4137.58   26666668 133.33%  5356.94 tb64s    20000000
  6927.29   26666668 133.33%  6461.46 tb64x    20000000
 16698.04   26666668 133.33% 15101.11 tb64sse    20000000
 16767.57   26666668 133.33% 15073.72 tb64auto    20000000
 48018.36   20000000 100.00% 48046.36 memcpy    20000000

aklomp/base64 (note that it's upside down in comparison to TB64)

$ ./build-macos/bin/benchmark
Filling buffer with 10.0 MB of random data...
Testing with buffer size 10 MB, fastest of 10 * 1
NEON64	encode	12673.61 MB/sec
NEON64	decode	8569.90 MB/sec
plain	encode	4694.65 MB/sec
plain	decode	4008.02 MB/sec
Testing with buffer size 1 MB, fastest of 10 * 10
NEON64	encode	18598.91 MB/sec
NEON64	decode	12437.81 MB/sec
plain	encode	5655.98 MB/sec
plain	decode	4230.86 MB/sec
Testing with buffer size 100 KB, fastest of 10 * 100
NEON64	encode	18235.06 MB/sec
NEON64	decode	12525.40 MB/sec
plain	encode	5660.82 MB/sec
plain	decode	4148.30 MB/sec
Testing with buffer size 10 KB, fastest of 100 * 100
NEON64	encode	18571.83 MB/sec
NEON64	decode	12587.33 MB/sec
plain	encode	6133.88 MB/sec
plain	decode	4428.03 MB/sec
Testing with buffer size 1 KB, fastest of 100 * 1000
NEON64	encode	16344.14 MB/sec
NEON64	decode	12087.52 MB/sec
plain	encode	5966.80 MB/sec
plain	decode	4366.96 MB/sec

@mkmkme
Copy link
Contributor

mkmkme commented Jul 31, 2023

@rschu1ze @alexey-milovidov I've pushed a somewhat minimalistic benchmark to compare those libraries.
The code and the results can be found here: https://github.com/mkmkme/base64-benchmark

In that repo, I tried to build Turbo-Base64 exactly as it is built in CH. For aklomp/base64, I tried to reuse the building scripts from the upstream repo.

Surprisingly, this time I got aklomp/base64 significantly faster on M1 Max, but sometimes even slower on Core i7.

Please take a look. Test on your machine, if needed. And please tell me if this benchmark makes sense.

@alexey-milovidov
Copy link
Member Author

alexey-milovidov commented Jul 31, 2023

https://github.com/mkmkme/base64-benchmark/blob/main/main.cpp#L55

This isn't the best benchmark - unrealistic data.

If the number 10000 means bytes, it is too much. Typical strings in a database are in order of 100 bytes.
String lengths should be randomly distributed.

The best benchmark will be on URL, Title, SearchPhrase columns from ClickBench.

@mkmkme
Copy link
Contributor

mkmkme commented Aug 1, 2023

@alexey-milovidov noted. I updated the benchmark to use the data taken from ClickBench.

I used URL, Title and SearchPhrase from the dataset, where lengthUTF = 50, 100, 200, 500. I also included the longest URL, Title and SearchPhrase from the table. The queries are included in the README.

The new results are also there. And so far those are exactly the same as before: aklomp/base64 is faster on M1 Max but slower on i7-8650U. I do hope I got all the compilation flags correctly.

Please take a look at https://github.com/mkmkme/base64-benchmark/blob/main/README.md

@powturbo
Copy link

powturbo commented Aug 1, 2023

Hi,
According to my benchmarks (see README) on several intel/amd cpu's Turbo-Base64 avx2 is nearly 2 times faster than aklomp/base64. There is also the avx512 in the latest GPL version which sets new records in base64 speed.
The short strings developed initially for Clickhouse are 3,5-4 times faster ( call tb64init(0,1) to enable this on avx2 or call directly _tb64v256enc/_tb64v256dec).
The results are slightly varying depending on the cpu used, but there are no doubts about the difference in speed.

I've actually no arm64 benchmarks on M1. In your benchmark you're wrongly calling the scalar functions tb64senc/tb64sdec while the correct functions for all architectures are tb64enc/tb64dec (or the direct SIMD/NEON calls tb64v128enc/tb64v128dec)

In general you can always call the auto functions: tb64enc/tb64dec (no need to initialize anything).

@mkmkme
Copy link
Contributor

mkmkme commented Aug 2, 2023

Hey @powturbo,

Thanks for your input! You're right I was using scalar functions in case of Turbo-Base64 functions on __aarch64__. I did this because it is like that currently in ClickHouse codebase. There seems to have been some bug on that architecture:

/*
* Some bug in sse arm64 implementation?
* `base64Encode(repeat('a', 46))` returns wrong padding character
*/
# if defined(__aarch64__)
return tb64senc(reinterpret_cast<const uint8_t *>(src.data()), src.size(), reinterpret_cast<uint8_t *>(dst));
# else
return _tb64e(reinterpret_cast<const uint8_t *>(src.data()), src.size(), reinterpret_cast<uint8_t *>(dst));
# endif

@powturbo
Copy link

powturbo commented Aug 3, 2023

I've now the results for apple M1. The benchmark was done by a third party. All libs with the latest versions.
Turbo-base64 encodes/decodes ~25% faster than base64 on ARM Neon
while the scalar functions are 64% faster than the base64 plain.

Note that base64 is using inline assembly.

E MB/s size ratio D MB/s 50,000 bytes (2023.08)
24012.43 66668 133.34% 15352.09 tb64v128 (turbo-base64)
19087.55 66668 133.34% 12515.17 b64neon64 (aklomp/base64)
5611.48 66668 133.34% 5092.64 tb64s (scalar)
9782.45 66668 133.34% 3919.52 tb64x (scalar)
6181.37 66668 133.34% 3108.54 b64plain
45566.16 50000 100.00% 45484.13 memcpy

@mkmkme
Copy link
Contributor

mkmkme commented Aug 8, 2023

@alexey-milovidov @rschu1ze what'd you say? Given the new information, do you still want to move to the other library?

@rschu1ze
Copy link
Member

(sorry for the delay, just returned from vacation)

The Turbo-Base64 version used by ClickHouse and in the benchmark is the last non-GPL version and ca. 3 years old. While the performance improvements since then (#31643 (comment)) are appreciated, they are not usable for ClickHouse due to licensing issues. The performance trade-off (slower x86, faster ARM) seems reasonable given that 1. base64-decoding is IMHO not exactly a bottleneck and 2. the version of aklomp/base64 used in the benchmark is also slightly outdated and there were more upstream optimizations for x86 since then. Overall, I think it's worth to move to aklomp/base64.

@alexey-milovidov
Copy link
Member Author

The bug on AArch64

base64Encode(repeat('a', 46))` returns wrong padding character

is a blocker for using this library.

@mkmkme
Copy link
Contributor

mkmkme commented Aug 15, 2023

I'll double-check if I can still reproduce it

@mkmkme
Copy link
Contributor

mkmkme commented Aug 31, 2023

The bug on AArch64

base64Encode(repeat('a', 46))` returns wrong padding character

is a blocker for using this library.

Couldn't reproduce this on my AArch64 VM.
Will publish the PR for replacing the library today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants