Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

base64 decode faster #10032

Closed
wants to merge 4 commits into from
Closed

base64 decode faster #10032

wants to merge 4 commits into from

Conversation

bagder
Copy link
Member

@bagder bagder commented Dec 5, 2022

  • by using a lookup table instead of strchr()
  • by doing full quantums first, then padding

Test

The test decodes "short" strings, up to 2948 bytes in length and it loops a thousand times.

test code

Numbers

Best out of three runs on my local machine.

Old code: 1m47.593s
New code: 24.098s

Speedup: 4.46 times.

- by using a lookup table instead of strchr()
- by doing full quantums first, then padding
@maage
Copy link

maage commented Dec 5, 2022

I see tiny speedup on your test program, like 10s -> 9s with this patch.
maage@492d018

@bagder
Copy link
Member Author

bagder commented Dec 5, 2022 via email

@bagder
Copy link
Member Author

bagder commented Dec 5, 2022

maybe I should allow the user to select which way to go with an ifdef...

@bagder bagder closed this in c6f602c Dec 6, 2022
@bagder bagder deleted the bagder/base64decode-faster branch December 6, 2022 07:57
@lucasexe
Copy link

lucasexe commented Dec 27, 2022

Thanks @bagder for this great improvement.
I would like to discuss again about the @maage's sugestion.

In a quick analysis, your original solution uses less memory but there are some caveats:

  • In the original solution we have: 80 bytes in the data/rodata segment + 256 bytes in stack + operations to copy the data to stack).
  • In the second version we have only 256 bytes in the rodata segment.

The original solution has a memory peak greater than the second one during the function execution.

Please also note that "operations to copy the data to stack" will also require some space in the code segment. Any compiler with a decent optimizer will replace both 2 memcpy to several other instructions. (Please check this example: https://godbolt.org/z/o8995W7qv):

GCC (with -O2 & x86-64 arch) optimizes the memcpys to other instructions that requires 112 bytes in code segment.
Clang (with -O2 & x86-64 arch) optimizes the memcpys to other instructions that requires 192 bytes in code segment.

So, compiling with GCC, the original solution spends 80 + 112 = 192 bytes + 256 (if the function is called). Compiling with Clang, spends 80 + 192 = 272 bytes (more than the second solution) + 256 bytes in the stack.

With all those numbers, I strongly recommend to use the second version and maybe add an ifdef to allow user change it if required, as you commented.

What do you think?

@bagder
Copy link
Member Author

bagder commented Dec 27, 2022

What do you think?

I've moved on. I don't believe in adding ifdefs for this because nobody will care enough to alter those. If someone feels strongly for further changes, propose those in a PR and explain why we want that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants