-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement hw accelerated AES #10902
Implement hw accelerated AES #10902
Conversation
// zen3/VAES(VEX.128). | ||
// It seems like VAES(VEX.256) should be faster? | ||
// TODO Choose value at runtime based on some criteria? | ||
constexpr size_t BLOCK_DEPTH = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw this is a value that may have a better default, as I've only tested on zen2+3. Maybe someone can try some speed tests when tweaking it on zen1 or some intel archs?
Please rebase. |
@shuffle2 breaks compilation on arm64 linux (testing on bionic with gcc-11) multiple of these type of error:
larger log
|
Lmao this time I really pasted the entire code into godbolt and checked arm64 gcc so I’m not sure why it differs for you. Maybe they changed intrinsic prototypes between versions or something |
I think it is a pretty simple error I think you should reference some of these posts: https://stackoverflow.com/questions/43521206/arm-neon-how-to-convert-from-uint8x16-t-to-uint8x8x2-t |
I didn't say I did not understand the error or the code |
Similar to the SHA1 PR, creates a wrapper around the "generic" path and overrides for arch-specific accelerated implementation.
Unlike SHA1 PR, this changes the API, so callers can manage IV and data buffers better, resulting in less data movement and more clarity about when IV is being updated or not.
In testing, VAES.256 instructions did offer decent further speedup on x64, but at the moment, it doesn't seem worth increasing the instruction set support matrix, maybe in the future.
Time for
DolphinTool verify -a sha1
on a wii game stored as gcz, in seconds:It's restricted to
-a sha1
to cut out md5, which is not optimized and would therefor cause all timings to be equivalent to md5 time. Since-a sha1
performs sha1 while reading the disc and over the entire file, such runtime is still dominated by sha1. Before this PR, it would have been dominated by aes.Since (wii) disc reads involve aes + sha1, and verify operation performs some of the hashing in a somewhat parallel way, the perf differences are more meaningful when considering both aes+sha1 improvements together.