New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
auth/Crypto: refactor to use OpenSSL's EVP API #23260
Conversation
Thanks for bringing this. Cephx still has potential for improvement and moving it forward is definitely the thing many people can benefit from. Just from the reviewer POV it's also fascinating puzzle considering number of contributing factors. Let's start. :-)
It's entirely valid that AES-NI can make huge difference. The question is about the relationship between the benefit and the price we need to pay for it. Not surprisingly "it depends" can be told about both factors. Data size. As Cephx is critical and the most prominent client of the At the time of the NSS - OpenSSL transition it was just 32 bytes (including PKCS#7 padding). To bring yet another piece to the puzzle, CVE-2018-1129 & co bumped it up to 48. Unfortunately, our unittests are using 256+16 ( Although I exhibit strong bias towards in-vivo testing, let me start from those new in-vitro tools. Before
Afterwith AES-NI
without AES-NI
The results appear, well, inconclusive. What wonders me most is degree of freedom we're omitting that way. This includes both machine, system (I wouldn't be surprised if the version of tcmalloc and its env. variables are involved ;-) and workload-specific items. For instance, the I very like the unification between slice and bl-taking variants. |
memcpy(iv, CEPH_AES_IV, AES_BLOCK_LEN); | ||
|
||
// we aren't using EVP because of performance concerns. Profiling | ||
// shows the cost is quite high. Endianness might be an issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be not valid anymore as Cephx's signature size has grown. More investigation and testing needed. WIP.
BTW: I can recall that ISAL Crypto was also considered during the transition. It doesn't have required certifications but maybe offering it as an optional implementation (with configurables and abstractions) would be possible. The technical pros are: it's already in the tree, offers very good performance and doesn't need malloc. Just thinking loudly.
@rzarzynski As stated in commit comment and PR description, EVP API provides encapsulated access to specialised hardware (again - be it AES-NI, Padlock, or whatever world came up with). You can verify it with
EVP (AES-NI) implementation was over 6x faster on 16-byte blocks -- and this is on my low-end VPS that happens to have AES-NI enabled. I've seen similar results on the bare metal Xeon boxes. By using |
I ran your tests on some more reasonable machine, and here are the results: EVP:
EVP (AES-NI disabled):
OLD CODE
That's more than conclusive for me. |
Piotr, I have no doubts that employing an AES-NI-augmented implementation can provide huge benefit. Whether it actually does that depends on several factors. To ensure we're on the same page, let me use an real-world analogy: if you need to move somewhere close, it can be faster to go on foot instead of waiting for a cab. I worry the numbers you provided from EVP_CIPHER_CTX_init(&ctx);
if(decrypt)
EVP_DecryptInit_ex(&ctx,evp_cipher,NULL,key16,iv);
else
EVP_EncryptInit_ex(&ctx,evp_cipher,NULL,key16,iv);
EVP_CIPHER_CTX_set_padding(&ctx, 0);
Time_F(START);
if(decrypt)
for (count=0,run=1; COND(save_count*4*lengths[0]/lengths[j]); count++)
EVP_DecryptUpdate(&ctx,buf,&outl,buf,lengths[j]);
else
for (count=0,run=1; COND(save_count*4*lengths[0]/lengths[j]); count++)
EVP_EncryptUpdate(&ctx,buf,&outl,buf,lengths[j]);
if(decrypt)
EVP_DecryptFinal_ex(&ctx,buf,&outl);
else
EVP_EncryptFinal_ex(&ctx,buf,&outl);
d=Time_F(STOP);
EVP_CIPHER_CTX_cleanup(&ctx);
} More detailed gist is available as well. As you notice, both creation and deletion of EVP context are performed once and outside the time measurement. When I see such API, I expect a lot of quirkiness coming from life cycle management as it's often supposed to be infrequent. This stays in opposition to the use case we have in Ceph. In the version of OpenSSL I have, EVP uses internally yet another Perl-crafted ;-) assembler piece named
On my environment the cost is huge and definitely dominates the workload. At the moment I'm working to figure out the reason. There is a lot of possibilities: memory allocator performance, OpenSSL version, configured engines etc. However, what bugs me is that the cab waiting time may depend on street traffic and the number of passengers calling taxi company the same time (e.g. Completely BTW: AFAIK there is an effort on SeaStar-based messenger. Hopefully it'll resolve the issue with context's thread safety. I bet your commit, enriched by context reusage, would be undoubtedly a big win. @tchaikov: do you know what's the current status? UPDATE: I the discussion I focus on AES-NI. Other acceleration techniques with similar setup costs would fit as well. Bulky crypto accelerators/far placed coprocessors are IMHO out of question for 48 bytes encryption. |
On 18-07-31 06:10 AM, Radoslaw Zarzynski wrote:
EVP (AES-NI) implementation was over 6x faster on 16-byte blocks -- and
this is on my low-end VPS that happens to have AES-NI enabled. I've seen
similar results on the bare metal Xeon boxes.
Piotr, I have no doubts that employing an AES-NI-augmented implementation
*can* provide huge benefit. Whether it actually does that depends on several
factors. To ensure we're on the same page, let me use an real-world analogy:
*if you need to move somewhere close, it can be faster to go on foot instead
of waiting for a cab*. I worry the numbers you provided from |openssl speed|
benchmark tell only about /how fast the cab travels *after you got into
it*/. They say nothing about /the waiting time/.
[..] > More detailed gist is available as well
<https://gist.github.com/rzarzynski/988106ae3b1985c64f50f05e5e9b3a58>. As
you notice, both creation and deletion of EVP context are performed once and
*outside* the time measurement. When I see such API, I expect a lot of
quirkiness coming from life cycle management as it's often supposed to be
infrequent. This stays in opposition to the use case we have in Ceph.
That's a Ceph design decision problem that needs to be addressed anyway. I
already envisioned a few solutions to this "problem", but I don't want to
stack them together with this PR.
In the version of OpenSSL I have, EVP uses internally yet another
Perl-crafted ;-) assembler piece named |aesni_cbc_encrypt|. With some
terrible hackery, it should be possible to use it directly
<https://github.com/rzarzynski/ceph/commits/wip-23260> and derive the costs
of context management.
Although it's unlikely, if the function definition changes, we're in
trouble. Also, it consumes just AES-NI version, ignoring Padlock and others.
Less terrible hackery would involve keeping an EVP structure on stack, just
like it was possible with pre-1.1.x OpenSSL, but that would be a security
risk because now we don't know if and how EVP internals change.
But anyway, let's keep away from any kind of hackery, especially in
security-related code.
|# bin/unittest_crypto --gtest_filter=AES.LoopCephx\* ... Note: Google Test
filter = AES.LoopCephx* [==========] Running 2 tests from 1 test case.
[----------] Global test environment set-up. [----------] 2 tests from AES [
RUN ] AES.LoopCephx [ OK ] AES.LoopCephx (79 ms) [ RUN ] AES.LoopCephxV2 [
OK ] AES.LoopCephxV2 (117 ms) [----------] 2 tests from AES (196 ms total)
[----------] Global test environment tear-down [==========] 2 tests from 1
test case ran. (196 ms total) [ PASSED ] 2 tests. |
On my environment the cost is huge and definitely dominates the workload. At
the moment I'm working to figure the reason.
Both context creation and destruction involves a call to memset() to
sanitize EVP structure, that's a good security practice. Depending on libc
version and processor, memset may cause you to run short on memory bandwidth.
Also, here:
https://github.com/ovh/ceph/blob/595537e1820703955184cd7cff58a5b4e6eeff5a/src/auth/Crypto.cc#L226
I have a code to copy IV from constant into variable which is in turn copied
to EVP context. I removed this code and saw no performance difference... I
believe compiler here is smart enough to fuse array construction with
memset, so eventually there's just one memcpy involved and not two.
I'll get rid of that, not every compiler may be that smart.
However, what bugs me is that the /cab waiting time/ may depend on street
traffic and the number of passengers calling taxi company the same time
(e.g. |malloc|). The |unittest_crypto| in that matter looks like a city of
single passenger and single cab. Would it be possible to provide
measurements from a live cluster running e.g. RBD workload?
That was my initial intention, but it turns out it's impossible to build
master on Xenial in a way that doesn't require upgrading a bunch of packages
on *installation* machine to ppa/testing versions and I can't do this on my
testing, bare-metal cluster. The packages provided on
http://download.ceph.com don't have this limitation because package builders
have a right mix of dependencies - @tchaikov was working on that, I don't
remember the PR number.
Completely BTW: AFAIK there is an effort on SeaStar-based messenger.
Hopefully it'll resolve the issue with context's thread safety. I bet your
commit, enriched by context reusage, would be undoubtedly a big win.
@tchaikov <https://github.com/tchaikov>: do you know what's the current status?
As far as I'm aware, msgr2 is supposed to provide secure communication. "Big
win" would be there in particular.
…--
Piotr Dałek
piotr.dalek@corp.ovh.com
https://www.ovhcloud.com
|
595537e
to
eb608d9
Compare
@rzarzynski sorry, i missed your comments. see https://github.com/ceph/ceph/tree/master/src/crimson/net . and i think it''d be a good chance for reusing the contexts without worrying the racing issue we'd be suffering. |
@branch-predictor: is there any progress in the matter of in-OSD performance testing? Are the blockers gone now? The results would be really helpful. Our benchmarks - in contrast to OSD - don't link with |
Right now I'm working on new test/benchmark hardware setup based on a bunch of current-gen CPUs, NVMes and plenty of powerful enough client hosts to flood the cluster. Hang on. |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
eb608d9
to
ec05dce
Compare
ec05dce
to
376a647
Compare
OpenSSL's EVP API encapsulates different encryption mechanisms and engines, including AES-NI, ARM NEON, VIA Padlock and possibly other hardware crypto accelerators. Considering that AES optimized with AES-NI alone is around 6x faster than directly-callable implementation, there's no reason to not use it. Signed-off-by: Piotr Dałek <piotr.dalek@corp.ovh.com>
376a647
to
377be65
Compare
Closing as with Messenger2 and new signing scheme this lost its relevance. |
All, I'd like to bring this one back up -- re-basing this PR to the latest stable release in conjunction with PR#32675, I was able to successfully get Ceph working reliably in a FIPS-validated environment without any special workarounds. As far as I can tell, both PR#23260 and PR#32675 are being treated solely as performance-enhancing, but they also resolve an issue with low-level crypto functions being forbidden in FIPS mode. Thoughts? |
I would also be interested in this PR for FIPS purposes. Is there a reason for not including it? |
OpenSSL's EVP API encapsulates different encryption mechanisms and engines, including AES-NI, ARM NEON, VIA Padlock and possibly other hardware crypto accelerators. Considering that AES encryption optimized with AES-NI alone is around 6x faster than directly-callable implementation, there's no reason to not use it.
Even with the need to create (malloc) and free EVP contexts for each encryption, anything shows the performance improvements, for example unittests:
Before:
after:
After with AES_NI disabled (
OPENSSL_ia32cap="~0x200000200000000" bin/unittest_crypto
):Signed-off-by: Piotr Dałek piotr.dalek@corp.ovh.com