Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

auth/Crypto: refactor to use OpenSSL's EVP API #23260

Closed
wants to merge 1 commit into from

Conversation

branch-predictor
Copy link
Contributor

OpenSSL's EVP API encapsulates different encryption mechanisms and engines, including AES-NI, ARM NEON, VIA Padlock and possibly other hardware crypto accelerators. Considering that AES encryption optimized with AES-NI alone is around 6x faster than directly-callable implementation, there's no reason to not use it.

Even with the need to create (malloc) and free EVP contexts for each encryption, anything shows the performance improvements, for example unittests:
Before:

[==========] Running 8 tests from 1 test case.           
[----------] Global test environment set-up.             
[----------] 8 tests from AES                            
[ RUN      ] AES.ValidateLegacy                          
[       OK ] AES.ValidateLegacy (0 ms)                   
[ RUN      ] AES.ValidateSecret                          
[       OK ] AES.ValidateSecret (0 ms)                   
[ RUN      ] AES.Encrypt                                 
[       OK ] AES.Encrypt (0 ms)                          
[ RUN      ] AES.EncryptNoBl                             
[       OK ] AES.EncryptNoBl (0 ms)                      
[ RUN      ] AES.Decrypt                                 
[       OK ] AES.Decrypt (0 ms)                          
[ RUN      ] AES.DecryptNoBl                             
[       OK ] AES.DecryptNoBl (0 ms)                      
[ RUN      ] AES.Loop                                    
[       OK ] AES.Loop (94 ms)                            
[ RUN      ] AES.LoopKey                                 
100000 encoded in 0.156125                               
[       OK ] AES.LoopKey (156 ms)                        
[----------] 8 tests from AES (250 ms total)             
                                                         
[----------] Global test environment tear-down           
[==========] 8 tests from 1 test case ran. (250 ms total)
[  PASSED  ] 8 tests.

after:

[==========] Running 8 tests from 1 test case.           
[----------] Global test environment set-up.             
[----------] 8 tests from AES                            
[ RUN      ] AES.ValidateLegacy                          
[       OK ] AES.ValidateLegacy (0 ms)                   
[ RUN      ] AES.ValidateSecret                          
[       OK ] AES.ValidateSecret (0 ms)                   
[ RUN      ] AES.Encrypt                                 
[       OK ] AES.Encrypt (0 ms)                          
[ RUN      ] AES.EncryptNoBl                             
[       OK ] AES.EncryptNoBl (0 ms)                      
[ RUN      ] AES.Decrypt                                 
[       OK ] AES.Decrypt (1 ms)                          
[ RUN      ] AES.DecryptNoBl                             
[       OK ] AES.DecryptNoBl (0 ms)                      
[ RUN      ] AES.Loop                                    
[       OK ] AES.Loop (32 ms)                            
[ RUN      ] AES.LoopKey                                 
100000 encoded in 0.072338                               
[       OK ] AES.LoopKey (72 ms)                         
[----------] 8 tests from AES (105 ms total)             
                                                         
[----------] Global test environment tear-down           
[==========] 8 tests from 1 test case ran. (105 ms total)
[  PASSED  ] 8 tests.                                    

After with AES_NI disabled (OPENSSL_ia32cap="~0x200000200000000" bin/unittest_crypto):

[==========] Running 8 tests from 1 test case.           
[----------] Global test environment set-up.             
[----------] 8 tests from AES                            
[ RUN      ] AES.ValidateLegacy                          
[       OK ] AES.ValidateLegacy (1 ms)                   
[ RUN      ] AES.ValidateSecret                          
[       OK ] AES.ValidateSecret (0 ms)                   
[ RUN      ] AES.Encrypt                                 
[       OK ] AES.Encrypt (0 ms)                          
[ RUN      ] AES.EncryptNoBl                             
[       OK ] AES.EncryptNoBl (0 ms)                      
[ RUN      ] AES.Decrypt                                 
[       OK ] AES.Decrypt (0 ms)                          
[ RUN      ] AES.DecryptNoBl                             
[       OK ] AES.DecryptNoBl (0 ms)                      
[ RUN      ] AES.Loop                                    
[       OK ] AES.Loop (50 ms)                            
[ RUN      ] AES.LoopKey                                 
100000 encoded in 0.099907                               
[       OK ] AES.LoopKey (100 ms)                        
[----------] 8 tests from AES (151 ms total)             
                                                         
[----------] Global test environment tear-down           
[==========] 8 tests from 1 test case ran. (151 ms total)
[  PASSED  ] 8 tests.      

Signed-off-by: Piotr Dałek piotr.dalek@corp.ovh.com

@rzarzynski
Copy link
Contributor

Thanks for bringing this. Cephx still has potential for improvement and moving it forward is definitely the thing many people can benefit from. Just from the reviewer POV it's also fascinating puzzle considering number of contributing factors. Let's start. :-)

Considering that AES encryption optimized with AES-NI alone is around 6x faster than directly-callable implementation, there's no reason to not use it.

It's entirely valid that AES-NI can make huge difference. The question is about the relationship between the benefit and the price we need to pay for it. Not surprisingly "it depends" can be told about both factors.

Data size. As Cephx is critical and the most prominent client of the CryptoKey I'm aware of, I focus entirely on it. If anybody knows another one, please don't hesitate and inform/start optimizing.

At the time of the NSS - OpenSSL transition it was just 32 bytes (including PKCS#7 padding). To bring yet another piece to the puzzle, CVE-2018-1129 & co bumped it up to 48. Unfortunately, our unittests are using 256+16 (AES.Loop) and 128+16 (AES.LoopKey). I tried to address that in PR #23291 by introducing more test cases.

Although I exhibit strong bias towards in-vivo testing, let me start from those new in-vitro tools.

Before

# bin/unittest_crypto --gtest_filter=AES.LoopCephx\*
...
Note: Google Test filter = AES.LoopCephx*
[==========] Running 2 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 2 tests from AES
[ RUN      ] AES.LoopCephx
[       OK ] AES.LoopCephx (426 ms)
[ RUN      ] AES.LoopCephxV2
[       OK ] AES.LoopCephxV2 (603 ms)
[----------] 2 tests from AES (1029 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test case ran. (1029 ms total)
[  PASSED  ] 2 tests.

After

with AES-NI

# bin/unittest_crypto --gtest_filter=AES.LoopCephx\*
...
Note: Google Test filter = AES.LoopCephx*
[==========] Running 2 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 2 tests from AES
[ RUN      ] AES.LoopCephx
[       OK ] AES.LoopCephx (472 ms)
[ RUN      ] AES.LoopCephxV2
[       OK ] AES.LoopCephxV2 (508 ms)
[----------] 2 tests from AES (980 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test case ran. (980 ms total)
[  PASSED  ] 2 tests.

without AES-NI

# OPENSSL_ia32cap="~0x200000200000000" bin/unittest_crypto --gtest_filter=AES.LoopCephx\*
...
Note: Google Test filter = AES.LoopCephx*
[==========] Running 2 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 2 tests from AES
[ RUN      ] AES.LoopCephx
[       OK ] AES.LoopCephx (632 ms)
[ RUN      ] AES.LoopCephxV2
[       OK ] AES.LoopCephxV2 (704 ms)
[----------] 2 tests from AES (1336 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test case ran. (1336 ms total)
[  PASSED  ] 2 tests.

The results appear, well, inconclusive. What wonders me most is degree of freedom we're omitting that way. This includes both machine, system (I wouldn't be surprised if the version of tcmalloc and its env. variables are involved ;-) and workload-specific items. For instance, the loop { malloc(); free(); } pattern we have in tests exhibit nice chances for optimization at the memory allocator's side. The in-vivo profiling is not easy as AES_cbc_encrypt is highly optimized, Perl-crafted assembly. I will keep digging.

I very like the unification between slice and bl-taking variants.

memcpy(iv, CEPH_AES_IV, AES_BLOCK_LEN);

// we aren't using EVP because of performance concerns. Profiling
// shows the cost is quite high. Endianness might be an issue.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be not valid anymore as Cephx's signature size has grown. More investigation and testing needed. WIP.

BTW: I can recall that ISAL Crypto was also considered during the transition. It doesn't have required certifications but maybe offering it as an optional implementation (with configurables and abstractions) would be possible. The technical pros are: it's already in the tree, offers very good performance and doesn't need malloc. Just thinking loudly.

@branch-predictor
Copy link
Contributor Author

@rzarzynski As stated in commit comment and PR description, EVP API provides encapsulated access to specialised hardware (again - be it AES-NI, Padlock, or whatever world came up with). You can verify it with openssl speed:

[branch@predictor ~]$ openssl speed aes-128-cbc
Doing aes-128 cbc for 3s on 16 size blocks: 14378122 aes-128 cbc's in 2.97s
Doing aes-128 cbc for 3s on 64 size blocks: 3973446 aes-128 cbc's in 2.97s
Doing aes-128 cbc for 3s on 256 size blocks: 1025526 aes-128 cbc's in 2.98s
Doing aes-128 cbc for 3s on 1024 size blocks: 261927 aes-128 cbc's in 2.98s
Doing aes-128 cbc for 3s on 8192 size blocks: 32869 aes-128 cbc's in 2.97s
OpenSSL 1.0.1e-fips 11 Feb 2013
built on: Wed Mar 22 21:37:40 UTC 2017
options:bn(64,32) md2(int) rc4(8x,mmx) des(ptr,risc1,16,long) aes(partial) idea(int) blowfish(idx)
compiler: gcc -fPIC -DOPENSSL_PIC -DZLIB -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DKRB5_MIT -DL_ENDIAN -DTERMIO -Wall -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m32 -march=i686 -mtune=atom -fasynchronous-unwind-tables -Wa,--noexecstack -DPURIFY -DOPENSSL_BN_ASM_PART_WORDS -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DRMD160_ASM -DAES_ASM -DVPAES_ASM -DWHIRLPOOL_ASM -DGHASH_ASM
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128 cbc      77457.90k    85623.08k    88098.88k    90004.45k    90660.89k
[branch@predictor ~]$ openssl speed -evp aes-128-cbc
Doing aes-128-cbc for 3s on 16 size blocks: 89705436 aes-128-cbc's in 2.96s
Doing aes-128-cbc for 3s on 64 size blocks: 21338855 aes-128-cbc's in 2.97s
Doing aes-128-cbc for 3s on 256 size blocks: 5967082 aes-128-cbc's in 2.97s
Doing aes-128-cbc for 3s on 1024 size blocks: 1540793 aes-128-cbc's in 2.97s
Doing aes-128-cbc for 3s on 8192 size blocks: 193836 aes-128-cbc's in 2.96s
OpenSSL 1.0.1e-fips 11 Feb 2013
built on: Wed Mar 22 21:37:40 UTC 2017
options:bn(64,32) md2(int) rc4(8x,mmx) des(ptr,risc1,16,long) aes(partial) idea(int) blowfish(idx)
compiler: gcc -fPIC -DOPENSSL_PIC -DZLIB -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DKRB5_MIT -DL_ENDIAN -DTERMIO -Wall -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m32 -march=i686 -mtune=atom -fasynchronous-unwind-tables -Wa,--noexecstack -DPURIFY -DOPENSSL_BN_ASM_PART_WORDS -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DRMD160_ASM -DAES_ASM -DVPAES_ASM -DWHIRLPOOL_ASM -DGHASH_ASM
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc     484894.25k   459827.18k   514334.34k   531236.37k   536454.23k

EVP (AES-NI) implementation was over 6x faster on 16-byte blocks -- and this is on my low-end VPS that happens to have AES-NI enabled. I've seen similar results on the bare metal Xeon boxes.

By using AES_cbc_encrypt directly, you're forcing usage of "highly optimized, Perl-crafted assembly" version that can't make use of any crypto accelerators available, including AES-NI. Depending on package maintainer, AES_cbc_encrypt may even not support SSE2, degrading performance even further.

@branch-predictor
Copy link
Contributor Author

I ran your tests on some more reasonable machine, and here are the results:

EVP:

$ bin/unittest_crypto --gtest_filter=AES.LoopCephx\*                          
2018-07-30 08:08:30.309 7f0987d83900 -1 WARNING: all dangerous and experimental features are enabled.
2018-07-30 08:08:30.309 7f0987d83900 -1 WARNING: all dangerous and experimental features are enabled.
2018-07-30 08:08:30.352 7f0987d83900 -1 WARNING: all dangerous and experimental features are enabled.
Note: Google Test filter = AES.LoopCephx*                                                            
[==========] Running 2 tests from 1 test case.                                                       
[----------] Global test environment set-up.                                                         
[----------] 2 tests from AES                                                                        
[ RUN      ] AES.LoopCephx                                                                           
[       OK ] AES.LoopCephx (238 ms)                                                                  
[ RUN      ] AES.LoopCephxV2                                                                         
[       OK ] AES.LoopCephxV2 (231 ms)                                                                
[----------] 2 tests from AES (469 ms total)                                                         
                                                                                                     
[----------] Global test environment tear-down                                                       
[==========] 2 tests from 1 test case ran. (469 ms total)                                            
[  PASSED  ] 2 tests.  

EVP (AES-NI disabled):

$ OPENSSL_ia32cap="~0x200000200000000" bin/unittest_crypto --gtest_filter=AES.LoopCephx\* 
2018-07-30 08:10:30.573 7f4f23452900 -1 WARNING: all dangerous and experimental features are enabled.            
2018-07-30 08:10:30.574 7f4f23452900 -1 WARNING: all dangerous and experimental features are enabled.            
2018-07-30 08:10:30.615 7f4f23452900 -1 WARNING: all dangerous and experimental features are enabled.            
Note: Google Test filter = AES.LoopCephx*                                                                        
[==========] Running 2 tests from 1 test case.                                                                   
[----------] Global test environment set-up.                                                                     
[----------] 2 tests from AES                                                                                    
[ RUN      ] AES.LoopCephx                                                                                       
[       OK ] AES.LoopCephx (352 ms)                                                                              
[ RUN      ] AES.LoopCephxV2                                                                                     
[       OK ] AES.LoopCephxV2 (376 ms)                                                                            
[----------] 2 tests from AES (728 ms total)                                                                     
                                                                                                                 
[----------] Global test environment tear-down                                                                   
[==========] 2 tests from 1 test case ran. (728 ms total)                                                        
[  PASSED  ] 2 tests.  

OLD CODE

$ bin/unittest_crypto --gtest_filter=AES.LoopCephx\*                          
2018-07-30 08:19:00.916 7f6a0c839900 -1 WARNING: all dangerous and experimental features are enabled.
2018-07-30 08:19:00.916 7f6a0c839900 -1 WARNING: all dangerous and experimental features are enabled.
2018-07-30 08:19:00.959 7f6a0c839900 -1 WARNING: all dangerous and experimental features are enabled.
Note: Google Test filter = AES.LoopCephx*                                                            
[==========] Running 2 tests from 1 test case.                                                       
[----------] Global test environment set-up.                                                         
[----------] 2 tests from AES                                                                        
[ RUN      ] AES.LoopCephx                                                                           
[       OK ] AES.LoopCephx (314 ms)                                                                  
[ RUN      ] AES.LoopCephxV2                                                                         
[       OK ] AES.LoopCephxV2 (398 ms)                                                                
[----------] 2 tests from AES (712 ms total)                                                         
                                                                                                     
[----------] Global test environment tear-down                                                       
[==========] 2 tests from 1 test case ran. (712 ms total)                                            
[  PASSED  ] 2 tests.

That's more than conclusive for me.

@rzarzynski
Copy link
Contributor

rzarzynski commented Jul 31, 2018

EVP (AES-NI) implementation was over 6x faster on 16-byte blocks -- and this is on my low-end VPS that happens to have AES-NI enabled. I've seen similar results on the bare metal Xeon boxes.

Piotr, I have no doubts that employing an AES-NI-augmented implementation can provide huge benefit. Whether it actually does that depends on several factors. To ensure we're on the same page, let me use an real-world analogy: if you need to move somewhere close, it can be faster to go on foot instead of waiting for a cab. I worry the numbers you provided from openssl speed benchmark tell only about how fast the cab travels after you got into it. They say nothing about the waiting time.

                                EVP_CIPHER_CTX_init(&ctx);
                                if(decrypt)
                                        EVP_DecryptInit_ex(&ctx,evp_cipher,NULL,key16,iv);
                                else
                                        EVP_EncryptInit_ex(&ctx,evp_cipher,NULL,key16,iv);
                                EVP_CIPHER_CTX_set_padding(&ctx, 0);

                                Time_F(START);
                                if(decrypt)
                                        for (count=0,run=1; COND(save_count*4*lengths[0]/lengths[j]); count++)
                                                EVP_DecryptUpdate(&ctx,buf,&outl,buf,lengths[j]);
                                else
                                        for (count=0,run=1; COND(save_count*4*lengths[0]/lengths[j]); count++)
                                                EVP_EncryptUpdate(&ctx,buf,&outl,buf,lengths[j]);
                                if(decrypt)
                                        EVP_DecryptFinal_ex(&ctx,buf,&outl);
                                else
                                        EVP_EncryptFinal_ex(&ctx,buf,&outl);
                                d=Time_F(STOP);
                                EVP_CIPHER_CTX_cleanup(&ctx);
                                }

More detailed gist is available as well. As you notice, both creation and deletion of EVP context are performed once and outside the time measurement. When I see such API, I expect a lot of quirkiness coming from life cycle management as it's often supposed to be infrequent. This stays in opposition to the use case we have in Ceph.

In the version of OpenSSL I have, EVP uses internally yet another Perl-crafted ;-) assembler piece named aesni_cbc_encrypt. With some terrible hackery, it should be possible to use it directly and derive the costs of context management.

# bin/unittest_crypto --gtest_filter=AES.LoopCephx\*
...
Note: Google Test filter = AES.LoopCephx*
[==========] Running 2 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 2 tests from AES
[ RUN      ] AES.LoopCephx
[       OK ] AES.LoopCephx (79 ms)
[ RUN      ] AES.LoopCephxV2
[       OK ] AES.LoopCephxV2 (117 ms)
[----------] 2 tests from AES (196 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test case ran. (196 ms total)
[  PASSED  ] 2 tests.

On my environment the cost is huge and definitely dominates the workload. At the moment I'm working to figure out the reason. There is a lot of possibilities: memory allocator performance, OpenSSL version, configured engines etc.

However, what bugs me is that the cab waiting time may depend on street traffic and the number of passengers calling taxi company the same time (e.g. malloc usage patterns). The unittest_crypto in that matter looks like a city of single passenger and single cab. Would it be possible to provide measurements from a live cluster running e.g. RBD workload?

Completely BTW: AFAIK there is an effort on SeaStar-based messenger. Hopefully it'll resolve the issue with context's thread safety. I bet your commit, enriched by context reusage, would be undoubtedly a big win. @tchaikov: do you know what's the current status?

UPDATE: I the discussion I focus on AES-NI. Other acceleration techniques with similar setup costs would fit as well. Bulky crypto accelerators/far placed coprocessors are IMHO out of question for 48 bytes encryption.

@branch-predictor
Copy link
Contributor Author

branch-predictor commented Jul 31, 2018 via email

@tchaikov
Copy link
Contributor

tchaikov commented Aug 8, 2018

@tchaikov: do you know what's the current status?

@rzarzynski sorry, i missed your comments. see https://github.com/ceph/ceph/tree/master/src/crimson/net . and i think it''d be a good chance for reusing the contexts without worrying the racing issue we'd be suffering.

@rzarzynski
Copy link
Contributor

That was my initial intention, but it turns out it's impossible to build master on Xenial in a way that doesn't require upgrading a bunch of packages on installation machine to ppa/testing versions and I can't do this on my testing, bare-metal cluster. The packages provided on http://download.ceph.com don't have this limitation because package builders have a right mix of dependencies - @tchaikov was working on that, I don't remember the PR number.

@branch-predictor: is there any progress in the matter of in-OSD performance testing? Are the blockers gone now?

The results would be really helpful. Our benchmarks - in contrast to OSD - don't link with libtcmalloc. As the costs of EVP context setup are high, the difference in memory allocator might be an important factor.

@branch-predictor
Copy link
Contributor Author

Right now I'm working on new test/benchmark hardware setup based on a bunch of current-gen CPUs, NVMes and plenty of powerful enough client hosts to flood the cluster. Hang on.

@stale
Copy link

stale bot commented Dec 17, 2018

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
If you are a maintainer or core committer, please follow-up on this issue to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

OpenSSL's EVP API encapsulates different encryption mechanisms and
engines, including AES-NI, ARM NEON, VIA Padlock and possibly other
hardware crypto accelerators. Considering that AES optimized with
AES-NI alone is around 6x faster than directly-callable 
implementation, there's no reason to not use it.

Signed-off-by: Piotr Dałek <piotr.dalek@corp.ovh.com>
@branch-predictor
Copy link
Contributor Author

Closing as with Messenger2 and new signing scheme this lost its relevance.

@branch-predictor branch-predictor deleted the bp-openssl-evp branch May 27, 2019 09:05
@kvanals
Copy link

kvanals commented Dec 7, 2020

All, I'd like to bring this one back up -- re-basing this PR to the latest stable release in conjunction with PR#32675, I was able to successfully get Ceph working reliably in a FIPS-validated environment without any special workarounds. As far as I can tell, both PR#23260 and PR#32675 are being treated solely as performance-enhancing, but they also resolve an issue with low-level crypto functions being forbidden in FIPS mode.

Thoughts?

@edevil
Copy link

edevil commented Jun 22, 2021

I would also be interested in this PR for FIPS purposes. Is there a reason for not including it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants