New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-43286][SQL] Updates aes_encrypt CBC mode to generate random IVs #40969
Conversation
d224b9b
to
9075e57
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you fix the examples:
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala
Lines 336 to 337 in dabd771
> SELECT base64(_FUNC_('Apache Spark', '1234567890abcdef', 'CBC', 'DEFAULT')); | |
U2FsdGVkX18JQ84pfRUwonUrFzpWQ46vKu4+MkJVFGM= |
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala
Lines 402 to 403 in dabd771
> SELECT _FUNC_(unbase64('U2FsdGVkX18JQ84pfRUwonUrFzpWQ46vKu4+MkJVFGM='), '1234567890abcdef', 'CBC'); | |
Apache Spark |
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionImplUtils.java
Show resolved
Hide resolved
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionImplUtils.java
Outdated
Show resolved
Hide resolved
...core/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionImplUtilsSuite.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sweisdb Could you remove the error class AES_SALTED_MAGIC
from error-classes.json
, and related function aesInvalidSalt()
+ the test "INVALID_PARAMETER_VALUE.AES_SALTED_MAGIC: AES decrypt failure - invalid salt"
Updated to address this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Waiting for CI.
@sweisdb Could you re-trigger GitHub actions, please. Highly likely the failure of the test org.apache.spark.storage.BlockManagerBasicStrategyReplicationSuite is not related to your changes but just in case let's re-run all tests. |
@MaxGekk Sure, that worked and it's all green now. |
+1, LGTM. Merging to master. |
@sweisdb Congratulations with your first contribution to Apache Spark! |
### What changes were proposed in this pull request? The current implementation of AES-CBC mode called via `aes_encrypt` and `aes_decrypt` uses a key derivation function (KDF) based on OpenSSL's [EVP_BytesToKey](https://www.openssl.org/docs/man3.0/man3/EVP_BytesToKey.html). This is intended for generating keys based on passwords and OpenSSL's documents discourage its use: "Newer applications should use a more modern algorithm". `aes_encrypt` and `aes_decrypt` should use the key directly in CBC mode, as it does for both GCM and ECB mode. The output should then be the initialization vector (IV) prepended to the ciphertext – as is done with GCM mode: `[16-byte randomly generated IV | AES-CBC encrypted ciphertext]` ### Why are the changes needed? We want to have the ciphertext output similar across different modes. OpenSSL's EVP_BytesToKey is effectively deprecated and their own documentation says not to use it. Instead, CBC mode will generate a random vector. ### Does this PR introduce _any_ user-facing change? AES-CBC output generated by the previous format will be incompatible with this change. That change was recently landed and we want to land this before CBC mode is used in practice. ### How was this patch tested? A new unit test in `DataFrameFunctionsSuite` was added to test both GCM and CBC modes. Also, a new standalone unit test suite was added in `ExpressionImplUtilsSuite` to test all the modes and various key lengths. ``` build/sbt "sql/test:testOnly org.apache.spark.sql.DataFrameFunctionsSuite" build/sbt "sql/test:testOnly org.apache.spark.sql.catalyst.expressions.ExpressionImplUtilsSuite" ``` CBC values can be verified with `openssl enc` using the following command: ``` echo -n "[INPUT]" | openssl enc -a -e -aes-256-cbc -iv [HEX IV] -K [HEX KEY] echo -n "Spark" | openssl enc -a -e -aes-256-cbc -iv f8c832cc9c61bac6151960a58e4edf86 -K 6162636465666768696a6b6c6d6e6f7031323334353637384142434445464748 ``` Closes apache#40969 from sweisdb/SPARK-43286. Authored-by: Steve Weis <steve.weis@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>
What changes were proposed in this pull request?
The current implementation of AES-CBC mode called via
aes_encrypt
andaes_decrypt
uses a key derivation function (KDF) based on OpenSSL's EVP_BytesToKey. This is intended for generating keys based on passwords and OpenSSL's documents discourage its use: "Newer applications should use a more modern algorithm".aes_encrypt
andaes_decrypt
should use the key directly in CBC mode, as it does for both GCM and ECB mode. The output should then be the initialization vector (IV) prepended to the ciphertext – as is done with GCM mode:[16-byte randomly generated IV | AES-CBC encrypted ciphertext]
Why are the changes needed?
We want to have the ciphertext output similar across different modes. OpenSSL's EVP_BytesToKey is effectively deprecated and their own documentation says not to use it. Instead, CBC mode will generate a random vector.
Does this PR introduce any user-facing change?
AES-CBC output generated by the previous format will be incompatible with this change. That change was recently landed and we want to land this before CBC mode is used in practice.
How was this patch tested?
A new unit test in
DataFrameFunctionsSuite
was added to test both GCM and CBC modes. Also, a new standalone unit test suite was added inExpressionImplUtilsSuite
to test all the modes and various key lengths.CBC values can be verified with
openssl enc
using the following command: