New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-37591][SQL] Support the GCM mode by aes_encrypt()
/aes_decrypt()
#34852
Conversation
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #146044 has finished for PR 34852 at commit
|
aes_encrypt()
/aes_decrypt()
aes_encrypt()
/aes_decrypt()
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #146063 has finished for PR 34852 at commit
|
retest this please. |
""", | ||
arguments = """ | ||
Arguments: | ||
* expr - The binary value to encrypt. | ||
* key - The passphrase to use to encrypt the data. | ||
* mode - Specifies which block cipher mode should be used to encrypt messages. | ||
Supported modes: ECB. | ||
Supported modes: ECB, GCM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Should we consistently use Supported
or Valid
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
assert(encrypted.filter($"enc" === $"input").isEmpty) | ||
val result = encrypted.selectExpr( | ||
"CAST(aes_decrypt(enc, key, 'GCM', 'NONE') AS STRING) AS res", "input") | ||
assert(!result.filter($"res" === $"input").isEmpty) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To test all the records are decrypted correctly, should we do assert(result.filter($"res" !== $"input").isEmpty)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other AES tests are in DataFrameFunctionsSuite
, so should we move this test to the same place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather move other tests for misc functions from DataFrameFunctionsSuite
to the dedicated MiscFunctionsSuite
. I was thinking of moving all AES related tests here but decided to don't do unrelated changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataFrameFunctionsSuite
has already had > 3800 lines. Don't think it makes sense to place new misc test there if there is a special test suite for such functions.
Kubernetes integration test starting |
Kubernetes integration test status failure |
* padding - Specifies how to pad messages whose length is not a multiple of the block size. | ||
Valid values: PKCS. | ||
Valid values: PKCS, NONE. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we have a smarter way to decide the default padding? e.g. if the mode is GCM
, the default padding is NONE
.
We can also require the mode and padding parameter to be constant, to simplify the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also require the mode and padding parameter to be constant, to simplify the implementation.
I wouldn't do that because it is unnecessary restriction from my point of view. I could image an use case when data gathered from different sources and encrypted slightly differently (using different padding/modes), and process in one places. What you propose requires to somehow split input data by dataframes (or selects + unions) and process them separately. Don't see any reasons to bring such pains to users.
Shall we have a smarter way to decide the default padding? e.g. if the mode is GCM, the default padding is NONE.
I think of introducing the DEFAULT
value for padding which the AES implementation substitutes by concrete value (NONE
or PKCS
) depending on the mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think of introducing the DEFAULT value for padding
Yea this also works, and we need to document the default padding for each mode.
Test build #146112 has finished for PR 34852 at commit
|
@@ -386,7 +386,7 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSparkSession { | |||
} | |||
|
|||
// Unsupported AES mode and padding in decrypt | |||
checkUnsupportedMode(df2.selectExpr(s"aes_decrypt(value16, '$key16', 'GSM')")) | |||
checkUnsupportedMode(df2.selectExpr(s"aes_decrypt(value16, '$key16', 'GSM', 'PKCS')")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we test GCM
?
Kubernetes integration test unable to build dist. exiting with code: 1 |
Test build #146128 has finished for PR 34852 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Seems
I am going to ignore the issue in R, and merge this PR since other GAs passed successfully. |
Merging to master. Thank you, @sarutak @gengliangwang and @cloud-fan for review. |
Test build #146131 has finished for PR 34852 at commit
|
What changes were proposed in this pull request?
In the PR, I propose new AES mode for the
aes_encrypt()
/aes_decrypt()
functions -GCM
(Galois/Counter Mode) w/o padding. Theaes_encrypt()
function returns a binary value which consists of the following fields:aes_encrypt()
call usingjava.security.SecureRandom
. The length of IV is 12 bytes.The
aes_decrypt()
functions assumes that its input has the fields as showed above.For example:
Why are the changes needed?
To achieve feature parity with other systems/frameworks, and make the migration process from them to Spark SQL easier. For example, the
GCM
mode is supported by:Does this PR introduce any user-facing change?
No. The AES functions haven't been released yet.
How was this patch tested?
By running new checks: