-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-12567][SQL] Add aes_{encrypt,decrypt} UDFs #10527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #48484 has finished for PR 10527 at commit
|
|
Jenkins, retest this please. |
|
Test build #48511 has finished for PR 10527 at commit
|
27693b4 to
0558bf8
Compare
|
Test build #48748 has finished for PR 10527 at commit
|
|
Test build #48744 has finished for PR 10527 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to with Serializable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
|
Test build #48890 has finished for PR 10527 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So when exception happens, codegen version will throw it, and the interpreted version will return null?
|
Test build #48943 has finished for PR 10527 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does only this test need extra installation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because Oracle JRE/JDK supports AES-128 out of the box. AES-192 and AES-256 are supported if Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy Files are installed. In this test case, key size is 256 bits. We should install JCE not only with this test also any test with 192/256 bits key. I commented out this test by default in case this test fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @cloud-fan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be okay, as long as the failure is clear and it doesn't break other things when its not installed. What happens when its not installed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marmbrus When Java Cryptography Extension (JCE) not installed under oracle jdk, it will return null and will not break other things. Plus, we don't need to worry about installing this extension under openjdk, it works fine with openjdk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan @marmbrus ping?
|
retest this please. |
|
Test build #50212 has finished for PR 10527 at commit
|
|
retest this please |
9476822 to
bc7aa2c
Compare
|
Test build #50243 has finished for PR 10527 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should follow the style of other classes in this file, i.e. use double quotation marks and \n
|
@cloud-fan Okay, addressed comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure how we should deal with errors, return null or throw exception? cc @marmbrus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked some similar expressions, Encode/Decode will throw exception if its charset string parameter is invalid, Sha2 will return null if its bit length parameter is invalid.
For AesEncrypt, what we should do if its secret key parameter is invalid? @vectorijk are there any other errors will be thrown during execution?
cc @rxin , what's the general rule for this kind of stuff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we wrap everything in an analysis exception during expression construction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not during construction, but execution. The parameters may be non-foldable and we can only know it's valid or not during execution. BTW the encrypting process may fail and throw exception, should we silently ignore it and return null, or rethrow the exception and fail the execution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say in this case we should rethrow the exception to fail everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan What I have implemented was trying to follow the way Sha2 did.
I think if secret key is invalid. It will return null just like the way Sha2 did. There is no other errors should be thrown during execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, you followed Sha2, but Sha2 may also be wrong. In this case we think rethrow exception makes more sense, and we can also fix Sha2 later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan okay, i will try to rethrow exception as Encode/Decode does.
|
Test build #50456 has finished for PR 10527 at commit
|
|
Test build #50457 has finished for PR 10527 at commit
|
|
Test build #51235 has finished for PR 10527 at commit
|
|
|
||
| // key length (80 bits) is not one of the permitted values (128, 192 or 256 bits) | ||
| intercept[java.security.InvalidKeyException] { | ||
| evaluate(AesDecrypt(UnBase64(Literal("y6Ss+zCYObpCbgfWfyNWTw==")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only tests interpreted version, we also need to test codegen version, i.e. create an unsafe projection using AesDecrypt and execute it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan I added codegen version tests with creating unsafe projection. But I'm not very sure whether I am doing right or not. Could you take a look again?
Meanwhile, I took look at code ExpressionEvalHelper.scala checkEvalutionWithUnsafeProjection(). Is that doing the same thing as you said to test codegen version?
|
Test build #51476 has finished for PR 10527 at commit
|
to avoid python test errors
|
Test build #51508 has finished for PR 10527 at commit
|
| assert(instance1.apply(null).getString(0) === "ABC") | ||
|
|
||
| val instance2 = UnsafeProjection.create(expr2 :: Nil) | ||
| assert(instance2.apply(null).getString(0) === "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above 2 are not needed. Normal cases are already covered by checkEvaluation, we only need to test error case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, Done that.
| evaluate(expr3) | ||
| } | ||
| intercept[java.security.InvalidKeyException] { | ||
| UnsafeProjection.create(expr3::Nil).apply(null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: expr3 :: Nil, we should add space
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
LGTM overall |
|
Test build #51549 has finished for PR 10527 at commit
|
|
Test build #51551 has finished for PR 10527 at commit
|
|
LGTM, merging this into master, thanks! |
| * @group misc_funcs | ||
| * @since 2.0.0 | ||
| */ | ||
| def aes_encrypt(input: Column, key: Column): Column = withExpr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
key should just be an int
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Key is not integer, it's binary, see examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think so. The examples is just one of those cases. Key also could be abcdef1234567890.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then we should probably just take a string.
the thing is we want to facilitate the most common cases, e.g. if 99% of the time people are passing in keys directly rather than relying on some other columns' values, we should just let them pass the literals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the functions take literals, some not., In 2.0, should we clean up all the mess?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the other examples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we do have a mess of it. Generally we have 3 kinds of parameters: Column, column name string, literal. For example, pow takes 2 parameters, and we have 8 overloaded versions of it, which are all combinations of these 3 kinds of parameters except both literals. However, some functions like pmod, only have both Column version. Some functions like sha2 only have Column, literal combination.
Personly I think for something like sha2 and this one, where one parameter will be literal 99% of the time, we should just give a Column, literal combination. i.e. def aes_encrypt(input: Column, key: String). For something like pow, we only need to provide the both Column version, as column name string and literal are easy to parse to Column, by col() and lit().
|
@davies does this implementation even make sense? decrypt always return a string, while encrypt always take in a binary? The two are not symmetric. |
|
Sorry I'm not convinced this is correct (decrypt returning string type), and there are other issues with it. I'm just going to revert the patch, since it is unlikely other things will conflict with this. |
|
@vectorijk I've reverted the patch. Can you reopen the pull request and fix the return types? |
|
Ok, Sure. I will do that. |
|
I did not look into this closely, since @cloud-fan already reviewed this many rounds. sorry for the rush. |
| @ExpressionDescription( | ||
| usage = "_FUNC_(input, key) - Decrypts input using AES.", | ||
| extended = "> SELECT _FUNC_(UnBase64('y6Ss+zCYObpCbgfWfyNWTw=='),'1234567890123456');\n 'ABC'") | ||
| case class AesDecrypt(left: Expression, right: Expression) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i find it confusing to have left/right here (e.g. input, key)
Let's give them a proper name, and then just override def left: Expression = input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also if the key is literal, i'd just do some input data type checking in analysis (override checkInputTypes) to make sure the key is in acceptable range.
|
I would also just remove support for 192/256, so we don't have to explain the JCE stuff. |
|
Sorry missed the symmetry part, but it is symmetric underneath, except |
|
Thanks so much for suggestion! I will open a new PR for update. |
|
@vectorijk Is this PR dead ? |
No description provided.