Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-37591][SQL] Support the GCM mode by aes_encrypt()/aes_decrypt() #34852

Closed
wants to merge 7 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Dec 9, 2021

What changes were proposed in this pull request?

In the PR, I propose new AES mode for the aes_encrypt()/aes_decrypt() functions - GCM (Galois/Counter Mode) w/o padding. The aes_encrypt() function returns a binary value which consists of the following fields:

  1. An initialization vector (IV) generated per every aes_encrypt() call using java.security.SecureRandom. The length of IV is 12 bytes.
  2. The encrypted input.
  3. An authentication tag that can be used to verify the integrity of the data. The length of the tag is 16 bytes.

The aes_decrypt() functions assumes that its input has the fields as showed above.

For example:

spark-sql> SELECT base64(aes_encrypt('Apache Spark', '0000111122223333', 'GCM', 'NONE'));
ak3k4wV71YJAjg6d5sUTm15Ag9fyQP0UL/ComBkBxzVvft2NBOCF+g==
spark-sql> SELECT aes_decrypt(unbase64('ak3k4wV71YJAjg6d5sUTm15Ag9fyQP0UL/ComBkBxzVvft2NBOCF+g=='), '0000111122223333', 'GCM', 'NONE');
Apache Spark

Why are the changes needed?

To achieve feature parity with other systems/frameworks, and make the migration process from them to Spark SQL easier. For example, the GCM mode is supported by:

Does this PR introduce any user-facing change?

No. The AES functions haven't been released yet.

How was this patch tested?

By running new checks:

$ build/sbt "test:testOnly org.apache.spark.sql.DataFrameFunctionsSuite"
$ build/sbt "sql/test:testOnly org.apache.spark.sql.expressions.ExpressionInfoSuite"
$ build/sbt "test:testOnly org.apache.spark.sql.MiscFunctionsSuite"

@github-actions github-actions bot added the SQL label Dec 9, 2021
@SparkQA
Copy link

SparkQA commented Dec 9, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50519/

@SparkQA
Copy link

SparkQA commented Dec 9, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50519/

@SparkQA
Copy link

SparkQA commented Dec 9, 2021

Test build #146044 has finished for PR 34852 at commit 1c3e5b2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk MaxGekk marked this pull request as ready for review December 10, 2021 06:27
@MaxGekk MaxGekk changed the title [WIP][SPARK-37591][SQL] Support the GCM mode by aes_encrypt()/aes_decrypt() [SPARK-37591][SQL] Support the GCM mode by aes_encrypt()/aes_decrypt() Dec 10, 2021
@SparkQA
Copy link

SparkQA commented Dec 10, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50539/

@SparkQA
Copy link

SparkQA commented Dec 10, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50539/

@SparkQA
Copy link

SparkQA commented Dec 10, 2021

Test build #146063 has finished for PR 34852 at commit 98885d0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sarutak
Copy link
Member

sarutak commented Dec 13, 2021

retest this please.

""",
arguments = """
Arguments:
* expr - The binary value to encrypt.
* key - The passphrase to use to encrypt the data.
* mode - Specifies which block cipher mode should be used to encrypt messages.
Supported modes: ECB.
Supported modes: ECB, GCM.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Should we consistently use Supported or Valid?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

assert(encrypted.filter($"enc" === $"input").isEmpty)
val result = encrypted.selectExpr(
"CAST(aes_decrypt(enc, key, 'GCM', 'NONE') AS STRING) AS res", "input")
assert(!result.filter($"res" === $"input").isEmpty)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To test all the records are decrypted correctly, should we do assert(result.filter($"res" !== $"input").isEmpty) ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other AES tests are in DataFrameFunctionsSuite, so should we move this test to the same place?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather move other tests for misc functions from DataFrameFunctionsSuite to the dedicated MiscFunctionsSuite. I was thinking of moving all AES related tests here but decided to don't do unrelated changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFrameFunctionsSuite has already had > 3800 lines. Don't think it makes sense to place new misc test there if there is a special test suite for such functions.

@SparkQA
Copy link

SparkQA commented Dec 13, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50587/

@SparkQA
Copy link

SparkQA commented Dec 13, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50587/

* padding - Specifies how to pad messages whose length is not a multiple of the block size.
Valid values: PKCS.
Valid values: PKCS, NONE.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we have a smarter way to decide the default padding? e.g. if the mode is GCM, the default padding is NONE.

We can also require the mode and padding parameter to be constant, to simplify the implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also require the mode and padding parameter to be constant, to simplify the implementation.

I wouldn't do that because it is unnecessary restriction from my point of view. I could image an use case when data gathered from different sources and encrypted slightly differently (using different padding/modes), and process in one places. What you propose requires to somehow split input data by dataframes (or selects + unions) and process them separately. Don't see any reasons to bring such pains to users.

Shall we have a smarter way to decide the default padding? e.g. if the mode is GCM, the default padding is NONE.

I think of introducing the DEFAULT value for padding which the AES implementation substitutes by concrete value (NONE or PKCS) depending on the mode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think of introducing the DEFAULT value for padding

Yea this also works, and we need to document the default padding for each mode.

@SparkQA
Copy link

SparkQA commented Dec 13, 2021

Test build #146112 has finished for PR 34852 at commit 98885d0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -386,7 +386,7 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSparkSession {
}

// Unsupported AES mode and padding in decrypt
checkUnsupportedMode(df2.selectExpr(s"aes_decrypt(value16, '$key16', 'GSM')"))
checkUnsupportedMode(df2.selectExpr(s"aes_decrypt(value16, '$key16', 'GSM', 'PKCS')"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we test GCM?

@SparkQA
Copy link

SparkQA commented Dec 13, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50604/

@SparkQA
Copy link

SparkQA commented Dec 13, 2021

Test build #146128 has finished for PR 34852 at commit dffeec7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 13, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50606/

@SparkQA
Copy link

SparkQA commented Dec 13, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50606/

@MaxGekk
Copy link
Member Author

MaxGekk commented Dec 13, 2021

Seems Linters, licenses, dependencies and documentation generation is flaky (this PR https://github.com/monkeyboy123/spark/actions/runs/1570703624 has the same issue)

Error in loadNamespace(x) : there is no package called ‘pkgdown’
Calls: loadNamespace -> withRestarts -> withOneRestart -> doWithOneRestart
Execution halted
                    ------------------------------------------------
      Jekyll 4.2.1   Please append `--trace` to the `build` command 
                     for any additional information or backtrace. 
                    ------------------------------------------------

I am going to ignore the issue in R, and merge this PR since other GAs passed successfully.

@MaxGekk
Copy link
Member Author

MaxGekk commented Dec 13, 2021

Merging to master. Thank you, @sarutak @gengliangwang and @cloud-fan for review.

@MaxGekk MaxGekk closed this in 5e4d664 Dec 13, 2021
@SparkQA
Copy link

SparkQA commented Dec 13, 2021

Test build #146131 has finished for PR 34852 at commit fc29610.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants