[SPARK-34990][SQL][TESTS] Add ParquetEncryptionSuite #32146

andersonm-ibm · 2021-04-13T06:58:23Z

What changes were proposed in this pull request?

A simple test that writes and reads an encrypted parquet and verifies that it's encrypted by checking its magic string (in encrypted footer mode).

Why are the changes needed?

To provide a test coverage for Parquet encryption.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

SBT / Hadoop 3.2 / Java8 (the default)
~~SBT / Hadoop 3.2 / Java11 by adding [test-java11] to the PR title.~~ (Jenkins Java11 build is broken due to missing JDK11 installation)
SBT / Hadoop 2.7 / Java8 by adding [test-hadoop2.7] to the PR title.
Maven / Hadoop 3.2 / Java8 by adding [test-maven] to the PR title.
Maven / Hadoop 2.7 / Java8 by adding [test-maven][test-hadoop2.7] to the PR title.

ggershinsky · 2021-04-13T07:08:43Z

thanks Maya!
cc @dbtsai

dbtsai · 2021-04-13T08:05:22Z

Jenkins, okay to test.

dbtsai · 2021-04-13T08:05:56Z

...rc/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncryptionTest.scala

+    }
+    keyMap
+  }
+}


add new line.

dbtsai · 2021-04-13T08:06:11Z

sql/core/pom.xml

+      <groupId>org.codehaus.jackson</groupId>
+      <artifactId>jackson-mapper-asl</artifactId>
+    </dependency>
+


remove unneeded change.

@dbtsai , Unfortunately, without the change I get the following exception:
java.lang.NoClassDefFoundError: org/codehaus/jackson/type/TypeReference at org.apache.parquet.crypto.keytools.FileKeyWrapper.getEncryptionKeyMetadata(FileKeyWrapper.java:140) at org.apache.parquet.crypto.keytools.FileKeyWrapper.getEncryptionKeyMetadata(FileKeyWrapper.java:113) at org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory.getFileEncryptionProperties(PropertiesDrivenCryptoFactory.java:127) at org.apache.parquet.hadoop.ParquetOutputFormat.createEncryptionProperties(ParquetOutputFormat.java:554) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:478) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:420) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:409) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetOutputWriter.scala:36) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:150) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:126) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:111) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:271) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:211) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ClassNotFoundException: org.codehaus.jackson.type.TypeReference at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:351)

I mean the extra new line change.

cc @dongjoon-hyun to check if it's okay to add jackson-mapper-asl as new dep in core module.

ggershinsky · 2021-04-13T12:06:47Z

...rc/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncryptionTest.scala

+    if (null == masterKey) {
+      throw new ParquetCryptoRuntimeException("Key not found: " + masterKeyIdentifier)
+    }
+    val AAD: Array[Byte] = masterKeyIdentifier.getBytes(StandardCharsets.UTF_8)


no need to use AADs here, they don't enhance integrity protection

ggershinsky · 2021-04-13T12:07:17Z

...rc/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncryptionTest.scala

+    if (null == masterKey) {
+      throw new ParquetCryptoRuntimeException("Key not found: " + masterKeyIdentifier)
+    }
+    val AAD: Array[Byte] = masterKeyIdentifier.getBytes(StandardCharsets.UTF_8)


ggershinsky · 2021-04-13T12:25:14Z

...rc/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncryptionTest.scala

+/**
+ * This is a mock class, built just for parquet encryption testing in Spark
+ * and based on InMemoryKMS in parquet-hadoop tests.
+ * Don't use it as an example of a KmsClient implementation.


It might be helpful to add a link to a sample KmsClient in this comment,
https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/parquet-hadoop/src/test/java/org/apache/parquet/crypto/keytools/samples/VaultClient.java

dongjoon-hyun

Thank you for pinging me, @dbtsai .

Thank you for your contribution, @andersonm-ibm . In terms of the dependency, adding jackson-mapper-asl into more important modules doesn't look good to me in many ways although it's used in Apache Spark.

jackson-mapper-asl is ancient (the last release is 2013). We cannot expect more bug patches.
The package itself is declared to be replaced with com.fasterxml.jackson.core » jackson-databind officially.

In sql/core module, can we use jackson-databind instead?

dongjoon-hyun · 2021-04-14T00:54:23Z

cc @viirya , @sunchao , @attilapiros , too.

dongjoon-hyun · 2021-04-14T00:59:30Z

sql/core/pom.xml

@@ -113,6 +113,10 @@
      <groupId>com.fasterxml.jackson.core</groupId>
      <artifactId>jackson-databind</artifactId>
    </dependency>
+    <dependency>
+      <groupId>org.codehaus.jackson</groupId>
+      <artifactId>jackson-mapper-asl</artifactId>


In hive module, it was inevitable due to Apache Hive, but this new addition will affect Spark's mllib module and kafka-0-10-sql module, too. Where does this come from?

If this is used only for testing, this should be a test dependency, @andersonm-ibm .

Another option is that moving this test case to hive module.

dongjoon-hyun · 2021-04-14T01:06:40Z

...rc/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncryptionTest.scala

+/**
+ * A test suite that tests parquet modular encryption usage in Spark.
+ */
+class ParquetEncryptionTest extends QueryTest with SharedSparkSession {


Let's rename this: ParquetEncryptionTest -> ParquetEncryptionSuite.

dongjoon-hyun · 2021-04-14T01:06:53Z

...rc/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncryptionTest.scala

+import scala.sys.process._
+
+/**
+ * A test suite that tests parquet modular encryption usage in Spark.


Please remote in Spark.

dongjoon-hyun · 2021-04-14T01:07:43Z

...rc/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncryptionTest.scala

+
+  import testImplicits._
+
+  test("Write and read an encrypted parquet") {


If you don't mind, please add a JIRA prefix like the following.

- test("Write and read an encrypted parquet") { + test("SPARK-34990: Write and read an encrypted parquet") {

viirya · 2021-04-14T01:30:26Z

Thank you for pinging me, @dongjoon-hyun.

Can we restore the PR description template? Spark pull requests follow the format and it will be in the commit log. Thanks.

viirya · 2021-04-14T01:35:58Z

...rc/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncryptionTest.scala

+import java.io.File
+import java.nio.charset.StandardCharsets
+import java.util.{Base64, HashMap, Map}
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.parquet.crypto.{KeyAccessDeniedException, ParquetCryptoRuntimeException}
+import org.apache.parquet.crypto.keytools.{KeyToolkit, KmsClient}
+import org.apache.spark.sql.QueryTest
+import org.apache.spark.sql.test.SharedSparkSession
+
+import scala.sys.process._


Please separate imports with newline by the order of 1. java import, 2. scala import, 3. third-party import, 4. Spark import. E.g.,

import java.io.File import java.nio.charset.StandardCharsets import java.util.{Base64, HashMap, Map} import scala.sys.process._ import org.apache.hadoop.conf.Configuration import org.apache.parquet.crypto.{KeyAccessDeniedException, ParquetCryptoRuntimeException} import org.apache.parquet.crypto.keytools.{KeyToolkit, KmsClient} import org.apache.spark.sql.QueryTest import org.apache.spark.sql.test.SharedSparkSession

viirya · 2021-04-14T01:39:39Z

...rc/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncryptionTest.scala

+      val parquetDF = spark.read.parquet(parquetDir)
+      assert(parquetDF.inputFiles.nonEmpty)
+      val ds = parquetDF.select("a", "b", "c")
+      ds.show()


Instead of show, should we verify the content?

viirya · 2021-04-14T01:41:08Z

...rc/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncryptionTest.scala

+  @throws[KeyAccessDeniedException]
+  @throws[UnsupportedOperationException]
+  override def wrapKey(keyBytes: Array[Byte], masterKeyIdentifier: String): String = {
+    println(s"Wrap Key ${masterKeyIdentifier}")


Is println necessary?

viirya · 2021-04-14T01:41:15Z

...rc/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncryptionTest.scala

+  @throws[KeyAccessDeniedException]
+  @throws[UnsupportedOperationException]
+  override def unwrapKey(wrappedKey: String, masterKeyIdentifier: String): Array[Byte] = {
+    println(s"Unwrap Key ${masterKeyIdentifier}")


ggershinsky · 2021-04-14T11:19:33Z

@dongjoon-hyun @dbtsai I agree replacing codehaus jackson with the fasterxml one is the right thing to do in the next parquet version. Regarding the current situation - parquet 1.12.0 has been released, with the coudehaus runtime dependency. This jackson was leveraged by PME a few years back, and tested with Spark 2.4 and 3.0. All these Spark distros, and the latest 3.1.1, have the codhaus jar. The current master drops this dependency in the core, but maybe it can be kept for one more release, so PME is enabled in Spark 3.2.0? We will work on replacing the jackson in parquet, making sure it's properly tested (inc backwards compatibility with 1.1.2.0) etc, can take some time; the next parquet version would go into the next Spark version after 3.2.0?

dongjoon-hyun · 2021-04-14T22:03:39Z

@ggershinsky . You should split (1) sql/core module dependency from (2) the Spark distribution dependency. I didn't ask Apache Parquet dependency change or Apache Spark dependency change. I provided @andersonm-ibm the solution #32146 (comment) already. Please move the test case to hive module simply.

@dongjoon-hyun @dbtsai I agree replacing codehaus jackson with the fasterxml one is the right thing to do in the next parquet version. Regarding the current situation - parquet 1.12.0 has been released, with the coudehaus runtime dependency. This jackson was leveraged by PME a few years back, and tested with Spark 2.4 and 3.0. All these Spark distros, and the latest 3.1.1, have the codhaus jar. The current master drops this dependency in the core, but maybe it can be kept for one more release, so PME is enabled in Spark 3.2.0? We will work on replacing the jackson in parquet, making sure it's properly tested (inc backwards compatibility with 1.1.2.0) etc, can take some time; the next parquet version would go into the next Spark version after 3.2.0?

dongjoon-hyun

Please don't forget this comment, @andersonm-ibm .

Another option is that moving this test case to hive module.

https://github.com/apache/spark/pull/32146/files#r612867557

ggershinsky · 2021-04-15T15:54:12Z

@dongjoon-hyun Got you, sounds good! Having the codehaus jackson jar in the Spark 3.2.0 distribution will enable users to work with Parquet encryption out-of-box in this Spark version.
As for the test case - deferring this to @andersonm-ibm

andersonm-ibm · 2021-04-16T07:36:50Z

@dongjoon-hyun Got you, sounds good! Having the codehaus jackson jar in the Spark 3.2.0 distribution will enable users to work with Parquet encryption out-of-box in this Spark version.
As for the test case - deferring this to @andersonm-ibm

Thank you, @dongjoon-hyun , for all your comments and suggestions. I'll move the test case to the hive module and update the PR.

dongjoon-hyun · 2021-04-17T20:42:20Z

ok to test

dongjoon-hyun · 2021-04-17T20:45:03Z

Thank you for update, @andersonm-ibm .
In addition, I revised the PR title and description according to @viirya 's advice (#32146 (comment) )

dongjoon-hyun · 2021-04-17T20:48:46Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetEncryptionSuite.scala

+
+import org.apache.hadoop.conf.Configuration
+import org.apache.parquet.crypto.keytools.{KeyToolkit, KmsClient}
+import org.apache.parquet.crypto.{KeyAccessDeniedException, ParquetCryptoRuntimeException}


Could you run dev/lint-scala and fix the Scala style error?

[error] /Users/dongjoon/PRS/SPARK-PR-32146/sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetEncryptionSuite.scala:28:0: org.apache.parquet.crypto. is in wrong order relative to org.apache.parquet.crypto.keytools..

SparkQA · 2021-04-23T02:47:13Z

Test build #137836 has finished for PR 32146 at commit aa507a4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-23T03:03:55Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42371/

SparkQA · 2021-04-23T03:03:56Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42371/

SparkQA · 2021-04-23T03:24:47Z

Test build #137838 has finished for PR 32146 at commit aa507a4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2021-04-23T09:19:15Z

Retest this please

SparkQA · 2021-04-23T09:20:48Z

Test build #137856 has started for PR 32146 at commit aa507a4.

SparkQA · 2021-04-23T10:14:29Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42386/

SparkQA · 2021-04-23T10:14:30Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42386/

SparkQA · 2021-04-23T10:23:42Z

Test build #137841 has finished for PR 32146 at commit aa507a4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

attilapiros

@andersonm-ibm We are very close. I just found two minor things.

sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetEncryptionSuite.scala

dongjoon-hyun · 2021-04-23T21:14:15Z

Please address @attilapiros 's comment. That doesn't make a different for the test result.

… verify all parquet parts in parquet folder.

SparkQA · 2021-04-24T10:00:50Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42409/

SparkQA · 2021-04-24T10:00:51Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42409/

SparkQA · 2021-04-24T11:16:45Z

Test build #137880 has finished for PR 32146 at commit 599c71d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

attilapiros · 2021-04-24T12:47:12Z

jenkins retest this please

SparkQA · 2021-04-24T14:04:05Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42413/

SparkQA · 2021-04-24T14:04:06Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42413/

dongjoon-hyun · 2021-04-24T19:36:17Z

Retest this please

SparkQA · 2021-04-24T20:57:17Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42419/

SparkQA · 2021-04-24T21:00:05Z

Test build #137887 has finished for PR 32146 at commit 599c71d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. Thank you, @andersonm-ibm , @ggershinsky , @dbtsai , @viirya , @attilapiros .
Since all comments are addressed, I'll merged to master for Apache Spark 3.2.

dongjoon-hyun · 2021-04-24T21:29:37Z

@andersonm-ibm . I added you to the Apache Spark contributor group and SPARK-34990 is assigned to you.
Thank you for your first contribution, @andersonm-ibm ! Welcome.

SparkQA · 2021-04-25T03:58:11Z

Test build #137894 has finished for PR 32146 at commit 599c71d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andersonm-ibm · 2021-04-25T07:51:19Z

@andersonm-ibm . I added you to the Apache Spark contributor group and SPARK-34990 is assigned to you.
Thank you for your first contribution, @andersonm-ibm ! Welcome.

Thank you, @dongjoon-hyun , and thank you for your patient help!

dbtsai reviewed Apr 13, 2021

View reviewed changes

ggershinsky reviewed Apr 13, 2021

View reviewed changes

github-actions bot added BUILD SQL labels Apr 13, 2021

dongjoon-hyun requested changes Apr 14, 2021

View reviewed changes

dongjoon-hyun reviewed Apr 14, 2021

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-34990] Add a test for Parquet Modular Encryption in Spark.~~ [SPARK-34990][TESTS] Add a test for Parquet Modular Encryption Apr 14, 2021

dongjoon-hyun reviewed Apr 14, 2021

View reviewed changes

viirya reviewed Apr 14, 2021

View reviewed changes

dongjoon-hyun previously requested changes Apr 14, 2021

View reviewed changes

dongjoon-hyun self-assigned this Apr 16, 2021

dongjoon-hyun changed the title ~~[SPARK-34990][TESTS] Add a test for Parquet Modular Encryption~~ [SPARK-34990][SQL][TESTS] Add ParquetEncryptionSuite Apr 17, 2021

dongjoon-hyun reviewed Apr 17, 2021

View reviewed changes

attilapiros reviewed Apr 23, 2021

View reviewed changes

sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetEncryptionSuite.scala Outdated Show resolved Hide resolved

sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetEncryptionSuite.scala Show resolved Hide resolved

dongjoon-hyun changed the title ~~[SPARK-34990][SQL][TESTS][test-maven] Add ParquetEncryptionSuite~~ [SPARK-34990][SQL][TESTS][test-maven][test-hadoop2.7] Add ParquetEncryptionSuite Apr 23, 2021

Add a comment, verify the first magic string instead of the last one,…

599c71d

… verify all parquet parts in parquet folder.

dongjoon-hyun changed the title ~~[SPARK-34990][SQL][TESTS][test-maven][test-hadoop2.7] Add ParquetEncryptionSuite~~ [SPARK-34990][SQL][TESTS] Add ParquetEncryptionSuite Apr 24, 2021

dongjoon-hyun approved these changes Apr 24, 2021

View reviewed changes

dongjoon-hyun closed this in 166cc62 Apr 24, 2021

dongjoon-hyun removed their assignment Apr 14, 2024


		import testImplicits._

		test("Write and read an encrypted parquet") {

[SPARK-34990][SQL][TESTS] Add ParquetEncryptionSuite #32146

[SPARK-34990][SQL][TESTS] Add ParquetEncryptionSuite #32146

Conversation

andersonm-ibm commented Apr 13, 2021 • edited by dongjoon-hyun

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

ggershinsky commented Apr 13, 2021

dbtsai commented Apr 13, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment • edited

Choose a reason for hiding this comment

dongjoon-hyun commented Apr 14, 2021 • edited

dongjoon-hyun Apr 14, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Apr 14, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ggershinsky commented Apr 14, 2021

dongjoon-hyun commented Apr 14, 2021 • edited

dongjoon-hyun left a comment

Choose a reason for hiding this comment

ggershinsky commented Apr 15, 2021

andersonm-ibm commented Apr 16, 2021

dongjoon-hyun commented Apr 17, 2021

dongjoon-hyun commented Apr 17, 2021

dongjoon-hyun Apr 17, 2021 • edited

Choose a reason for hiding this comment

SparkQA commented Apr 23, 2021

SparkQA commented Apr 23, 2021

SparkQA commented Apr 23, 2021

SparkQA commented Apr 23, 2021

dongjoon-hyun commented Apr 23, 2021

SparkQA commented Apr 23, 2021

SparkQA commented Apr 23, 2021

SparkQA commented Apr 23, 2021

SparkQA commented Apr 23, 2021

attilapiros left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Apr 23, 2021

SparkQA commented Apr 24, 2021

SparkQA commented Apr 24, 2021

SparkQA commented Apr 24, 2021

attilapiros commented Apr 24, 2021

SparkQA commented Apr 24, 2021

SparkQA commented Apr 24, 2021

dongjoon-hyun commented Apr 24, 2021

SparkQA commented Apr 24, 2021

SparkQA commented Apr 24, 2021

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Apr 24, 2021

SparkQA commented Apr 25, 2021

andersonm-ibm commented Apr 25, 2021

andersonm-ibm commented Apr 13, 2021 •

edited by dongjoon-hyun

dongjoon-hyun left a comment •

edited

dongjoon-hyun commented Apr 14, 2021 •

edited

dongjoon-hyun Apr 14, 2021 •

edited

dongjoon-hyun commented Apr 14, 2021 •

edited

dongjoon-hyun Apr 17, 2021 •

edited