Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-34990][SQL][TESTS] Add ParquetEncryptionSuite #32146

Closed
wants to merge 14 commits into from

Conversation

andersonm-ibm
Copy link
Contributor

@andersonm-ibm andersonm-ibm commented Apr 13, 2021

What changes were proposed in this pull request?

A simple test that writes and reads an encrypted parquet and verifies that it's encrypted by checking its magic string (in encrypted footer mode).

Why are the changes needed?

To provide a test coverage for Parquet encryption.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

@ggershinsky
Copy link
Contributor

thanks Maya!
cc @dbtsai

@dbtsai
Copy link
Member

dbtsai commented Apr 13, 2021

Jenkins, okay to test.

}
keyMap
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add new line.

sql/core/pom.xml Outdated
<groupId>org.codehaus.jackson</groupId>
<artifactId>jackson-mapper-asl</artifactId>
</dependency>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove unneeded change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbtsai , Unfortunately, without the change I get the following exception:
java.lang.NoClassDefFoundError: org/codehaus/jackson/type/TypeReference at org.apache.parquet.crypto.keytools.FileKeyWrapper.getEncryptionKeyMetadata(FileKeyWrapper.java:140) at org.apache.parquet.crypto.keytools.FileKeyWrapper.getEncryptionKeyMetadata(FileKeyWrapper.java:113) at org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory.getFileEncryptionProperties(PropertiesDrivenCryptoFactory.java:127) at org.apache.parquet.hadoop.ParquetOutputFormat.createEncryptionProperties(ParquetOutputFormat.java:554) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:478) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:420) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:409) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetOutputWriter.scala:36) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:150) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:126) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:111) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:271) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:211) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ClassNotFoundException: org.codehaus.jackson.type.TypeReference at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:351)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the extra new line change.

cc @dongjoon-hyun to check if it's okay to add jackson-mapper-asl as new dep in core module.

if (null == masterKey) {
throw new ParquetCryptoRuntimeException("Key not found: " + masterKeyIdentifier)
}
val AAD: Array[Byte] = masterKeyIdentifier.getBytes(StandardCharsets.UTF_8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to use AADs here, they don't enhance integrity protection

if (null == masterKey) {
throw new ParquetCryptoRuntimeException("Key not found: " + masterKeyIdentifier)
}
val AAD: Array[Byte] = masterKeyIdentifier.getBytes(StandardCharsets.UTF_8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

/**
* This is a mock class, built just for parquet encryption testing in Spark
* and based on InMemoryKMS in parquet-hadoop tests.
* Don't use it as an example of a KmsClient implementation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pinging me, @dbtsai .

Thank you for your contribution, @andersonm-ibm . In terms of the dependency, adding jackson-mapper-asl into more important modules doesn't look good to me in many ways although it's used in Apache Spark.

  • jackson-mapper-asl is ancient (the last release is 2013). We cannot expect more bug patches.
  • The package itself is declared to be replaced with com.fasterxml.jackson.core » jackson-databind officially.

In sql/core module, can we use jackson-databind instead?

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Apr 14, 2021

cc @viirya , @sunchao , @attilapiros , too.

sql/core/pom.xml Outdated
@@ -113,6 +113,10 @@
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
</dependency>
<dependency>
<groupId>org.codehaus.jackson</groupId>
<artifactId>jackson-mapper-asl</artifactId>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In hive module, it was inevitable due to Apache Hive, but this new addition will affect Spark's mllib module and kafka-0-10-sql module, too. Where does this come from?

  • If this is used only for testing, this should be a test dependency, @andersonm-ibm .
  • Another option is that moving this test case to hive module.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-34990] Add a test for Parquet Modular Encryption in Spark. [SPARK-34990][TESTS] Add a test for Parquet Modular Encryption Apr 14, 2021
/**
* A test suite that tests parquet modular encryption usage in Spark.
*/
class ParquetEncryptionTest extends QueryTest with SharedSparkSession {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename this: ParquetEncryptionTest -> ParquetEncryptionSuite.

import scala.sys.process._

/**
* A test suite that tests parquet modular encryption usage in Spark.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remote in Spark.


import testImplicits._

test("Write and read an encrypted parquet") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't mind, please add a JIRA prefix like the following.

- test("Write and read an encrypted parquet") {
+ test("SPARK-34990: Write and read an encrypted parquet") {

@viirya
Copy link
Member

viirya commented Apr 14, 2021

Thank you for pinging me, @dongjoon-hyun.

Can we restore the PR description template? Spark pull requests follow the format and it will be in the commit log. Thanks.

Comment on lines 20 to 30
import java.io.File
import java.nio.charset.StandardCharsets
import java.util.{Base64, HashMap, Map}

import org.apache.hadoop.conf.Configuration
import org.apache.parquet.crypto.{KeyAccessDeniedException, ParquetCryptoRuntimeException}
import org.apache.parquet.crypto.keytools.{KeyToolkit, KmsClient}
import org.apache.spark.sql.QueryTest
import org.apache.spark.sql.test.SharedSparkSession

import scala.sys.process._
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please separate imports with newline by the order of 1. java import, 2. scala import, 3. third-party import, 4. Spark import. E.g.,

import java.io.File
import java.nio.charset.StandardCharsets
import java.util.{Base64, HashMap, Map}

import scala.sys.process._

import org.apache.hadoop.conf.Configuration
import org.apache.parquet.crypto.{KeyAccessDeniedException, ParquetCryptoRuntimeException}
import org.apache.parquet.crypto.keytools.{KeyToolkit, KmsClient}

import org.apache.spark.sql.QueryTest
import org.apache.spark.sql.test.SharedSparkSession

val parquetDF = spark.read.parquet(parquetDir)
assert(parquetDF.inputFiles.nonEmpty)
val ds = parquetDF.select("a", "b", "c")
ds.show()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of show, should we verify the content?

@throws[KeyAccessDeniedException]
@throws[UnsupportedOperationException]
override def wrapKey(keyBytes: Array[Byte], masterKeyIdentifier: String): String = {
println(s"Wrap Key ${masterKeyIdentifier}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is println necessary?

@throws[KeyAccessDeniedException]
@throws[UnsupportedOperationException]
override def unwrapKey(wrappedKey: String, masterKeyIdentifier: String): Array[Byte] = {
println(s"Unwrap Key ${masterKeyIdentifier}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

@ggershinsky
Copy link
Contributor

@dongjoon-hyun @dbtsai I agree replacing codehaus jackson with the fasterxml one is the right thing to do in the next parquet version. Regarding the current situation - parquet 1.12.0 has been released, with the coudehaus runtime dependency. This jackson was leveraged by PME a few years back, and tested with Spark 2.4 and 3.0. All these Spark distros, and the latest 3.1.1, have the codhaus jar. The current master drops this dependency in the core, but maybe it can be kept for one more release, so PME is enabled in Spark 3.2.0? We will work on replacing the jackson in parquet, making sure it's properly tested (inc backwards compatibility with 1.1.2.0) etc, can take some time; the next parquet version would go into the next Spark version after 3.2.0?

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Apr 14, 2021

@ggershinsky . You should split (1) sql/core module dependency from (2) the Spark distribution dependency. I didn't ask Apache Parquet dependency change or Apache Spark dependency change. I provided @andersonm-ibm the solution #32146 (comment) already. Please move the test case to hive module simply.

@dongjoon-hyun @dbtsai I agree replacing codehaus jackson with the fasterxml one is the right thing to do in the next parquet version. Regarding the current situation - parquet 1.12.0 has been released, with the coudehaus runtime dependency. This jackson was leveraged by PME a few years back, and tested with Spark 2.4 and 3.0. All these Spark distros, and the latest 3.1.1, have the codhaus jar. The current master drops this dependency in the core, but maybe it can be kept for one more release, so PME is enabled in Spark 3.2.0? We will work on replacing the jackson in parquet, making sure it's properly tested (inc backwards compatibility with 1.1.2.0) etc, can take some time; the next parquet version would go into the next Spark version after 3.2.0?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't forget this comment, @andersonm-ibm .

Another option is that moving this test case to hive module.

@ggershinsky
Copy link
Contributor

@dongjoon-hyun Got you, sounds good! Having the codehaus jackson jar in the Spark 3.2.0 distribution will enable users to work with Parquet encryption out-of-box in this Spark version.
As for the test case - deferring this to @andersonm-ibm

@dongjoon-hyun dongjoon-hyun self-assigned this Apr 16, 2021
@andersonm-ibm
Copy link
Contributor Author

@dongjoon-hyun Got you, sounds good! Having the codehaus jackson jar in the Spark 3.2.0 distribution will enable users to work with Parquet encryption out-of-box in this Spark version.
As for the test case - deferring this to @andersonm-ibm

Thank you, @dongjoon-hyun , for all your comments and suggestions. I'll move the test case to the hive module and update the PR.

@dongjoon-hyun
Copy link
Member

ok to test

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-34990][TESTS] Add a test for Parquet Modular Encryption [SPARK-34990][SQL][TESTS] Add ParquetEncryptionSuite Apr 17, 2021
@dongjoon-hyun
Copy link
Member

Thank you for update, @andersonm-ibm .
In addition, I revised the PR title and description according to @viirya 's advice (#32146 (comment) )


import org.apache.hadoop.conf.Configuration
import org.apache.parquet.crypto.keytools.{KeyToolkit, KmsClient}
import org.apache.parquet.crypto.{KeyAccessDeniedException, ParquetCryptoRuntimeException}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you run dev/lint-scala and fix the Scala style error?

[error] /Users/dongjoon/PRS/SPARK-PR-32146/sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetEncryptionSuite.scala:28:0: 
org.apache.parquet.crypto. is in wrong order relative to org.apache.parquet.crypto.keytools..

@SparkQA
Copy link

SparkQA commented Apr 23, 2021

Test build #137836 has finished for PR 32146 at commit aa507a4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 23, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42371/

@SparkQA
Copy link

SparkQA commented Apr 23, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42371/

@SparkQA
Copy link

SparkQA commented Apr 23, 2021

Test build #137838 has finished for PR 32146 at commit aa507a4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please

@SparkQA
Copy link

SparkQA commented Apr 23, 2021

Test build #137856 has started for PR 32146 at commit aa507a4.

@SparkQA
Copy link

SparkQA commented Apr 23, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42386/

@SparkQA
Copy link

SparkQA commented Apr 23, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42386/

@SparkQA
Copy link

SparkQA commented Apr 23, 2021

Test build #137841 has finished for PR 32146 at commit aa507a4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@attilapiros attilapiros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andersonm-ibm We are very close. I just found two minor things.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-34990][SQL][TESTS][test-maven] Add ParquetEncryptionSuite [SPARK-34990][SQL][TESTS][test-maven][test-hadoop2.7] Add ParquetEncryptionSuite Apr 23, 2021
@dongjoon-hyun
Copy link
Member

Please address @attilapiros 's comment. That doesn't make a different for the test result.

… verify all parquet parts in parquet folder.
@SparkQA
Copy link

SparkQA commented Apr 24, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42409/

@SparkQA
Copy link

SparkQA commented Apr 24, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42409/

@SparkQA
Copy link

SparkQA commented Apr 24, 2021

Test build #137880 has finished for PR 32146 at commit 599c71d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@attilapiros
Copy link
Contributor

jenkins retest this please

@SparkQA
Copy link

SparkQA commented Apr 24, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42413/

@SparkQA
Copy link

SparkQA commented Apr 24, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42413/

@dongjoon-hyun
Copy link
Member

Retest this please

@SparkQA
Copy link

SparkQA commented Apr 24, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42419/

@SparkQA
Copy link

SparkQA commented Apr 24, 2021

Test build #137887 has finished for PR 32146 at commit 599c71d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-34990][SQL][TESTS][test-maven][test-hadoop2.7] Add ParquetEncryptionSuite [SPARK-34990][SQL][TESTS] Add ParquetEncryptionSuite Apr 24, 2021
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @andersonm-ibm , @ggershinsky , @dbtsai , @viirya , @attilapiros .
Since all comments are addressed, I'll merged to master for Apache Spark 3.2.

@dongjoon-hyun
Copy link
Member

@andersonm-ibm . I added you to the Apache Spark contributor group and SPARK-34990 is assigned to you.
Thank you for your first contribution, @andersonm-ibm ! Welcome.

@SparkQA
Copy link

SparkQA commented Apr 25, 2021

Test build #137894 has finished for PR 32146 at commit 599c71d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andersonm-ibm
Copy link
Contributor Author

@andersonm-ibm . I added you to the Apache Spark contributor group and SPARK-34990 is assigned to you.
Thank you for your first contribution, @andersonm-ibm ! Welcome.

Thank you, @dongjoon-hyun , and thank you for your patient help!

@dongjoon-hyun dongjoon-hyun removed their assignment Apr 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants