New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-34990][SQL][TESTS] Add ParquetEncryptionSuite #32146
Conversation
thanks Maya! |
Jenkins, okay to test. |
} | ||
keyMap | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add new line.
sql/core/pom.xml
Outdated
<groupId>org.codehaus.jackson</groupId> | ||
<artifactId>jackson-mapper-asl</artifactId> | ||
</dependency> | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove unneeded change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dbtsai , Unfortunately, without the change I get the following exception:
java.lang.NoClassDefFoundError: org/codehaus/jackson/type/TypeReference at org.apache.parquet.crypto.keytools.FileKeyWrapper.getEncryptionKeyMetadata(FileKeyWrapper.java:140) at org.apache.parquet.crypto.keytools.FileKeyWrapper.getEncryptionKeyMetadata(FileKeyWrapper.java:113) at org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory.getFileEncryptionProperties(PropertiesDrivenCryptoFactory.java:127) at org.apache.parquet.hadoop.ParquetOutputFormat.createEncryptionProperties(ParquetOutputFormat.java:554) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:478) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:420) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:409) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetOutputWriter.scala:36) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:150) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:126) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:111) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:271) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:211) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ClassNotFoundException: org.codehaus.jackson.type.TypeReference at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the extra new line change.
cc @dongjoon-hyun to check if it's okay to add jackson-mapper-asl
as new dep in core module.
if (null == masterKey) { | ||
throw new ParquetCryptoRuntimeException("Key not found: " + masterKeyIdentifier) | ||
} | ||
val AAD: Array[Byte] = masterKeyIdentifier.getBytes(StandardCharsets.UTF_8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to use AADs here, they don't enhance integrity protection
if (null == masterKey) { | ||
throw new ParquetCryptoRuntimeException("Key not found: " + masterKeyIdentifier) | ||
} | ||
val AAD: Array[Byte] = masterKeyIdentifier.getBytes(StandardCharsets.UTF_8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
/** | ||
* This is a mock class, built just for parquet encryption testing in Spark | ||
* and based on InMemoryKMS in parquet-hadoop tests. | ||
* Don't use it as an example of a KmsClient implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be helpful to add a link to a sample KmsClient in this comment,
https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/parquet-hadoop/src/test/java/org/apache/parquet/crypto/keytools/samples/VaultClient.java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pinging me, @dbtsai .
Thank you for your contribution, @andersonm-ibm . In terms of the dependency, adding jackson-mapper-asl
into more important modules doesn't look good to me in many ways although it's used in Apache Spark.
jackson-mapper-asl
is ancient (the last release is 2013). We cannot expect more bug patches.- The package itself is declared to be replaced with
com.fasterxml.jackson.core » jackson-databind
officially.
In sql/core
module, can we use jackson-databind
instead?
cc @viirya , @sunchao , @attilapiros , too. |
sql/core/pom.xml
Outdated
@@ -113,6 +113,10 @@ | |||
<groupId>com.fasterxml.jackson.core</groupId> | |||
<artifactId>jackson-databind</artifactId> | |||
</dependency> | |||
<dependency> | |||
<groupId>org.codehaus.jackson</groupId> | |||
<artifactId>jackson-mapper-asl</artifactId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In hive module, it was inevitable due to Apache Hive, but this new addition will affect Spark's mllib
module and kafka-0-10-sql
module, too. Where does this come from?
- If this is used only for testing, this should be a test dependency, @andersonm-ibm .
- Another option is that moving this test case to
hive
module.
/** | ||
* A test suite that tests parquet modular encryption usage in Spark. | ||
*/ | ||
class ParquetEncryptionTest extends QueryTest with SharedSparkSession { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename this: ParquetEncryptionTest
-> ParquetEncryptionSuite
.
import scala.sys.process._ | ||
|
||
/** | ||
* A test suite that tests parquet modular encryption usage in Spark. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remote in Spark
.
|
||
import testImplicits._ | ||
|
||
test("Write and read an encrypted parquet") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't mind, please add a JIRA prefix like the following.
- test("Write and read an encrypted parquet") {
+ test("SPARK-34990: Write and read an encrypted parquet") {
Thank you for pinging me, @dongjoon-hyun. Can we restore the PR description template? Spark pull requests follow the format and it will be in the commit log. Thanks. |
import java.io.File | ||
import java.nio.charset.StandardCharsets | ||
import java.util.{Base64, HashMap, Map} | ||
|
||
import org.apache.hadoop.conf.Configuration | ||
import org.apache.parquet.crypto.{KeyAccessDeniedException, ParquetCryptoRuntimeException} | ||
import org.apache.parquet.crypto.keytools.{KeyToolkit, KmsClient} | ||
import org.apache.spark.sql.QueryTest | ||
import org.apache.spark.sql.test.SharedSparkSession | ||
|
||
import scala.sys.process._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please separate imports with newline by the order of 1. java import, 2. scala import, 3. third-party import, 4. Spark import. E.g.,
import java.io.File
import java.nio.charset.StandardCharsets
import java.util.{Base64, HashMap, Map}
import scala.sys.process._
import org.apache.hadoop.conf.Configuration
import org.apache.parquet.crypto.{KeyAccessDeniedException, ParquetCryptoRuntimeException}
import org.apache.parquet.crypto.keytools.{KeyToolkit, KmsClient}
import org.apache.spark.sql.QueryTest
import org.apache.spark.sql.test.SharedSparkSession
val parquetDF = spark.read.parquet(parquetDir) | ||
assert(parquetDF.inputFiles.nonEmpty) | ||
val ds = parquetDF.select("a", "b", "c") | ||
ds.show() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of show
, should we verify the content?
@throws[KeyAccessDeniedException] | ||
@throws[UnsupportedOperationException] | ||
override def wrapKey(keyBytes: Array[Byte], masterKeyIdentifier: String): String = { | ||
println(s"Wrap Key ${masterKeyIdentifier}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is println
necessary?
@throws[KeyAccessDeniedException] | ||
@throws[UnsupportedOperationException] | ||
override def unwrapKey(wrappedKey: String, masterKeyIdentifier: String): Array[Byte] = { | ||
println(s"Unwrap Key ${masterKeyIdentifier}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
@dongjoon-hyun @dbtsai I agree replacing codehaus jackson with the fasterxml one is the right thing to do in the next parquet version. Regarding the current situation - parquet 1.12.0 has been released, with the coudehaus runtime dependency. This jackson was leveraged by PME a few years back, and tested with Spark 2.4 and 3.0. All these Spark distros, and the latest 3.1.1, have the codhaus jar. The current master drops this dependency in the core, but maybe it can be kept for one more release, so PME is enabled in Spark 3.2.0? We will work on replacing the jackson in parquet, making sure it's properly tested (inc backwards compatibility with 1.1.2.0) etc, can take some time; the next parquet version would go into the next Spark version after 3.2.0? |
@ggershinsky . You should split (1)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't forget this comment, @andersonm-ibm .
Another option is that moving this test case to hive module.
@dongjoon-hyun Got you, sounds good! Having the codehaus jackson jar in the Spark 3.2.0 distribution will enable users to work with Parquet encryption out-of-box in this Spark version. |
Thank you, @dongjoon-hyun , for all your comments and suggestions. I'll move the test case to the hive module and update the PR. |
ok to test |
Thank you for update, @andersonm-ibm . |
|
||
import org.apache.hadoop.conf.Configuration | ||
import org.apache.parquet.crypto.keytools.{KeyToolkit, KmsClient} | ||
import org.apache.parquet.crypto.{KeyAccessDeniedException, ParquetCryptoRuntimeException} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you run dev/lint-scala
and fix the Scala style error?
[error] /Users/dongjoon/PRS/SPARK-PR-32146/sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetEncryptionSuite.scala:28:0:
org.apache.parquet.crypto. is in wrong order relative to org.apache.parquet.crypto.keytools..
Test build #137836 has finished for PR 32146 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #137838 has finished for PR 32146 at commit
|
Retest this please |
Test build #137856 has started for PR 32146 at commit |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #137841 has finished for PR 32146 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andersonm-ibm We are very close. I just found two minor things.
sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetEncryptionSuite.scala
Outdated
Show resolved
Hide resolved
sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetEncryptionSuite.scala
Show resolved
Hide resolved
Please address @attilapiros 's comment. That doesn't make a different for the test result. |
… verify all parquet parts in parquet folder.
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #137880 has finished for PR 32146 at commit
|
jenkins retest this please |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Retest this please |
Kubernetes integration test unable to build dist. exiting with code: 1 |
Test build #137887 has finished for PR 32146 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @andersonm-ibm , @ggershinsky , @dbtsai , @viirya , @attilapiros .
Since all comments are addressed, I'll merged to master for Apache Spark 3.2.
@andersonm-ibm . I added you to the Apache Spark contributor group and SPARK-34990 is assigned to you. |
Test build #137894 has finished for PR 32146 at commit
|
Thank you, @dongjoon-hyun , and thank you for your patient help! |
What changes were proposed in this pull request?
A simple test that writes and reads an encrypted parquet and verifies that it's encrypted by checking its magic string (in encrypted footer mode).
Why are the changes needed?
To provide a test coverage for Parquet encryption.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
SBT / Hadoop 3.2 / Java11 by adding [test-java11] to the PR title.(Jenkins Java11 build is broken due to missing JDK11 installation)