-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-35658][DOCS] Document Parquet encryption feature in Spark SQL #32895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| <div class="codetabs"> | ||
| <div data-lang="scala" markdown="1"> | ||
| {% highlight scala %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be complete, could you add more examples for the other languages like python, SQL, r?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark/PME* functionality has been mass tested with Scala, so I'm mostly comfortable with this example. Also, ORC column encryption has one example too, I've followed the same approach. But if needed, I can add a Python sample, this is known to work with PME. In the future, as the community gathers experience with PME in e.g. Java or R applications, this doc could be augmented with additional examples.
(* PME - "parquet modular encryption")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed the tabs, and stressed that this an illustration-only sample that can be run without a KMS server - just spark-shell is sufficient. This should make a scala example a sort of an intuitive choice in this place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ORC column encryption has SQL example because it can work in all environments (Spark Shell/PySpark/SparkR/SQL Shell/STS). I'm curious if Apache Parquet doesn't work with SQL environment, for example Spark Thrift Server?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I also wonder about this; but the bulk of my personal experience is with direct activation of this function - so I'm comfortable with adding these samples and parameters to the Spark documentation (since they are well tested etc). If other community members get to test and verify PME via SQL interface, I think it would be a good future addition to this section.
docs/sql-data-sources-parquet.md
Outdated
|
|
||
| Since Spark 3.2, columnar encryption is supported for Parquet tables with Apache Parquet 1.12. | ||
|
|
||
| Parquet uses the envelope encryption practice, where the file parts are encrypted with “data encryption keys” (DEKs), and the DEKs are encrypted with “master encryption keys” (MEKs). The DEKs are randomly generated by Parquet for each encrypted file/column. The MEKs are generated, stored and managed in a Key Management Service (KMS) of user’s choice. Parquet-test [package](https://repo1.maven.org/maven2/org/apache/parquet/parquet-hadoop/1.12.0/parquet-hadoop-1.12.0-tests.jar) has a mock KMS implementation that allows to run column encryption and decryption without a KMS server: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The audience of this document includes general end users. We should give Hadoop KMS example instead of this test mock class. Let's remove this and use the following instead.
The following example is using Hadoop KMS as a key provider with the given location. Please visit Apache Hadoop KMS for the detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are lots of open source and public cloud KMS systems. There is even a larger number of proprietary KMSs, deployed in platforms of different companies. Also, the KMS APIs and protocols are known to change rather frequently (surely faster than the Parquet release cycle).
Given these reasons, the Parquet community has decided not to release and support a client for any particular KMS. Instead, we've defined a plug-in KmsClient interface that can be used for implementing a client for any public or private KMS system.
the audience of this document includes not only a Spark developer but also a general end user.
That's a good point. I don't think that general end users, without a basic data security expertise, should use Parquet encryption directly. I'll add a text that explains that this document (and the API) is for developers only, who build a production-grade data security platform in their organization, and understand the key / user identity / access management.
| {% highlight scala %} | ||
|
|
||
| sc.hadoopConfiguration.set("parquet.encryption.kms.client.class" , | ||
| "org.apache.parquet.crypto.keytools.mocks.InMemoryKMS") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto. Please use the Hadoop KMS example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned above, we don't support any particular KMS. The advantage of the mock InMemoryKMS class is that it provides an easy to understand demo code, and can be tried without any KMS server. But I'd agree we must stress this is not a real KMS and should never be used in production.
|
|
||
| // Explicit master keys (base64 encoded) - required only for mock InMemoryKMS | ||
| sc.hadoopConfiguration.set("parquet.encryption.key.list" , | ||
| "keyA:AAECAwQFBgcICQoLDA0ODw== , keyB:AAECAAECAAECAAECAAECAA==") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto. We should remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making a PR, @ggershinsky . In general, looks nice. I added a few comments because the audience of this document includes not only a Spark developer but also a general end user.
cc @dbtsai , @viirya , @sunchao
BTW, do we have a corresponding Iceberg document for Parquet encryption, @ggershinsky ? (cc @RussellSpitzer )
docs/sql-data-sources-parquet.md
Outdated
| ## Columnar Encryption | ||
|
|
||
|
|
||
| Since Spark 3.2, columnar encryption is supported for Parquet tables with Apache Parquet 1.12. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apache Parquet 1.12+?
docs/sql-data-sources-parquet.md
Outdated
|
|
||
| Since Spark 3.2, columnar encryption is supported for Parquet tables with Apache Parquet 1.12. | ||
|
|
||
| Parquet uses the envelope encryption practice, where the file parts are encrypted with “data encryption keys” (DEKs), and the DEKs are encrypted with “master encryption keys” (MEKs). The DEKs are randomly generated by Parquet for each encrypted file/column. The MEKs are generated, stored and managed in a Key Management Service (KMS) of user’s choice. Parquet-test [package](https://repo1.maven.org/maven2/org/apache/parquet/parquet-hadoop/1.12.0/parquet-hadoop-1.12.0-tests.jar) has a mock KMS implementation that allows to run column encryption and decryption without a KMS server: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we used to use “ for quoting terms, e.g. “data encryption keys”? I don't find it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. Sure, will change; what would you suggest for term quotes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The normal ""?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got you :) I thought the problem was this looks like a string.
docs/sql-data-sources-parquet.md
Outdated
| // Column "square" will be encrypted with master key "keyA". | ||
| // Parquet file footers will be encrypted with master key "keyB" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, technically, are we encrypting the column using DEKs encryped with the master key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A good point! I'll change "encrypted with something like "protected".
| // Read encrypted dataframe files | ||
| val df2 = spark.read.parquet("/path/to/table.parquet.encrypted") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need any options for reading encrypted file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. The options are passed in writing only; and stored in Parquet metadata, so the readers don't need them.
sunchao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few minor comments. Thanks for the work @ggershinsky !
docs/sql-data-sources-parquet.md
Outdated
| <td>3.1.0</td> | ||
| </tr> | ||
| <tr> | ||
| <td>parquet.encryption.column.keys</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add non-Spark configs here? seems normally we don't do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I've also noticed this.. Not sure, guess this is up to the community. Technically, it looks like an overhead to duplicate these parameters.
docs/sql-data-sources-parquet.md
Outdated
|
|
||
| Since Spark 3.2, columnar encryption is supported for Parquet tables with Apache Parquet 1.12. | ||
|
|
||
| Parquet uses the envelope encryption practice, where the file parts are encrypted with “data encryption keys” (DEKs), and the DEKs are encrypted with “master encryption keys” (MEKs). The DEKs are randomly generated by Parquet for each encrypted file/column. The MEKs are generated, stored and managed in a Key Management Service (KMS) of user’s choice. Parquet-test [package](https://repo1.maven.org/maven2/org/apache/parquet/parquet-hadoop/1.12.0/parquet-hadoop-1.12.0-tests.jar) has a mock KMS implementation that allows to run column encryption and decryption without a KMS server: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe change https://repo1.maven.org/maven2/org/apache/parquet/parquet-hadoop/1.12.0/parquet-hadoop-1.12.0-tests.jar to point to a web page instead of a download link.
Iceberg is not integrated with Parquet encryption yet. |
|
Thanks all for the review and the comments! I've pushed a commit with the updates. Will appreciate if you'd have another look at this. |
|
Sure, will do. Thanks for updating, @ggershinsky . |
docs/sql-data-sources-parquet.md
Outdated
| <td>Length of data encryption keys (DEKs), randomly generated by Parquet key management tools. Can be 128, 192 or 256 bits. | ||
| </td> | ||
| <td>3.2.0</td> | ||
| </tr> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you remove these conf additions? As @sunchao and you discussed, we don't duplicate like this. Instead, you can mention them in your examples and give a pointer to Apache Parquet page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats a good idea, we have a nice page with these parameters. Will do.
docs/sql-data-sources-parquet.md
Outdated
|
|
||
| An [example](https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/parquet-hadoop/src/test/java/org/apache/parquet/crypto/keytools/samples/VaultClient.java) of such class for an open source [KMS](https://www.vaultproject.io/api/secret/transit) can be found in parquet-mr repository. The production KMS client should be designed in cooperation with organization's security administrators, and built by developers with an experience in access control management. Once such class is created, it can be passed to applications via the `parquet.encryption.kms.client.class` Hadoop parameter and leveraged by general Spark users as shown in the dataframe write/read sample above. | ||
|
|
||
| Note: By default, Parquet implements "double envelope encryption" mode, that minimizes the interaction of Spark executors with a KMS server. In this mode, the DEKs are encrypted with "key encryption keys" (KEKs, randomly generated by Parquet). The KEKs are encrypted with MEKs; the result and the KEK itself are cached in Spark executor memory. Users interested in a regular envelope encryption, can switch it on via configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please mention the exact configuration name here: can switch it on via configuration. -> can switch it on via xxx.
|
From my side, I added three more comments. |
|
@dongjoon-hyun I've pushed a new commit that addresses these comments. |
|
@dongjoon-hyun A gentle reminder about this pull request. |
docs/sql-data-sources-parquet.md
Outdated
|
|
||
| Since Spark 3.2, columnar encryption is supported for Parquet tables with Apache Parquet 1.12+. | ||
|
|
||
| Parquet uses the envelope encryption practice, where file parts are encrypted with "data encryption keys" (DEKs), and the DEKs are encrypted with "master encryption keys" (MEKs). The DEKs are randomly generated by Parquet for each encrypted file/column. The MEKs are generated, stored and managed in a Key Management Service (KMS) of user’s choice. Parquet maven [repository]( https://repo1.maven.org/maven2/org/apache/parquet/parquet-hadoop/1.12.0/) has a jar with a mock KMS implementation that allows to run column encryption and decryption using a spark-shell only, without deploying a KMS server (download the `parquet-hadoop-tests.jar` file and place it in the Spark `jars` folder): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parquet maven -> The Parquet Maven
(Fix the hyperlink - space in parens)
But do we need to talk about JARs here? the idea is just that you need to include this as a dependency one way or the other
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parquet maven -> The Parquet Maven
(Fix the hyperlink - space in parens)
Will do.
But do we need to talk about JARs here? the idea is just that you need to include this as a dependency one way or the other
This is a special KMS client class, used only for sample illustration and initial demonstrations of Parquet encryption feature (not in production deployment), so the Spark distribution doesn't include it. Still, this class makes it easier to get started with this feature, and having a direct link to its jar would save some overhead for the developers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's not in Spark, but, why repeat 'how to include a JAR' here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a link would make it easier to find the file. But we can indeed skip the "find the jar" instructions, and just mention a need to include the parquet-hadoop-tests v1.12.0 dependency.
docs/sql-data-sources-parquet.md
Outdated
|
|
||
| </div> | ||
|
|
||
| An [example](https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/parquet-hadoop/src/test/java/org/apache/parquet/crypto/keytools/samples/VaultClient.java) of such class for an open source [KMS](https://www.vaultproject.io/api/secret/transit) can be found in the parquet-mr repository. The production KMS client should be designed in cooperation with organization's security administrators, and built by developers with an experience in access control management. Once such class is created, it can be passed to applications via the `parquet.encryption.kms.client.class` parameter and leveraged by general Spark users as shown in the encrypted dataframe write/read sample above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the hyperlinks, how about linking to 'latest' docs in Parquet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't quite belong in Parquet documentation.. Hashicorp Vault is a separate technology, used here as an example of an open source KMS Server. Parquet repo keeps a sample Client class for this Server, but this class is not a part of Parquet API. Moreover, Parquet unfortunately doesn't maintain a documentation of its APIs and usage, the info in https://parquet.apache.org/documentation/latest/ is very high level, and grossly outdated..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just mean using a URL with /latest/ in it. What do you mean this doesn't belong in the docs, if you add it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like I've misunderstood. I thought you refer to these hyperlinks :
-
https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/parquet-hadoop/src/test/java/org/apache/parquet/crypto/keytools/samples/VaultClient.java
For the latest version, this can be replaced with a link to the master branch:
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/crypto/keytools/samples/VaultClient.java -
https://www.vaultproject.io/api/secret/transit
I think this is their latest version
As for adding this information in https://parquet.apache.org/documentation/latest/ - there are a number of problems with it: this page doesn't keep this kind of technical details, on encryption or other Parquet features; this page is a few years old and doesn't mention the recent Parquet features; I don't have edit rights to it - updating this page (if possible at all) will be a long project in the community, without a chance to be on time for Spark 3.2.0 release.
This is about hyperlinks to a sample KMS client/server. Another option would be simply to remove the URLs, and leave only the names (Hashicorp Vault; Parquet VaultClient); this should be enough for developers to find them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen will one of these two options be ok for you? (1. removing the URLs, keeping the names of the server/client; or 2. keeping the URLs to the latest versions of the server/client)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, there isn't any published API doc. OK, linking to source in master is probably the closest that's possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
| </div> | ||
|
|
||
|
|
||
| #### KMS Client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section looks proper to Apache Parquet website and I believe that a link to Apache Parquet website will be enough. Isn't it a little weird to have Apache Parquet's document about a class which should not be used in a real deployment. Please remove this section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned in the comment above, Parquet doesn't have a proper documentation website, unfortunately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To expand more on this point - removing this section would mean removal of the previous section too, since it also uses the "Hello World" example. Therefore, it removes the full content of this pull request, comprised of the two sections.
I agree it would be reasonable to move some of the content to a Parquet documentation site, but such site doesn't exist.., AFAIK Parquet doesn't have API documentation (keeps only a page on its Hadoop parameters).
I realize this section/PR seems to be somewhat unusual compared to other Spark/Parquet doc sections, but there is a simple reason. Encryption is somewhat unusual compared to other Parquet features. To be really useful, it requires more than just Hadoop parameters. It has an API, or more specifically, an interface for custom KMS Client classes tailored for user-specific KMS/IAM systems, deployed in their organizations. Providing any such classes in Parquet or Spark packages will be totally pointless, as detailed in other comments.
Therefore, the approach taken here, is to provide a simple to understand "Hello World" KmsClient class, which is also easy to experiment with (since it runs alone and doesn't require a real KMS Server). Followed by an explanation about how to take the next step and develop a real-life KmsClient. This should provide sufficient documentation for new adopters of the Spark/ParquetEncryption capability, which is already used in numerous deployments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the question is, is this specific to Spark, or generally applicable to Parquet? if the latter, it is better in the Parquet docs. I understand it's not there, but a similar PR could go into Parquet docs. It has API docs, of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mm, a good point. Thinking of this, I'm not aware of any other analytic framework where Parquet encryption is (or can be) activated this way. While this approach is supposed to be general, it was designed and tested within Spark. In other frameworks, updating parquet to 1.12.0 is not sufficient, they need to call low-level Parquet APIs to leverage the encryption feature.
Still, I agree it would be good to document Parquet APIs (general; not only encryption). However, there is a high chance such documentation simply doesn't exist.. At least I couldn't find anything, besides a page with the parquet-hadoop parameters..
Given these two points, I believe it is reasonable to add this section in the Spark documentation.
|
I believe you are the best person to do that in this PR.
|
I should have mentioned that besides testing and documenting, this will require a substantial coding work on developing a KMS client class (for a selected KMS) and integrating it in Spark SQL. As I mentioned, Parquet doesn't contain client classes for any particular KMS, because there are too many of them, and because it's expensive to maintain these classes (basically impossible, given the Parquet release cycle). So having a SQL API sample has a high price, and a limited benefit - Spark runs with Parquet encryption in many deployments today, with the Scala API and various KMS systems. |
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I think that's reasonable given the Parquet docs state and the nature of this client.
|
Jenkins test this please |
|
Test build #141263 has finished for PR 32895 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Jenkins test this please |
|
Test build #141338 has finished for PR 32895 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Merged to master |
### What changes were proposed in this pull request? Spark 3.2.0 will use parquet-mr.1.12.0 version (or higher), that contains the column encryption feature which can be called from Spark SQL. The aim of this PR is to document the use of Parquet encryption in Spark. ### Why are the changes needed? - To provide information on how to use Parquet column encryption ### Does this PR introduce _any_ user-facing change? Yes, documents a new feature. ### How was this patch tested? bundle exec jekyll build Closes #32895 from ggershinsky/parquet-encryption-doc. Authored-by: Gidon Gershinsky <ggershinsky@apple.com> Signed-off-by: Sean Owen <srowen@gmail.com>
|
Merged to 3.2 as well |
What changes were proposed in this pull request?
Spark 3.2.0 will use parquet-mr.1.12.0 version (or higher), that contains the column encryption feature which can be called from Spark SQL. The aim of this PR is to document the use of Parquet encryption in Spark.
Why are the changes needed?
Does this PR introduce any user-facing change?
Yes, documents a new feature.
How was this patch tested?
bundle exec jekyll build