New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-9318: [C++] Parquet encryption key management #8023
ARROW-9318: [C++] Parquet encryption key management #8023
Conversation
Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? Then could you also rename pull request title in the following format?
See also: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made some style comments but didn't flag every place that they were an issue. Please self review to fix (it also looks like you might need to run clang-format as the lint is failing).
I would expect additional tests around individual files and not just end-to-end tests (any place with non trivial implementations should probably have at least one or two tests. excercising of error conditions on json parsing I think would also be useful.
Another issue is thread-safey. It would be good to clarify which classes are expected/not-expected to be thread safe. For instance, I would at lease expect client caches to be thread safe but it doesn't look like they were implemented to be.
Also this PR is quite large is there any way that it could possibly be broken up into smaller units?
a96c7e4
to
7206be0
Compare
Just a drive-by comment for now, but given the number of files added I think we should create a |
794ece0
to
c8c5aa2
Compare
I fixed all the comments in this pull, can you please take a look? @emkornfield @pitrou @bkietz cc @ggershinsky |
Ok, I've pushed some updates to try and conform with coding conventions a bit more. Some high-level comments:
I also don't have a personal interest in this currently and I recognize that the PR implements a useful functionality. |
@pitrou My personal preference is the latter - merging this pr and letting the users of this code to be mindful of potential issues. This is safer than proceeding with the current low-level interface work, which has many more security and other issues. When this pr is merged, it will unblock the users who'd create the high-level interface in Python and address the API point (I'll add a link to this thread in the Python design doc; the Python API is to be reviewed; the C++ API could be synced accordingly). Regarding the second point - we'll be testing the interop with the Java counterpart, to make sure the spec is implemented properly. Regarding the third point, I think the code is safe, as per our discussion in that jira. |
As for testing, I'd still welcome unit tests with actual pieces of JSON as per the spec, but I won't insist on it here :-) Last question: do we want to mark those APIs experimental so that we feel free to change them in the future? |
Thanks! Yep, I think it's a good idea to mark them as experimental, there is a chance they'll be updated in the course of the Python API work. |
I'll try to look tonight but this makes it sound like there are general concerns around quality here, and that doesn't seem like something we want to merge (especially if it is security related). If I get distracted and don't give feedback by monday, it is OK not to block on me. |
Thanks. To clarify - the security concerns I've mentioned above, relate to low level encryption, not to this pull request. |
@emkornfield Feel free to take a last look. |
It looks like there are JSON changes in here? does this need a rebase? |
@emkornfield The JSON changes are expected, the encryption layer needs to parse and generate some JSON. |
Ah right, I forgot, will start looking again. |
Sorry didn't get a chance to look. @pitrou if you are comfortable with changes please go ahead and merge |
I'll merge then. Thank you @thamht4190 and @ggershinsky for contributing this! |
This PR is C++ implementation for parquet key tool, based on [the Java implementation](apache/parquet-mr#615) and [the design doc](https://docs.google.com/document/d/1boH6HPkG0ZhgxcaRkGk3QpZ8X_J91uXZwVGwYN45St4/edit?usp=sharing). The major parts of this PR are: * higher level of encryption/decryption configuration, including key management configuration. * KMS connection configuration * Abstract class KmsClient, KmsClientFactory * PropertiesDrivenCryptoFactory class to convert these above configurations to the lower level FileEncryptionProperties, FileDecryptionProperties * unit test using InMemoryKms (an sample of KmsClient). Comparing to Java version, this C++ pull doesn't contain externally storing key material using hadoop file system (only storing key material internally in parquet file is supported for now). The reason is lack of understanding about Hadoop file system, can be implemented it later in another pull. Thanks! Closes apache#8023 from thamht4190/arrow-9318-encryption-key-management Lead-authored-by: Ha Thi Tham <thamht01188@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
This PR is C++ implementation for parquet key tool, based on [the Java implementation](apache/parquet-mr#615) and [the design doc](https://docs.google.com/document/d/1boH6HPkG0ZhgxcaRkGk3QpZ8X_J91uXZwVGwYN45St4/edit?usp=sharing). The major parts of this PR are: * higher level of encryption/decryption configuration, including key management configuration. * KMS connection configuration * Abstract class KmsClient, KmsClientFactory * PropertiesDrivenCryptoFactory class to convert these above configurations to the lower level FileEncryptionProperties, FileDecryptionProperties * unit test using InMemoryKms (an sample of KmsClient). Comparing to Java version, this C++ pull doesn't contain externally storing key material using hadoop file system (only storing key material internally in parquet file is supported for now). The reason is lack of understanding about Hadoop file system, can be implemented it later in another pull. Thanks! Closes apache#8023 from thamht4190/arrow-9318-encryption-key-management Lead-authored-by: Ha Thi Tham <thamht01188@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
…files. Exposes in PyArrow the high-level C++ API for Parquet encryption that was added in #8023. Design document: https://docs.google.com/document/d/1i1M5f5azLEmASj9XQZ_aQLl5Fr5F0CvnyPPVu1xaD9U A test is added for writing and reading encrypted parquet files using a simple in-memory KMS for testing, that is not to be used as an example of a KMS client. In addition, there is an example KMS client using Vault KMS. This PR handles only the file-level encryption and decryption. Dataset is handled in separate PRs. The investigation of the multithreading model of PME is separated into a separate issue, independent of this one. Closes #10450 from andersonm-ibm/encryption Lead-authored-by: Maya Anderson <mayaa@il.ibm.com> Co-authored-by: andersonm-ibm <63074550+andersonm-ibm@users.noreply.github.com> Co-authored-by: roee88 <roee88@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
This PR is C++ implementation for parquet key tool, based on the Java implementation and the design doc.
The major parts of this PR are:
Comparing to Java version, this C++ pull doesn't contain externally storing key material using hadoop file system (only storing key material internally in parquet file is supported for now). The reason is lack of understanding about Hadoop file system, can be implemented it later in another pull.
Thanks!