Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-9318: [C++] Parquet encryption key management #8023

Closed

Conversation

thamht4190
Copy link
Contributor

This PR is C++ implementation for parquet key tool, based on the Java implementation and the design doc.

The major parts of this PR are:

  • higher level of encryption/decryption configuration, including key management configuration.
  • KMS connection configuration
  • Abstract class KmsClient, KmsClientFactory
  • PropertiesDrivenCryptoFactory class to convert these above configurations to the lower level FileEncryptionProperties, FileDecryptionProperties
  • unit test using InMemoryKms (an sample of KmsClient).

Comparing to Java version, this C++ pull doesn't contain externally storing key material using hadoop file system (only storing key material internally in parquet file is supported for now). The reason is lack of understanding about Hadoop file system, can be implemented it later in another pull.

Thanks!

@github-actions
Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@thamht4190 thamht4190 changed the title Arrow 9318 encryption key management ARROW-9318: [C++] Parquet encryption key management Aug 22, 2020
@github-actions
Copy link

cpp/src/parquet/encryption.h Show resolved Hide resolved
cpp/src/parquet/encryption_key_management_test.cc Outdated Show resolved Hide resolved
cpp/src/parquet/file_key_material_store.h Outdated Show resolved Hide resolved
cpp/src/parquet/file_key_material_store.h Outdated Show resolved Hide resolved
cpp/src/parquet/file_key_material_store.h Outdated Show resolved Hide resolved
cpp/src/parquet/key_toolkit.h Outdated Show resolved Hide resolved
cpp/src/parquet/kms_client.h Outdated Show resolved Hide resolved
cpp/src/parquet/kms_client.h Outdated Show resolved Hide resolved
cpp/src/parquet/kms_client_factory.h Outdated Show resolved Hide resolved
cpp/src/parquet/string_util.h Outdated Show resolved Hide resolved
Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some style comments but didn't flag every place that they were an issue. Please self review to fix (it also looks like you might need to run clang-format as the lint is failing).

I would expect additional tests around individual files and not just end-to-end tests (any place with non trivial implementations should probably have at least one or two tests. excercising of error conditions on json parsing I think would also be useful.

Another issue is thread-safey. It would be good to clarify which classes are expected/not-expected to be thread safe. For instance, I would at lease expect client caches to be thread safe but it doesn't look like they were implemented to be.

Also this PR is quite large is there any way that it could possibly be broken up into smaller units?

cpp/src/parquet/test_encryption_util.h Outdated Show resolved Hide resolved
cpp/src/parquet/CMakeLists.txt Show resolved Hide resolved
cpp/src/parquet/encryption.h Outdated Show resolved Hide resolved
cpp/src/parquet/encryption_internal.h Show resolved Hide resolved
@thamht4190 thamht4190 force-pushed the arrow-9318-encryption-key-management branch from a96c7e4 to 7206be0 Compare September 3, 2020 14:54
@pitrou
Copy link
Member

pitrou commented Sep 3, 2020

Just a drive-by comment for now, but given the number of files added I think we should create a src/parquet/encryption subdirectory.

cpp/src/parquet/file_key_wrapper.h Outdated Show resolved Hide resolved
cpp/src/parquet/file_key_wrapper.h Outdated Show resolved Hide resolved
cpp/src/parquet/file_key_wrapper.h Outdated Show resolved Hide resolved
cpp/src/parquet/key_toolkit.h Outdated Show resolved Hide resolved
@thamht4190
Copy link
Contributor Author

thamht4190 commented Feb 28, 2021

I fixed all the comments in this pull, can you please take a look? @emkornfield @pitrou @bkietz cc @ggershinsky

@pitrou
Copy link
Member

pitrou commented Mar 10, 2021

Ok, I've pushed some updates to try and conform with coding conventions a bit more.

Some high-level comments:

  • In general, I'm not particularly happy with the API (for example, why is the configuration mutated while in use?)
  • There are some roundtrip tests but no tests with reference data; so there's no guarantee that this actually implements the spec faithfully
  • I'm concerned that the crypto code isn't very mindful of safety (see https://issues.apache.org/jira/browse/PARQUET-1997); this PR builds on that and also seems to keep sensitive data around (decryption keys) in various places

I also don't have a personal interest in this currently and I recognize that the PR implements a useful functionality.
@ggershinsky @thamht4190 Do you plan to address some of these concerns in this PR? Otherwise, I may merge this and users of this code will have to be mindful of potential issues.

@ggershinsky
Copy link
Contributor

@pitrou My personal preference is the latter - merging this pr and letting the users of this code to be mindful of potential issues. This is safer than proceeding with the current low-level interface work, which has many more security and other issues. When this pr is merged, it will unblock the users who'd create the high-level interface in Python and address the API point (I'll add a link to this thread in the Python design doc; the Python API is to be reviewed; the C++ API could be synced accordingly).

Regarding the second point - we'll be testing the interop with the Java counterpart, to make sure the spec is implemented properly. Regarding the third point, I think the code is safe, as per our discussion in that jira.

@pitrou
Copy link
Member

pitrou commented Mar 11, 2021

As for testing, I'd still welcome unit tests with actual pieces of JSON as per the spec, but I won't insist on it here :-)

Last question: do we want to mark those APIs experimental so that we feel free to change them in the future?

@ggershinsky
Copy link
Contributor

Thanks! Yep, I think it's a good idea to mark them as experimental, there is a chance they'll be updated in the course of the Python API work.

@emkornfield
Copy link
Contributor

I also don't have a personal interest in this currently and I recognize that the PR implements a useful functionality.
@ggershinsky @thamht4190 Do you plan to address some of these concerns in this PR? Otherwise, I may merge this and users of this code will have to be mindful of potential issues.

I'll try to look tonight but this makes it sound like there are general concerns around quality here, and that doesn't seem like something we want to merge (especially if it is security related). If I get distracted and don't give feedback by monday, it is OK not to block on me.

@ggershinsky
Copy link
Contributor

Thanks. To clarify - the security concerns I've mentioned above, relate to low level encryption, not to this pull request.

@pitrou
Copy link
Member

pitrou commented Mar 15, 2021

@emkornfield Feel free to take a last look.

@emkornfield
Copy link
Contributor

It looks like there are JSON changes in here? does this need a rebase?

@pitrou
Copy link
Member

pitrou commented Mar 16, 2021

@emkornfield The JSON changes are expected, the encryption layer needs to parse and generate some JSON.

@emkornfield
Copy link
Contributor

@emkornfield The JSON changes are expected, the encryption layer needs to parse and generate some JSON.

Ah right, I forgot, will start looking again.

@emkornfield
Copy link
Contributor

Sorry didn't get a chance to look. @pitrou if you are comfortable with changes please go ahead and merge

@pitrou
Copy link
Member

pitrou commented Mar 17, 2021

I'll merge then. Thank you @thamht4190 and @ggershinsky for contributing this!

@pitrou pitrou closed this in 5b14d53 Mar 17, 2021
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
This PR is C++ implementation for parquet key tool, based on [the Java implementation](apache/parquet-mr#615) and [the design doc](https://docs.google.com/document/d/1boH6HPkG0ZhgxcaRkGk3QpZ8X_J91uXZwVGwYN45St4/edit?usp=sharing).

The major parts of this PR are:
* higher level of encryption/decryption configuration, including key management configuration.
* KMS connection configuration
* Abstract class KmsClient, KmsClientFactory
* PropertiesDrivenCryptoFactory class to convert these above configurations to the lower level FileEncryptionProperties, FileDecryptionProperties
* unit test using InMemoryKms (an sample of KmsClient).

Comparing to Java version, this C++ pull doesn't contain externally storing key material using hadoop file system (only storing key material internally in parquet file is supported for now). The reason is lack of understanding about Hadoop file system, can be implemented it later in another pull.

Thanks!

Closes apache#8023 from thamht4190/arrow-9318-encryption-key-management

Lead-authored-by: Ha Thi Tham <thamht01188@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
This PR is C++ implementation for parquet key tool, based on [the Java implementation](apache/parquet-mr#615) and [the design doc](https://docs.google.com/document/d/1boH6HPkG0ZhgxcaRkGk3QpZ8X_J91uXZwVGwYN45St4/edit?usp=sharing).

The major parts of this PR are:
* higher level of encryption/decryption configuration, including key management configuration.
* KMS connection configuration
* Abstract class KmsClient, KmsClientFactory
* PropertiesDrivenCryptoFactory class to convert these above configurations to the lower level FileEncryptionProperties, FileDecryptionProperties
* unit test using InMemoryKms (an sample of KmsClient).

Comparing to Java version, this C++ pull doesn't contain externally storing key material using hadoop file system (only storing key material internally in parquet file is supported for now). The reason is lack of understanding about Hadoop file system, can be implemented it later in another pull.

Thanks!

Closes apache#8023 from thamht4190/arrow-9318-encryption-key-management

Lead-authored-by: Ha Thi Tham <thamht01188@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
pitrou added a commit that referenced this pull request Mar 2, 2022
…files.

Exposes in PyArrow the high-level C++ API for Parquet encryption that was added in #8023.
Design document:  https://docs.google.com/document/d/1i1M5f5azLEmASj9XQZ_aQLl5Fr5F0CvnyPPVu1xaD9U
A test is added for writing and reading encrypted parquet files using a simple in-memory KMS for testing, that is not to be used as an example of a KMS client. In addition, there is an example KMS client using Vault KMS.

This PR handles only the file-level encryption and decryption.
Dataset is handled in separate PRs.
The investigation of the multithreading model of PME is separated into a separate issue, independent of this one.

Closes #10450 from andersonm-ibm/encryption

Lead-authored-by: Maya Anderson <mayaa@il.ibm.com>
Co-authored-by: andersonm-ibm <63074550+andersonm-ibm@users.noreply.github.com>
Co-authored-by: roee88 <roee88@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: C++ Component: Parquet needs-rebase A PR that needs to be rebased by the author
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants