Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-9960: [C++] Enable external material and rotation for encryption keys #10491

Closed
wants to merge 3 commits into from

Conversation

revit13
Copy link
Contributor

@revit13 revit13 commented Jun 9, 2021

This PR adds the support for external key material and key rotation for Parquet Modular encryption.
More information on external key material and key rotation for PME can be found in https://docs.google.com/document/d/1bEu903840yb95k9q2X-BlsYKuXoygE4VnMDl9xz_zhk/edit?usp=sharing.

Thanks to @roee88 for suggesting the usage of arrow::fs::FileSystem in the implementation.

@github-actions
Copy link

github-actions bot commented Jun 9, 2021

@pitrou
Copy link
Member

pitrou commented May 4, 2022

@revit13 @andersonm-ibm @ggershinsky What is the status of this PR? Should it be rebased?

@revit13
Copy link
Contributor Author

revit13 commented May 4, 2022

@pitrou yes, will rebase it. Thanks

revit13 and others added 2 commits May 5, 2022 18:23
…n keys

Signed-off-by: Revital Sur <eres@il.ibm.com>
Signed-off-by: Revital Sur <eres@il.ibm.com>
Co-authored-by: Maya Anderson <mayaa@il.ibm.com>
Signed-off-by: Revital Sur <eres@il.ibm.com>
@ggershinsky
Copy link
Contributor

ggershinsky commented May 6, 2022

thanks @revit13 ! @andersonm-ibm , could you verify that the files are readable by Spark 3.2 (and vice versa)?
Also, do we need PyArrow additions to write in this mode?

@pitrou
Copy link
Member

pitrou commented May 9, 2022

@ggershinsky @revit13 Could any validation files be added to the parquet-testing, if not already done?

@andersonm-ibm
Copy link
Contributor

thanks @revit13 ! @andersonm-ibm , could you verify that the files are readable by Spark 3.2 (and vice versa)? Also, do we need PyArrow additions to write in this mode?

Thank you, @ggershinsky , I'll check the interop with Spark 3.2.
In addition, I've created a Jira to track the PyArrow additions needed to use this mode: ARROW-16505 .

@ggershinsky
Copy link
Contributor

@ggershinsky @revit13 Could any validation files be added to the parquet-testing, if not already done?

cc @andersonm-ibm

@wjones127
Copy link
Member

It's been a while. @revit13 are you still interested in moving this forward? @andersonm-ibm are you still planning on helping test these changes?

If so, I can help review.

@wjones127 wjones127 self-requested a review January 24, 2023 17:40
Copy link
Member

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is nearly complete, but could use some updates since its last been touched.

  • Rebase the branch on latest master
  • Upgrade to use C++14 features (e.g. std::make_unique)
  • Drop unnecessary filesystem API wrappers
  • Improve some of the internal documentation.
  • Test encrypted files written by Spark 3.2

FilePath(std::string path, std::shared_ptr<::arrow::fs::FileSystem> filesystem)
: file_info_(std::move(path)), filesystem_(filesystem) {}

/// \brief The directory base name (component before the file base name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// \brief The directory base name (component before the file base name
/// \brief The directory base name (component before the file base name)

namespace parquet {

/// \brief The path and filesystem where an actual file is located.
class PARQUET_EXPORT FilePath {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this class?

std::string key_material_file_name =
full_prefix + parquet_file_path->base_name() +
std::string(FileSystemKeyMaterialStore::kKeyMaterialFileSuffix);
key_material_file_ = ::arrow::internal::make_unique<FilePath>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably switch this to std::make_unique

::arrow::Result<std::shared_ptr<Buffer>> buff =
input->ReadAt(0, input->GetSize().ValueOrDie());
std::string buff_str = buff.ValueOrDie()->ToString();
::arrow::util::string_view sv(buff_str);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also should upgrade string_view stuff.

Comment on lines +72 to +73
key.SetString(it->first.c_str(), allocator);
value.SetString(it->second.c_str(), allocator);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you set #define RAPIDJSON_HAS_STDSTRING 1 you can just pass in the strings as is.

std::vector<::arrow::fs::FileInfo> parquet_files_in_folder;
::arrow::fs::FileSelector s;
s.base_dir = folder_path->dir_name();
parquet_files_in_folder = folder_path->filesystem()->GetFileInfo(s).ValueOrDie();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be recursive? Or be allowing the user to supply the vector of fileinfos?

Comment on lines +82 to +83
if (parquet_file.type() != ::arrow::fs::FileType::File)
throw ParquetException("Expecting file type in " + folder_path->dir_name());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not ignore directories?


if (key_material_store != nullptr) {
key_material_store_ = key_material_store;
checked_key_material_internal_storage_ = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if instead of having this bool we just checked key_material_store != nullptr?

std::string key_id_in_file = key_metadata.key_reference();
std::string key_material_string = key_material_store_->GetKeyMaterial(key_id_in_file);
if (key_material_string.empty()) {
throw ParquetException("Null key material for keyIDinFile: " + key_id_in_file);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
throw ParquetException("Null key material for keyIDinFile: " + key_id_in_file);
throw ParquetException("Null key material for key with ID: " + key_id_in_file);

Comment on lines +98 to +100
key_id_in_file =
KeyMaterial::kColumnKeyIdInFilePrefix + std::to_string(key_counter_);
key_counter_++;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we appending this integer? Who else needs to know about this ID? It seems like this detail deserves more comments.

throw ParquetException("External key material store is not supported yet.");
if (parquet_file_path == nullptr) {
std::stringstream ss;
ss << "Output file path cannot be null"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to clarify this a bit:

Suggested change
ss << "Output file path cannot be null"
ss << "Output file path cannot be null when using external key material"

namespace parquet {
namespace encryption {

constexpr const char FileSystemKeyMaterialStore::kKetMaterialFilePrefix[];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo, should be kKeyMaterialFilePrefix

if (parquet_file.type() != ::arrow::fs::FileType::File)
throw ParquetException("Expecting file type in " + folder_path->dir_name());
std::string child = parquet_file.base_name();
if (filter_out_file(child)) continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make more sense to only look for files that begin with the _KEY_MATERIAL_FOR_ prefix, rather than assuming that any file that doesn't begin with "." or "_" is a Parquet file with an associated external key material file? Otherwise this will fail for any directories that contain any other type of file.

@adamreeve
Copy link
Contributor

It doesn't look like this pull request is going to be revived, so I've opened a new PR that addresses the review comments at #34181

wjones127 pushed a commit that referenced this pull request Feb 27, 2023
…keys (#34181)

This PR is a replacement for #10491 which appears to be abandoned. I've fixed the issues pointed out in review comments as well as a few more things I noticed. I've also made one significant change to the API; rather than `CryptoFactory::RotateMasterKeys` operating on a whole directory, it rotates keys for a single file, as this is much more flexible and allows users to decide what files to rotate keys for, and there didn't appear to be a good reason for this to work at the whole directory level. This does mean the API diverges slightly from the parquet-mr API though.

### Rationale for this change

Use of external key material allows rotating master encryption keys without having to rewrite Parquet file data. See https://docs.google.com/document/d/1bEu903840yb95k9q2X-BlsYKuXoygE4VnMDl9xz_zhk/edit?usp=sharing for more details.

### What changes are included in this PR?

Adds support for writing and reading external key material for Parquet files from C++, as well as a new `CryptoFactory::RotateMasterKeys` function that allows re-encrypting key encryption keys or data encryption keys with latest versions of master keys.

### Are these changes tested?

Yes, unit tests are included. I've added an additional test that reads a file generated with parquet-mr (used by Spark) from the parquet-testing repository. This requires merging the PR at apache/parquet-testing#36 and updating the parquet-testing submodule before the new test will pass.

### Are there any user-facing changes?

Yes, the existing `internal_key_material` option in `parquet::encryption::EncryptionConfiguration` will now work and use external key material. This requires using two new parameters (`file_path` and `file_system`) in `CryptoFactory::GetFileEncryptionProperties` and `CryptoFactory::GetFileDecryptionProperties`, which are needed so that we know where to write/read the external key material. Note that this means external key material won't work from Python until the new parameters are exposed in Python too.

This changes the `CryptoFactory` ABI but the API is still source compatible. Is this okay or should new overloads be added that take the new parameters?

The `CryptoFactory::RotateMasterKeys` function is also a new public facing API.
* Closes: #25986

Lead-authored-by: Adam Reeve <adreeve@gmail.com>
Co-authored-by: Revital Sur <eres@il.ibm.com>
Co-authored-by: Maya Anderson <mayaa@il.ibm.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
@andersonm-ibm
Copy link
Contributor

It doesn't look like this pull request is going to be revived, so I've opened a new PR that addresses the review comments at #34181

Thank you, @adamreeve and @wjones127 , for taking this up!

@wjones127 wjones127 closed this Feb 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants