New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-9960: [C++] Enable external material and rotation for encryption keys #10491
Conversation
cee4bec
to
0b7209b
Compare
35edaea
to
9964bb5
Compare
5fdb6bb
to
eff6d6b
Compare
@revit13 @andersonm-ibm @ggershinsky What is the status of this PR? Should it be rebased? |
@pitrou yes, will rebase it. Thanks |
…n keys Signed-off-by: Revital Sur <eres@il.ibm.com>
Signed-off-by: Revital Sur <eres@il.ibm.com> Co-authored-by: Maya Anderson <mayaa@il.ibm.com>
Signed-off-by: Revital Sur <eres@il.ibm.com>
thanks @revit13 ! @andersonm-ibm , could you verify that the files are readable by Spark 3.2 (and vice versa)? |
@ggershinsky @revit13 Could any validation files be added to the parquet-testing, if not already done? |
Thank you, @ggershinsky , I'll check the interop with Spark 3.2. |
|
It's been a while. @revit13 are you still interested in moving this forward? @andersonm-ibm are you still planning on helping test these changes? If so, I can help review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this is nearly complete, but could use some updates since its last been touched.
- Rebase the branch on latest master
- Upgrade to use C++14 features (e.g.
std::make_unique
) - Drop unnecessary filesystem API wrappers
- Improve some of the internal documentation.
- Test encrypted files written by Spark 3.2
FilePath(std::string path, std::shared_ptr<::arrow::fs::FileSystem> filesystem) | ||
: file_info_(std::move(path)), filesystem_(filesystem) {} | ||
|
||
/// \brief The directory base name (component before the file base name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// \brief The directory base name (component before the file base name | |
/// \brief The directory base name (component before the file base name) |
namespace parquet { | ||
|
||
/// \brief The path and filesystem where an actual file is located. | ||
class PARQUET_EXPORT FilePath { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need this class?
std::string key_material_file_name = | ||
full_prefix + parquet_file_path->base_name() + | ||
std::string(FileSystemKeyMaterialStore::kKeyMaterialFileSuffix); | ||
key_material_file_ = ::arrow::internal::make_unique<FilePath>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably switch this to std::make_unique
::arrow::Result<std::shared_ptr<Buffer>> buff = | ||
input->ReadAt(0, input->GetSize().ValueOrDie()); | ||
std::string buff_str = buff.ValueOrDie()->ToString(); | ||
::arrow::util::string_view sv(buff_str); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also should upgrade string_view
stuff.
key.SetString(it->first.c_str(), allocator); | ||
value.SetString(it->second.c_str(), allocator); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if you set #define RAPIDJSON_HAS_STDSTRING 1
you can just pass in the strings as is.
std::vector<::arrow::fs::FileInfo> parquet_files_in_folder; | ||
::arrow::fs::FileSelector s; | ||
s.base_dir = folder_path->dir_name(); | ||
parquet_files_in_folder = folder_path->filesystem()->GetFileInfo(s).ValueOrDie(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be recursive? Or be allowing the user to supply the vector of fileinfos?
if (parquet_file.type() != ::arrow::fs::FileType::File) | ||
throw ParquetException("Expecting file type in " + folder_path->dir_name()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not ignore directories?
|
||
if (key_material_store != nullptr) { | ||
key_material_store_ = key_material_store; | ||
checked_key_material_internal_storage_ = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if instead of having this bool we just checked key_material_store != nullptr
?
std::string key_id_in_file = key_metadata.key_reference(); | ||
std::string key_material_string = key_material_store_->GetKeyMaterial(key_id_in_file); | ||
if (key_material_string.empty()) { | ||
throw ParquetException("Null key material for keyIDinFile: " + key_id_in_file); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
throw ParquetException("Null key material for keyIDinFile: " + key_id_in_file); | |
throw ParquetException("Null key material for key with ID: " + key_id_in_file); |
key_id_in_file = | ||
KeyMaterial::kColumnKeyIdInFilePrefix + std::to_string(key_counter_); | ||
key_counter_++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we appending this integer? Who else needs to know about this ID? It seems like this detail deserves more comments.
throw ParquetException("External key material store is not supported yet."); | ||
if (parquet_file_path == nullptr) { | ||
std::stringstream ss; | ||
ss << "Output file path cannot be null" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to clarify this a bit:
ss << "Output file path cannot be null" | |
ss << "Output file path cannot be null when using external key material" |
namespace parquet { | ||
namespace encryption { | ||
|
||
constexpr const char FileSystemKeyMaterialStore::kKetMaterialFilePrefix[]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo, should be kKeyMaterialFilePrefix
if (parquet_file.type() != ::arrow::fs::FileType::File) | ||
throw ParquetException("Expecting file type in " + folder_path->dir_name()); | ||
std::string child = parquet_file.base_name(); | ||
if (filter_out_file(child)) continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make more sense to only look for files that begin with the _KEY_MATERIAL_FOR_
prefix, rather than assuming that any file that doesn't begin with "." or "_" is a Parquet file with an associated external key material file? Otherwise this will fail for any directories that contain any other type of file.
It doesn't look like this pull request is going to be revived, so I've opened a new PR that addresses the review comments at #34181 |
…keys (#34181) This PR is a replacement for #10491 which appears to be abandoned. I've fixed the issues pointed out in review comments as well as a few more things I noticed. I've also made one significant change to the API; rather than `CryptoFactory::RotateMasterKeys` operating on a whole directory, it rotates keys for a single file, as this is much more flexible and allows users to decide what files to rotate keys for, and there didn't appear to be a good reason for this to work at the whole directory level. This does mean the API diverges slightly from the parquet-mr API though. ### Rationale for this change Use of external key material allows rotating master encryption keys without having to rewrite Parquet file data. See https://docs.google.com/document/d/1bEu903840yb95k9q2X-BlsYKuXoygE4VnMDl9xz_zhk/edit?usp=sharing for more details. ### What changes are included in this PR? Adds support for writing and reading external key material for Parquet files from C++, as well as a new `CryptoFactory::RotateMasterKeys` function that allows re-encrypting key encryption keys or data encryption keys with latest versions of master keys. ### Are these changes tested? Yes, unit tests are included. I've added an additional test that reads a file generated with parquet-mr (used by Spark) from the parquet-testing repository. This requires merging the PR at apache/parquet-testing#36 and updating the parquet-testing submodule before the new test will pass. ### Are there any user-facing changes? Yes, the existing `internal_key_material` option in `parquet::encryption::EncryptionConfiguration` will now work and use external key material. This requires using two new parameters (`file_path` and `file_system`) in `CryptoFactory::GetFileEncryptionProperties` and `CryptoFactory::GetFileDecryptionProperties`, which are needed so that we know where to write/read the external key material. Note that this means external key material won't work from Python until the new parameters are exposed in Python too. This changes the `CryptoFactory` ABI but the API is still source compatible. Is this okay or should new overloads be added that take the new parameters? The `CryptoFactory::RotateMasterKeys` function is also a new public facing API. * Closes: #25986 Lead-authored-by: Adam Reeve <adreeve@gmail.com> Co-authored-by: Revital Sur <eres@il.ibm.com> Co-authored-by: Maya Anderson <mayaa@il.ibm.com> Signed-off-by: Will Jones <willjones127@gmail.com>
Thank you, @adamreeve and @wjones127 , for taking this up! |
This PR adds the support for external key material and key rotation for Parquet Modular encryption.
More information on external key material and key rotation for PME can be found in https://docs.google.com/document/d/1bEu903840yb95k9q2X-BlsYKuXoygE4VnMDl9xz_zhk/edit?usp=sharing.
Thanks to @roee88 for suggesting the usage of arrow::fs::FileSystem in the implementation.