-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API #29238
Comments
Krisztian Szucs / @kszucs: |
Neal Richardson / @nealrichardson: |
@ianmcook @jorisvandenbossche this doesn't seem like a blocker, right? |
@raulcd this one is very close to being finished and we would like to get it into 14.0.0 if possible |
…n the new Dataset API (#34616) ### Rationale for this change The purpose of this pull request is to support modular encryption in the new Dataset API. See [https://docs.google.com/document/d/13EysCNC6-Nu9wnJ8YpdzmD-aMLn4i2KXUJTNqIihy7A/edit#](url) for supporting document. ### What changes are included in this PR? I made improvements to the C++ and Python code to enable the Dataset API to have per-file settings for each file saved. Previously, the Dataset API applied the same encryption properties to all saved files, but now I've updated the code to allow for greater flexibility. In the Python code, I've added support for the changes by updating the ParquetFormat class to accept DatasetEncryptionConfiguration and DatasetDecryptionConfiguration structures. With these changes, you can pass the format object to the write_dataset function, giving you the ability to set unique encryption properties for each file in your Dataset. ### Are these changes tested? Yes, unit tests are included. I have also included a python sample project. ### Are there any user-facing changes? Yes, as stated above the ParquetFormat class has optional parameters for DatasetEncryptionConfiguration and DatasetDecryptionConfiguration through setters and getters. The Dataset now has the option using this to set different file encryption properties per file * Closes: #29238 Lead-authored-by: Don <tolleybot@gmail.com> Co-authored-by: Donald Tolley <tolleybot@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: anjakefala <anja@voltrondata.com> Co-authored-by: Sutou Kouhei <kou@cozmixng.org> Co-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: Gang Wu <ustcwg@gmail.com> Co-authored-by: scoder <stefan_ml@behnel.de> Co-authored-by: Will Jones <willjones127@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@tolleybot if you could answer here on this comment, I can assign the issue to you (some strange limitation of github that this is required .. ;)) |
…tion in the new Dataset API (apache#34616) ### Rationale for this change The purpose of this pull request is to support modular encryption in the new Dataset API. See [https://docs.google.com/document/d/13EysCNC6-Nu9wnJ8YpdzmD-aMLn4i2KXUJTNqIihy7A/edit#](url) for supporting document. ### What changes are included in this PR? I made improvements to the C++ and Python code to enable the Dataset API to have per-file settings for each file saved. Previously, the Dataset API applied the same encryption properties to all saved files, but now I've updated the code to allow for greater flexibility. In the Python code, I've added support for the changes by updating the ParquetFormat class to accept DatasetEncryptionConfiguration and DatasetDecryptionConfiguration structures. With these changes, you can pass the format object to the write_dataset function, giving you the ability to set unique encryption properties for each file in your Dataset. ### Are these changes tested? Yes, unit tests are included. I have also included a python sample project. ### Are there any user-facing changes? Yes, as stated above the ParquetFormat class has optional parameters for DatasetEncryptionConfiguration and DatasetDecryptionConfiguration through setters and getters. The Dataset now has the option using this to set different file encryption properties per file * Closes: apache#29238 Lead-authored-by: Don <tolleybot@gmail.com> Co-authored-by: Donald Tolley <tolleybot@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: anjakefala <anja@voltrondata.com> Co-authored-by: Sutou Kouhei <kou@cozmixng.org> Co-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: Gang Wu <ustcwg@gmail.com> Co-authored-by: scoder <stefan_ml@behnel.de> Co-authored-by: Will Jones <willjones127@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…tion in the new Dataset API (apache#34616) ### Rationale for this change The purpose of this pull request is to support modular encryption in the new Dataset API. See [https://docs.google.com/document/d/13EysCNC6-Nu9wnJ8YpdzmD-aMLn4i2KXUJTNqIihy7A/edit#](url) for supporting document. ### What changes are included in this PR? I made improvements to the C++ and Python code to enable the Dataset API to have per-file settings for each file saved. Previously, the Dataset API applied the same encryption properties to all saved files, but now I've updated the code to allow for greater flexibility. In the Python code, I've added support for the changes by updating the ParquetFormat class to accept DatasetEncryptionConfiguration and DatasetDecryptionConfiguration structures. With these changes, you can pass the format object to the write_dataset function, giving you the ability to set unique encryption properties for each file in your Dataset. ### Are these changes tested? Yes, unit tests are included. I have also included a python sample project. ### Are there any user-facing changes? Yes, as stated above the ParquetFormat class has optional parameters for DatasetEncryptionConfiguration and DatasetDecryptionConfiguration through setters and getters. The Dataset now has the option using this to set different file encryption properties per file * Closes: apache#29238 Lead-authored-by: Don <tolleybot@gmail.com> Co-authored-by: Donald Tolley <tolleybot@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: anjakefala <anja@voltrondata.com> Co-authored-by: Sutou Kouhei <kou@cozmixng.org> Co-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: Gang Wu <ustcwg@gmail.com> Co-authored-by: scoder <stefan_ml@behnel.de> Co-authored-by: Will Jones <willjones127@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…tion in the new Dataset API (apache#34616) ### Rationale for this change The purpose of this pull request is to support modular encryption in the new Dataset API. See [https://docs.google.com/document/d/13EysCNC6-Nu9wnJ8YpdzmD-aMLn4i2KXUJTNqIihy7A/edit#](url) for supporting document. ### What changes are included in this PR? I made improvements to the C++ and Python code to enable the Dataset API to have per-file settings for each file saved. Previously, the Dataset API applied the same encryption properties to all saved files, but now I've updated the code to allow for greater flexibility. In the Python code, I've added support for the changes by updating the ParquetFormat class to accept DatasetEncryptionConfiguration and DatasetDecryptionConfiguration structures. With these changes, you can pass the format object to the write_dataset function, giving you the ability to set unique encryption properties for each file in your Dataset. ### Are these changes tested? Yes, unit tests are included. I have also included a python sample project. ### Are there any user-facing changes? Yes, as stated above the ParquetFormat class has optional parameters for DatasetEncryptionConfiguration and DatasetDecryptionConfiguration through setters and getters. The Dataset now has the option using this to set different file encryption properties per file * Closes: apache#29238 Lead-authored-by: Don <tolleybot@gmail.com> Co-authored-by: Donald Tolley <tolleybot@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: anjakefala <anja@voltrondata.com> Co-authored-by: Sutou Kouhei <kou@cozmixng.org> Co-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: Gang Wu <ustcwg@gmail.com> Co-authored-by: scoder <stefan_ml@behnel.de> Co-authored-by: Will Jones <willjones127@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
In order for the new Dataset API to fully support PME, the same writer properties that include file_encryption_properties shouldn’t be used for the whole dataset. file_encryption_properties should be per file, for example in order to support key rotation ARROW-9960 .
Design document: https://docs.google.com/document/d/13EysCNC6-Nu9wnJ8YpdzmD-aMLn4i2KXUJTNqIihy7A/edit#
Reporter: Maya Anderson / @andersonm-ibm
Related issues:
PRs and other links:
Note: This issue was originally created as ARROW-13593. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: