Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API #29238

Closed
asfimport opened this issue Aug 10, 2021 · 5 comments · Fixed by #34616
Closed

[C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API #29238

asfimport opened this issue Aug 10, 2021 · 5 comments · Fixed by #34616

Comments

@asfimport
Copy link
Collaborator

asfimport commented Aug 10, 2021

In order for the new Dataset API to fully support PME, the same writer properties that include file_encryption_properties shouldn’t be used for the whole dataset. file_encryption_properties should be per file, for example in order to support key rotation ARROW-9960 .
Design document: https://docs.google.com/document/d/13EysCNC6-Nu9wnJ8YpdzmD-aMLn4i2KXUJTNqIihy7A/edit#

Reporter: Maya Anderson / @andersonm-ibm

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-13593. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Krisztian Szucs / @kszucs:
Postponing to 9.0

@asfimport
Copy link
Collaborator Author

Neal Richardson / @nealrichardson:
This issue has been inactive for 3 months, so it has been unassigned and marked as unstarted. If you are still working on this, feel free to reassign yourself and resume progress.

@raulcd
Copy link
Member

raulcd commented Oct 10, 2023

@ianmcook @jorisvandenbossche this doesn't seem like a blocker, right?

@ianmcook
Copy link
Member

@raulcd this one is very close to being finished and we would like to get it into 14.0.0 if possible

@raulcd raulcd added the Priority: Blocker Marks a blocker for the release label Oct 10, 2023
jorisvandenbossche added a commit that referenced this issue Oct 11, 2023
…n the new Dataset API (#34616)

### Rationale for this change

The purpose of this pull request is to support modular encryption in the new Dataset API.  See [https://docs.google.com/document/d/13EysCNC6-Nu9wnJ8YpdzmD-aMLn4i2KXUJTNqIihy7A/edit#](url) for supporting document.

### What changes are included in this PR?

I made improvements to the C++ and Python code to enable the Dataset API to have per-file settings for each file saved. Previously, the Dataset API applied the same encryption properties to all saved files, but now I've updated the code to allow for greater flexibility. In the Python code, I've added support for the changes by updating the ParquetFormat class to accept DatasetEncryptionConfiguration and DatasetDecryptionConfiguration structures. With these changes, you can pass the format object to the write_dataset function, giving you the ability to set unique encryption properties for each file in your Dataset.

### Are these changes tested?

Yes, unit tests are included. I have also included a python sample project.

### Are there any user-facing changes?

Yes,  as stated above the ParquetFormat class has optional parameters for DatasetEncryptionConfiguration and DatasetDecryptionConfiguration through setters and getters.  The Dataset now has the option using this to set different file encryption properties per file

* Closes: #29238

Lead-authored-by: Don <tolleybot@gmail.com>
Co-authored-by: Donald Tolley <tolleybot@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: anjakefala <anja@voltrondata.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Gang Wu <ustcwg@gmail.com>
Co-authored-by: scoder <stefan_ml@behnel.de>
Co-authored-by: Will Jones <willjones127@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@jorisvandenbossche
Copy link
Member

@tolleybot if you could answer here on this comment, I can assign the issue to you (some strange limitation of github that this is required .. ;))

JerAguilon pushed a commit to JerAguilon/arrow that referenced this issue Oct 23, 2023
…tion in the new Dataset API (apache#34616)

### Rationale for this change

The purpose of this pull request is to support modular encryption in the new Dataset API.  See [https://docs.google.com/document/d/13EysCNC6-Nu9wnJ8YpdzmD-aMLn4i2KXUJTNqIihy7A/edit#](url) for supporting document.

### What changes are included in this PR?

I made improvements to the C++ and Python code to enable the Dataset API to have per-file settings for each file saved. Previously, the Dataset API applied the same encryption properties to all saved files, but now I've updated the code to allow for greater flexibility. In the Python code, I've added support for the changes by updating the ParquetFormat class to accept DatasetEncryptionConfiguration and DatasetDecryptionConfiguration structures. With these changes, you can pass the format object to the write_dataset function, giving you the ability to set unique encryption properties for each file in your Dataset.

### Are these changes tested?

Yes, unit tests are included. I have also included a python sample project.

### Are there any user-facing changes?

Yes,  as stated above the ParquetFormat class has optional parameters for DatasetEncryptionConfiguration and DatasetDecryptionConfiguration through setters and getters.  The Dataset now has the option using this to set different file encryption properties per file

* Closes: apache#29238

Lead-authored-by: Don <tolleybot@gmail.com>
Co-authored-by: Donald Tolley <tolleybot@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: anjakefala <anja@voltrondata.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Gang Wu <ustcwg@gmail.com>
Co-authored-by: scoder <stefan_ml@behnel.de>
Co-authored-by: Will Jones <willjones127@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…tion in the new Dataset API (apache#34616)

### Rationale for this change

The purpose of this pull request is to support modular encryption in the new Dataset API.  See [https://docs.google.com/document/d/13EysCNC6-Nu9wnJ8YpdzmD-aMLn4i2KXUJTNqIihy7A/edit#](url) for supporting document.

### What changes are included in this PR?

I made improvements to the C++ and Python code to enable the Dataset API to have per-file settings for each file saved. Previously, the Dataset API applied the same encryption properties to all saved files, but now I've updated the code to allow for greater flexibility. In the Python code, I've added support for the changes by updating the ParquetFormat class to accept DatasetEncryptionConfiguration and DatasetDecryptionConfiguration structures. With these changes, you can pass the format object to the write_dataset function, giving you the ability to set unique encryption properties for each file in your Dataset.

### Are these changes tested?

Yes, unit tests are included. I have also included a python sample project.

### Are there any user-facing changes?

Yes,  as stated above the ParquetFormat class has optional parameters for DatasetEncryptionConfiguration and DatasetDecryptionConfiguration through setters and getters.  The Dataset now has the option using this to set different file encryption properties per file

* Closes: apache#29238

Lead-authored-by: Don <tolleybot@gmail.com>
Co-authored-by: Donald Tolley <tolleybot@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: anjakefala <anja@voltrondata.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Gang Wu <ustcwg@gmail.com>
Co-authored-by: scoder <stefan_ml@behnel.de>
Co-authored-by: Will Jones <willjones127@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…tion in the new Dataset API (apache#34616)

### Rationale for this change

The purpose of this pull request is to support modular encryption in the new Dataset API.  See [https://docs.google.com/document/d/13EysCNC6-Nu9wnJ8YpdzmD-aMLn4i2KXUJTNqIihy7A/edit#](url) for supporting document.

### What changes are included in this PR?

I made improvements to the C++ and Python code to enable the Dataset API to have per-file settings for each file saved. Previously, the Dataset API applied the same encryption properties to all saved files, but now I've updated the code to allow for greater flexibility. In the Python code, I've added support for the changes by updating the ParquetFormat class to accept DatasetEncryptionConfiguration and DatasetDecryptionConfiguration structures. With these changes, you can pass the format object to the write_dataset function, giving you the ability to set unique encryption properties for each file in your Dataset.

### Are these changes tested?

Yes, unit tests are included. I have also included a python sample project.

### Are there any user-facing changes?

Yes,  as stated above the ParquetFormat class has optional parameters for DatasetEncryptionConfiguration and DatasetDecryptionConfiguration through setters and getters.  The Dataset now has the option using this to set different file encryption properties per file

* Closes: apache#29238

Lead-authored-by: Don <tolleybot@gmail.com>
Co-authored-by: Donald Tolley <tolleybot@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: anjakefala <anja@voltrondata.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Gang Wu <ustcwg@gmail.com>
Co-authored-by: scoder <stefan_ml@behnel.de>
Co-authored-by: Will Jones <willjones127@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants