Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][Packaging] Add Parquet encryption/decryption / OpenSSL to Python wheels #24254

Open
1 task
asfimport opened this issue Mar 9, 2020 · 21 comments
Open
1 task

Comments

@asfimport
Copy link

asfimport commented Mar 9, 2020

Reporter: Wes McKinney / @wesm

Subtasks:

Note: This issue was originally created as ARROW-8040. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Itamar Turner-Trauring / @itamarst:
I would like to get this working (doing this on behalf of a client)—the packaging sides seem relatively simple, just adding the right flags to the build scripts, and maybe making sure OpenSSL is compiled in statically.

However, it doesn't seem like there's Python bindings for the encryption? Or at least, it's not clear to me how to use Parquet encryption from Python... So does that need to be done separately? Or is there an example I can look at?

Thanks!

@asfimport
Copy link
Author

Itamar Turner-Trauring / @itamarst:
Just checking in again—could someone answer my question above re the need for Python bindings in addition to packaging wheels? Thanks!

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Right, it seems that encryption options are not exposed in the Python bindings. So I suggest you start with that. Do you need any guidance? Places to look at probably include pyarrow/_parquet.pxd, pyarrow/_parquet.pyx and pyarrow/parquet.py.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Perhaps open a separate JIRA for exposing encryption options in the Python bindings?

@asfimport
Copy link
Author

Itamar Turner-Trauring / @itamarst:
Thanks for the confirmation. Will open new issue and try to create a first pass API design.

@asfimport
Copy link
Author

Gidon Gershinsky / @ggershinsky:
@itamarst  Sorry for missing the start of this discussion. Please see my response in the list, and feel free to contact me with any questions, I'd be glad to help with this.

@asfimport
Copy link
Author

Itamar Turner-Trauring / @itamarst:
So it seems like high-level encryption is going to be a long time coming, and access to low-level encryption is still useful. So going to try to do that.

@asfimport
Copy link
Author

Gidon Gershinsky / @ggershinsky:
Hi Itamar, to sum up my previous comments on this, there are two issues with exposing the low level encryption to the users,

  • security: low-level encryption API is easy to misuse (eg giving the same keys for a number of different files; this'd break the AES GCM cipher). The high-level encryption layer handles that by applying envelope encryption and other best practices in data security. Also, this layer is maintained by the community, meaning that future improvements and security fixes can be upstreamed by anyone, and available to all.

  • compatibility: parquet-mr implements the high-level encryption layer. If we want the files produced by Spark/Presto/etc to be readable by pandas/PyArrow (and vice versa), we need to provide the Arrow users with the high-level API. 

    Actually, this layer is already implemented in C++ and sent as a PR to the Arrow repo (ARROW-9318: [C++] Parquet encryption key management #8023 I agree with you it is a long time coming, and wonder when the review can be resumed by the community.. 

@asfimport
Copy link
Author

Itamar Turner-Trauring / @itamarst:
Given people are paying me to try to get this moving, some people are using the low-level API directly already, and they want to use it from Python too, and now. So exposing it doesn't necessarily make their situation any worse. And I do understand the concerns about interop.

So my thought was to expose the low-level API in a way that didn't preclude a higher-level API, and made it clear (via documentation and API names) that there will eventually be a nicer, higher-level API. Would something like that take away your concerns?

@asfimport
Copy link
Author

Gidon Gershinsky / @ggershinsky:
I presume you are getting paid to apply this code in their usecase, not to upstream it to an open source repository. I don't have concerns with the internal usage.

@asfimport
Copy link
Author

Itamar Turner-Trauring / @itamarst:
Sure, but that suggests it might be useful to other people? The low-level encryption API is already exposed in C+, and is the only API in C+, so people can already use it right now. It's possible no one is, of course.

Basically argument I'm making is "this will be useful to some people". That may not override the concerns about security, of course.

@asfimport
Copy link
Author

Itamar Turner-Trauring / @itamarst:
The cryptography library has this issue, with primitives they don't think most people should be using, and the solution is having a "hazardous materials" part of the API, explicitly marked as such: https://cryptography.io/en/latest/hazmat/primitives/index.html

@asfimport
Copy link
Author

Gidon Gershinsky / @ggershinsky:
Yep, I understand. But with this being a data protection technology, we need to take extra care to keep it secure. 

Regarding C++ - we're building this bottom up, starting with a cipher layer, then low-level encryption layer, then the high-level layer. No way this'd be accepted by the community as a single PR. The current situation is not ideal, it'd be good to merge the high-level PR (and maybe hide the low level), but here we are; also, C++ is a kind of a low-level language; Python would expose it to a less experienced audience.

@asfimport
Copy link
Author

Itamar Turner-Trauring / @itamarst:
On the list, Micah Kornfield pointed out that decryption doesn't have the security concerns, and insofar as interop is a concern it's never going to produce incompatible files. So I'm going to start with doing that as a minimum, and we can revisit encryption as a second pass if the high-level API is still in limbo.

@asfimport
Copy link
Author

Gidon Gershinsky / @ggershinsky:
Thanks. I agree the decryption part is much safer than encryption. Regarding interop - this decryption code will not be able to read files, produced by Spark, because that would require parsing the key material. Implementation of key material parsing means implementation of the high level layer. I'll send a mail to the list to check if this can be indeed be moved out of a state of limbo.

@asfimport
Copy link
Author

Itamar Turner-Trauring / @itamarst:
High-level API would be better, yeah. So I'm just going to do decryption low-level, have my client test it and see if it works for them. And if high-level gets merged soon (and that works for them), even better, can switch to wrapping that for Python.

@asfimport
Copy link
Author

Gidon Gershinsky / @ggershinsky:
Sure, SGTM

@asfimport
Copy link
Author

Itamar Turner-Trauring / @itamarst:
Turns out I probalby need to wrap some of the encryption API just to make testing easy :( So the first pass might include that, but won't expose it publicly.

@asfimport
Copy link
Author

Itamar Turner-Trauring / @itamarst:
I have an initial working sketch PR: #9631

Would appreciate any feedback, especially on API design. As mentioned above, I was forced to wrap encryption in order to do testing, but the plan is to not make it public part of the API.

@asfimport
Copy link
Author

Itamar Turner-Trauring / @itamarst:
And also, of course, this only wraps parts of the decryption API, I plan to wrap the rest of it as well.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Note that Parquet modular encryption is now available in the Python wheels (there are nightly builds: https://arrow.apache.org/docs/python/install.html#installing-nightly-packages ). Should this JIRA be kept open?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant