New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-11644: [Python][Parquet] Low-level Parquet decryption in Python #9631
Conversation
Covers some but not all of the API features.
Status: I think I have a basic API working, next just need to figure out how to make Python tests builds and wheels use encryption. Even if this doesn't get merged the second half should be useful for high-level encryption. |
Should be ready for review now. |
sure, will do |
Thank you! One takeaway, not really useful in this context, is that I wouldn't ever use Cython for C++. E.g. I had to jump through hoops to have Python "override" a virtual method, had lots of problems with bindings being out of sync due to manual repetition... With |
yep, I'm too getting an impression a bulk of this pr is deep into cython and related spaces. I don't know anything about this, so will leave a few comments on the things I do understand, but other folks will be needed for a full review of this pr. Please feel free to contact additional reviewers. |
@@ -70,6 +70,26 @@ class PARQUET_EXPORT StringKeyIdRetriever : public DecryptionKeyRetriever { | |||
std::map<std::string, std::string> key_map_; | |||
}; | |||
|
|||
// Function variant of DecryptionKeyRetriever, taking a state object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is a change in the low level C++ layer needed to build a wrapper layer on top of it? for example, we have the #8023 that also builds a layer on top, but without additions to low level layer (besides some light fixes), working with the DecryptionKeyRetriever as it is today.
Since this pull request is a temporary python wrapping for decryption, helpful until the high-level python wrapper is ready, not sure it makes sense to introduce changes in the low level layer in order to accommodate this pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, I should note that I'm not sure this is temporary API. The client paying for me to write this examined the high-level API and decided it didn't add anything they cared about, so from their perspective they actively want to use the low-level API.
Second, this is the place where Cython's unfortunate (lack of) C++ integration comes in. It's not possible to subclass a C++ class in Cython, which is why I had to create a new C++ class that can dispatch to a function. The new C++ class could, of course, be placed somewhere else—it doesn't have to be part of the Parquet public API. In theory I could put it somewhere Cython specific and then just need to convince CMake to load it. I put it in the public API on theory someone else might end up wrapping the API in a language that is tied to C rather than C++.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, I think I haven't seen this before. Typically, these are technical discussions, but this is the second time you mention the paying customer :) Usecases, real-life scenarios and customers are important of course; still, their requirements are best translated into technical terms.
In the jira discussion, I've listed the security and compatibility issues, associated with providing a direct access to the low level encryption API. I were ok though with providing only the decryption part as a temp measure before the high-level layer becomes available. While the former is still incompatible and won't be able to read e.g. the files written by Apache Spark - at least the decryption part does not have the security issues (that the encryption part has). But if it requires Cython-specific changes in the existing code, this adds to arguments against this approach. Another argument is the availability of the C++ version of the high-level layer, merged recently. The decision is up to the community, of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you working on Arrow in your spare time? Or the Ursa Labs people? The main driver for all of this is companies that want to use these tools, and funding it one way or another. That being said, I mention they're client of mine as, it's true, a caveat: I can't speak to their particular technical assessment of the high-level API, but they did make one.
If you're willing I can try to introduce you to the people who decided the high-level API wasn't worth using from their perspective, and maybe you can convince them they're wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I work on parquet encryption in a number of OSS repos/frameworks, trying to make sure these frameworks are interoperable regardless of the ecosystem they belong to. Also, trying to make the use of this security tool as safe as possible. The high-level layer addresses these goals; it is unfortunate that your client has decided it is not worth using. There are quite a few companies that made a different decision :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not strictly related to the above discussion, but usually these classes are implemented in https://github.com/apache/arrow/tree/master/cpp/src/arrow/python as far as I can tell.
In https://docs.google.com/document/d/1i1M5f5azLEmASj9XQZ_aQLl5Fr5F0CvnyPPVu1xaD9U/edit @emkornfield suggested to look at PyFileSystem as an example and I suggest to follow the same pattern in terms of implementation (vtable, ARROW_PYTHON_EXPORT, and anything else).
It's not possible to subclass a C++ class in Cython
To be fair to Cython, it's possible (see an example). However, it's not well documented and might be less efficient w.r.t. GIL so the vtable approach used in the arrow codebase is probably better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't a performance critical path, so I think I will try the subclassing path for simplicity's sake.
Thank you for the review! |
cdef c_map[c_string, shared_ptr[ | ||
ColumnEncryptionProperties]] c_column_keys | ||
|
||
builder = new FileEncryptionProperties.Builder(footer_key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a possibility of a memory leak here? Did you consider using a unique_ptr here instead?
So here is what the end-user explained about why they specifically want the low-level API; I've bolded the most important part.
I don't want to waste time if this PR will never be accepted, but AFAICT there is a real use-case for low-level API. |
I'll try to review soon. I think the rationale presented by the end-user makes sense, @ggershinsky are you still very concerned with this? |
I will attempt to address existing review comments next week. |
Well, a couple of points: |
I know Itamar has put a lot of time into developing this capability for his customer (which I appreciate), and into contributing it to the open source, so I'd be glad to repeat the reasons for the concern, and to expand on them. |
Hi @ggershinsky, I'm one of the end-users pestering @itamarst. ;-) I don't claim to be an encryption expert, so the following feedback is purely from a user/developer perspective.
IMHO the last point should be carefully considered, as it's reflected and used in highly acclaimed libraries and APIs, such as C++ STL, Boost, Zlib, OpenSSL, Vulkan, etc (personal bias in this choice of libraries, of course; interestingly DirectX12/Vulkan do prove a point though - developers want more fine grained access and level of controls in their API, not less). |
Hey @GPSnoopy , thanks for the detailed input! It is particularly interesting because you use asymmetric encryption of AES keys; we always wanted to check the high-level API against such scenario. Regarding addressing the immediate needs of your usecase - I'm sure we'll find a practical solution, with one of the APIs (more on that later in this comment).
this could mapped rather easily to the high-level API. Basically, it requires a developer to implement a method
This is precisely the intent of having a pluggable KMS interface in the high-level API; it works like that in other companies.
Yep, I understand. Some companies though use both Spark and PyArrow/pandas in their data pipelines. Or migrate from one to the other.
No choice with C++, the low-level has to be implemented in some language. With Python, we do have a choice to make the API safer. Also, the Python API will be used by a wider set of developers, including users that don't have any experience with data encryption.
In general, I'd totally agree. However, this is a somewhat unusual library, because it belongs in the field of security, and is supposed to protect sensitive data. Unfortunately, its low-level interface can be easily misused by unexperienced users, resulting in broken protection. Now, for the practical solutions for your usecase. I can think of the following options:
|
Hi @GPSnoopy , could you please check out the proposal for the high-level API ARROW-9947: Adding a Python API for Parquet encryption in Arrow that @ggershinsky is referring to, in particular where it describes the KmsClient API and examples of writing and reading encrypted files? |
Hi @andersonm-ibm, I'll have a look. I wasn't aware of this document, will be quite useful to get more context on how to use the high-level API.
@ggershinsky That could potentially work. Is this available in the C++ version already and I missed it? |
So while there's an ongoing discussion on the high-level API (and I do hope you can work out something that works!) in the short term still seems worth getting this merged? I think it meets @ggershinsky's requirements that encryption is not an approved public API, insofar as one has to use a private API. But could be made more private. |
Hi @itamarst , we are working on an implementation of the high-level API based on the document ARROW-9947: Adding a Python API for Parquet encryption in Arrow and we are going to open a PR in the upcoming days after we add some finishing touches to the code. |
Hi @andersonm-ibm, Apologies for the late reply. I finally had time to review the high-level API and it indeed sounds like it would suit our needs. I'll need to test it at some point with ParquetSharp (quite a lot of work, since there are many new classes) to check that it integrates well with the company crypto libraries. Cheers! |
@itamarst @ggershinsky Should this PR remain open and be revived? (there are merge conflicts right now) |
Let's see if @GPSnoopy comes back with positive comments on high-level API; hopefully he does and then this isn't relevant anymore. |
I think we can reasonably close this PR since the high level Python API had been merged in. |
Thanks for the answer, closing with a delay ;-) |
This does both encryption and decryption, but since encryption is a bit more controversial, as a first pass in this PR the assumption is the encryption APIs are just there for testing.
(While the high-level API is coming soon, the company that hired me to do this is not particularly interested in the high-level API: they reviewed it and would rather use the low-level API. Presumably there are others in same situation.)