Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-11644: [Python][Parquet] Low-level Parquet decryption in Python #9631

Closed
wants to merge 33 commits into from

Conversation

itamarst
Copy link
Contributor

@itamarst itamarst commented Mar 4, 2021

This does both encryption and decryption, but since encryption is a bit more controversial, as a first pass in this PR the assumption is the encryption APIs are just there for testing.

(While the high-level API is coming soon, the company that hired me to do this is not particularly interested in the high-level API: they reviewed it and would rather use the low-level API. Presumably there are others in same situation.)

@github-actions
Copy link

github-actions bot commented Mar 4, 2021

@itamarst
Copy link
Contributor Author

Status: I think I have a basic API working, next just need to figure out how to make Python tests builds and wheels use encryption. Even if this doesn't get merged the second half should be useful for high-level encryption.

@itamarst itamarst marked this pull request as ready for review March 29, 2021 20:01
@itamarst itamarst changed the title ARROW-11644: [Python][Parquet] Low-level Parquet encryption in Python, initial sketch for feedback ARROW-11644: [Python][Parquet] Low-level Parquet decryption in Python Mar 29, 2021
@itamarst
Copy link
Contributor Author

Should be ready for review now.

@ggershinsky
Copy link
Contributor

sure, will do

@itamarst
Copy link
Contributor Author

itamarst commented Apr 6, 2021

Thank you! One takeaway, not really useful in this context, is that I wouldn't ever use Cython for C++. E.g. I had to jump through hoops to have Python "override" a virtual method, had lots of problems with bindings being out of sync due to manual repetition... With pybind11 this would I suspect be a lot easier given my long-long-ago experiences with boost::python.

@ggershinsky
Copy link
Contributor

ggershinsky commented Apr 8, 2021

yep, I'm too getting an impression a bulk of this pr is deep into cython and related spaces. I don't know anything about this, so will leave a few comments on the things I do understand, but other folks will be needed for a full review of this pr. Please feel free to contact additional reviewers.

@@ -70,6 +70,26 @@ class PARQUET_EXPORT StringKeyIdRetriever : public DecryptionKeyRetriever {
std::map<std::string, std::string> key_map_;
};

// Function variant of DecryptionKeyRetriever, taking a state object.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is a change in the low level C++ layer needed to build a wrapper layer on top of it? for example, we have the #8023 that also builds a layer on top, but without additions to low level layer (besides some light fixes), working with the DecryptionKeyRetriever as it is today.
Since this pull request is a temporary python wrapping for decryption, helpful until the high-level python wrapper is ready, not sure it makes sense to introduce changes in the low level layer in order to accommodate this pr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, I should note that I'm not sure this is temporary API. The client paying for me to write this examined the high-level API and decided it didn't add anything they cared about, so from their perspective they actively want to use the low-level API.

Second, this is the place where Cython's unfortunate (lack of) C++ integration comes in. It's not possible to subclass a C++ class in Cython, which is why I had to create a new C++ class that can dispatch to a function. The new C++ class could, of course, be placed somewhere else—it doesn't have to be part of the Parquet public API. In theory I could put it somewhere Cython specific and then just need to convince CMake to load it. I put it in the public API on theory someone else might end up wrapping the API in a language that is tied to C rather than C++.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I think I haven't seen this before. Typically, these are technical discussions, but this is the second time you mention the paying customer :) Usecases, real-life scenarios and customers are important of course; still, their requirements are best translated into technical terms.

In the jira discussion, I've listed the security and compatibility issues, associated with providing a direct access to the low level encryption API. I were ok though with providing only the decryption part as a temp measure before the high-level layer becomes available. While the former is still incompatible and won't be able to read e.g. the files written by Apache Spark - at least the decryption part does not have the security issues (that the encryption part has). But if it requires Cython-specific changes in the existing code, this adds to arguments against this approach. Another argument is the availability of the C++ version of the high-level layer, merged recently. The decision is up to the community, of course.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you working on Arrow in your spare time? Or the Ursa Labs people? The main driver for all of this is companies that want to use these tools, and funding it one way or another. That being said, I mention they're client of mine as, it's true, a caveat: I can't speak to their particular technical assessment of the high-level API, but they did make one.

If you're willing I can try to introduce you to the people who decided the high-level API wasn't worth using from their perspective, and maybe you can convince them they're wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I work on parquet encryption in a number of OSS repos/frameworks, trying to make sure these frameworks are interoperable regardless of the ecosystem they belong to. Also, trying to make the use of this security tool as safe as possible. The high-level layer addresses these goals; it is unfortunate that your client has decided it is not worth using. There are quite a few companies that made a different decision :)

Copy link
Contributor

@roee88 roee88 Apr 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not strictly related to the above discussion, but usually these classes are implemented in https://github.com/apache/arrow/tree/master/cpp/src/arrow/python as far as I can tell.

In https://docs.google.com/document/d/1i1M5f5azLEmASj9XQZ_aQLl5Fr5F0CvnyPPVu1xaD9U/edit @emkornfield suggested to look at PyFileSystem as an example and I suggest to follow the same pattern in terms of implementation (vtable, ARROW_PYTHON_EXPORT, and anything else).

It's not possible to subclass a C++ class in Cython

To be fair to Cython, it's possible (see an example). However, it's not well documented and might be less efficient w.r.t. GIL so the vtable approach used in the arrow codebase is probably better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a performance critical path, so I think I will try the subclassing path for simplicity's sake.

@itamarst
Copy link
Contributor Author

itamarst commented Apr 8, 2021

Thank you for the review!

cdef c_map[c_string, shared_ptr[
ColumnEncryptionProperties]] c_column_keys

builder = new FileEncryptionProperties.Builder(footer_key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a possibility of a memory leak here? Did you consider using a unique_ptr here instead?

@itamarst
Copy link
Contributor Author

So here is what the end-user explained about why they specifically want the low-level API; I've bolded the most important part.

The way I understand things, the low-level API only cares about encrypting/decrypting the parquet data but does not do any key management beyond simply storing the key-identifiers.

This is really useful because it means that the low-level API makes no assumption on how the key should be handled. Instead this responsibility is left to the application. Personally I think this is the right call: Apache Arrow/Parquet is not a Key Management System, there are plenty of other libraries for that.

(Again from what I understand) the high-level API defines how the keys are stored in relation to the key identifier, handles the symmetric vs asymmetric keys, and how Parquet files metadata should be stored in order to be compatible with a de-facto standard (vague recollection that they want to be compatible with existing ecosystems such as HADOOP or Spark).

IMHO Trying to be compatible with the key management of an existing Parquet ecosystem is a mistake if it forces everyone to adopt that system. Instead it should be a separate library, away from Arrow/Parquet, as one of the possible way to manage keys wrt Parquet files.

Unlike a data-format specification like Parquet, I am not convinced that KMS requirements are as universal as to mandate a standard solution that everyone is forced to used (which would be the case if it's the only one exposed in Python). I'm convinced of the opposite: that a lot of companies have their own tailor-made KMS. In fact, we already have our own KMS and KMS-related routines to securely manage keys and accesses from the key identifiers . So not only is the high-level API redundant to what we already have, but we also cannot use it as it would be incompatible with the rest of the company.

I don't want to waste time if this PR will never be accepted, but AFAICT there is a real use-case for low-level API.

@emkornfield
Copy link
Contributor

I'll try to review soon. I think the rationale presented by the end-user makes sense, @ggershinsky are you still very concerned with this?

@itamarst
Copy link
Contributor Author

itamarst commented May 2, 2021

I will attempt to address existing review comments next week.

@ggershinsky
Copy link
Contributor

I'll try to review soon. I think the rationale presented by the end-user makes sense, @ggershinsky are you still very concerned with this?

Well, a couple of points:
-the comment above is not very accurate: high-level API doesn't perform KMS functions; instead, it provides a plug-in interface for any KMS. This end-user might be not familiar enough with the high-level API; there is a good chance it will fit their needs. In an unlikely event it doesn't, it would be good to get a technical feedback that refers to gaps in this API.
-my position hasn't changed - I'm ok with the read/decryption part of the low-level API, as a temp measure before the Python version of the high-level API is ready. I am concerned with exposing the write/encryption part of the low-level API, for the reasons I've mentioned in the ticket.

@ggershinsky
Copy link
Contributor

I know Itamar has put a lot of time into developing this capability for his customer (which I appreciate), and into contributing it to the open source, so I'd be glad to repeat the reasons for the concern, and to expand on them.
IMO, the core of the problem at hand is that the low-level API looks deceivingly simple, and the high-level API seems more restrictive and less intuitive. I know who is to blame, because I've basically designed both of them :) But this is being built ground-up, from the spec, to the low-level; then, with time and field experience, to the high-level layer; so there is no way to hide the low-level now. Its take on encryption seems to be "just give the key, its id, and we're done". I know that a handful of top data encryption experts won't think like that, and will look for ways to handle the NIST limit on the number of GCM crypto operations (so the cipher is not broken), and for ways to perform key rotation and other standard data security procedures. But I'm pretty sure most of the end users of PyArrow and pandas will be using the low-level, if exposed, in the "intuitive" way.
Another reason is compatibility with Apache Spark and other frameworks, that never exposed the low-level layer, and will start offering Parquet encryption via the high-level API. It would be really good if Spark will be able to read files produced by PyArrow, and vice-versa.
It will take some time till a full high-level API implementation is available in Arrow. I understand and appreciate the pressure to have Parquet encryption available asap, but it is worth (at least in my view :) to wait a bit more for the safe and compatible high-level layer. The work on Parquet encryption has started in 2017; what's few more months..

@GPSnoopy
Copy link
Contributor

GPSnoopy commented May 3, 2021

Hi @ggershinsky,

I'm one of the end-users pestering @itamarst. ;-)

I don't claim to be an encryption expert, so the following feedback is purely from a user/developer perspective.

  • We already have an in-house crypto library which handles the security choices, design and the integration with our KMS.
  • The integration of this library with Apache Arrow/Parquet (via ParquetSharp) is about 10-20 lines of code.
  • This crypto library generates the AES key, encrypts it using asymmetric keys (obtained via the KMS, driven by an company-internal user provided key identifier), adds some extra necessary header information and publishes that to Parquet as the key identifier.
  • It also deals with user authentication and key permissions.
  • This means that the way we manage Parquet encryption inside the company is consistent with the rest of the company; approved by the various security teams.
  • Being compatible with other external tools and a de-facto Parquet encryption high-level standard is nice, but ultimately the company cares about its own sensitive IP. So being compatible with the company ecosystem is higher priority than being compatible with Spark (ultimately we will never share encrypted files with other companies, kind of the main point).
  • The low-level API is internally used by us in both C++ and C#. So why is Python different?
  • I'm not sure I understand or appreciate the reluctance to provide both the low-level and higher-level API. It's a really nice property of a library to expose various level of abstraction, such that the user can integrate with the library at the required level. Having both APIs means that you provide the correct default behaviour and compatibility with Spark ecosystem for your users, and also provide the necessary flexibility for users with use-cases you have not anticipated or foreseen.

IMHO the last point should be carefully considered, as it's reflected and used in highly acclaimed libraries and APIs, such as C++ STL, Boost, Zlib, OpenSSL, Vulkan, etc (personal bias in this choice of libraries, of course; interestingly DirectX12/Vulkan do prove a point though - developers want more fine grained access and level of controls in their API, not less).

@ggershinsky
Copy link
Contributor

Hey @GPSnoopy , thanks for the detailed input! It is particularly interesting because you use asymmetric encryption of AES keys; we always wanted to check the high-level API against such scenario. Regarding addressing the immediate needs of your usecase - I'm sure we'll find a practical solution, with one of the APIs (more on that later in this comment).

This crypto library generates the AES key, encrypts it using asymmetric keys (obtained via the KMS, driven by an company-internal user provided key identifier), adds some extra necessary header information and publishes that to Parquet as the key identifier.

this could mapped rather easily to the high-level API. Basically, it requires a developer to implement a method string wrapKey(byte[] aesKey, string masterKeyID) - here, you could take the AES key (generated by us) with the ID of the master key (specified by you for the table/column), obtain the asymmetric key with this ID from your KMS, encrypt the AES key with it, add any extra header information, and return to us as a base64 encoded string. We keep it, and give this string back to you upon reading, via the byte[] unwrapKey(string wrappedKey, masterKeyID) method - that you use to decrypt the AES key. Will this conceptually work for you? I know, the AES key is generated by us, but we do it to make sure that one DEK is not used more than allowed by the NIST spec for GCM to prevent its break-down.

It also deals with user authentication and key permissions.
This means that the way we manage Parquet encryption inside the company is consistent with the rest of the company; approved by the various security teams.

This is precisely the intent of having a pluggable KMS interface in the high-level API; it works like that in other companies.

Being compatible with other external tools and a de-facto Parquet encryption high-level standard is nice, but ultimately the company cares about its own sensitive IP. So being compatible with the company ecosystem is higher priority than being compatible with Spark (ultimately we will never share encrypted files with other companies, kind of the main point).

Yep, I understand. Some companies though use both Spark and PyArrow/pandas in their data pipelines. Or migrate from one to the other.

The low-level API is internally used by us in both C++ and C#. So why is Python different?

No choice with C++, the low-level has to be implemented in some language. With Python, we do have a choice to make the API safer. Also, the Python API will be used by a wider set of developers, including users that don't have any experience with data encryption.

..provide the necessary flexibility for users with use-cases you have not anticipated or foreseen... the last point should be carefully considered, as it's reflected and used in highly acclaimed libraries and APIs

In general, I'd totally agree. However, this is a somewhat unusual library, because it belongs in the field of security, and is supposed to protect sensitive data. Unfortunately, its low-level interface can be easily misused by unexperienced users, resulting in broken protection.

Now, for the practical solutions for your usecase. I can think of the following options:

  • be an early adopter of the high-level API. A basic C++ version is ready, and a basic Python version should be ready soon. Obviously, this is my top preference, because it will help the library and its future users in the community to benefit from your experience / contribution.
  • use the low-level Python wrapping, developed by Itamar. You already have it working, and you have enough key management experience to make it safe in your deployment. No need to upstream it to an open source repo that is leveraged by many users without experience in data security.
  • open source / expose the low-level API, with warnings (in a hope users will see / heed them). IMHO, this is the least preferable option, I'd still like to understand why nothing else would work.

@andersonm-ibm
Copy link
Contributor

Hi @GPSnoopy , could you please check out the proposal for the high-level API ARROW-9947: Adding a Python API for Parquet encryption in Arrow that @ggershinsky is referring to, in particular where it describes the KmsClient API and examples of writing and reading encrypted files?
It is still WIP, but your feedback whether this addresses your requirements would be very valuable.

@GPSnoopy
Copy link
Contributor

GPSnoopy commented May 7, 2021

Hi @andersonm-ibm, I'll have a look. I wasn't aware of this document, will be quite useful to get more context on how to use the high-level API.

this could mapped rather easily to the high-level API. Basically, it requires a developer to implement a method string wrapKey(byte[] aesKey, string masterKeyID)

@ggershinsky That could potentially work. Is this available in the C++ version already and I missed it?

@itamarst
Copy link
Contributor Author

So while there's an ongoing discussion on the high-level API (and I do hope you can work out something that works!) in the short term still seems worth getting this merged? I think it meets @ggershinsky's requirements that encryption is not an approved public API, insofar as one has to use a private API. But could be made more private.

@andersonm-ibm
Copy link
Contributor

So while there's an ongoing discussion on the high-level API (and I do hope you can work out something that works!) in the short term still seems worth getting this merged? I think it meets @ggershinsky's requirements that encryption is not an approved public API, insofar as one has to use a private API. But could be made more private.

Hi @itamarst , we are working on an implementation of the high-level API based on the document ARROW-9947: Adding a Python API for Parquet encryption in Arrow and we are going to open a PR in the upcoming days after we add some finishing touches to the code.

@GPSnoopy
Copy link
Contributor

Hi @andersonm-ibm,

Apologies for the late reply. I finally had time to review the high-level API and it indeed sounds like it would suit our needs.

I'll need to test it at some point with ParquetSharp (quite a lot of work, since there are many new classes) to check that it integrates well with the company crypto libraries.

Cheers!

@andersonm-ibm
Copy link
Contributor

Hi @GPSnoopy and @itamarst ,
We've opened a PR #10450 with file-level parquet encryption.
Your feedback is welcome.

@pitrou
Copy link
Member

pitrou commented Mar 15, 2022

@itamarst @ggershinsky Should this PR remain open and be revived? (there are merge conflicts right now)

@itamarst
Copy link
Contributor Author

Let's see if @GPSnoopy comes back with positive comments on high-level API; hopefully he does and then this isn't relevant anymore.

@GPSnoopy
Copy link
Contributor

I think we can reasonably close this PR since the high level Python API had been merged in.

@pitrou
Copy link
Member

pitrou commented May 4, 2022

Thanks for the answer, closing with a delay ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants