ARROW-11644: [Python][Parquet] Low-level Parquet decryption in Python #9631

itamarst · 2021-03-04T16:56:51Z

This does both encryption and decryption, but since encryption is a bit more controversial, as a first pass in this PR the assumption is the encryption APIs are just there for testing.

(While the high-level API is coming soon, the company that hired me to do this is not particularly interested in the high-level API: they reviewed it and would rather use the low-level API. Presumably there are others in same situation.)

…ython.

Covers some but not all of the API features.

github-actions · 2021-03-04T17:11:49Z

https://issues.apache.org/jira/browse/ARROW-11644

itamarst · 2021-03-15T16:00:46Z

Status: I think I have a basic API working, next just need to figure out how to make Python tests builds and wheels use encryption. Even if this doesn't get merged the second half should be useful for high-level encryption.

…rt for Python.

…hon compiles.

itamarst · 2021-03-29T20:03:39Z

Should be ready for review now.

ggershinsky · 2021-04-06T16:28:12Z

sure, will do

itamarst · 2021-04-06T17:23:19Z

Thank you! One takeaway, not really useful in this context, is that I wouldn't ever use Cython for C++. E.g. I had to jump through hoops to have Python "override" a virtual method, had lots of problems with bindings being out of sync due to manual repetition... With pybind11 this would I suspect be a lot easier given my long-long-ago experiences with boost::python.

ggershinsky · 2021-04-08T06:45:32Z

yep, I'm too getting an impression a bulk of this pr is deep into cython and related spaces. I don't know anything about this, so will leave a few comments on the things I do understand, but other folks will be needed for a full review of this pr. Please feel free to contact additional reviewers.

ggershinsky · 2021-04-08T07:33:03Z

cpp/src/parquet/encryption/encryption.h

@@ -70,6 +70,26 @@ class PARQUET_EXPORT StringKeyIdRetriever : public DecryptionKeyRetriever {
  std::map<std::string, std::string> key_map_;
 };

+// Function variant of DecryptionKeyRetriever, taking a state object.


is a change in the low level C++ layer needed to build a wrapper layer on top of it? for example, we have the #8023 that also builds a layer on top, but without additions to low level layer (besides some light fixes), working with the DecryptionKeyRetriever as it is today.
Since this pull request is a temporary python wrapping for decryption, helpful until the high-level python wrapper is ready, not sure it makes sense to introduce changes in the low level layer in order to accommodate this pr.

First, I should note that I'm not sure this is temporary API. The client paying for me to write this examined the high-level API and decided it didn't add anything they cared about, so from their perspective they actively want to use the low-level API.

Second, this is the place where Cython's unfortunate (lack of) C++ integration comes in. It's not possible to subclass a C++ class in Cython, which is why I had to create a new C++ class that can dispatch to a function. The new C++ class could, of course, be placed somewhere else—it doesn't have to be part of the Parquet public API. In theory I could put it somewhere Cython specific and then just need to convince CMake to load it. I put it in the public API on theory someone else might end up wrapping the API in a language that is tied to C rather than C++.

Interesting, I think I haven't seen this before. Typically, these are technical discussions, but this is the second time you mention the paying customer :) Usecases, real-life scenarios and customers are important of course; still, their requirements are best translated into technical terms.

In the jira discussion, I've listed the security and compatibility issues, associated with providing a direct access to the low level encryption API. I were ok though with providing only the decryption part as a temp measure before the high-level layer becomes available. While the former is still incompatible and won't be able to read e.g. the files written by Apache Spark - at least the decryption part does not have the security issues (that the encryption part has). But if it requires Cython-specific changes in the existing code, this adds to arguments against this approach. Another argument is the availability of the C++ version of the high-level layer, merged recently. The decision is up to the community, of course.

Are you working on Arrow in your spare time? Or the Ursa Labs people? The main driver for all of this is companies that want to use these tools, and funding it one way or another. That being said, I mention they're client of mine as, it's true, a caveat: I can't speak to their particular technical assessment of the high-level API, but they did make one.

If you're willing I can try to introduce you to the people who decided the high-level API wasn't worth using from their perspective, and maybe you can convince them they're wrong.

I work on parquet encryption in a number of OSS repos/frameworks, trying to make sure these frameworks are interoperable regardless of the ecosystem they belong to. Also, trying to make the use of this security tool as safe as possible. The high-level layer addresses these goals; it is unfortunate that your client has decided it is not worth using. There are quite a few companies that made a different decision :)

Not strictly related to the above discussion, but usually these classes are implemented in https://github.com/apache/arrow/tree/master/cpp/src/arrow/python as far as I can tell.

In https://docs.google.com/document/d/1i1M5f5azLEmASj9XQZ_aQLl5Fr5F0CvnyPPVu1xaD9U/edit @emkornfield suggested to look at PyFileSystem as an example and I suggest to follow the same pattern in terms of implementation (vtable, ARROW_PYTHON_EXPORT, and anything else).

It's not possible to subclass a C++ class in Cython

To be fair to Cython, it's possible (see an example). However, it's not well documented and might be less efficient w.r.t. GIL so the vtable approach used in the arrow codebase is probably better.

This isn't a performance critical path, so I think I will try the subclassing path for simplicity's sake.

itamarst · 2021-04-08T12:36:51Z

Thank you for the review!

roee88 · 2021-04-12T18:50:41Z

python/pyarrow/_parquet.pyx

+        cdef c_map[c_string, shared_ptr[
+            ColumnEncryptionProperties]] c_column_keys
+
+        builder = new FileEncryptionProperties.Builder(footer_key)


Is there a possibility of a memory leak here? Did you consider using a unique_ptr here instead?

itamarst · 2021-04-30T15:39:56Z

So here is what the end-user explained about why they specifically want the low-level API; I've bolded the most important part.

The way I understand things, the low-level API only cares about encrypting/decrypting the parquet data but does not do any key management beyond simply storing the key-identifiers.

This is really useful because it means that the low-level API makes no assumption on how the key should be handled. Instead this responsibility is left to the application. Personally I think this is the right call: Apache Arrow/Parquet is not a Key Management System, there are plenty of other libraries for that.

(Again from what I understand) the high-level API defines how the keys are stored in relation to the key identifier, handles the symmetric vs asymmetric keys, and how Parquet files metadata should be stored in order to be compatible with a de-facto standard (vague recollection that they want to be compatible with existing ecosystems such as HADOOP or Spark).

IMHO Trying to be compatible with the key management of an existing Parquet ecosystem is a mistake if it forces everyone to adopt that system. Instead it should be a separate library, away from Arrow/Parquet, as one of the possible way to manage keys wrt Parquet files.

Unlike a data-format specification like Parquet, I am not convinced that KMS requirements are as universal as to mandate a standard solution that everyone is forced to used (which would be the case if it's the only one exposed in Python). I'm convinced of the opposite: that a lot of companies have their own tailor-made KMS. In fact, we already have our own KMS and KMS-related routines to securely manage keys and accesses from the key identifiers . So not only is the high-level API redundant to what we already have, but we also cannot use it as it would be incompatible with the rest of the company.

I don't want to waste time if this PR will never be accepted, but AFAICT there is a real use-case for low-level API.

emkornfield · 2021-05-02T04:44:36Z

I'll try to review soon. I think the rationale presented by the end-user makes sense, @ggershinsky are you still very concerned with this?

itamarst · 2021-05-02T20:52:29Z

I will attempt to address existing review comments next week.

ggershinsky · 2021-05-03T14:45:18Z

I'll try to review soon. I think the rationale presented by the end-user makes sense, @ggershinsky are you still very concerned with this?

Well, a couple of points:
-the comment above is not very accurate: high-level API doesn't perform KMS functions; instead, it provides a plug-in interface for any KMS. This end-user might be not familiar enough with the high-level API; there is a good chance it will fit their needs. In an unlikely event it doesn't, it would be good to get a technical feedback that refers to gaps in this API.
-my position hasn't changed - I'm ok with the read/decryption part of the low-level API, as a temp measure before the Python version of the high-level API is ready. I am concerned with exposing the write/encryption part of the low-level API, for the reasons I've mentioned in the ticket.

ggershinsky · 2021-05-03T17:24:15Z

I know Itamar has put a lot of time into developing this capability for his customer (which I appreciate), and into contributing it to the open source, so I'd be glad to repeat the reasons for the concern, and to expand on them.
IMO, the core of the problem at hand is that the low-level API looks deceivingly simple, and the high-level API seems more restrictive and less intuitive. I know who is to blame, because I've basically designed both of them :) But this is being built ground-up, from the spec, to the low-level; then, with time and field experience, to the high-level layer; so there is no way to hide the low-level now. Its take on encryption seems to be "just give the key, its id, and we're done". I know that a handful of top data encryption experts won't think like that, and will look for ways to handle the NIST limit on the number of GCM crypto operations (so the cipher is not broken), and for ways to perform key rotation and other standard data security procedures. But I'm pretty sure most of the end users of PyArrow and pandas will be using the low-level, if exposed, in the "intuitive" way.
Another reason is compatibility with Apache Spark and other frameworks, that never exposed the low-level layer, and will start offering Parquet encryption via the high-level API. It would be really good if Spark will be able to read files produced by PyArrow, and vice-versa.
It will take some time till a full high-level API implementation is available in Arrow. I understand and appreciate the pressure to have Parquet encryption available asap, but it is worth (at least in my view :) to wait a bit more for the safe and compatible high-level layer. The work on Parquet encryption has started in 2017; what's few more months..

GPSnoopy · 2021-05-03T21:42:01Z

Hi @ggershinsky,

I'm one of the end-users pestering @itamarst. ;-)

I don't claim to be an encryption expert, so the following feedback is purely from a user/developer perspective.

We already have an in-house crypto library which handles the security choices, design and the integration with our KMS.
The integration of this library with Apache Arrow/Parquet (via ParquetSharp) is about 10-20 lines of code.
This crypto library generates the AES key, encrypts it using asymmetric keys (obtained via the KMS, driven by an company-internal user provided key identifier), adds some extra necessary header information and publishes that to Parquet as the key identifier.
It also deals with user authentication and key permissions.
This means that the way we manage Parquet encryption inside the company is consistent with the rest of the company; approved by the various security teams.
Being compatible with other external tools and a de-facto Parquet encryption high-level standard is nice, but ultimately the company cares about its own sensitive IP. So being compatible with the company ecosystem is higher priority than being compatible with Spark (ultimately we will never share encrypted files with other companies, kind of the main point).
The low-level API is internally used by us in both C++ and C#. So why is Python different?
I'm not sure I understand or appreciate the reluctance to provide both the low-level and higher-level API. It's a really nice property of a library to expose various level of abstraction, such that the user can integrate with the library at the required level. Having both APIs means that you provide the correct default behaviour and compatibility with Spark ecosystem for your users, and also provide the necessary flexibility for users with use-cases you have not anticipated or foreseen.

IMHO the last point should be carefully considered, as it's reflected and used in highly acclaimed libraries and APIs, such as C++ STL, Boost, Zlib, OpenSSL, Vulkan, etc (personal bias in this choice of libraries, of course; interestingly DirectX12/Vulkan do prove a point though - developers want more fine grained access and level of controls in their API, not less).

ggershinsky · 2021-05-04T13:55:26Z

Hey @GPSnoopy , thanks for the detailed input! It is particularly interesting because you use asymmetric encryption of AES keys; we always wanted to check the high-level API against such scenario. Regarding addressing the immediate needs of your usecase - I'm sure we'll find a practical solution, with one of the APIs (more on that later in this comment).

This crypto library generates the AES key, encrypts it using asymmetric keys (obtained via the KMS, driven by an company-internal user provided key identifier), adds some extra necessary header information and publishes that to Parquet as the key identifier.

this could mapped rather easily to the high-level API. Basically, it requires a developer to implement a method string wrapKey(byte[] aesKey, string masterKeyID) - here, you could take the AES key (generated by us) with the ID of the master key (specified by you for the table/column), obtain the asymmetric key with this ID from your KMS, encrypt the AES key with it, add any extra header information, and return to us as a base64 encoded string. We keep it, and give this string back to you upon reading, via the byte[] unwrapKey(string wrappedKey, masterKeyID) method - that you use to decrypt the AES key. Will this conceptually work for you? I know, the AES key is generated by us, but we do it to make sure that one DEK is not used more than allowed by the NIST spec for GCM to prevent its break-down.

It also deals with user authentication and key permissions.
This means that the way we manage Parquet encryption inside the company is consistent with the rest of the company; approved by the various security teams.

This is precisely the intent of having a pluggable KMS interface in the high-level API; it works like that in other companies.

Being compatible with other external tools and a de-facto Parquet encryption high-level standard is nice, but ultimately the company cares about its own sensitive IP. So being compatible with the company ecosystem is higher priority than being compatible with Spark (ultimately we will never share encrypted files with other companies, kind of the main point).

Yep, I understand. Some companies though use both Spark and PyArrow/pandas in their data pipelines. Or migrate from one to the other.

The low-level API is internally used by us in both C++ and C#. So why is Python different?

No choice with C++, the low-level has to be implemented in some language. With Python, we do have a choice to make the API safer. Also, the Python API will be used by a wider set of developers, including users that don't have any experience with data encryption.

..provide the necessary flexibility for users with use-cases you have not anticipated or foreseen... the last point should be carefully considered, as it's reflected and used in highly acclaimed libraries and APIs

In general, I'd totally agree. However, this is a somewhat unusual library, because it belongs in the field of security, and is supposed to protect sensitive data. Unfortunately, its low-level interface can be easily misused by unexperienced users, resulting in broken protection.

Now, for the practical solutions for your usecase. I can think of the following options:

be an early adopter of the high-level API. A basic C++ version is ready, and a basic Python version should be ready soon. Obviously, this is my top preference, because it will help the library and its future users in the community to benefit from your experience / contribution.
use the low-level Python wrapping, developed by Itamar. You already have it working, and you have enough key management experience to make it safe in your deployment. No need to upstream it to an open source repo that is leveraged by many users without experience in data security.
open source / expose the low-level API, with warnings (in a hope users will see / heed them). IMHO, this is the least preferable option, I'd still like to understand why nothing else would work.

andersonm-ibm · 2021-05-06T12:42:41Z

Hi @GPSnoopy , could you please check out the proposal for the high-level API ARROW-9947: Adding a Python API for Parquet encryption in Arrow that @ggershinsky is referring to, in particular where it describes the KmsClient API and examples of writing and reading encrypted files?
It is still WIP, but your feedback whether this addresses your requirements would be very valuable.

…iever.

GPSnoopy · 2021-05-07T21:11:43Z

Hi @andersonm-ibm, I'll have a look. I wasn't aware of this document, will be quite useful to get more context on how to use the high-level API.

this could mapped rather easily to the high-level API. Basically, it requires a developer to implement a method string wrapKey(byte[] aesKey, string masterKeyID)

@ggershinsky That could potentially work. Is this available in the C++ version already and I missed it?

ggershinsky · 2021-05-08T04:20:33Z

@GPSnoopy It was merged recently, https://github.com/apache/arrow/pull/8023/files
The APIs:
https://github.com/apache/arrow/blob/master/cpp/src/parquet/encryption/kms_client.h
https://github.com/apache/arrow/blob/master/cpp/src/parquet/encryption/crypto_factory.h
and a basic sample:
https://github.com/apache/arrow/blob/master/cpp/src/parquet/encryption/test_in_memory_kms.cc
https://github.com/apache/arrow/blob/master/cpp/src/parquet/encryption/key_management_test.cc

itamarst · 2021-05-21T17:21:12Z

So while there's an ongoing discussion on the high-level API (and I do hope you can work out something that works!) in the short term still seems worth getting this merged? I think it meets @ggershinsky's requirements that encryption is not an approved public API, insofar as one has to use a private API. But could be made more private.

andersonm-ibm · 2021-05-23T08:14:49Z

So while there's an ongoing discussion on the high-level API (and I do hope you can work out something that works!) in the short term still seems worth getting this merged? I think it meets @ggershinsky's requirements that encryption is not an approved public API, insofar as one has to use a private API. But could be made more private.

Hi @itamarst , we are working on an implementation of the high-level API based on the document ARROW-9947: Adding a Python API for Parquet encryption in Arrow and we are going to open a PR in the upcoming days after we add some finishing touches to the code.

GPSnoopy · 2021-05-26T16:00:28Z

Hi @andersonm-ibm,

Apologies for the late reply. I finally had time to review the high-level API and it indeed sounds like it would suit our needs.

I'll need to test it at some point with ParquetSharp (quite a lot of work, since there are many new classes) to check that it integrates well with the company crypto libraries.

Cheers!

andersonm-ibm · 2021-06-06T07:43:28Z

Hi @GPSnoopy and @itamarst ,
We've opened a PR #10450 with file-level parquet encryption.
Your feedback is welcome.

pitrou · 2022-03-15T12:38:09Z

@itamarst @ggershinsky Should this PR remain open and be revived? (there are merge conflicts right now)

itamarst · 2022-03-15T13:38:15Z

Let's see if @GPSnoopy comes back with positive comments on high-level API; hopefully he does and then this isn't relevant anymore.

GPSnoopy · 2022-03-16T09:11:47Z

I think we can reasonably close this PR since the high level Python API had been merged in.

pitrou · 2022-05-04T14:41:40Z

Thanks for the answer, closing with a delay ;-)

itamarst added 9 commits February 16, 2021 10:32

ARROW-11644: [Python][Parquet] Expose Parquet encryption headers to C…

f40ae8c

…ython.

ARROW-11644: [Python][Parquet] Expose Parquet decryption headers to C…

6a41e21

…ython.

ARROW-11644: [Python][Parquet] For now we're doing decryption only.

d9f5d82

ARROW-11644: [Python][Parquet] Fix typo.

284b162

ARROW-11644: [Python][Parquet] Sketch of Python API for decryption.

7bab699

Covers some but not all of the API features.

It compiles now.

8369a1c

Sketch of encryption, initially just for testing decryption.

eef1f07

ARROW-11644: [Python][Parquet] Basic encryption support compiles now.

e828a91

ARROW-11644: [Python][Parquet] First working round-trip test.

1ff110d

github-actions bot added the Component: Python label Mar 4, 2021

itamarst added 4 commits March 12, 2021 10:01

ARROW-11644: [Python][Parquet] Support for column keys.

23e2feb

ARROW-11644: [Python][Parquet] Test for allowing plain text files.

219e581

ARROW-11644: [Python][Parquet] Test for disabling footer verification.

3eac091

ARROW-11644: [Python][Parquet] Lint fixes.

2ca0390

itamarst added 8 commits March 29, 2021 11:59

ARROW-11644: [Python][Parquet] Sketch of DecryptionKeyRetriever suppo…

d6fd90e

…rt for Python.

ARROW-11644: [Python][Parquet] DecryptionKeyRetriever support for Pyt…

e925794

…hon compiles.

ARROW-11644: [Python][Parquet] Just rely on C++ error handling.

707a94e

ARROW-11644: [Python][Parquet] Fix GIL handling.

079f635

ARROW-11644: [Python][Parquet] A test for the key retriever.

2bd1aba

ARROW-11644: [Python][Parquet] Lint fixes.

0d1ed15

ARROW-11644: [Python][Parquet] Enable encyrption in wheel.

d739f52

Merge remote-tracking branch 'upstream/master' into ARROW-11644

622e21c

itamarst marked this pull request as ready for review March 29, 2021 20:01

itamarst changed the title ~~ARROW-11644: [Python][Parquet] Low-level Parquet encryption in Python, initial sketch for feedback~~ ARROW-11644: [Python][Parquet] Low-level Parquet decryption in Python Mar 29, 2021

github-actions bot added Component: C++ Component: Parquet labels Mar 30, 2021

ARROW-11644: [Python][Parquet] Make it compile.

1b5b93b

ggershinsky reviewed Apr 8, 2021

View reviewed changes

roee88 reviewed Apr 12, 2021

View reviewed changes

itamarst added 3 commits May 6, 2021 11:45

Merge remote-tracking branch 'upstream/master' into ARROW-11644

9a5f4c7

ARROW-11644: [Python][Parquet] Switch to simpler pure-Cython key retr…

7267bd4

…iever.

ARROW-11644: [Python][Parquet] Fix bad merge.

7cfb720

GPSnoopy mentioned this pull request May 26, 2021

Integrate high level encryption API G-Research/ParquetSharp#200

Closed

pitrou closed this May 4, 2022

adamreeve mentioned this pull request Dec 19, 2022

Cannot decrypt files with pyarrow G-Research/ParquetSharp#324

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-11644: [Python][Parquet] Low-level Parquet decryption in Python #9631

ARROW-11644: [Python][Parquet] Low-level Parquet decryption in Python #9631

itamarst commented Mar 4, 2021 •

edited

github-actions bot commented Mar 4, 2021

itamarst commented Mar 15, 2021

itamarst commented Mar 29, 2021

ggershinsky commented Apr 6, 2021

itamarst commented Apr 6, 2021

ggershinsky commented Apr 8, 2021 •

edited

ggershinsky Apr 8, 2021

itamarst Apr 8, 2021

ggershinsky Apr 9, 2021

itamarst Apr 9, 2021

ggershinsky Apr 9, 2021

roee88 Apr 11, 2021 •

edited

itamarst May 6, 2021

itamarst commented Apr 8, 2021

roee88 Apr 12, 2021

itamarst commented Apr 30, 2021

emkornfield commented May 2, 2021

itamarst commented May 2, 2021

ggershinsky commented May 3, 2021

ggershinsky commented May 3, 2021

GPSnoopy commented May 3, 2021 •

edited

ggershinsky commented May 4, 2021

andersonm-ibm commented May 6, 2021

GPSnoopy commented May 7, 2021

ggershinsky commented May 8, 2021 •

edited

itamarst commented May 21, 2021

andersonm-ibm commented May 23, 2021

GPSnoopy commented May 26, 2021

andersonm-ibm commented Jun 6, 2021

pitrou commented Mar 15, 2022

itamarst commented Mar 15, 2022

GPSnoopy commented Mar 16, 2022

pitrou commented May 4, 2022

ARROW-11644: [Python][Parquet] Low-level Parquet decryption in Python #9631

ARROW-11644: [Python][Parquet] Low-level Parquet decryption in Python #9631

Conversation

itamarst commented Mar 4, 2021 • edited

github-actions bot commented Mar 4, 2021

itamarst commented Mar 15, 2021

itamarst commented Mar 29, 2021

ggershinsky commented Apr 6, 2021

itamarst commented Apr 6, 2021

ggershinsky commented Apr 8, 2021 • edited

ggershinsky Apr 8, 2021

Choose a reason for hiding this comment

itamarst Apr 8, 2021

Choose a reason for hiding this comment

ggershinsky Apr 9, 2021

Choose a reason for hiding this comment

itamarst Apr 9, 2021

Choose a reason for hiding this comment

ggershinsky Apr 9, 2021

Choose a reason for hiding this comment

roee88 Apr 11, 2021 • edited

Choose a reason for hiding this comment

itamarst May 6, 2021

Choose a reason for hiding this comment

itamarst commented Apr 8, 2021

roee88 Apr 12, 2021

Choose a reason for hiding this comment

itamarst commented Apr 30, 2021

emkornfield commented May 2, 2021

itamarst commented May 2, 2021

ggershinsky commented May 3, 2021

ggershinsky commented May 3, 2021

GPSnoopy commented May 3, 2021 • edited

ggershinsky commented May 4, 2021

andersonm-ibm commented May 6, 2021

GPSnoopy commented May 7, 2021

ggershinsky commented May 8, 2021 • edited

itamarst commented May 21, 2021

andersonm-ibm commented May 23, 2021

GPSnoopy commented May 26, 2021

andersonm-ibm commented Jun 6, 2021

pitrou commented Mar 15, 2022

itamarst commented Mar 15, 2022

GPSnoopy commented Mar 16, 2022

pitrou commented May 4, 2022

itamarst commented Mar 4, 2021 •

edited

ggershinsky commented Apr 8, 2021 •

edited

roee88 Apr 11, 2021 •

edited

GPSnoopy commented May 3, 2021 •

edited

ggershinsky commented May 8, 2021 •

edited