-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encryption in Data Files #20
Comments
You certainly want to use a KeyManager/Provider. I'd suggest that you take a look at the API that I made for ORC. You are going to want to support the cloud Key Management Servers (KMS) and thus you need to be compatible with their APIs. Given that you have a KeyManager, you want to record the key name to encrypt with. On the read side, you pretty much need to put the encrypted local key and IV into the file's metadata. You absolutely can't have a fixed IV for the table and shouldn't have a fixed local key for the table. For column encryption, it becomes more interesting. There you have:
|
I should also mention that ORC's KeyProvider API is a subset of Hadoop's and the default implementation will use the Hadoop implementation. On the cloud providers, we'll need to create an implementation for each service. |
I was considering using Palantir's hadoop-crypto library to do the actual encryption portion of things. What do you think about this package? Column encryption is interesting; on our side we haven't explored this yet, and thus would not really be able to handle per-column encryption, and need to, in the meantime, only encrypt at the top file layer. That is to say, our internal storage solution doesn't handle storing multiple keys to decrypt different portions of the same file. You'll also notice this as such in the hadoop-crypto library. So whatever solution we come up with should be able to handle a full file encryption or a per-column encryption. I suppose though a file would only be able to be encrypted one way or the other way strictly; if we encrypt the whole file, you more or less lose all the benefits of per-column encryption. Additionally a key part of performance is reducing the number of round trips made to the key storage backend, particularly if the backend supports batch operations. So it's ideal if the |
Tagging @vinooganesh @yifeih to follow. |
I understand that column encryption take file format support and that isn't available yet, although it will be available for ORC soon. I haven't looked at the details of Palantir's hadoop-crypto library, but the approach looks good. For per-file encryption, I would:
The relevant features:
|
Looking a little deeper at hadoop-crypto, they are doing the key management themselves. I think most users would be better served by using a KMS. Amazon's KMS is here. Using a KMS means that you don't send big secrets to the job, which radically lowers your potential for screw ups. |
I think we can build a top level interface that, when implemented with KMS, could accomplish those goals, but, could also be implemented by other means to support other key storage strategies. More precisely I really don't want to tie us to KMS as the key storage backend for all Iceberg users. Here's a rough sketch of a design that I think could have that flexibility:
Now let's suppose we were to implement all of the above with KMS as the storage backend. I'd suppose we could provide this as the default implementation that ships with Iceberg.
Regarding hadoop-crypto, I was considering it less for the key storage layer and more so for its in-memory representation of keys and its generation of ciphers from those key objects. Though the interface shouldn't expose hadoop-crypto objects, naturally. Finally to cap it all off I'd imagine Thoughts on the above proposal? |
I'm actually going to go ahead and put this into a prototype - we can add further discussion there. |
Ok, I'd separate out the key information from the encryption data:
The KMS doesn't need the path, iv, or the algorithm. But I think this is starting to look good. |
And of course, you could add |
For creation it would be helpful to pass in the file path, and in fact I'm wondering if the name should always be the URI of the file. Thoughts? We rely on path-based lookup for our key storage system. |
You typically don't want a new master key per a file, just a local file key. So the KMS shouldn't know or care about file or table paths. So for example, you could have an entire set of tables protected with the "pii" key.
So to write, the job needs to generate random file keys and have them encrypted. The KeyManager just needs the master key name and gives back the key version, the encrypted file key, and the unencrypted file key. To read, we need to decrypt the file key, so the job passes the master key name, the master key version, and the encrypted key to the KeyManager. |
This is different from the way we model key storage. We roughly follow this model: https://github.com/palantir/hadoop-crypto/blob/ac8d4474ee667121873bf0abf0674d83c78d8b90/crypto-keys/src/main/java/com/palantir/crypto2/keys/KeyStorageStrategy.java#L27, where for a given FileSystem the encryption key name is derived from the file path: https://github.com/palantir/hadoop-crypto/blob/6d9e05a1e667f150f7d98435e93a0dd6f3ea5c08/hadoop-crypto/src/main/java/com/palantir/crypto2/hadoop/EncryptedFileSystem.java#L115. Because we use hadoop-crypto right now for existing encryption we need to match the existing model. I was wondering how we could do this with the Iceberg API while still enabling the KMS model you propose. |
Let me also explain in more detail. In our system we don't have a mapping from file paths to encryption key names. We can't store that without changing the way our storage solution reasons about encryption. We are using Iceberg as an ephemeral representation of the data that will be read by Spark and written via Spark. On the write side specifically, our |
One other way we could solve for both use cases would be as follows:
What this allows is for implementations that want to write their own key name derivation to do so. So for example the name could be derived from the path, but by default for the traditional KMS use case it is completely random. The key name generator could also be its own module / interface. |
Ok, thanks for explaining. I can at least see at least some disconnect now, which helps. So from the hadoop-crypto point of view, the keys are the "file keys" in the way that I was thinking about it. They are writing the file key out as a side file and encrypting with a public/private key. That is relatively expensive and doubles the number of s3 objects. So their model would fit into my proposal except that they have a single global master key (their public/private keys). I get that your implementation hashes the path to generate the file key, but I don't see how you secure it. Obviously the hash of the path isn't a secret. :) |
We would generate the name of the key using the path, but the bytes of the key given that name are generated securely. (Edit: Meaning, the name of the key has an association with the path, but the bytes of the key are generated independent of the path or the key name.) Additionally the built-in encryption solution that stores the key alongside the file is a default implementation, but one can choose to implement |
I'd also argue that Iceberg can't assume paths are immutable. So using the hash of the path to reconstruct the secret key won't work. |
Sure, and the default implementation doesn't have to. I think the default implementation shouldn't actually assume the file path has any association with the key. The idea is that it doesn't hurt to have the API include the file URI somewhere when generating keys, because the key manager can thus know what it is encrypting, and that can influence how it wants to make the key name and the storage metadata of that key. |
We should also loop in @ggershinsky, who is working on the Parquet encryption spec. |
Parquet encryption has additional metadata parameters (such as per-file aadPrefix's, column keys, etc) - but, since this is basically a column encryption format, it might be too early to factor this in. In any case, whenever needed feel free to ping me, I'd be glad to assist with the Parquet part. |
@rdblue @omalley I moved the architecture to https://docs.google.com/document/d/1LptmFB7az2rLnou27QK_KKHgjcA5vKza0dWj4h8fkno/edit?usp=sharing. Please take a look and comment, and then we can continue with implementation. |
@mccheah, can you also start a thread on the dev list to point out this spec? I think other people will probably be interested that aren't necessarily following the github issues. |
This is done. Thanks everyone! |
We want to support encrypting and decrypting data that is recorded in Iceberg tables. There are several API extensions that we can consider to make this work:
KeyReference
field, which is a byte blob in theDataFile
object. AKeyReference
is a pointer to a key.EncryptionKey
which is a composition of the key bytes, the iv, and the key algorithm (see e.g. here and here)KeyManager
which manages creating new keys and retrieving keys based on key references. TheTableOperations
API should support returning anOptional<KeyManager>
; returnOptional.empty()
if the table operations doesn't support encryption.The text was updated successfully, but these errors were encountered: