Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encryption in Data Files #20

Closed
mccheah opened this issue Nov 30, 2018 · 24 comments
Closed

Encryption in Data Files #20

mccheah opened this issue Nov 30, 2018 · 24 comments

Comments

@mccheah
Copy link
Contributor

mccheah commented Nov 30, 2018

We want to support encrypting and decrypting data that is recorded in Iceberg tables. There are several API extensions that we can consider to make this work:

  • Define a KeyReference field, which is a byte blob in the DataFile object. A KeyReference is a pointer to a key.
  • Define an EncryptionKey which is a composition of the key bytes, the iv, and the key algorithm (see e.g. here and here)
struct EncryptionKey {
    byte[] encodedKey();
    String keyAlgorithm();
    byte[] iv();
}
  • Define a KeyManager which manages creating new keys and retrieving keys based on key references. The TableOperations API should support returning an Optional<KeyManager>; return Optional.empty() if the table operations doesn't support encryption.
struct CreatedKey {
    EncryptionKey key();
    byte[] keyReference();
}

interface KeyManager {
    CreatedKey createKey(String pathToEncrypt);
    EncryptionKey getKey(byte[] reference);
}
@omalley
Copy link
Contributor

omalley commented Nov 30, 2018

You certainly want to use a KeyManager/Provider. I'd suggest that you take a look at the API that I made for ORC.

https://github.com/apache/orc/blob/280cb5a6a6122ad79d04dc5ca9c76b2281f6153a/java/shims/src/java/org/apache/orc/impl/HadoopShims.java#L129

You are going to want to support the cloud Key Management Servers (KMS) and thus you need to be compatible with their APIs.

Given that you have a KeyManager, you want to record the key name to encrypt with. On the read side, you pretty much need to put the encrypted local key and IV into the file's metadata. You absolutely can't have a fixed IV for the table and shouldn't have a fixed local key for the table.

For column encryption, it becomes more interesting. There you have:

  • a key name
  • the columns to encrypt with which key
  • how to mask the unencrypted version of the data (defaulting to nullify)
  • a local key per file / encrypted column, which will be stored encrypted in the file's metadata

@omalley
Copy link
Contributor

omalley commented Nov 30, 2018

I should also mention that ORC's KeyProvider API is a subset of Hadoop's and the default implementation will use the Hadoop implementation. On the cloud providers, we'll need to create an implementation for each service.

@mccheah
Copy link
Contributor Author

mccheah commented Nov 30, 2018

I was considering using Palantir's hadoop-crypto library to do the actual encryption portion of things. What do you think about this package?

Column encryption is interesting; on our side we haven't explored this yet, and thus would not really be able to handle per-column encryption, and need to, in the meantime, only encrypt at the top file layer. That is to say, our internal storage solution doesn't handle storing multiple keys to decrypt different portions of the same file. You'll also notice this as such in the hadoop-crypto library. So whatever solution we come up with should be able to handle a full file encryption or a per-column encryption. I suppose though a file would only be able to be encrypted one way or the other way strictly; if we encrypt the whole file, you more or less lose all the benefits of per-column encryption.

Additionally a key part of performance is reducing the number of round trips made to the key storage backend, particularly if the backend supports batch operations. So it's ideal if the KeyManager could support getting and putting multiple keys at once, as well as implementing the Spark Data Source and other Iceberg clients to contact the backend as few times as possible.

@mccheah
Copy link
Contributor Author

mccheah commented Nov 30, 2018

Tagging @vinooganesh @yifeih to follow.

@omalley
Copy link
Contributor

omalley commented Nov 30, 2018

I understand that column encryption take file format support and that isn't available yet, although it will be available for ORC soon.

I haven't looked at the details of Palantir's hadoop-crypto library, but the approach looks good. For per-file encryption, I would:

  • Define the key name in the Iceberg table metadata.
  • When writing, call the KMS to generate a random local key and the corresponding encrypted bytes. Create a random IV for each file. When you update the manifest for the file, add the key name, key version, encryption algorithm, iv, and encrypted local key.
  • When reading, use the metadata from the manifest to have the KMS decrypt the key for that file. Use the decrypted key and iv to decrypt the file as needed.

The relevant features:

  • The master key stays in the KMS and is never given to the user.
  • There is only one trip per a file to the KMS during reading or writing.
  • The encryption never reuses a local key/iv pair. Reuse of those pairs is very very bad.
  • If the user keeps a local key, it can only be used to decrypt that file.
  • Rolling new versions of master keys is supported.

@omalley
Copy link
Contributor

omalley commented Nov 30, 2018

Looking a little deeper at hadoop-crypto, they are doing the key management themselves. I think most users would be better served by using a KMS. Amazon's KMS is here. Using a KMS means that you don't send big secrets to the job, which radically lowers your potential for screw ups.

@mccheah
Copy link
Contributor Author

mccheah commented Nov 30, 2018

I think we can build a top level interface that, when implemented with KMS, could accomplish those goals, but, could also be implemented by other means to support other key storage strategies. More precisely I really don't want to tie us to KMS as the key storage backend for all Iceberg users. Here's a rough sketch of a design that I think could have that flexibility:

  • Define an EncryptionMetadata struct that is a column in the DataFile schema. It is optional. The struct has the following schema:
struct EncryptionMetadata {
    String keyName();
    long keyVersion();
    String encryptedKeyAlgorithm();
    byte[] iv();
    byte[] encryptedKeyBytes();
}
  • Define the structure EncryptionKey as follows; it's simply the real key (more or less just this KeyMaterial
struct EncryptionKey {
    byte[] keyBytes();
    byte[] iv();
}
  • Define a KeyManager that handles both read and generation of these keys.
struct GeneratedEncryptionKey {
    EncryptionMetadata keyMetadata();
    EncryptionKey key();
}

interface KeyManager {
    EncryptionKey resolveEncryptionkey(EncryptionMetadata metadata);

    // Optional, can just default to calling the individual resolve(...) multiple times.
    List<EncryptionKey> resolveEncryptionKeyBatch(List<EncryptionMetadata> metadatas);

    GeneratedEncryptionkey generateEncryptionKey(String filePath);

    // Optional, can just default to calling the individual generate(...) multiple times.
    List<GeneratedEncryptionKey> generateEncryptionKeyBatch(List<String> filePaths);
}
  • We remark that GeneratedEncryptionKey as a pair of decrypted and encrypted key is required because we need both the decrypted key to actually encrypt the output stream wherever we're doing the encrypted-write itself, as well as the encrypted key to store in the manifest file after the fact.

Now let's suppose we were to implement all of the above with KMS as the storage backend. I'd suppose we could provide this as the default implementation that ships with Iceberg.

  • KeyManager#resolveEncryptionKey goes to the KMS and asks to decrypt the key with the given metadata's name and the encrypted bytes. KMS uses the master key and passes back the decrypted key. KeyManager converts that into an Encryptionkey object and returns the value.
  • KeyManager#generateEncryptionKey generates a local key, tells KMS to encrypt it and gets back the encrypted key, Return the pair of decrypted and encrypted key in a GeneratedEncryptionKey struct.

Regarding hadoop-crypto, I was considering it less for the key storage layer and more so for its in-memory representation of keys and its generation of ciphers from those key objects. Though the interface shouldn't expose hadoop-crypto objects, naturally.

Finally to cap it all off I'd imagine KeyManager instances can be provided by implementations of TableOperations and must thus be serializable to be sent to e.g. Spark executors.

Thoughts on the above proposal?

@mccheah
Copy link
Contributor Author

mccheah commented Dec 1, 2018

I'm actually going to go ahead and put this into a prototype - we can add further discussion there.

@omalley
Copy link
Contributor

omalley commented Dec 1, 2018

Ok, I'd separate out the key information from the encryption data:

struct KeyDescription {
  String keyName;
  int keyVersion;
}

struct FileKey {
  KeyDescription key;
  byte[] decryptedKey;
  byte[] encryptedKey;
}

struct FileEncryptionMetadata {
  KeyDescription key;
  byte[] encryptedKey;
  byte[] iv;
  String algorithm;
}

interface KeyManager {
   // Create a key for a single file.
   FileKey createFileKey(String keyName);

   // Decrypt a single file key
   byte[] decryptKey(KeyDescription key, byte[] encryptedKey);

   // Decrypt a list of keys
   List<byte[]> decryptKeys(KeyDescription key, List<byte[]> encryptedKeys);
}

The KMS doesn't need the path, iv, or the algorithm. But I think this is starting to look good.

@omalley
Copy link
Contributor

omalley commented Dec 1, 2018

And of course, you could add List<FileKey> createFileKeys(String keyName, int keys);

@mccheah
Copy link
Contributor Author

mccheah commented Dec 1, 2018

For creation it would be helpful to pass in the file path, and in fact I'm wondering if the name should always be the URI of the file. Thoughts? We rely on path-based lookup for our key storage system.

@omalley
Copy link
Contributor

omalley commented Dec 1, 2018

You typically don't want a new master key per a file, just a local file key. So the KMS shouldn't know or care about file or table paths. So for example, you could have an entire set of tables protected with the "pii" key.

  1. The "user" table property has encrypt with "pii" key.
  2. Each file has a unique file key that the KeyManager encrypts with the "pii" master key.

So to write, the job needs to generate random file keys and have them encrypted. The KeyManager just needs the master key name and gives back the key version, the encrypted file key, and the unencrypted file key. To read, we need to decrypt the file key, so the job passes the master key name, the master key version, and the encrypted key to the KeyManager.

@mccheah
Copy link
Contributor Author

mccheah commented Dec 3, 2018

So to write, the job needs to generate random file keys and have them encrypted. The KeyManager just needs the master key name and gives back the key version, the encrypted file key, and the unencrypted file key. To read, we need to decrypt the file key, so the job passes the master key name, the master key version, and the encrypted key to the KeyManager.

This is different from the way we model key storage. We roughly follow this model: https://github.com/palantir/hadoop-crypto/blob/ac8d4474ee667121873bf0abf0674d83c78d8b90/crypto-keys/src/main/java/com/palantir/crypto2/keys/KeyStorageStrategy.java#L27, where for a given FileSystem the encryption key name is derived from the file path: https://github.com/palantir/hadoop-crypto/blob/6d9e05a1e667f150f7d98435e93a0dd6f3ea5c08/hadoop-crypto/src/main/java/com/palantir/crypto2/hadoop/EncryptedFileSystem.java#L115. Because we use hadoop-crypto right now for existing encryption we need to match the existing model. I was wondering how we could do this with the Iceberg API while still enabling the KMS model you propose.

@mccheah
Copy link
Contributor Author

mccheah commented Dec 3, 2018

Let me also explain in more detail. In our system we don't have a mapping from file paths to encryption key names. We can't store that without changing the way our storage solution reasons about encryption. We are using Iceberg as an ephemeral representation of the data that will be read by Spark and written via Spark. On the write side specifically, our KeyManager plugin will need the file paths to store the encryption keys, because that's the only metadata we will have available to us on the read side. In other words, because some internal key storage and metastore solutions don't have a notion of a key name per file, we should pass in the file path to the KeyManager APIs so that they can use that as an additional option for their situations.

@mccheah
Copy link
Contributor Author

mccheah commented Dec 3, 2018

One other way we could solve for both use cases would be as follows:

interface KeyManager {

    default String generateNewKeyName(String filePath) {
        return "iceberg-key-" + UUID.randomUUID();
    }

    // Create a key for a single file.
    FileKey createFileKey(String keyName);

    // Decrypt a single file key
    byte[] decryptKey(KeyDescription key, byte[] encryptedKey);

   // Decrypt a list of keys
   List<byte[]> decryptKeys(KeyDescription key, List<byte[]> encryptedKeys);
}

What this allows is for implementations that want to write their own key name derivation to do so. So for example the name could be derived from the path, but by default for the traditional KMS use case it is completely random. The key name generator could also be its own module / interface.

@omalley
Copy link
Contributor

omalley commented Dec 3, 2018

Ok, thanks for explaining. I can at least see at least some disconnect now, which helps.

So from the hadoop-crypto point of view, the keys are the "file keys" in the way that I was thinking about it. They are writing the file key out as a side file and encrypting with a public/private key. That is relatively expensive and doubles the number of s3 objects. So their model would fit into my proposal except that they have a single global master key (their public/private keys).

I get that your implementation hashes the path to generate the file key, but I don't see how you secure it. Obviously the hash of the path isn't a secret. :)

@mccheah
Copy link
Contributor Author

mccheah commented Dec 3, 2018

We would generate the name of the key using the path, but the bytes of the key given that name are generated securely. (Edit: Meaning, the name of the key has an association with the path, but the bytes of the key are generated independent of the path or the key name.) Additionally the built-in encryption solution that stores the key alongside the file is a default implementation, but one can choose to implement KeyStorageStrategy with another backend.

@omalley
Copy link
Contributor

omalley commented Dec 3, 2018

I'd also argue that Iceberg can't assume paths are immutable. So using the hash of the path to reconstruct the secret key won't work.

@mccheah
Copy link
Contributor Author

mccheah commented Dec 3, 2018

Sure, and the default implementation doesn't have to. I think the default implementation shouldn't actually assume the file path has any association with the key. The idea is that it doesn't hurt to have the API include the file URI somewhere when generating keys, because the key manager can thus know what it is encrypting, and that can influence how it wants to make the key name and the storage metadata of that key.

@rdblue
Copy link
Contributor

rdblue commented Dec 6, 2018

We should also loop in @ggershinsky, who is working on the Parquet encryption spec.

@ggershinsky
Copy link
Contributor

Parquet encryption has additional metadata parameters (such as per-file aadPrefix's, column keys, etc) - but, since this is basically a column encryption format, it might be too early to factor this in. In any case, whenever needed feel free to ping me, I'd be glad to assist with the Parquet part.

@mccheah
Copy link
Contributor Author

mccheah commented Dec 11, 2018

@rdblue @omalley I moved the architecture to https://docs.google.com/document/d/1LptmFB7az2rLnou27QK_KKHgjcA5vKza0dWj4h8fkno/edit?usp=sharing. Please take a look and comment, and then we can continue with implementation.

@rdblue
Copy link
Contributor

rdblue commented Dec 12, 2018

@mccheah, can you also start a thread on the dev list to point out this spec? I think other people will probably be interested that aren't necessarily following the github issues.

@mccheah
Copy link
Contributor Author

mccheah commented Mar 1, 2019

This is done. Thanks everyone!

@mccheah mccheah closed this as completed Mar 1, 2019
rdsr added a commit to rdsr/incubator-iceberg that referenced this issue Mar 13, 2020
SinghAsDev referenced this issue in SinghAsDev/iceberg Nov 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants