Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-1450: Detailed crypto specification #114

Merged
merged 14 commits into from
Jan 17, 2019

Conversation

ggershinsky
Copy link
Contributor

No description provided.

Encryption.md Outdated Show resolved Hide resolved
Encryption.md Outdated

Parquet-mr/-cpp implementations use the RBG-based IV construction as defined in the NIST
SP 800-38D document for the GCM ciphers (section 8.2.2).
The initialization vectors (IVs) in AES GCM and CTR ciphers must be unique for each encryption key,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be misunderstanding the NIST SP 800-38D document, but in my interpretation IVs do not have to be unique for each encryption key, but for each set of inputs instead. One can naturally use the same encrpytion key for multiple inputs, and it would be problematic to use the same IV for them, since identical inputs would lead to identical ciphertexts in this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so I just realized that since the IV is stored in the EncryptionAlgorithm and the EncryptionAlgorithm is stored in the FileCryptoMetaData, it means that there is only a single IV for a whole Parquet file. This seems problematic to me, because:

  • Either we chain all blocks together, in which case one IV is enough, but we can not decrypt parts of the file without decryipting every block that preceeds it;
  • Or we reuse the same IV multiple times, in which case we can selectively decrypt parts of the file, but identical data will lead to identical ciphertexts, weakening the encryption (e.g., if two pages contain the same data).

Do I misunderstand something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Zoltan, have to run now, so will address this comment only, and the others tomorrow morning. We don't keep IVs in the EncryptionAlgorithm. We used to have a so-called iv_prefix for 8.2.1 - but we don't anymore, as per our discussion. Please check the Thrift structures in the doc - iv_prefix'es dont show up there. Once the doc is signed off, I'll send a pr removing these fields from parquet.thrift.
In any case - we have a fresh IV/nonce for each ciphertext (structure or page) in a file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for the clarification. Just noticed the sentence "The IVs are written before each ciphertext.", sorry for not noticing it earler. Maybe you could make this a bit more explicit by saying that a new, unique IV is generated for every ciphertext and maybe also mention exactly how they are generated (using random number generation).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

Copy link
Contributor

@zivanfi zivanfi Oct 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also add some diagrams to the corresponding sections like so:

IV (12 bytes) length (4 bytes) ciphertext (length bytes) authentication tag (16 bytes)

and so:

nonce (16 bytes) length (4 bytes) ciphertext (length bytes)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zoltan is right. This states that the IV is unique for each key: "The initialization vectors (IVs) in AES GCM and CTR ciphers must be unique for each encryption key, since a single re-use of an IV breaks the encryption."

An IV must be unique for each encrypted stream. The key stream for each IV is identical, so I can take two encrypted streams with the same IV and XOR them to get the plain-text XORed together. If I can control or predict one of the plain-text streams, then I can decrypt without the key.

Encryption.md Outdated
struct ColumnChunk {
...
/** Crypto metadata of encrypted columns **/
8: optional ColumnCryptoMetaData crypto_meta_data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) Just noticed that this field is named *_meta_data, while other structs use *_metadata. You may consider renaming this field here and in the proto itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is named after the original "meta_data" field in the same structure. Lets indeed rename the other *_metadata fields to *_meta_data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just noticed that the current FileMetaData structure has a field named key_value_metadata. So there are already metadata and meta_data fields..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer _metadata to _meta_data. There is only one meta_data field name, for ColumnMetaData in ColumnChunk, and there are multiple _metadata fields. Let's standardize on _metadata here and from now on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change still needs to be made.

Encryption.md Outdated
the file - if constructed with a timestamp or version number, the AAD will allow readers to make sure they
work with a correct file version. If organization works with a number of different
AAD construction methods, the `aad_metadata` can contain the method ID. In any case, the AAD string itself
must not be kept in this field - a reader reliance on this AAD will defy the cipher authentication purpose.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this last sentece, could you please explain it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, the wording is not clear enough, I'll re-write this sentence. The point is as follows: say we have a claims processor that needs to run a query on yesterday's data, making sure it had not been replaced with an older data. The query will specify the AAD based on the expected file name: abcd_yesterday_date. There is no point in keeping this AAD in the file itself - because it can be tampered with / replaced - but mostly because the reader should know apriori what data version (date) it looks for.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to do some research to understand this, it turned out I had mistunderstood AAD earlier. I would propose a decription along the following lines: First state how AAD is used conventionally (AAD is not stored alongside the ciphertext, the ciphertext can only be decrypted when the AAD is provided to the decryption algorithm, essentially behaves like a second password but is used for an entirely different purpose). Then explain that in Parquet we do store the AAD as well (emphasizing that it is not the usual practice), because it allows more flexibility:

  • Readers that don't care about checking integrity can still read the data without providing an AAD, since they can read it from the file itself.
  • Readers that want integrity checking can provide an expected AAD and encryption will fail unless the same AAD was used for encryption (thereby ensuring integrity).
  • Readers that provide multiple AADs can first check whether the stored value is amongst the provided AADs and then do the integrity check using that single AAD instead of trying to decrypt the data using all providedf AADs.

Since this does not seem to be the usual practice, please also explain why storing the AAD does not degrade its security.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay, so reading your description once more, it seems that I misunderstood again. :) The fields should not store the AAD itself, just some way to identify it (select which one to use).

Encryption.md Outdated

Only the `FileCryptoMetaData` is written as a plaintext, all other file parts are protected
The same magic bytes are written at the beginning of the file (offset 0).
Parquet readers start file parsing by reading and checking the magic string. Therefore, the encrypted footer mode uses a new magic string ("PARE") in order to immediately inform legacy readers (expecting ‘PAR1’ bytes) that they can’t parse this file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say the purpose of a different magic string is not to inform legacy reader that they can't read the file (they would fail anyway), but instead to inform new readers to look for a file crypto metadata at the end of the file instead of a footer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this is the main reason. Still, having an early and clear exception in legacy readers, informing of a wrong file type (instead of a more obscure exception later on) is helpful too. I'll update the text accordingly.

Encryption.md Outdated Show resolved Hide resolved
Encryption.md Outdated
length of `aad_metadata` is 512 bytes. Typically, readers would know the file AAD without relying
on this field, using instead a method adopted by the organization. For example, the AAD can be the name of
the file - if constructed with a timestamp or version number, the AAD will allow readers to make sure they
work with a correct file version. If organization works with a number of different
AAD construction methods, the `aad_metadata` can contain the method ID. In any case, the AAD string itself
must not be kept in this field - a reader reliance on this AAD will defy the cipher authentication purpose.
AAD construction methods, the `aad_metadata` can contain the method ID; otherwise, the `aad_metadata` can
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove the sentence that the AAD itself must not be kept in this field? I think that was one of the most important points. In fact, I think you should elaborate on that and describe the intended use of this field. If I understand correctly: if there are multiple AADs in use, we need a way to decide which one to use in each case. Storing some kind of ID or reference to the AAD in this field solves this purpose. Is this the intended use case? Is there any other one? Please describe them here briefly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out there is a usecase where it might make sense to store the AAD in this field. Say we have a dataset with 3 files: customers_ny_oct18, customers_pa_oct18 and customers_ct_oct18. We can store these strings as AADs in the 'aad_metadata' field - as long as the reader knows it must get Oct18 version and checks for "customer_??_oct18" AADs. The main message is in the last part - a reader must know what dataset it needs, and what AADs should be there. I'm adding a section on AADs to the doc.

Encryption.md Outdated
possible to infer from other metadata. This might be exploited for an exotic
attack on file integrity (such as swapping dictionary pages of two columns). While these
attacks succeed under rare conditions and in any case don't harm data confidentiality -
they can be prevented by using the offset as a part of the GCM AAD formation. Therefore,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The offset of the very first page of every Parquet file is the same. Does this mean that a malicious person can swap the first pages of two Parquet files without knowing the encryption key?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, this won't work, but another attack might: replacement of column pages in row group 0 by pages from the same column in row group 1.

@ggershinsky
Copy link
Contributor Author

ggershinsky commented Oct 30, 2018 via email

@zivanfi
Copy link
Contributor

zivanfi commented Oct 31, 2018

Looks good to me. One final thing to consider (at least in my opinion) is the mixing of IVs and nonces in the terminology. Currently these terms are used somewhat inconsistently. I would prefer if the docs either always used IV or if it explicitly called out that the correct terminology depends on the encryption mode in use.

Additionally, according to this discussion, a nonce seems to be a weaker term than an IV, since a nonce only requires uniqueness, while an IV also requires unpredictability, so the first can be a regular counter but the latter must be random. This sounds like one more reason not to mix their usage.

@ggershinsky
Copy link
Contributor Author

As Ryan has pointed out, GCM usually uses "IV" terminology, while CTR calls them "nonces". I'll add this detail to the doc.

The referenced discussion is related to a different AES mode (CBC), where an IV indeed must be unpredictable. There is no such requirement in GCM or CTR modes. In general, the term IV refers to a byte array used for cipher initialization, with requirements depending on the cipher mode (eg there are modes where an IV doesnt have to be unique).

@julienledem
Copy link
Member

@rdblue: Gidon has updated the doc based on your comments. Does it look good to you now?

Encryption.md Outdated
while allowing for a regular Parquet functionality (columnar projection,
predicate pushdown, encoding and compression). The mechanism also enables column access
predicate pushdown, encoding and compression). This mechanism also enables column access
Copy link
Contributor

@rdblue rdblue Nov 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this document should make the claim that this "enables column access control". Using different encryption keys can be used in a scheme that provides some form of access control, but in itself it doesn't provide what we commonly think of as "access control" because there is no way to revoke that access.

This document should be clear about what the feature provides -- the ability to encrypt columns with different keys or not encrypt a column -- and should not make claims about security guarantees that can be partially built with these features.

Encryption.md Outdated
@@ -39,32 +39,32 @@ speed and access control granularity.
2. Implement "client-side" encryption/decryption (storage client). The storage server
must not see plaintext data, metadata or encryption keys.
3. Leverage authenticated encryption that allows clients to check integrity of the
retrieved data - making sure the file (or file parts) had not been replaced with a
retrieved data - making sure the file (or file parts) have not been replaced with a
wrong version, or tampered with otherwise.
4. Support column access control - by enabling different encryption keys for different
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I noted above, this goal should be rephrased to state that columns can use different encryption keys. It should not claim to provide access control.

Encryption.md Outdated
with its version (or creation timestamp) as the AAD, to verify the file has not been
replaced with an older version.
with its version (or creation timestamp) as the AAD, to verify that the file has not been
replaced with an older version (see below).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See what below? Which section? This should be a description, ideally with a link.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the AAD section below, an AAD scheme is specified. If an AAD scheme is required, then the statement that the user can do something is false. This encryption spec requires that the AAD is used to authenticate streams such that no stream can be replaced. with one from a different file or moved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From below:

Parquet modular encryption uses the AAD mode to protect against swapping ciphertext modules inside a file, between files - or against swapping full files (for example, replacement of a file with an old version)"

Encryption.md Outdated
columns in an encrypted file.
9. Miminize overhead of encryption: in terms of size of encrypted files, and throughput
9. Minimize overhead of encryption: in terms of size of encrypted files, and throughput
of write/read operations.


## Technical Approach

Each Parquet module (footer, page headers, pages, column indexes, column metadata) is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of "module" is confusing later because it isn't used elsewhere it Parquet documentation. Please add a definition instead of using it and defining it at the same time.

Encryption.md Outdated
written after each ciphertext. The IVs (12 bytes) are written before each ciphertext.
All modules are encrypted by an AES-GCM cipher, without padding. Unique IVs are generated and
written before each ciphertext. The IV length is 12
bytes (96 bits), which is generally considered to be the optimal size of GCM IVs. The NIST
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove statements of opinion, like "... which is generally considered to be the optimal size of ..."

Encryption.md Outdated


### AES_GCM_V1
All modules are encrypted with the AES-GCM cipher. The authentication tags (16 bytes) are
written after each ciphertext. The IVs (12 bytes) are written before each ciphertext.
All modules are encrypted by an AES-GCM cipher, without padding. Unique IVs are generated and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the padding statement clear enough? Is there a reference we could point to that shows how to encrypt the last, under-filled block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using the standard APIs for that, such as
javax.crypto.Cipher.getInstance("AES/GCM/NoPadding"); in Java.
As for a reference, we can point to
https://crypto.stackexchange.com/questions/42412/gcm-padding-or-not
or
https://docs.oracle.com/javase/9/docs/api/javax/crypto/Cipher.html
or some other reference

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mode used for a Java library is not sufficient to document this. A reference for exactly what that mode does would be, if not a description here of how this is done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, makes sense. I'll try to find one, or let me know if you come across something suitable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it looks like the most authoritative references for this are
https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38a.pdf
https://tools.ietf.org/html/rfc3686

CTR is explicitly excluded from the list of padded ciphers in the former. The latter says:
"AES-CTR does not require the plaintext to be padded to a multiple of the block size"

Also, we have these from different sources:

"because CTR mode (contrary to CBC mode) does not require padding to block boundaries, and AES-CTR would also be compatible with no padding at all, or with strict PKCS#5 padding"

"Internally GCM really is CTR mode along with a polynomial hashing function applied on the ciphertext. CTR-mode doesn't need padding because you can just partly use the bits the last counter block generated and the polynomial hash does use (zero-)padding."

"GCM is a streaming mode which means that the ciphertext is only as long as the plaintext (not including authentication tag). GCM doesn't require a padding. This means that the PKCS5Padding version is actually only a synonym for NoPadding for convenience during programming. Some providers don't have this strange mode."

So, while the NIST spec text is not great on this, the picture is clear. GCM/NoPadding implementations will be compatible in any language and crypto provider. Same is true for CTR/NoPadding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there "no padding" modes for C++, Python, etc.? The libraries you mention are all JVM-based.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenSSL is C++, we're using it in the parquet-cpp implementation. Tested for interop with the parquet-mr.
Anyways, adding the no-padding details to the spec.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I just want to make sure that there isn't any confusion that would cause incompatibility. If based on the available APIs you think it is perfectly clear, then that is fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting to look at this, I've realized I'm not comfortable with providing details that are not provided in the NIST specifications for these ciphers. The reason NIST explicitly ignores padding for the CTR and GCM modes is likely to be this: ".. CTR modes do not require any special measures to handle messages whose lengths are not multiples of the block size, since the modes work by XORing the plaintext with the output of the block cipher".
Therefore, I think we need to require from a Parquet implementer to use a cryptographic provider that implements these two ciphers according to the NIST specification. That will be sufficient to guarantee an interop. As an example, we already know the SunJCE and OpenSSL providers are interoperable, since we have tested this in the parquet-mr and parquet-cpp implementations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, then let's leave it at "no padding" then.

Encryption.md Outdated
Thrift modules are encrypted with the AES-GCM cipher, as described above.
The pages are encrypted with AES-CTR, where the IVs (16 bytes) are written before each
ciphertext.
Thrift modules are encrypted with AES-GCM, as described above. The pages are encrypted by an AES-CTR
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mixes the layout for AES-GCM or AES-CTR with the Parquet modes, AES_GCM_v1 and AES_GCM_CTR_V1.

Those should be separate to be more clear for implementers. This should have a section for AES-GCM and AES-CTR that specifies the layout (CTR: IV, length, ciphertext; GCM: IV, length, ciphertext, auth tag) and a separate section that specifies the Parquet modes: AES_GCM_V1 that uses AES-GCM mode used for all encryption, and AES_GCM_CTR_V1 that uses AES-GCM mode for metadata structures and AES-CTR for page data.

Encryption.md Outdated
The pages are encrypted with AES-CTR, where the IVs (16 bytes) are written before each
ciphertext.
Thrift modules are encrypted with AES-GCM, as described above. The pages are encrypted by an AES-CTR
cipher without padding. The nonces are generated and written before each ciphertext, with the nonce length of 16 bytes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either "nonce" or "IV" should be used throughout the doc. Terminology should not change between GCM and CTR.

Encryption.md Outdated
ciphertext.
Thrift modules are encrypted with AES-GCM, as described above. The pages are encrypted by an AES-CTR
cipher without padding. The nonces are generated and written before each ciphertext, with the nonce length of 16 bytes
(128 bits). At least 12 bytes out of 16 are constructed with an RBG.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are nonces and IVs different lengths? Why can't a 12-byte IV that is entirely generated with an RBG be used for CTR mode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://crypto.stackexchange.com/questions/41601/aes-gcm-recommended-iv-size-why-12-bytes/41610
GCM runs CTR internally which requires a 16-byte counter. The IV provides 12 of those, the other 4 are an actual block-wise counter.
The AES block size is 128 bits, that's 16 bytes.
Out of those 16, the first 12 are the IV and the remaining 4 are the counter (starting from 1).

https://stackoverflow.com/questions/4608489/how-to-pick-an-appropriate-iv-initialization-vector-for-aes-ctr-nopadding
use a cryptographically secure random number generator, and create a new 16-byte random IV for every message. I underline "cryptographically secure" because that's important; a basic random number generator is not enough. With Java, use java.util.SecureRandom. With Win32, call CryptGenRandom(). With a random selection, the space of possible 128-bit IV is large enough that collisions are extremely improbable. Actually, that's why AES uses 128-bit blocks (thus implying 128-bit IV).

https://security.stackexchange.com/questions/90848/encrypting-using-aes-256-can-i-use-256-bits-iv
AES uses 128-bit blocks, so a 128-bit IV. Note that AES-256 uses a 256-bit key (hence the name), but still with 128-bit blocks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The links don't address my question. AES-GCM requires a 12-byte IV, to which a 4-byte counter is added to get the nonce for AES-CTR. Why doesn't the counter mode do the same and take an IV to which it adds the counter? Is this a quirk of the API for the AES-CTR library you're using, or is there a real difference between how the two are handled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works like this:

AES CTR
requires a user to provide a 16 byte IV for each encryption operation [API call to convert n-byte plaintext into n-byte ciphertext - takes the following input parameters: key (16 bytes), IV (16 bytes), plaintext offset, and length n].

Per the quotes above, the IV can be constructed by the user in one of the two following ways:
a. First 12 bytes are random bytes, the other 4 are set to a value of 1.
b. All 16 bytes are random.

Inside, CTR implementation will use the last 4 bytes as a counter field (starting with the provided value) when running over n bytes split into 16 byte chunks.

AES GCM
requires a user to provide a 12 byte IV (see NIST spec) for each encryption operation [API call to convert n-byte plaintext into n-byte ciphertext and 16-byte tag - takes the following input parameters: key (16 bytes), IV (12 bytes), plaintext offset, and length n].

Per the NIST spec section 8.2.2, the IV is constructed by the user as a 12-byte random array.

Inside, GCM runs a CTR cipher that gets a 16-byte IV, constructed from the 12-byte array provided by user, and extra 4 bytes provided by the GCM implementation itself. The CTR cipher then runs as described above ("will use the last 4 bytes as a counter field (starting with the provided value) when running over n bytes split into 16 byte chunks").

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, then it sounds like the problem is just with the API that is used. One accepts a 12-byte IV and constructs a 16-byte IV+counter, the other accepts a 16-byte nonce where 12 bytes are an IV and it overwrites some bytes with the counter.

For Parquet, we don't need two formats. The encryption library should use the same 12-byte IV to construct a nonce that is acceptable for passing into AES-CTR if that API is different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. The CTR option a. above allows us to use 12 bytes for IV storage in Parquet for both GCM and CTR ciphers.

Encryption.md Outdated
the file.

Parquet constructs a module AAD from two components: an optional AAD prefix - a string provided
by the user for the file, and an AAD suffix, built internally for each encrypted module inside
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For each AES-GCM encrypted module.

Encryption.md Outdated
the file. The AAD prefix should reflect the target identity that helps to detect file swapping -
for example, table name and version or date / timestamp (e.g., "employees_23_05_2018"). The
AAD suffix reflects the internal identity of modules inside the file, which for example prevents
replacement of column pages in row group 0 by pages from the same column in row group 1. The module
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of "pages" use "AES-GCM ciphertext" because pages may use AES-CTR.

Encryption.md Outdated

A file writer passes an AAD prefix string upon file creation, that allows to differentate it from other
files and datasets. The reader should know this string apriori, or should be able to retrieve and verify
it, using a convention accepted in the organization.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not generate a random AAD prefix?

If Parquet relies on users to supply a prefix, then it allows the case where files are not protected by an AAD or where files use the same AAD. That allows attackers to swap ciphertext streams when the user has been lazy or doesn't understand the purpose of the AAD.

A random AAD prefix would ensure that ciphertext cannot be moved between files and the AAD suffix ensures that ciphertext cannot be moved between row groups or columns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative is to allow users to pass a prefix, but to add a random portion to the suffix that makes the file AADs unique.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random AAD prefix is possible, but requires a trusted service where these prefixes are stored for every file. A reader must be able to retrieve an AAD prefix for every file. Since the data storage is untrusted (thats why we use the keys and AADs in the first place), the AAD prefixes must be stored in a secure location, and serviced by an authenticated process.
Unlike keys, AADs are not secret. Hence, having a "well known" method of AAD prefix creation (such as <table name>_<date> or <table name>_<version>) allows writers and readers to create/verify the file AAD prefix without a remote service.
Still, I'll add your suggestion as an example of possible AAD prefixes and their store/fetch service.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AAD prefix is stored in the encryption metadata. I'm suggesting adding a per-file random to that. Users can add to the AAD prefix if they choose, but if they don't supply anything, then something unique to the file should be there. That way even if users are dumb, files are protected from replacing columns or pages with encrypted ones from other files. Isn't that the point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AAD prefix is not necessarily stored in the aad_metadata. The "well known" method approach, described above, doesn't require setting the aad_metadata at all. But I agree it does require smart users, that can generate and verify unique AAD prefixes. Dumb users should work with a trusted service that keeps the random AAD prefixes created for each file. I'll add this to the text (maybe avoiding the dumb/smart terminology :). Similar to key metadata section, we'll have a number of example usages of the aad metadata field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example above describes the "smart user" scenario - because both writer and reader know the AAD without a help from any service. They just use a convention "well known" in the organization for table AAD creation. That is, the same code is running in both writer and reader, creating matching AADs for every table.

Here goes the "dumb user" scenario - which also doesn't require the aad_metadata field.

  1. We use a random string as the file AAD.
  2. Writer generates a random string and uses it as an AAD for the file. It also sends the table name ("employees"), month of creation, "10_<CURRENT_YEAR>", and the generated AAD, to a trusted AAD service. It might also send the file name ("employees_10_<CURRENT_YEAR>).
  3. Reader runs on the day of 9 Nov 2018 - it goes to the trusted (authenticated) AAD service, and asks for the AAD for the latest version of employees table. The service sends the AAD, and the file name. The reader opens the file, using the AAD to verify its authenticity.
  4. As before, we don't need "aad_metadata" for any of these. Putting the random AAD in this field or elsewhere in the file is not safe, can be modified by attacker.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The attack I'm addressing is slightly different. In this scenario, users don't know about supplying an AAD. User AAD metadata is empty in the 2017 version and the 2018 version.

The suffix part of the AAD is used to prevent pages from being swapped, so I can't overwrite row group 0 column 1 page 3 with row group 0 column 1 page 4.

But there isn't anything that prevents an attacker from replacing row group 0 column 1 page 4 with the same page from the 2017 version of the file. A per-file unique that is stored in the file would prevent page swapping between any 2 files, even if the unique string is in the file metadata.

This would not protect against copying 2017 over 2018 like your example, but it does at least ensure the pages a file has not been tampered with, in relation to each other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. Protecting against page swapping between files - without being dependent on user input. Sounds good, adding aad_file_unique to the EncryptionAlgorithm structs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the described scenarios a bit oversimplied. Of course, one can use Parquet in the way desribed (separate Parquet files for each month each being read explicitly), but I think in the much more common scenario a directory of Parquet files will correspond to a table without a one-to-one mapping between files and meaningful attributes like months.

When one uses MapReduce or RDD to process big data, different parts of the dataset get processed and written by different machines of the cluster. Instead of writing the data to a single file, each machine writes to a separate file. The individual file names don't even matter.

Additionally, when more data is written to the same table, it is not appended to existing files but new files get added. Impala can't even write arbitrary amount of data into a single Parquet file, it starts a new one for each block instead.

Could the discussed integrity checking measures be used to protect against (partial or full) file-swapping/duplicating/deleting attacks in such a scenario?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, the PARQUET-1457 addresses this scenario. As for the AAD prefix section in this doc - I'm currently updating it with more details - with the same focus on integrity of individual files, since Parquet API (ParquetFileWriter and ParquetFileReader) is designed for writing/reading individual files. Data sets (tables, dataFrames, etc) are handled by frameworks above Parquet. The data set integrity can be handled by framework tools, with or without PARQUET-1457 help. I'll mention this in the doc.

Encryption.md Outdated
confidentiality - they can be prevented by using the column and row group identity as a part of GCM AAD
formation. Therefore, the `PageHeader` structures and the pages (in case of AES_GCM_V1) are enciphered using
an AAD suffix built by direct concatenation of two parts: the row group offset in the file (8 bytes, little endian)
and the column name (for nested columns, a concatenated path with "." separator) serialized with the UTF-8 charset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An AAD suffix constructed this way could go over the size. Why not use column ordinal, which is required by Parquet to match the order in the file schema?

I think the suffix should be made from a per-file random, row group ordinal, column ordinal, and an id for the type of structure encrypted (e.g., data page, data page header, etc.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The limit on AAD size is in gigabytes, so the column name and row group offset are ok. The ordinals can also work, but they require extra code, and also can't be verified directly since they are not kept in the protected structures such as ColumnMetadata and RowGroup.
We'll add per-file random and the structure type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked the code - the column ordinal will be feasible, as both readers and writers have a simple loop on column chunks. I agree this will save the path String concatenation and serialization. We need to decide whether to use 2 or 4 bytes for the column ordinal component in AAD suffixes. Is there a limit on the number of columns? Is 32K enough?

The situation with row groups is different. In readers, the RowGroup list is filtered inside the FileMetaData structure, before being processed by a higher level code. Therefore, we can't rely on the list/loop count.
On the other hand, the cost of serialization of 8 byte longs is basically as negligible as the cost of serialization of 4 byte ints or 2 byte shorts. Therefore, we can use the row group offset (kept inside RowGroup structures) as a part of AAD suffix.
Or we can add an i32 or i16 "ordinal" field to the RowGroup structure.
I'm fine with either option, let me know your preference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

32k sounds like enough to me.

I'm fine either adding an ordinal or using the offset. I slightly prefer ordinal since it sounds generally useful. And a 2-byte ordinal would work here, too.

Encryption.md Outdated
@@ -132,49 +203,164 @@ the encryption buffer is comprised of an IV, ciphertext and tag, as described in
Algorithms section. The length of the encryption buffer (a 4-byte little endian) is
written to the output stream, followed by the buffer itself.

|length (4 bytes) | IV (12 bytes) | ciphertext (`length` bytes) | tag (16 bytes) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for including this diagram.

Why is IV between length and ciphertext? I think IV should come before length. Otherwise, it isn't clear that length is the ciphertext length and not the length of IV, ciphertext, and tag, or for CTR, just IV and ciphertext.

Encryption.md Outdated
compression and encryption) is kept in the page headers.

A `crypto_meta_data` field in set in each `ColumnChunk` in the encrypted columns.
| IV (12 bytes) | ciphertext (`page_size` bytes) | tag (16 bytes) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather have length of the ciphertext stored in all cases so that these are always the same. The overhead isn't that high.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason is not the storage overhead. Having the length here, exposes the offset and size of every page - which can be exploited for integrity and other attacks. Storing the length inside encrypted and tamper-proofed PageHeaders makes the page offset/size invisible (only the offset of the first page is known; the size of the first page and offset/size of all other pages are hidden).
And saves a bit of file space.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe an integrity attack that uses this information?

In general, if you're worried about the length of the ciphertext giving away information, I think you should be addressing that directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. page swapping attacks - but mostly the attacks we don't know about today, its impossible to foresee all possible future threats. Given two options - one that hides the position/size of every ciphertext (with its IVs, tags, etc), at no cost - and one that exposes them all, with storage overhead - we should choose the first one, unless there is a very strong reason not to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AES-GCM and AADs are used to avoid page swapping attacks.

I would rather all encryption buffers be stored the same way, but I don't have a strong opinion here.

@julienledem or @zivanfi: do you have an opinion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. 'plaintext' threw me off, we keep the ciphertext length, but you mean the plaintext length before encryption.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with using option 1 for all IVs.
Sounds good, adding to the spec.

Because the page header must have a length, an attacker can identify where each encrypted page header starts and work backward from there.
Yep, possible, but with a certain effort / probability of success. If we replicate and explicitly expose the page length, an attack will require no effort and will be 100% successful. I guess my point here - security is not a binary thing, its often about defense in depth, making an attacker work harder to learn something about the sensitive data. Sometimes, its worth an effort on the defender part to hide things. But in this case, we don't even have to make an effort - we get this feature out of box. We just shouldn't make an effort to expose this field; especially since we won't use it; only an attacker would. More on this later.

security through obscurity
Agreed, we don't want this. The only place I relied on obscurity was for page swapping attacks within the same chunk. I'm adding a page ordinal to the spec (4 bytes; but let me know if 2 are enough). With this addition, the spec won't have any dependency on obscurity, both implicitly and explicitly (it is not mentioned as a part of the security design).

hiding the lengths ... makes this spec harder to implement
That's where we differ. Page length is kept in page headers and read from there by the existing Parquet code. Once the page header is encrypted, we don't need to make any effort to hide the length. Exposing the length will actually require an effort and Parquet code changes, as follows:

  1. Writer will need to serialize the length and write it to the stream.
  2. Reader doesn't need this field. We will have to change Parquet reader code - to either read and deserialize this field, in addition to the length in PageHeaders - or to ignore it, which will require some form of data skipping, with or without a seek.

Again, all this might be done, for a good cause. But in my view, the net result of this effort would be helping an attacker.

Copy link
Contributor

@rdblue rdblue Nov 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the length of a plaintext page is something that should be protected, then Parquet should pad before encrypting. Anything else is security through obscurity, which you agree we should not employ here. I don't buy the argument about making it harder. Page headers are reliably short and those lengths will be easy to identify.

Hiding the length does make this harder to implement. Instead of a function to go read an encrypted buffer and return its bytes, you need to have two methods: one with an explicit length and one without. Instead of having a method that encrypts bytes and writes out a buffer, you need two: one that write the length and one that only returns it. You need to make sure that the right method is used in the right places. And worse, you must make it absolutely clear for other people that are implementing the spec.

Any time you add special cases, it makes a spec harder to implement.

This comes from the same motivation for using only one IV format. The fewer cases, the easier this is to implement. That makes it more reliable.

I'm against trading clarity in the spec just to obscure something that can be easily recovered and isn't a problem anyway. If you want to hide the length of the plaintext, then hide it by padding before encryption. Don't obscure it at the cost of simplicity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me know if 2 are enough

2 bytes for a page ordinal within a row group should be sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Given the technical trade-offs involved, this is not worth delaying the work.

Encryption.md Outdated
(potentially, along with file names/paths).

#### 4.4.2 AAD_Suffix
File integrity can be disrupted by module swapping inside the file - such as replacement
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These sentences are written in a misleading way. They state that "file integrity can be disrupted by module swapping inside the file" first, and only later introduce the AAD_suffix to prevent it.

Instead, this should state that the purpose of the AAD_suffix is to prevent the following integrity attacks, then have a list of some of the swapping attacks. That way, the reader knows that this specification prevents those attacks.

Consider a similar change to the AAD_prefix section above.

Encryption.md Outdated
of column pages in row group 0 by pages from the same column in row group 1. Also, if an encryption
key is re-used in multiple files and the user has not provided a unique AAD_Prefix for each file – an
attacker can swap modules between the files. While these attacks succeed under rare conditions and in
any case don't harm data confidentiality - they can be prevented by using an internally generated file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are prevented, at least in full GCM mode.

Encryption.md Outdated
and module identity as a part of GCM AAD formation in Parquet.

The AAD_Suffix is built by direct concatenation of the following parts:
1. [All modules] random byte array (file identifier): variable length
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that "(file identifier)" is sufficient. Be more clear that this is a random byte array generated once per file and stored in the footer encryption struct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just referring to this as the "file identifier, a random byte array generated for each file of implementation-defined length" works.

Encryption.md Outdated

| |File ID | Module type | Row group ordinal | Column ordinal | Page ordinal|
|----------------------|------------|-------------|-------------------|----------------|-------------|
| Footer | V | V (0) | X | X | X |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do V and X indicate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

V for check (yes), X for no.
Suggestions on alternatives are welcome.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably use a character to indicate it is used and blank to indicate it is not. V for a checkmark isn't clear.

Whatever characters you decide to use, the convention should be explained somewhere. Readers should never guess what something means in a spec.

Encryption.md Outdated
|-----------------|------------------|------------------------------|


A `crypto_meta_data` field is set in each ColumnChunk in the encrypted columns. ColumnCryptoMetaData
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that the decision was to use "metadata" instead of "meta_data"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parquet-cpp has released the version 1.5.0 with these crypto structures.
https://github.com/apache/parquet-cpp/blob/apache-parquet-cpp-1.5.0/src/parquet/parquet.thrift#L708

While this version cant read files with encrypted footer, or encrypted columns in files with plaintext footer - it should be able to read plaintext columns in encrypted files with plaintext footer - as any legacy reader.

However, this means we cant rename a couple of fields, including this one. The cpp thrift above has:

struct ColumnChunk {
  8: optional ColumnCryptoMetaData crypto_meta_data

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the point of field IDs was that we can rename fields. It may just work.

As for the 1.5.0 release of parquet-cpp, I don't think we intend to support files written with the encryption structures from that release. The spec was not approved.

And aren't we making a lot of other changes to the spec anyway? I would expect more than just this to break. Column metadata, for example, has moved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the 1.5.0 release of parquet-cpp, I don't think we intend to support files written with the encryption structures from that release. The spec was not approved.
Yep, but this version can't even write files with encryption structures (they are not set, simply because it has no code that drives encryption). My concern is different - this version, like any other legacy parquet reader, should be able to read plaintext columns in encrypted files (with plaintext footer), written by new writers.

And aren't we making a lot of other changes to the spec anyway? I would expect more than just this to break. Column metadata, for example, has moved.
things should be ok for parquet-cpp-1.5.0 working as a legacy reader on plaintext columns. Unless there is a name collision on a couple of crypto fields in ColumnChunk and FileMetaData. Keeping eg the name crypto_meta_data instead of crypto_metadata seems to be a reasonable price to pay. I'll also experiment with your suggestion on fields IDs - if this works, then no problem with renaming them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think IDs should work here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They do! The names seem to be used just for API generation.
Renaming the fields.

Encryption.md Outdated
1: required list<string> path_in_schema

/** Retrieval metadata of the column-specific key **/
2: optional binary column_key_metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be named "key_metadata" since it is already in a column-specific structure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Encryption.md Outdated
readers get the public or symmetric key, verify the signature file and get the AAD prefixes for the table files
(potentially, along with file names/paths).

#### 4.4.2 AAD_Suffix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this "AAD_Suffix" and not "aad_suffix"? I would prefer keeping it lower-case since that's used elsewhere. For example, "aad_metadata".

...

/** Column metadata for this chunk.. **/
3: optional ColumnMetaData meta_data
Copy link
Contributor

@rdblue rdblue Dec 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know how older readers act when meta_data is null?

That will be the case with an unencrypted footer and encrypted columns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, I see that meta_data is still required in plaintext footer mode. Glad to see you already caught it!

Encryption.md Outdated
number of rows, key-value properties, column sort order, names of the encrypted columns
and metadata of the column encryption keys.

The columns encrypted with the same key as the footer leave the column metadata at the original
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, can you update this to "must leave"? I don't want this interpreted as optional when the key is identical.

Encryption.md Outdated
/** Column metadata for this chunk.. **/
3: optional ColumnMetaData meta_data
..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to copy the crypto_metadata field in this snippet as well, to be clear that it is along side encrypted_column_metadata and meta_data.

is Thrift-serialized, encrypted with the column key and written to the `encrypted_column_metadata`
field in the `ColumnChunk` structure, as described in the section 5.1.

A Thrift-serialized `FileCryptoMetaData` structure is written before the encrypted footer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A table diagram of the layout like the ones for encrypted buffers would help clarity here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be visible in the figures I'm working on.

Encryption.md Outdated
}
```

<!--
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is the figure placeholder. I'll uncomment it when adding them back next week.

Encryption.md Outdated

/** Retrieval metadata of key used for encryption of footer,
* and (possibly) columns **/
2: optional binary footer_key_metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be just key_metadata since it is in a footer-specific struct?

Encryption.md Outdated
-->


### 5.4 New fields for vectorized readers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What broke that requires this?

I think this is an addition to the Parquet format, not to encryption. Why is it needed for encryption but not for normal vectorized reads? Can't total compressed size be inferred from page locations?

Also, total_compressed_size is misleading. It is total size on disk, not just the compressed size. Page buffers are compressed, but may also be encrypted. Page headers aren't compressed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This helps normal vectorized reads too - instead of looping on all column chunks, they can just read the row group size field.

The column chunk length is not available in files with encrypted footer (unless a reader has all column keys) - therefore, these new fields are required there. It might be possible to reconstruct the size from page lengths and other available file metadata, but it wont be easy or elegant ( eg the row group offset is also unknown, so we wouldn't know where the pages start; its also not obvious where they end for each row group).
I think adding these two fields is the right thing to do for both regular and encrypted files.

Regarding the name total_compressed_size - it copies an existing field in ColumnMetaData structure, called the same:

/** total byte size of all compressed pages in this column chunk (including the headers) **/
  7: required i64 total_compressed_size

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regarding section 5.4 - we can replace it by adding a short text in section "5.2 Encrypted footer mode", explaining that these two fields must be set in RowGroup structures in files created in this mode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding these two fields is the right thing to do for both regular and encrypted files

If that's the case, then we should add the fields to the spec as a separate change.

A reader can find the starting point of all of the columns it intends to read and work from there. Vectorized reads don't necessarily need a stopping point and that can be inferred most of the time.

I don't see a need to over-complicate this spec with this concern, let's do it separately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

Encryption.md Outdated
## 6. Encryption Overhead
The size overhead of Parquet modular encryption is negligible, since most of the encryption
operations are performed on pages (the minimal unit of Parquet data storage and compression).
The overhead order of magnitude is adding 1 byte per each ~30,000 bytes of original compressed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine because it is a fact. But what is this based on?

Overhead is going to depend on the total size of data after encoding and compression. So overhead is dependent on the data that was written.

Encryption.md Outdated
The throughput overhead of Parquet modular encryption depends on whether AES enciphering is
done in software or hardware. In both cases, performing encryption on full pages (~1MB buffers)
instead of on much smaller individual data values causes AES to work at its maximal speed.
Preliminary tests show Parquet modular encryption throughput overhead to be up to a few
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A spec should not contain statements that will be outdated later as more data is gathered. The spec should not argue for itself to exist. It should simply and clearly state its requirements and little else.

@ggershinsky
Copy link
Contributor Author

@rdblue As before, the comments I haven't replied to, sound good. Please review the replies.

@rdblue
Copy link
Contributor

rdblue commented Dec 6, 2018

@ggershinsky, I replied, I think we're mostly in agreement.

@rdblue
Copy link
Contributor

rdblue commented Dec 14, 2018

@ggershinsky, looks good to me. Do you want to call a vote or do you want me to?

@ggershinsky
Copy link
Contributor Author

Cool, will do.

@rdblue
Copy link
Contributor

rdblue commented Jan 17, 2019

The vote passed! I'm going to merge this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants