Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-1517: [C++] Crypto package updates to match the final spec #3520

Closed

Conversation

ggershinsky
Copy link
Contributor

@ggershinsky ggershinsky commented Jan 29, 2019

An initial version of crypto package is merged. This PR updates the crypto code to

conform the signed off specification (wire protocol updates, signature tag creation, AAD support, etc)
improve performance by extending cipher lifecycle to file writing/reading - instead of creating cipher on each encrypt/decrypt operation

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you're fixing some bugs here. Can you explain those briefly in the PR description? Also, do we have some regression tests for this that would deserve updating?

@@ -196,10 +204,16 @@ int ctr_encrypt(const uint8_t* plaintext, int plaintext_len, uint8_t* key, int k

uint8_t iv[ctrIvLen];
memset(iv, 0, ctrIvLen);
iv[ctrIvLen - 1] = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment explaining the initialization here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Parquet, an input to the CTR cipher is an encryption key, a 16-byte IV and a plaintext. IVs are comprised of a 12-byte nonce and a 4-byte initial counter field. The first 31 bits of the initial counter field are set to 0, the last bit is set to 1.

Will add a comment in the code too.

@ggershinsky
Copy link
Contributor Author

It looks like you're fixing some bugs here. Can you explain those briefly in the PR description? Also, do we have some regression tests for this that would deserve updating?

@pitrou , this PR is a part of developing a new Parquet feature - the columnar encryption. More specifically, here I update the crypto functions to match the encryption specification
https://github.com/apache/parquet-format/blob/encryption/Encryption.md ,
formally signed off by the Parquet community earlier this month. I'll add this to the PR description.

In the C++ Parquet, the encryption is comprised of #2555, this PR and potentially new ones we'll create in the future. We have regression tests for the Java Parquet encryption, will certainly add them to the C++ version.

@ggershinsky ggershinsky changed the title PARQUET-1517: [C++] Crypto package updates PARQUET-1517: [C++] Crypto package updates to match the final spec Jan 30, 2019
std::copy(bufferSizeArray, bufferSizeArray + bufferSizeLen, ciphertext);
std::copy(iv, iv + ctrIvLen, ciphertext + bufferSizeLen);

return bufferSize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the bufferSize will have more 4 bytes for bufferSizeLen, then I think now we should return bufferSize + bufferSizeLen? or the caller will have to add 4 bytes into bufferSize itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thamht4190 You are right, it should be + sizeLen (or rather, bufferSize should include this addition). I'll update the code.

@wesm
Copy link
Member

wesm commented May 22, 2019

@ggershinsky can you add a PR description that summarizes the changes you've made here, for the project changelog?

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some stylistic comments. Since we're trying to flatten out the package and remove parquet/util it might make sense to move this to parquet/encryption-internal.h/cc

constexpr int GCM_MODE = 0;
constexpr int CTR_MODE = 1;
constexpr int CTRIvLength = 16;
constexpr int BufferSizeLength = 4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use style-conforming names for these constants in the form kCamelCase?

explicit EvpCipher(int cipher, int key_len, int type) {
ctx_ = nullptr;
AesEncryptor::AesEncryptor(ParquetCipher::type alg_id, int key_len, bool metadata,
std::shared_ptr<std::list<AesEncryptor*>> all_encryptors) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to use std::list instead of the more common std::vector?

std::stringstream ss;
ss << "Crypto algorithm " << alg_id << " is not supported";
throw ParquetException(ss.str());
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally not in favor of class constructors being able to fail. This maintains the status quo but you might consider doing argument validation in a factory static constructor and passing in sanitized arguments into a private ctor

private:
EVP_CIPHER_CTX* ctx_;
};
int AesEncryptor::CiphertextSizeDelta() { return ciphertext_size_delta_; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ciphertext_size_delta() (since this is just accessing an attribute)


int gcm_encrypt(const uint8_t* plaintext, int plaintext_len, uint8_t* key, int key_len,
uint8_t* aad, int aad_len, uint8_t* ciphertext) {
int AesEncryptor::gcm_encrypt(const uint8_t* plaintext, int plaintext_len, uint8_t* key,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take the opportunity to rename this as GcmEncrypt?

uint8_t* plaintext);
class AesEncryptor {
public:
// Can serve one key length only. Possible values: 16, 24, 32 bytes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use /// for comments (for the sake of doxygen)

EVP_CIPHER_CTX_free(ctx_);
ctx_ = NULLPTR;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put these implementations in the .cc file?

}

private:
EVP_CIPHER_CTX* ctx_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you hide this in a PIMPL? If this is an internal header then this is not necessary

}

private:
EVP_CIPHER_CTX* ctx_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question re using PIMPL

@@ -18,29 +18,126 @@
#ifndef PARQUET_UTIL_CRYPTO_H
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is PARQUET_EXPORT needed on any functions in here? If might be clearer to move this file to the top level of the Parquet codebase and call it encryption-internal.h if it is internal-only

@ggershinsky
Copy link
Contributor Author

We went with Revital over these comments, all of them make sense to us, we'll make the respective changes.

@wesm
Copy link
Member

wesm commented May 28, 2019

Please be sure to run clang-format-7 (using make format) and make lint on the codebase -- it is currently failing lint checks

@ggershinsky
Copy link
Contributor Author

@wesm , we've addressed your comments. If you're ok with the current version, please merge it, so the staging branch can pick up these 2 new files, upon rebase.

@wesm
Copy link
Member

wesm commented May 29, 2019

Thanks. Having a look now

@wesm
Copy link
Member

wesm commented May 29, 2019

FYI, it's a little bit odd to do code reviews on your fork as it's pseudoprivate

ggershinsky#6

In the future, can you adopt a process that permits the code review to be seen more obviously by the community? If you are collaborating on a PR, you can add each other as contributors on your fork and then have the code review on the PR itself

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thank you!

@wesm wesm closed this in f199a77 May 29, 2019
@ggershinsky ggershinsky deleted the p1517-crypto-pack-updates branch February 18, 2020 11:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants