Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Go][Parquet] writer.WriteBatch silently fail when using encryption #14940

Closed
ahmb84 opened this issue Dec 13, 2022 · 9 comments · Fixed by #14954
Closed

[Go][Parquet] writer.WriteBatch silently fail when using encryption #14940

ahmb84 opened this issue Dec 13, 2022 · 9 comments · Fixed by #14954

Comments

@ahmb84
Copy link

ahmb84 commented Dec 13, 2022

Describe the bug, including details regarding any error messages, version, and platform.

Hello, I am trying to leverage the go arrow library to encrypt/decrypt parquet files.
Unfortunately, it seems that the encrypted values are not written in the parquet file when calling the WriteBatch function.
I have written a test function that writes and then reads a parquet file. The function is working correctly without encryption, but when I add encryption, the values for the encrypted columns do not exist. This is confirmed by two reasons:

  • The size of the encrypted file is smaller than the unencrypted one
  • When I read the encrypted file, the reader.HasNext always returns false, and the number of rows in the row group is 0.

After noticing this behavior and testing different configurations (serial/buffer, encoding, compression), I've looked at the test suite and noticed what I assume is a bug.

I’ve run the test suite and all the tests pass. The decryptions work perfectly for the files coming from the repo https://github.com/apache/parquet-testing, but for the file created in the previous test (the encryption one), the test is actually silently failing.
The number of rows is zero, even if it should be 50. The test does check if the number of rows read equals the number of rows in the stats, but it does not check if there are rows at all.

I am assuming that something is wrong with the flush when using encryption.

I have forked the repo and updated the test to include a check for the number of rows. You can find the forked repo here => https://github.com/ahmb84/arrow/tree/parquet-go-add-test-for-number-of-rows-in-decryption

Let me know if my assumptions are correct or if I am missing anything. Thank you!

Component(s)

Go, Parquet

@kou kou changed the title [GO][Parquet] writer.WriteBatch silently fail when using encryption [Go][Parquet] writer.WriteBatch silently fail when using encryption Dec 14, 2022
@zeroshade
Copy link
Member

I'll take a look at this today

@ahmb84
Copy link
Author

ahmb84 commented Dec 14, 2022

Thank you! Let me know if I can contribute in anyway.

@zeroshade
Copy link
Member

zeroshade commented Dec 14, 2022

@ahmb84 I don't think you pushed your updated version of the test. At least I'm not able to easily find your updated version of the test in your fork, could you give me a direct link to the commit?

@ahmb84
Copy link
Author

ahmb84 commented Dec 14, 2022

@zeroshade my bad forgot to commit yesterday. Just pushed, you can find the changed test file here: https://github.com/ahmb84/arrow/blob/5a651992d5ca9db0a5eefc2569276cdfe3ccf038/go/parquet/encryption_read_config_test.go

@zeroshade
Copy link
Member

@ahmb84 Thanks for your great reproducer, i was able to narrow down the issue and not only fix it, but fix things to properly propagate that error through and added checks in the tests to check for any error returned from WriteBatch. See the attached PR and let me know if everything works well for you with it.

@ahmb84
Copy link
Author

ahmb84 commented Dec 14, 2022

@zeroshade thank you for the quick solution! I'll see the PR, make sure that everything works well, and get back to you asap - probably tomorrow.

zeroshade added a commit to zeroshade/arrow that referenced this issue Dec 16, 2022
@ahmb84
Copy link
Author

ahmb84 commented Dec 18, 2022

Hello @zeroshade, sorry for the delay. I've been able to test your PR and it's working well. Thank you!
I have a question related to it. Is it possible to read a parquet file and selectively choose which column to decrypt while having a usable parquet file?
It seems that the parquet spec does not mention if it's supported, but I assume that the ability to encrypt some columns selectively implies that it should be possible to decrypt some selectively.
With the current reader in the go library force to either decrypts the file entirely if a column is not specified in the decryption policy, it panics.
If my assumption is correct, would it be possible to add the support for decrypting selectively some column and keeping other encrypted? Thank you!

@zeroshade
Copy link
Member

In theory it should be entirely possible to do so, I'll have to take a look to see why it's panic'ing. Could you file another GH issue to track that? If you could provide the stack trace of the panic that would also be extremely helpful.

zeroshade added a commit that referenced this issue Dec 19, 2022
Missed a `default` causing a fall through to always return an error that wasn't being returned through and checked. 

Test updated to validate number of values/rows written. Error is now propagated through if encountered.
* Closes: #14940

Authored-by: Matt Topol <zotthewizard@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
@zeroshade zeroshade added this to the 11.0.0 milestone Dec 19, 2022
@ahmb84
Copy link
Author

ahmb84 commented Dec 19, 2022

@zeroshade thanks for your feedback! I'll open another issue with the stack trace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants