-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python][Dataset] to_batches crash after calling 'count_rows' using dataset to read encrypted parquet #41431
Comments
From the message cc @tolleybot to see if you have any insight. |
@RyogaWan Could you double check the suspicious line above? It seems that we need to use COL_KEY_NAME list as in arrow/python/examples/dataset/write_dataset_encrypted.py Lines 43 to 53 in 4cf44b4
|
I think this is an identifier for column_keys used in KmsClient to get key to wrap or unwrap real key that encrypt dataset. In the example, I want to use same key for column and footer, so i just used FOOTER_KEY_NAME in the dict. Is there anything i‘m misunderstanding? And I apologize for the confusion I have caused. |
You may use |
Yes, I think it's correct. In my test, the 'to_batches' before 'count_rows' can be processing normally, but always failed after 'count_rows'. |
@wgtmac Sure thing |
The error does not appear from |
I have found the root cause.
|
…set (#41550) ### Rationale for this change When parquet dataset is reused to create multiple scanners, `FileMetaData` objects are cached to avoid parsing them again. However, these caused issues on encrypted files since internal file decryptors were no longer created by cached `FileMetaData` objects. ### What changes are included in this PR? Expose file_decryptor from FileMetaData and set it properly. ### Are these changes tested? Yes, modify the test to reproduce the issue and assure fixed. ### Are there any user-facing changes? No. * GitHub Issue: #41431 Authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Gang Wu <ustcwg@gmail.com>
Issue resolved by pull request 41550 |
…set (#41550) ### Rationale for this change When parquet dataset is reused to create multiple scanners, `FileMetaData` objects are cached to avoid parsing them again. However, these caused issues on encrypted files since internal file decryptors were no longer created by cached `FileMetaData` objects. ### What changes are included in this PR? Expose file_decryptor from FileMetaData and set it properly. ### Are these changes tested? Yes, modify the test to reproduce the issue and assure fixed. ### Are there any user-facing changes? No. * GitHub Issue: #41431 Authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Gang Wu <ustcwg@gmail.com>
…d dataset (apache#41550) ### Rationale for this change When parquet dataset is reused to create multiple scanners, `FileMetaData` objects are cached to avoid parsing them again. However, these caused issues on encrypted files since internal file decryptors were no longer created by cached `FileMetaData` objects. ### What changes are included in this PR? Expose file_decryptor from FileMetaData and set it properly. ### Are these changes tested? Yes, modify the test to reproduce the issue and assure fixed. ### Are there any user-facing changes? No. * GitHub Issue: apache#41431 Authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Gang Wu <ustcwg@gmail.com>
Describe the bug, including details regarding any error messages, version, and platform.
I‘m using pyarrow.dataset.dataset to read an encrypted parquet file with decryption_config set in ParquetFragmentScanOptions and use to_batches to generate a RecordBatchReader. But I find that, after calling 'count_rows', the to batches will crashed.
I also tried csv and parquet in plaintext, they can generate reader after 'count_rows'.
pyarrow version is 16.0.0
ExampleCode
Traceback
Component(s)
Python
The text was updated successfully, but these errors were encountered: