[Python][Dataset] to_batches crash after calling 'count_rows' using dataset to read encrypted parquet #41431

RyogaWan · 2024-04-29T10:04:12Z

Describe the bug, including details regarding any error messages, version, and platform.

I‘m using pyarrow.dataset.dataset to read an encrypted parquet file with decryption_config set in ParquetFragmentScanOptions and use to_batches to generate a RecordBatchReader. But I find that, after calling 'count_rows', the to batches will crashed.

I also tried csv and parquet in plaintext, they can generate reader after 'count_rows'.

pyarrow version is 16.0.0

ExampleCode

import base64

import pyarrow as pa
import pyarrow.dataset as ds
import pyarrow.csv
import pyarrow.parquet as pq
import pyarrow.parquet.encryption as pe

rb = pa.record_batch([[1, 2, 3], ["a", "b", "c"], [None, 1.2, 3.2]], ["id", "name", "value"])

FOOTER_KEY = b"0123456789112345"
FOOTER_KEY_NAME = "footer_key"


class PyarrowInMemoryKmsClient(pe.KmsClient):
    def __init__(self):
        """Create an InMemoryKmsClient instance."""
        pe.KmsClient.__init__(self)
        self.master_keys_map = {
            FOOTER_KEY_NAME: FOOTER_KEY.decode("UTF-8"),
        }

    def wrap_key(self, key_bytes, master_key_identifier):
        """Not a secure cipher - the wrapped key
        is just the master key concatenated with key bytes"""
        print(f"key_bytes is {key_bytes}, {master_key_identifier=}")
        master_key_bytes = self.master_keys_map[master_key_identifier].encode(
            'utf-8')
        wrapped_key = b"".join([master_key_bytes, key_bytes])
        result = base64.b64encode(wrapped_key)
        return result

    def unwrap_key(self, wrapped_key, master_key_identifier):
        """Not a secure cipher - just extract the key from
        the wrapped key"""
        print(f"wrapped_key is {wrapped_key}, {master_key_identifier=}")
        expected_master_key = self.master_keys_map[master_key_identifier]
        decoded_wrapped_key = base64.b64decode(wrapped_key)
        master_key_bytes = decoded_wrapped_key[:16]
        decrypted_key = decoded_wrapped_key[16:]
        if (expected_master_key == master_key_bytes.decode('utf-8')):
            return decrypted_key
        raise ValueError("Incorrect master key used",
                         master_key_bytes, decrypted_key)


def kms_client_factory(kms_connection_configuration):
    return PyarrowInMemoryKmsClient()


def test_csv_dataset():
    source_csv = "./test.csv"
    with pa.csv.CSVWriter(source_csv, rb.schema) as c:
        for _ in range(6):
            c.write(rb)

    dataset = ds.dataset(source_csv, format="csv")
    reader = pa.RecordBatchReader.from_batches(dataset.schema, dataset.to_batches())

    count = 0
    for record in reader:
        count += record.num_rows

    print(dataset.count_rows())
    reader = pa.RecordBatchReader.from_batches(dataset.schema, dataset.to_batches())

    count = 0
    for record in reader:
        count += record.num_rows


def test_parquet_dataset():
    source_parquet = "./test.parquet"
    with pq.ParquetWriter(source_parquet, rb.schema) as c:
        for _ in range(6):
            c.write(rb)

    dataset = ds.dataset(source_parquet, format="parquet")
    reader = pa.RecordBatchReader.from_batches(dataset.schema, dataset.to_batches())

    count = 0
    for record in reader:
        count += record.num_rows

    print(dataset.count_rows())
    reader = pa.RecordBatchReader.from_batches(dataset.schema, dataset.to_batches())

    count = 0
    for record in reader:
        count += record.num_rows


def test_encrypted_parquet_dataset():
    source_enc_parquet = "./test.enc.parquet"
    crypt_factory = pe.CryptoFactory(kms_client_factory)
    encryption_config = pe.EncryptionConfiguration(
        footer_key=FOOTER_KEY_NAME,
        column_keys={
            FOOTER_KEY_NAME: rb.schema.names,
        },
        encryption_algorithm="AES_GCM_V1",
        data_key_length_bits=256,
    )
    kms_connection_config = pe.KmsConnectionConfig()

    with pq.ParquetWriter(source_enc_parquet, rb.schema,
                          encryption_properties=crypt_factory.file_encryption_properties(kms_connection_config,
                                                                                         encryption_config)) as c:
        for _ in range(6):
            c.write(rb)

    scan_options = ds.ParquetFragmentScanOptions(
        decryption_config=ds.ParquetDecryptionConfig(
            crypto_factory=crypt_factory, kms_connection_config=kms_connection_config,
            decryption_config=pe.DecryptionConfiguration()
        )
    )
    file_format = ds.ParquetFileFormat(default_fragment_scan_options=scan_options)

    dataset = ds.dataset(source_enc_parquet, format=file_format)
    reader = pa.RecordBatchReader.from_batches(dataset.schema, dataset.to_batches())

    count = 0
    for record in reader:
        count += record.num_rows

    print(dataset.count_rows())
    reader = pa.RecordBatchReader.from_batches(dataset.schema, dataset.to_batches())

    count = 0
    for record in reader:
        count += record.num_rows


if __name__ == '__main__':
    test_csv_dataset()
    test_parquet_dataset()
    test_encrypted_parquet_dataset()

Traceback

Traceback (most recent call last):
  File "/Users/yangyu/Library/Application Support/JetBrains/PyCharm2022.1/scratches/scratch_3.py", line 138, in <module>
    test_encrypted_parquet_dataset()
  File "/Users/yangyu/Library/Application Support/JetBrains/PyCharm2022.1/scratches/scratch_3.py", line 131, in test_encrypted_parquet_dataset
    for record in reader:
  File "pyarrow/ipc.pxi", line 666, in pyarrow.lib.RecordBatchReader.__next__
  File "pyarrow/ipc.pxi", line 700, in pyarrow.lib.RecordBatchReader.read_next_batch
  File "pyarrow/error.pxi", line 88, in pyarrow.lib.check_status
  File "pyarrow/_dataset.pyx", line 3769, in _iterator
  File "pyarrow/_dataset.pyx", line 3387, in pyarrow._dataset.TaggedRecordBatchIterator.__next__
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: RowGroup is noted as encrypted but no file decryptor

Component(s)

Python

AlenkaF · 2024-05-06T08:44:59Z

I can reproduce the issue locally.

@wgtmac @pitrou would you have time and idea on what could be the issue that both dataset.to_table() or dataset.to_batches() returns the error after calling count_rows if the parquet file is encrypted?

wgtmac · 2024-05-06T09:19:41Z

From the message OSError: RowGroup is noted as encrypted but no file decryptor, it seems that the decryption config is not correctly created. I'll take some time to investigate this as I'm not that familiar with python.

cc @tolleybot to see if you have any insight.

wgtmac · 2024-05-06T09:35:31Z

def test_encrypted_parquet_dataset():
    source_enc_parquet = "./test.enc.parquet"
    crypt_factory = pe.CryptoFactory(kms_client_factory)
    encryption_config = pe.EncryptionConfiguration(
        footer_key=FOOTER_KEY_NAME,
        column_keys={
            FOOTER_KEY_NAME: rb.schema.names,  // <-- something wrong here
        },
        encryption_algorithm="AES_GCM_V1",
        data_key_length_bits=256,
    )
    kms_connection_config = pe.KmsConnectionConfig()

@RyogaWan Could you double check the suspicious line above? It seems that we need to use COL_KEY_NAME list as in

arrow/python/examples/dataset/write_dataset_encrypted.py

Lines 43 to 53 in 4cf44b4

    
           encryption_config = pe.EncryptionConfiguration( 
        
               footer_key=FOOTER_KEY_NAME, 
        
               plaintext_footer=False, 
        
               # Use COL_KEY_NAME to encrypt `n_legs` and `animal` columns. 
        
               column_keys={ 
        
                   COL_KEY_NAME: ["n_legs", "animal"], 
        
               }, 
        
               encryption_algorithm="AES_GCM_V1", 
        
               # requires timedelta or an assertion is raised 
        
               cache_lifetime=timedelta(minutes=5.0), 
        
               data_key_length_bits=256)

RyogaWan · 2024-05-06T09:50:04Z

def test_encrypted_parquet_dataset():
    source_enc_parquet = "./test.enc.parquet"
    crypt_factory = pe.CryptoFactory(kms_client_factory)
    encryption_config = pe.EncryptionConfiguration(
        footer_key=FOOTER_KEY_NAME,
        column_keys={
            FOOTER_KEY_NAME: rb.schema.names,  // <-- something wrong here
        },
        encryption_algorithm="AES_GCM_V1",
        data_key_length_bits=256,
    )
    kms_connection_config = pe.KmsConnectionConfig()
@RyogaWan Could you double check the suspicious line above? It seems that we need to use COL_KEY_NAME list as in

arrow/python/examples/dataset/write_dataset_encrypted.py

Lines 43 to 53 in 4cf44b4

encryption_config = pe.EncryptionConfiguration(

footer_key=FOOTER_KEY_NAME,

plaintext_footer=False,

# Use COL_KEY_NAME to encrypt `n_legs` and `animal` columns.

column_keys={

COL_KEY_NAME: ["n_legs", "animal"],

},

encryption_algorithm="AES_GCM_V1",

# requires timedelta or an assertion is raised

cache_lifetime=timedelta(minutes=5.0),

data_key_length_bits=256)

I think this is an identifier for column_keys used in KmsClient to get key to wrap or unwrap real key that encrypt dataset. In the example, I want to use same key for column and footer, so i just used FOOTER_KEY_NAME in the dict. Is there anything i‘m misunderstanding? And I apologize for the confusion I have caused.

wgtmac · 2024-05-06T09:55:07Z

You may use FOOTER_KEY_NAME as the key name, but the column names should be wrapped in the list. And please also check if the column name (i.e. rb.schema.names) is correct.

RyogaWan · 2024-05-06T09:59:31Z

You may use FOOTER_KEY_NAME as the key name, but the column names should be wrapped in the list. And please also check if the column name (i.e. rb.schema.names) is correct.

Yes, I think it's correct. In my test, the 'to_batches' before 'count_rows' can be processing normally, but always failed after 'count_rows'.

tolleybot · 2024-05-06T14:01:27Z

From the message OSError: RowGroup is noted as encrypted but no file decryptor, it seems that the decryption config is not correctly created. I'll take some time to investigate this as I'm not that familiar with python.

cc @tolleybot to see if you have any insight.

@wgtmac Sure thing

wgtmac · 2024-05-06T14:28:40Z

The error does not appear from print(dataset.count_rows()) and it successfully prints 18 on my end. The issue comes when the code snippet tries to create the reader again from the same dataset. It seems that the file decryption config cannot be reused.

wgtmac · 2024-05-06T15:38:40Z

I have found the root cause.

When the dataset scanner is running for the 1st time, it will cache FileMetaData in the ParquetFileFragment:

arrow/cpp/src/arrow/dataset/file_parquet.cc

Line 816 in d10ebf0

return SetMetadata(reader->parquet_reader()->metadata(), std::move(manifest));
When the dataset creates a new scanner, the internal parquet reader will reuse the cached FileMetaData instead of parsing it from the footer:

arrow/cpp/src/parquet/file_reader.cc

Line 779 in d10ebf0

file->set_metadata(std::move(metadata));
Because we don't parse FileMetaData again, file_decryptor_ is not created any more. (It was created in the 1st run here:

arrow/cpp/src/parquet/file_reader.cc

Line 686 in d10ebf0

file_decryptor_ = std::make_shared<InternalFileDecryptor>(

)
Since the file_decryptor_ is null, error is reported when we have access to encrypted data:

arrow/cpp/src/parquet/encryption/internal_file_decryptor.cc

Line 217 in d10ebf0

throw ParquetException("RowGroup is noted as encrypted but no file decryptor");

…d dataset

…set (#41550) ### Rationale for this change When parquet dataset is reused to create multiple scanners, `FileMetaData` objects are cached to avoid parsing them again. However, these caused issues on encrypted files since internal file decryptors were no longer created by cached `FileMetaData` objects. ### What changes are included in this PR? Expose file_decryptor from FileMetaData and set it properly. ### Are these changes tested? Yes, modify the test to reproduce the issue and assure fixed. ### Are there any user-facing changes? No. * GitHub Issue: #41431 Authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Gang Wu <ustcwg@gmail.com>

wgtmac · 2024-05-08T01:53:08Z

Issue resolved by pull request 41550
#41550

…set (#41550) ### Rationale for this change When parquet dataset is reused to create multiple scanners, `FileMetaData` objects are cached to avoid parsing them again. However, these caused issues on encrypted files since internal file decryptors were no longer created by cached `FileMetaData` objects. ### What changes are included in this PR? Expose file_decryptor from FileMetaData and set it properly. ### Are these changes tested? Yes, modify the test to reproduce the issue and assure fixed. ### Are there any user-facing changes? No. * GitHub Issue: #41431 Authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Gang Wu <ustcwg@gmail.com>

…d dataset (apache#41550) ### Rationale for this change When parquet dataset is reused to create multiple scanners, `FileMetaData` objects are cached to avoid parsing them again. However, these caused issues on encrypted files since internal file decryptors were no longer created by cached `FileMetaData` objects. ### What changes are included in this PR? Expose file_decryptor from FileMetaData and set it properly. ### Are these changes tested? Yes, modify the test to reproduce the issue and assure fixed. ### Are there any user-facing changes? No. * GitHub Issue: apache#41431 Authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Gang Wu <ustcwg@gmail.com>

RyogaWan added the Type: bug label Apr 29, 2024

github-actions bot added the Component: Python label Apr 29, 2024

wgtmac added a commit to wgtmac/arrow that referenced this issue May 6, 2024

apacheGH-41431: [C++][Parquet][Dataset] Fix repeated scan on encrypte…

f442528

…d dataset

github-actions bot mentioned this issue May 6, 2024

GH-41431: [C++][Parquet][Dataset] Fix repeated scan on encrypted dataset #41550

Merged

github-actions bot assigned wgtmac May 6, 2024

wgtmac added this to the 17.0.0 milestone May 8, 2024

wgtmac closed this as completed May 8, 2024

raulcd modified the milestones: 17.0.0, 16.1.0 May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python][Dataset] to_batches crash after calling 'count_rows' using dataset to read encrypted parquet #41431

[Python][Dataset] to_batches crash after calling 'count_rows' using dataset to read encrypted parquet #41431

RyogaWan commented Apr 29, 2024

AlenkaF commented May 6, 2024

wgtmac commented May 6, 2024

wgtmac commented May 6, 2024

RyogaWan commented May 6, 2024

wgtmac commented May 6, 2024

RyogaWan commented May 6, 2024

tolleybot commented May 6, 2024

wgtmac commented May 6, 2024

wgtmac commented May 6, 2024 •

edited

Loading

wgtmac commented May 8, 2024

[Python][Dataset] to_batches crash after calling 'count_rows' using dataset to read encrypted parquet #41431

[Python][Dataset] to_batches crash after calling 'count_rows' using dataset to read encrypted parquet #41431

Comments

RyogaWan commented Apr 29, 2024

Describe the bug, including details regarding any error messages, version, and platform.

ExampleCode

Traceback

Component(s)

AlenkaF commented May 6, 2024

wgtmac commented May 6, 2024

wgtmac commented May 6, 2024

RyogaWan commented May 6, 2024

wgtmac commented May 6, 2024

RyogaWan commented May 6, 2024

tolleybot commented May 6, 2024

wgtmac commented May 6, 2024

wgtmac commented May 6, 2024 • edited Loading

wgtmac commented May 8, 2024

wgtmac commented May 6, 2024 •

edited

Loading