Skip to content

Field::try_canonical_extension_type has unexpected behavior from external writers #9976

@limenilbuz

Description

@limenilbuz

I created a sample Parquet file with pyarrow that contained Uuid columns.

Here is a test script:

use std::fs::File;
use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;

fn main() {
    let file = File::open("uuids.parquet").unwrap();
    let reader = ParquetRecordBatchReaderBuilder::try_new(file).unwrap();
    let schema = reader.schema();
    let field = schema.fields.get(0).unwrap();
    println!("{:?}", field);
    match field.try_canonical_extension_type() {
        Ok(extension_type) => println!("I am of extension type: {:?}", extension_type),
        _ => println!("I am NOT an extension type")
    }
}

and the output of cargo run:

$ cargo run
Field { name: "uuids", data_type: FixedSizeBinary(16), metadata: {"ARROW:extension:metadata": "", "ARROW:extension:name": "arrow.uuid"} }
I am NOT an extension type

This is because in the C++ definition of UuidType, the SerDe methods both return and expect the empty string. This causes issues in the Rust SerDe methods for Uuid, which return and expect Option::None.

I found this comment in the original commit for the extension types. The simplest fix is just to accept the empty string as valid metadata.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions