Skip to content

ArrowScan to_table fails if the data is mixed between dict-encoded strings and plain strings. #3260

@lukaswelsch

Description

@lukaswelsch

Apache Iceberg version

0.11.0 (latest release)

Please describe the bug 🐞

We have recently updated our functions to call the pyiceberg table.append() function with dict encoded arrow tables. Now we have in our iceberg tables mixed data from before this change, (where our data still is stored as string) and after the change, where the data is stored as dict-encoded strings.

If we now call to_arrow() of a DataScan class, on this table we get this error:

pyarrow.lib.ArrowTypeError: Unable to merge: Field col has incompatible types: string vs dictionary<values=string, indices=int32, ordered=0>

Here is a minimal example that reproduces this error:

from pyiceberg.io.pyarrow import ArrowScan
from pyiceberg.table import ALWAYS_TRUE
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField
from pyiceberg.types import StringType

import pyarrow as pa


def create_scan_with_mixed_dict_encode_not_encode() -> ArrowScan:
    schema = Schema(
        NestedField(field_id=1, name="col", field_type=StringType(), required=False)
    )

    class FakeTableMetadata:
        def schema(self) -> Schema:
            return schema

    scan = ArrowScan(table_metadata=FakeTableMetadata(),
                     io=object(),
                     projected_schema=schema,
                     row_filter=ALWAYS_TRUE)

    def _batches_for_repro(self, _tasks):
        str_values = pa.array(["a"], type=pa.string())
        yield pa.record_batch([str_values], names=["col"])
        yield pa.record_batch([str_values.dictionary_encode()], names=["col"])

    ArrowScan.to_record_batches = _batches_for_repro
    return scan


if __name__ == "__main__":
    scan = create_scan_with_mixed_dict_encode_not_encode()
    arrow_table = ArrowScan.to_table(scan, tasks=[])

I am happy to provide a bugfix PR, but I need a small guidance on the best approach.
One idea is to cast each batch in to_table to the arrow_schema. The more performant way is to check for each batch, if the schema is different. If they are different, then find the dict_encoded col and only cast that one to string.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions