Apache Iceberg version
0.11.0 (latest release)
Please describe the bug 🐞
We have recently updated our functions to call the pyiceberg table.append() function with dict encoded arrow tables. Now we have in our iceberg tables mixed data from before this change, (where our data still is stored as string) and after the change, where the data is stored as dict-encoded strings.
If we now call to_arrow() of a DataScan class, on this table we get this error:
pyarrow.lib.ArrowTypeError: Unable to merge: Field col has incompatible types: string vs dictionary<values=string, indices=int32, ordered=0>
Here is a minimal example that reproduces this error:
from pyiceberg.io.pyarrow import ArrowScan
from pyiceberg.table import ALWAYS_TRUE
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField
from pyiceberg.types import StringType
import pyarrow as pa
def create_scan_with_mixed_dict_encode_not_encode() -> ArrowScan:
schema = Schema(
NestedField(field_id=1, name="col", field_type=StringType(), required=False)
)
class FakeTableMetadata:
def schema(self) -> Schema:
return schema
scan = ArrowScan(table_metadata=FakeTableMetadata(),
io=object(),
projected_schema=schema,
row_filter=ALWAYS_TRUE)
def _batches_for_repro(self, _tasks):
str_values = pa.array(["a"], type=pa.string())
yield pa.record_batch([str_values], names=["col"])
yield pa.record_batch([str_values.dictionary_encode()], names=["col"])
ArrowScan.to_record_batches = _batches_for_repro
return scan
if __name__ == "__main__":
scan = create_scan_with_mixed_dict_encode_not_encode()
arrow_table = ArrowScan.to_table(scan, tasks=[])
I am happy to provide a bugfix PR, but I need a small guidance on the best approach.
One idea is to cast each batch in to_table to the arrow_schema. The more performant way is to check for each batch, if the schema is different. If they are different, then find the dict_encoded col and only cast that one to string.
Willingness to contribute
Apache Iceberg version
0.11.0 (latest release)
Please describe the bug 🐞
We have recently updated our functions to call the pyiceberg table.append() function with dict encoded arrow tables. Now we have in our iceberg tables mixed data from before this change, (where our data still is stored as string) and after the change, where the data is stored as dict-encoded strings.
If we now call to_arrow() of a DataScan class, on this table we get this error:
Here is a minimal example that reproduces this error:
I am happy to provide a bugfix PR, but I need a small guidance on the best approach.
One idea is to cast each batch in to_table to the arrow_schema. The more performant way is to check for each batch, if the schema is different. If they are different, then find the dict_encoded col and only cast that one to string.
Willingness to contribute