Skip to content

table.scan(row_filter="x IN (0, 1)") does not include the values for which x=0 when x is a DoubleType and a partition column #1937

@ypsah

Description

@ypsah

Apache Iceberg version

0.9.0 (latest release)

Please describe the bug 🐞

Hi, thanks for writing pyiceberg.

The bug is pretty much described in the title: table.scan(row_filter="x IN (0, 1)") does not include the values for which x=0 when x is a DoubleType and a partition column.

Here is a reproducer:

pip install pyiceberg[sql-sqlite,pyarrow]
from pathlib import Path
from tempfile import TemporaryDirectory

import pyarrow
from pyiceberg.catalog.sql import SqlCatalog
from pyiceberg.schema import Schema
from pyiceberg.transforms import IdentityTransform
from pyiceberg.types import DoubleType, NestedField
from pyiceberg.partitioning import PartitionSpec, PartitionField

schema = Schema(
    NestedField(field_id=1, name="x", field_type=DoubleType()),
    NestedField(field_id=2, name="y", field_type=DoubleType()),
)
partition_spec = PartitionSpec(PartitionField(source_id=1, field_id=1001, transform=IdentityTransform(), name="x"))

with TemporaryDirectory() as tmpdir:
    catalog = SqlCatalog(
        "local",
        uri=f"sqlite:///{tmpdir}/catalog.db",
        warehouse=f"file://{tmpdir}/warehouse",
    )
    catalog.create_namespace("test")
    table = catalog.create_table(
        "test.test", schema=schema, partition_spec=partition_spec
    )

    data = pyarrow.table(
        {
            "x": [0.0, 1.0, 2.0],
            "y": [0.0, 0.0, 0.0],
        }
    )
    table.overwrite(data)

    print("=== no filter ===")
    print(table.scan().to_arrow())
    print("=== x IN (0) ===")
    print(table.scan(row_filter="x IN (0)").to_arrow())
    print("=== x IN (0, 1, 2) ===")
    print(table.scan(row_filter="x IN (0, 1, 2)").to_arrow())

Output:

/tmp/tmp.l2MLQFjC7C-05duO9h5/lib/python3.13/site-packages/pyiceberg/table/__init__.py:686: UserWarning: Delete operation did not match any records
  warnings.warn("Delete operation did not match any records")
=== no filter ===
pyarrow.Table
x: double
y: double
----
x: [[0],[1],[2]]
y: [[0],[0],[0]]
=== x IN (0) ===
pyarrow.Table
x: double
y: double
----
x: [[0]]
y: [[0]]
=== x IN (0, 1, 2) ===
pyarrow.Table
x: double
y: double
----
x: [[1],[2]]
y: [[0],[0]]

I expect output for x in (0, 1, 2) to match that of the no filter scan.

Note that I could not reproduce when x is a LongType instead of a DoubleType.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions