Skip to content

Inconsistent row count across versions #1132

@dev-goyal

Description

@dev-goyal

Apache Iceberg version

0.7.1 (latest release)

Please describe the bug 🐞

Noticing some fairly weird behaviour with pyiceberg - with the same exact code being run across different versions of the API, we're seeing different counts returned. Have tried this with athena, and can confirm that the 0.6.1 count is the correct one. Any ideas on where to look when debugging this?

Can confirm that the .plan_files() and delete_files is identical across the two versions.

import  pyiceberg
print(pyiceberg.__version__)

from pyiceberg import catalog as pyi_catalog

catalog = pyi_catalog.load_catalog(name="default", type="glue")
table = catalog.load_table("ml_recommendations.users_v2")
scan = table.scan(
    row_filter=kwargs["row_filter"]
)

df_users = scan.to_duckdb("users")
df_users.sql("SELECT count(*) FROM users")
>> 0.6.1
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│      6700635 │
└──────────────┘
>> 0.7.1
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│      1973154 │
└──────────────┘

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions