Skip to content

DeltaTable.delete fails with DataFusion predicate: arrow_cast should have been simplified to cast #19866

@geoHeil

Description

@geoHeil

Describe the bug

When using deltalake.DeltaTable.delete(predicate=...) with a
predicate generated for the DataFusion dialect, Delta errors
with:

DeltaError: Generic DeltaTable error: Internal error: arrow_cast
should have been simplified to cast

Traceback (most recent call last):
  File "demo.py", line 64, in <module>
    main()
  File "demo.py", line 60, in main
    table.delete(predicate=predicate)
  File ".venv/lib/python3.10/site-packages/deltalake/table.py", line 1112, in delete
    metrics = self._table.delete(
_internal.DeltaError: Generic DeltaTable error: Internal error: arrow_cast should have been simplified to cast.
This issue was likely caused by a bug in DataFusion's code. Please help us to resolve this by filing a bug report in our issue tracker: https://github.com/apache/datafusion/issues

See the raw SQL which is causing the problem

SELECT
  *
FROM "t" AS "t0"
WHERE
  "t0"."created_at" = ARROW_CAST('2024-01-01 00:00:00+00:00', 'Timestamp(Microsecond, Some("UTC"))')
Predicate: "created_at" = ARROW_CAST('2024-01-01 00:00:00+00:00', 'Timestamp(Microsecond, Some("UTC"))')

To Reproduce

Minimal repro (Python):

verified for deltalake 1.3.2

from __future__ import annotations

from datetime import datetime, timedelta, timezone
from pathlib import Path
import shutil
import re

import polars as pl
import sqlglot
from deltalake import DeltaTable, write_deltalake

import ibis


def extract_where(sql: str, *, dialect: str | None) -> str:
    expr = sqlglot.parse_one(sql, read=dialect)
    where = expr.args.get("where")
    if where is None:
        raise RuntimeError("No WHERE clause found")
    # Strip table qualifiers for a single-table predicate
    for col in where.find_all(sqlglot.exp.Column):
        if col.table:
            col.set("table", None)
    predicate = where.sql(dialect=dialect) if dialect else where.sql()
    predicate = re.sub(r"^WHERE\s+", "", predicate, flags=re.IGNORECASE)
    return predicate


def main() -> None:
    root = Path("/tmp/delta_arrow_cast_datafusion").resolve()
    table_uri = str(root)
    shutil.rmtree(root, ignore_errors=True)

    base_time = datetime(2024, 1, 1, tzinfo=timezone.utc)
    newer_time = base_time + timedelta(minutes=5)

    df = pl.DataFrame(
        {
            "sample_uid": ["a", "a"],
            "value": [1, 2],
            "created_at": [base_time, newer_time],
        }
    )

    write_deltalake(table_uri, df, mode="overwrite")

    schema = ibis.schema({
        "sample_uid": "string",
        "value": "int64",
        "created_at": "timestamp('UTC')",
    })
    t = ibis.table(schema, name="t")
    filtered = t.filter(t.created_at == base_time)
    sql = ibis.to_sql(filtered, dialect="datafusion")
    predicate = extract_where(sql, dialect="datafusion")
    print("DataFusion SQL:", sql)
    print("Predicate:", predicate)

    table = DeltaTable(table_uri)
    table.delete(predicate=predicate)


if __name__ == "__main__":
    main()

Expected behavior

perform the deletion without parsing error

Additional context

Switching the dialect to postgres by changing from

predicate = extract_where(sql, dialect="datafusion")

to

predicate = extract_where(sql, dialect="postgres")

is a viable workaround for now - but feels wrong.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions