-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
When using deltalake.DeltaTable.delete(predicate=...) with a
predicate generated for the DataFusion dialect, Delta errors
with:
DeltaError: Generic DeltaTable error: Internal error: arrow_cast
should have been simplified to cast
Traceback (most recent call last):
File "demo.py", line 64, in <module>
main()
File "demo.py", line 60, in main
table.delete(predicate=predicate)
File ".venv/lib/python3.10/site-packages/deltalake/table.py", line 1112, in delete
metrics = self._table.delete(
_internal.DeltaError: Generic DeltaTable error: Internal error: arrow_cast should have been simplified to cast.
This issue was likely caused by a bug in DataFusion's code. Please help us to resolve this by filing a bug report in our issue tracker: https://github.com/apache/datafusion/issues
See the raw SQL which is causing the problem
SELECT
*
FROM "t" AS "t0"
WHERE
"t0"."created_at" = ARROW_CAST('2024-01-01 00:00:00+00:00', 'Timestamp(Microsecond, Some("UTC"))')
Predicate: "created_at" = ARROW_CAST('2024-01-01 00:00:00+00:00', 'Timestamp(Microsecond, Some("UTC"))')To Reproduce
Minimal repro (Python):
verified for deltalake 1.3.2
from __future__ import annotations
from datetime import datetime, timedelta, timezone
from pathlib import Path
import shutil
import re
import polars as pl
import sqlglot
from deltalake import DeltaTable, write_deltalake
import ibis
def extract_where(sql: str, *, dialect: str | None) -> str:
expr = sqlglot.parse_one(sql, read=dialect)
where = expr.args.get("where")
if where is None:
raise RuntimeError("No WHERE clause found")
# Strip table qualifiers for a single-table predicate
for col in where.find_all(sqlglot.exp.Column):
if col.table:
col.set("table", None)
predicate = where.sql(dialect=dialect) if dialect else where.sql()
predicate = re.sub(r"^WHERE\s+", "", predicate, flags=re.IGNORECASE)
return predicate
def main() -> None:
root = Path("/tmp/delta_arrow_cast_datafusion").resolve()
table_uri = str(root)
shutil.rmtree(root, ignore_errors=True)
base_time = datetime(2024, 1, 1, tzinfo=timezone.utc)
newer_time = base_time + timedelta(minutes=5)
df = pl.DataFrame(
{
"sample_uid": ["a", "a"],
"value": [1, 2],
"created_at": [base_time, newer_time],
}
)
write_deltalake(table_uri, df, mode="overwrite")
schema = ibis.schema({
"sample_uid": "string",
"value": "int64",
"created_at": "timestamp('UTC')",
})
t = ibis.table(schema, name="t")
filtered = t.filter(t.created_at == base_time)
sql = ibis.to_sql(filtered, dialect="datafusion")
predicate = extract_where(sql, dialect="datafusion")
print("DataFusion SQL:", sql)
print("Predicate:", predicate)
table = DeltaTable(table_uri)
table.delete(predicate=predicate)
if __name__ == "__main__":
main()Expected behavior
perform the deletion without parsing error
Additional context
Switching the dialect to postgres by changing from
predicate = extract_where(sql, dialect="datafusion")
to
predicate = extract_where(sql, dialect="postgres")
is a viable workaround for now - but feels wrong.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working