-
-
Notifications
You must be signed in to change notification settings - Fork 87
Description
chdb version: v3.6.0 polars version: 1.34.0 python version: 3.12.3
Bug Description:
When querying a PyArrow table that was generated by polars.DataFrame.to_arrow()
, chdb incorrectly parses the date32
column starting from the second row. The first row's date is read correctly, but subsequent identical dates (2000-01-01
) are returned as 1970-01-01
(the Unix epoch date).
This issue seems specific to the Arrow table created by Polars, as tables created from Pandas do not exhibit this behavior.
Steps to Reproduce:
The following Python script consistently reproduces the bug using simulated data.
import chdb
import polars as pl
# 1. Create a Polars DataFrame with a date column from strings
data_dict = {
'id': [1, 1],
'td_date': ['2000-01-01', '2000-01-01'],
'value': [1234.1, 2345.7]
}
df_polars = pl.DataFrame(data_dict)
df_polars = df_polars.with_columns(pl.col("td_date").str.to_date())
# 2. Convert the Polars DataFrame to a PyArrow Table
arrow_table_from_polars = df_polars.to_arrow()
print("--- Initial Polars DataFrame ---")
print(df_polars)
print("\n--- Arrow Table from Polars (looks correct) ---")
print(arrow_table_from_polars)
# 3. Query with chdb, which triggers the bug
result = chdb.query('''
SELECT *
FROM Python(arrow_table_from_polars)
LIMIT 2
''', output_format='Markdown')
print("\n--- chdb Query Result (Buggy) ---")
print(result)
Actual Behavior: The query returns the incorrect date 1970-01-01
for the second row.
--- Initial Polars DataFrame ---
shape: (2, 3)
┌─────┬────────────┬───────────┐
│ id ┆ td_date ┆ value │
│ --- ┆ --- ┆ --- │
│ i64 ┆ date ┆ f64 │
╞═════╪════════════╪═══════════╡
│ 1 ┆ 2000-01-01 ┆ 1234.1000 │
│ 1 ┆ 2000-01-01 ┆ 2345.7000 │
└─────┴────────────┴───────────┘
--- Arrow Table from Polars (looks correct) ---
pyarrow.Table
id: int64
td_date: date32[day]
value: double
----
id: [[1,1]]
td_date: [[2000-01-01,2000-01-01]]
value: [[1234.1,2345.7]]
--- chdb Query Result (Buggy) ---
| id | td_date | value |
|-:|-:|-:|
| 1 | 2000-01-01 | 1234.1 |
| 1 | 1970-01-01 | 2345.7 |
Workaround:
A functioning workaround is to convert the date column back to a string within Polars before converting the DataFrame to an Arrow table. This suggests the issue is specific to how chdb
parses the date32
array type when it originates from Polars.
# Workaround implementation
df_polars_str_date = df_polars.with_columns(
pl.col("td_date").dt.strftime("%Y-%m-%d")
)
arrow_table_workaround = df_polars_str_date.to_arrow()
# A chdb query on 'arrow_table_workaround' works correctly.