Skip to content

Bug: Incorrect date 1970-01-01 for the second row when querying from Polars to_arrow() table #403

@baiyus

Description

@baiyus

chdb version: v3.6.0 polars version: 1.34.0 python version: 3.12.3

Bug Description:
When querying a PyArrow table that was generated by polars.DataFrame.to_arrow(), chdb incorrectly parses the date32 column starting from the second row. The first row's date is read correctly, but subsequent identical dates (2000-01-01) are returned as 1970-01-01 (the Unix epoch date).

This issue seems specific to the Arrow table created by Polars, as tables created from Pandas do not exhibit this behavior.

Steps to Reproduce:
The following Python script consistently reproduces the bug using simulated data.

import chdb
import polars as pl

# 1. Create a Polars DataFrame with a date column from strings
data_dict = {
    'id': [1, 1],
    'td_date': ['2000-01-01', '2000-01-01'],
    'value': [1234.1, 2345.7]
}
df_polars = pl.DataFrame(data_dict)
df_polars = df_polars.with_columns(pl.col("td_date").str.to_date())

# 2. Convert the Polars DataFrame to a PyArrow Table
arrow_table_from_polars = df_polars.to_arrow()

print("--- Initial Polars DataFrame ---")
print(df_polars)
print("\n--- Arrow Table from Polars (looks correct) ---")
print(arrow_table_from_polars)

# 3. Query with chdb, which triggers the bug
result = chdb.query('''
    SELECT *
    FROM Python(arrow_table_from_polars)
    LIMIT 2
''', output_format='Markdown')

print("\n--- chdb Query Result (Buggy) ---")
print(result)

Actual Behavior: The query returns the incorrect date 1970-01-01 for the second row.

--- Initial Polars DataFrame ---
shape: (2, 3)
┌─────┬────────────┬───────────┐
│ id  ┆ td_date    ┆ value     │
│ --- ┆ ---        ┆ ---       │
│ i64 ┆ date       ┆ f64       │
╞═════╪════════════╪═══════════╡
│ 1   ┆ 2000-01-01 ┆ 1234.1000 │
│ 1   ┆ 2000-01-01 ┆ 2345.7000 │
└─────┴────────────┴───────────┘
--- Arrow Table from Polars (looks correct) ---
pyarrow.Table
id: int64
td_date: date32[day]
value: double
----
id: [[1,1]]
td_date: [[2000-01-01,2000-01-01]]
value: [[1234.1,2345.7]]
--- chdb Query Result (Buggy) ---
| id | td_date | value |
|-:|-:|-:|
| 1 | 2000-01-01 | 1234.1 |
| 1 | 1970-01-01 | 2345.7 |

Workaround:
A functioning workaround is to convert the date column back to a string within Polars before converting the DataFrame to an Arrow table. This suggests the issue is specific to how chdb parses the date32 array type when it originates from Polars.

# Workaround implementation
df_polars_str_date = df_polars.with_columns(
    pl.col("td_date").dt.strftime("%Y-%m-%d")
)
arrow_table_workaround = df_polars_str_date.to_arrow()

# A chdb query on 'arrow_table_workaround' works correctly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions