Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPC date serialization issue #902

Open
Tracked by #8282 ...
tafia opened this issue Nov 2, 2021 · 2 comments
Open
Tracked by #8282 ...

IPC date serialization issue #902

tafia opened this issue Nov 2, 2021 · 2 comments
Labels

Comments

@tafia
Copy link

tafia commented Nov 2, 2021

Describe the bug

When writing a RecordBatch via ipc (and then reading the stream from python) dates are not correct.
I have a date column from 2000-02-01 to 2021-11-02 on rust side but once converted to ipc pyarrow table, I see dates from 1998-01-16 to 2019-11-12. Dates are saved in a Date64Array.
Writing the same recordbatch as csv displays the correct dates.

To Reproduce

I can't really share the code but here is the rust->python (using pyo3) part.

fn df<'a>(py: Python<'a>, batch: &RecordBatch) -> Result<&'a PyAny, Error> {
    // rust ipc write
    let buf: Vec<u8> = Vec::new();
    let mut writer = StreamWriter::try_new(buf, &batch.schema())?;
    writer.write(batch)?;
    writer.finish()?;
    let buf = writer.into_inner()?;

    // python reading
    let bytes = PyBytes::new(py, &buf);
    let locals = PyDict::new(py);
    locals.set_item("buf", bytes)?;
    locals.set_item("pa", PyModule::import(py, "pyarrow")?)?;
    locals.set_item("pd", PyModule::import(py, "pandas")?)?;
    py.run(
        r#"
reader = pa.ipc.open_stream(buf)
batches = [b for b in reader]
table = pa.Table.from_batches(batches) # table.column("date") shows wrong values
df = table.to_pandas(split_blocks=True, self_destruct=True)
df["date"] = pd.to_datetime(df["date"])
df.set_index("date", inplace=True)
    "#,
        Some(&locals),
        None,
    )?;

    Ok(locals.get_item("df").unwrap())
}

Expected behavior

Same dates

Additional context

I am new to arrow so there is a high chance I'm doing something wrong

@tafia tafia added the bug label Nov 2, 2021
@tafia
Copy link
Author

tafia commented Nov 2, 2021

Note that when using pyarrow feature (which I just discovered reading through the code, is it documented??), this is working fine.

@alamb
Copy link
Contributor

alamb commented Nov 4, 2021

@tafia can you possibly provide an example of what is in the batch: &RecordBatch argument? Specifically, what is the exact type (DataType::Timestamp(..))? If so, what is the unit?

Ideally would be some example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants