wrong result when operation parquet #2044

alitrack · 2022-03-21T01:36:48Z

Describe the bug
A clear and concise description of what the bug is.
when use register_parquet, datetime got wrong result, but register_csv no problem.
if use pandas read it dataframe and use register_record_batches also OK.

To Reproduce
Steps to reproduce the behavior:

import datafusion
import pyarrow as pa

ctx = datafusion.ExecutionContext()
ctx.register_parquet('taxi_sample','yellow_taxi_sample.parquet')
sql ="select * from taxi_sample"
pydf=ctx.sql(query)
pa.Table.from_batches(pydf.collect()).to_pandas()

Expected behavior
A clear and concise description of what you expected to happen.
expected result ,

	pickup_datetime
0	2009-01-04 02:52:00
1	2009-01-04 03:31:00
2	2009-01-03 15:43:00

but got,

pickup_datetime
0	1970-01-15 05:57:17.520
1	1970-01-15 05:57:19.860
2	1970-01-15 05:56:37.380

Additional context
Add any other context about the problem here.

the sample data is part of Year 2009-2015 - 1 billion rows - 107GB

yellow_taxi_sample.parquet.zip

The text was updated successfully, but these errors were encountered:

jiangzhx · 2022-03-21T08:19:12Z

@alitrack which version?

alitrack · 2022-03-21T08:21:21Z

@jiangzhx 0.5.1

jiangzhx · 2022-03-21T08:36:56Z

i test your parquet file yellow_taxi_sample.parquet.zip

use rust datafusion master version;

query response:
pickup_datetime
0 1970-01-15 05:57:17.520
1 1970-01-15 05:57:19.860
2 1970-01-15 05:56:37.380

does parquet file right?

jiangzhx · 2022-03-21T08:40:13Z

also test with python 0.5.1；
query result same with rust datafusion version;

alitrack · 2022-03-21T08:50:23Z

please try pandas , pyarrow , vaex, all have the same result(correct one),

import pandas as pd
#pd.read_parquet("yellow_taxi_sample.parquet", engine='pyarrow')
pd.read_parquet("yellow_taxi_sample.parquet", engine='fastparquet')

jiangzhx · 2022-03-21T09:39:14Z

import pandas as pd
df = pd.read_parquet('yellow_taxi_sample.parquet')
#df = pd.read_parquet('yellow_taxi_sample.parquet',engine='pyarrow')
print(df.head())

print right result:

alitrack · 2022-03-22T00:34:02Z

yes, but yellow_taxi_2009_2015_f32.parquet is about 28G, so I want to use register_parquet, not pandas or vaex read it first.

korowa · 2022-03-22T07:42:06Z

@alitrack, the issue may be caused by "ARROW:schema" key-value pair in .parquet metadata - it contains schema which treats pickup/dropoff_datatime fields as Timestamp(Nanosecond) instead of Timestamp(Microseconds) in actual file schema. I suppose removing this tag from file metadata should help.

jiangzhx · 2022-03-31T05:36:02Z

@alitrack, the issue may be caused by "ARROW:schema" key-value pair in .parquet metadata - it contains schema which treats pickup/dropoff_datatime fields as Timestamp(Nanosecond) instead of Timestamp(Microseconds) in actual file schema. I suppose removing this tag from file metadata should help.

i did more research, read parquet metadata with parquet = { version = "9.0.0"} .
the value of key ARROW:schema was base64 encoding .

@korowa was right, the column pickup_datetim's datatype was datetime64[ns]

`
{
"name": "pickup_datetime",
"field_name": "pickup_datetime",
"pandas_type": "datetime",
"numpy_type": "datetime64[ns]",
"metadata": null
},

`

jiangzhx · 2022-03-31T06:19:08Z

confused....

print_row_with_parquet testcase get right result
pickup_datetime:2009-01-04 02:52:00 +00:00

print_row_with_datafusion get wrong result
pickup_datetime:1970-01-15 05:57:17.520

use datafusion::error::Result;
use datafusion::prelude::ExecutionContext;
use std::convert::TryFrom;
use std::fs::File;
use std::path::Path;

use parquet::file::reader::FileReader;
use parquet::file::serialized_reader::SerializedFileReader;

#[tokio::test]
async fn print_row_with_parquet() -> Result<()> {
	let path = Path::new("yellow_taxi_sample.parquet");
	let row_iter = SerializedFileReader::try_from(path).unwrap().into_iter();

	for row in row_iter {
		let s = row.to_string();
		println!("{}", s);
	}
	Ok(())
}

#[tokio::test]
async fn print_row_with_datafusion() -> Result<()> {
	let mut ctx = ExecutionContext::new();
	ctx.register_parquet("taxi_sample", "yellow_taxi_sample.parquet")
		.await?;
	let df = ctx.sql("SELECT * from taxi_sample").await?;
	df.show().await?;

	Ok(())
}

tustvold · 2022-04-01T14:04:01Z

This is likely related to apache/arrow-rs#1459

tustvold · 2022-10-20T00:28:31Z

I think this should have been resolved by apache/arrow-rs#1682, could you let me know if the issue still persists?

alitrack · 2022-10-22T07:12:19Z

@tustvold I tested last version roapi, fixed, just datafusion python bind sill has the issue, thanks!

alitrack added the bug Something isn't working label Mar 21, 2022

waitingkuo mentioned this issue Aug 15, 2022

[EPIC] A collection of Date/Time related open issues #3148

Closed

85 tasks

Kxnr mentioned this issue Oct 12, 2022

Timestamps render incorrectly Kxnr/parqbench#15

Open

alitrack closed this as completed Oct 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wrong result when operation parquet #2044

wrong result when operation parquet #2044

alitrack commented Mar 21, 2022

jiangzhx commented Mar 21, 2022

alitrack commented Mar 21, 2022

jiangzhx commented Mar 21, 2022

jiangzhx commented Mar 21, 2022

alitrack commented Mar 21, 2022

jiangzhx commented Mar 21, 2022

alitrack commented Mar 22, 2022

korowa commented Mar 22, 2022 •

edited

jiangzhx commented Mar 31, 2022

jiangzhx commented Mar 31, 2022 •

edited

tustvold commented Apr 1, 2022

tustvold commented Oct 20, 2022

alitrack commented Oct 22, 2022

wrong result when operation parquet #2044

wrong result when operation parquet #2044

Comments

alitrack commented Mar 21, 2022

jiangzhx commented Mar 21, 2022

alitrack commented Mar 21, 2022

jiangzhx commented Mar 21, 2022

jiangzhx commented Mar 21, 2022

alitrack commented Mar 21, 2022

jiangzhx commented Mar 21, 2022

alitrack commented Mar 22, 2022

korowa commented Mar 22, 2022 • edited

jiangzhx commented Mar 31, 2022

jiangzhx commented Mar 31, 2022 • edited

tustvold commented Apr 1, 2022

tustvold commented Oct 20, 2022

alitrack commented Oct 22, 2022

korowa commented Mar 22, 2022 •

edited

jiangzhx commented Mar 31, 2022 •

edited