Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wrong result when operation parquet #2044

Closed
Tracked by #3148
alitrack opened this issue Mar 21, 2022 · 13 comments
Closed
Tracked by #3148

wrong result when operation parquet #2044

alitrack opened this issue Mar 21, 2022 · 13 comments
Labels
bug Something isn't working

Comments

@alitrack
Copy link

Describe the bug
A clear and concise description of what the bug is.
when use register_parquet, datetime got wrong result, but register_csv no problem.
if use pandas read it dataframe and use register_record_batches also OK.

To Reproduce
Steps to reproduce the behavior:

import datafusion
import pyarrow as pa

ctx = datafusion.ExecutionContext()
ctx.register_parquet('taxi_sample','yellow_taxi_sample.parquet')
sql ="select * from taxi_sample"
pydf=ctx.sql(query)
pa.Table.from_batches(pydf.collect()).to_pandas()  

Expected behavior
A clear and concise description of what you expected to happen.
expected result ,

	pickup_datetime
0	2009-01-04 02:52:00
1	2009-01-04 03:31:00
2	2009-01-03 15:43:00

but got,

pickup_datetime
0	1970-01-15 05:57:17.520
1	1970-01-15 05:57:19.860
2	1970-01-15 05:56:37.380

Additional context
Add any other context about the problem here.

the sample data is part of Year 2009-2015 - 1 billion rows - 107GB

yellow_taxi_sample.parquet.zip

@alitrack alitrack added the bug Something isn't working label Mar 21, 2022
@jiangzhx
Copy link
Contributor

@alitrack which version?

@alitrack
Copy link
Author

@jiangzhx 0.5.1

@jiangzhx
Copy link
Contributor

i test your parquet file yellow_taxi_sample.parquet.zip

use rust datafusion master version;

query response:
pickup_datetime
0 1970-01-15 05:57:17.520
1 1970-01-15 05:57:19.860
2 1970-01-15 05:56:37.380

does parquet file right?

@jiangzhx
Copy link
Contributor

also test with python 0.5.1;
query result same with rust datafusion version;

@alitrack
Copy link
Author

please try pandas , pyarrow , vaex, all have the same result(correct one),

import pandas as pd
#pd.read_parquet("yellow_taxi_sample.parquet", engine='pyarrow')
pd.read_parquet("yellow_taxi_sample.parquet", engine='fastparquet')

@jiangzhx
Copy link
Contributor

import pandas as pd
df = pd.read_parquet('yellow_taxi_sample.parquet')
#df = pd.read_parquet('yellow_taxi_sample.parquet',engine='pyarrow')
print(df.head())

print right result:

image

@alitrack
Copy link
Author

yes, but yellow_taxi_2009_2015_f32.parquet is about 28G, so I want to use register_parquet, not pandas or vaex read it first.

@korowa
Copy link
Contributor

korowa commented Mar 22, 2022

@alitrack, the issue may be caused by "ARROW:schema" key-value pair in .parquet metadata - it contains schema which treats pickup/dropoff_datatime fields as Timestamp(Nanosecond) instead of Timestamp(Microseconds) in actual file schema. I suppose removing this tag from file metadata should help.

@jiangzhx
Copy link
Contributor

@alitrack, the issue may be caused by "ARROW:schema" key-value pair in .parquet metadata - it contains schema which treats pickup/dropoff_datatime fields as Timestamp(Nanosecond) instead of Timestamp(Microseconds) in actual file schema. I suppose removing this tag from file metadata should help.

i did more research, read parquet metadata with parquet = { version = "9.0.0"} .
the value of key ARROW:schema was base64 encoding .

@korowa was right, the column pickup_datetim's datatype was datetime64[ns]

`
{
"name": "pickup_datetime",
"field_name": "pickup_datetime",
"pandas_type": "datetime",
"numpy_type": "datetime64[ns]",
"metadata": null
},

`

@jiangzhx
Copy link
Contributor

jiangzhx commented Mar 31, 2022

confused....

print_row_with_parquet testcase get right result
pickup_datetime:2009-01-04 02:52:00 +00:00

print_row_with_datafusion get wrong result
pickup_datetime:1970-01-15 05:57:17.520

use datafusion::error::Result;
use datafusion::prelude::ExecutionContext;
use std::convert::TryFrom;
use std::fs::File;
use std::path::Path;

use parquet::file::reader::FileReader;
use parquet::file::serialized_reader::SerializedFileReader;

#[tokio::test]
async fn print_row_with_parquet() -> Result<()> {
	let path = Path::new("yellow_taxi_sample.parquet");
	let row_iter = SerializedFileReader::try_from(path).unwrap().into_iter();

	for row in row_iter {
		let s = row.to_string();
		println!("{}", s);
	}
	Ok(())
}

#[tokio::test]
async fn print_row_with_datafusion() -> Result<()> {
	let mut ctx = ExecutionContext::new();
	ctx.register_parquet("taxi_sample", "yellow_taxi_sample.parquet")
		.await?;
	let df = ctx.sql("SELECT * from taxi_sample").await?;
	df.show().await?;

	Ok(())
}


@tustvold
Copy link
Contributor

tustvold commented Apr 1, 2022

This is likely related to apache/arrow-rs#1459

@tustvold
Copy link
Contributor

I think this should have been resolved by apache/arrow-rs#1682, could you let me know if the issue still persists?

@alitrack
Copy link
Author

@tustvold I tested last version roapi, fixed, just datafusion python bind sill has the issue, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants