Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get error value if timestamp represented by the INT96 in the parquet file #9981

Open
liukun4515 opened this issue Apr 7, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@liukun4515
Copy link
Contributor

liukun4515 commented Apr 7, 2024

Describe the bug

We use the INT96 physical type to store the timestamp column which is written by the Spark ETL engine.

We create the logical/physical plan to read the parquet file including above timestamp column using the UTC-7 timezone, the result of value for the timestamp column will be adjusted.

The field of timestamp column in the schema for the physical/logical plan is timestamp(nanosecond, "utc-7") or timestamp(nonosecond, "-07:00")

@waitingkuo @alamb

Do you have any insight for that?

To Reproduce

No response

Expected behavior

No response

Additional context

No response

@liukun4515 liukun4515 added the bug Something isn't working label Apr 7, 2024
@alamb
Copy link
Contributor

alamb commented Apr 8, 2024

Hi @liukun4515

Can you share an example parquet file?

I didn't quite follow your description -- is the issue that your data is already in UTC-7 time (and thus should not be adjusted) but that DataFusion is adjusting the timezone anyways?

@liukun4515
Copy link
Contributor Author

I didn't quite follow your description -- is the issue that your data is already in UTC-7 time (and thus should not be adjusted) but that DataFusion is adjusting the timezone anyways?

@alamb

Sorry for the bad description, I will share my parquet file in the next week.

Our ETL engine(spark) write the parquet file to HDFS and as we all know the spark use the UTC/UNIX epoch/time.

But int arrow-rs, when we meet the INT96 and will get the the arrow datatype of DataType::Timestamp(TimeUnit::Nanosecond, None) by this code https://github.com/apache/arrow-rs/blob/a999fb86764e9310bb4822c7e7c6551f247e0e0b/parquet/src/arrow/schema/primitive.rs#L99

In the definition of timestamp in the arrow data type https://github.com/apache/arrow/blob/main/format/Schema.fbs#L303, if the type of timestamp without the timezone value, it means we don't know the reference of the timestamp https://github.com/apache/arrow/blob/main/format/Schema.fbs#L318

@alamb
Copy link
Contributor

alamb commented Apr 13, 2024

I see -- so the expected behavior is that the timestamp should be read as a Timestamp(TimeUnit::Nanosecond, Some("UTC")) rather than Timestamp(TimeUnit::Nanosecond, None)

@liukun4515
Copy link
Contributor Author

I see -- so the expected behavior is that the timestamp should be read as a Timestamp(TimeUnit::Nanosecond, Some("UTC")) rather than Timestamp(TimeUnit::Nanosecond, None)

@alamb thanks for your feedback.

Yes,
But not only this issue, but also the value of timezone and how to use the timezone in compute of the arrow-rs/datafusion.

When we read the timestamp column from the parquet using the arrow-parquet crate, I will get the column with the data type which will be timestamp(unit, None) or timestamp(unit, "UTC"), but we need to use the timestamp with 'UTC-7' timezone or other timezone specified by the customer when we do the compute or operation on the timestamp column.

I have clarify the issue in the comment/pr apache/arrow-rs#5605 (comment)

@liukun4515
Copy link
Contributor Author

liukun4515 commented Apr 19, 2024

If we could input the schema into the parquet reader, we type of column will be defined by the input schema and not totally inferred by the parquet metadata.

We need to wait the next release of the arrow apache/arrow-rs#5657

@liukun4515
Copy link
Contributor Author

when we can input the schema in the ParquetRecordBatchStreamBuilder, we need to change the usage of

            let mut builder =
                ParquetRecordBatchStreamBuilder::new_with_options(reader, options)
                    .await?;

in the ParquetExec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants