Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Querying datetime data in DataFusion with an embedded timezone always fails #153

Closed
Tracked by #3148
alamb opened this issue Apr 26, 2021 · 2 comments
Closed
Tracked by #3148
Labels
bug Something isn't working datafusion Changes in the datafusion crate

Comments

@alamb
Copy link
Contributor

alamb commented Apr 26, 2021

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-11324

We have a number (~ hundreds of thousands) of Parquet files that have embedded Arrow schemas in them that have time-valued columns with the type DateTime(TimeUnit::Nanosecond, Some("UTC")).

 

One of the changes in the Arrow 2 -> 3 working window was to make the Parquet loader prefer the Arrow schema compared to the one generated from the columns. 

 

But because DataFusion has the timezone field of the DateTime variant hardcoded as None, we can't load any of our data after this upgrade; we get errors like:

{{SELECT * FROM parquet_table WHERE ("timestamp" >= to_timestamp('2010-03-24T13:00:00.000000Z') AND "timestamp" <= to_timestamp('2010-03-25T00:00:00.000000Z')) ORDER BY timestamp ASC NULLS LAST;}}
{{Plan("'Timestamp(Nanosecond, Some("UTC")) >= Timestamp(Nanosecond, None)' can't be evaluated because there isn't a common type to coerce the types to")}}

 

Any ideas/thoughts? 

@alamb alamb added the datafusion Changes in the datafusion crate label Apr 26, 2021
@alamb
Copy link
Contributor Author

alamb commented Apr 26, 2021

Comment from Andrew Lamb(alamb) @ 2021-01-20T14:26:14.792+0000:

[~m18e] I can try and take a look at fixing this -- do you have a reproducer (e.g. the input file) easily at hand?

Comment from Max Burke(m18e) @ 2021-01-21T19:36:59.503+0000:

Yup. I've attached a test file. For what it's worth, this is the change that we've applied locally to work around it: [https://github.com/urbanlogiq/arrow/commit/9be88cf2994fe55ae0d2f5ae137b9e73daac1ef0] 

 

 

 [^0100c909-2537-c4dc-ce1d-1b7a75d613e8.parquet]

@andygrove andygrove added the bug Something isn't working label May 1, 2021
@alamb
Copy link
Contributor Author

alamb commented Feb 2, 2023

This query appears to be working as of DataFusion 17.0.0 in datafusion-cli -- FYI @maxburke

❯ SELECT * FROM '0100c909-2537-c4dc-ce1d-1b7a75d613e8.parquet' WHERE ("timestamp" >= to_timestamp('2010-03-24T13:00:00.000000Z') AND "timestamp" <= to_timestamp('2010-03-25T00:00:00.000000Z')) ORDER BY timestamp ASC NULLS LAST;
0 rows in set. Query took 0.026 seconds.
❯ SELECT * FROM '0100c909-2537-c4dc-ce1d-1b7a75d613e8.parquet' limit 10;
+--------------------------+----------+-------+-----------------------------------------------+-----------+
| ul_node_id               | id       | count | timestamp                                     | direction |
+--------------------------+----------+-------+-----------------------------------------------+-----------+
| x5uF7gCs0N2UvC4icn2y7g== | 11715577 | 0     | 2011-09-09T07:00:00 (Unknown Time Zone 'UTC') | EB        |
| x5uF7gCs0N2UvC4icn2y7g== | 11715578 | 0     | 2011-09-09T08:00:00 (Unknown Time Zone 'UTC') | EB        |
| x5uF7gCs0N2UvC4icn2y7g== | 11715579 | 0     | 2011-09-09T09:00:00 (Unknown Time Zone 'UTC') | EB        |
| x5uF7gCs0N2UvC4icn2y7g== | 11715580 | 0     | 2011-09-09T10:00:00 (Unknown Time Zone 'UTC') | EB        |
| x5uF7gCs0N2UvC4icn2y7g== | 11715581 | 0     | 2011-09-09T11:00:00 (Unknown Time Zone 'UTC') | EB        |
| x5uF7gCs0N2UvC4icn2y7g== | 11715582 | 0     | 2011-09-09T12:00:00 (Unknown Time Zone 'UTC') | EB        |
| x5uF7gCs0N2UvC4icn2y7g== | 11715583 | 0     | 2011-09-09T13:00:00 (Unknown Time Zone 'UTC') | EB        |
| x5uF7gCs0N2UvC4icn2y7g== | 11715584 | 0     | 2011-09-09T14:00:00 (Unknown Time Zone 'UTC') | EB        |
| x5uF7gCs0N2UvC4icn2y7g== | 11715585 | 0     | 2011-09-09T15:00:00 (Unknown Time Zone 'UTC') | EB        |
| x5uF7gCs0N2UvC4icn2y7g== | 11715586 | 0     | 2011-09-09T16:00:00 (Unknown Time Zone 'UTC') | EB        |
+--------------------------+----------+-------+-----------------------------------------------+-----------+
10 rows in set. Query took 0.005 seconds.

@alamb alamb closed this as completed Feb 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working datafusion Changes in the datafusion crate
Projects
None yet
Development

No branches or pull requests

2 participants