Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Python] Cannot filter dataset with a timestamp (with timezone) column #37110

Closed
brokenjacobs opened this issue Aug 10, 2023 · 7 comments · Fixed by #37135
Closed

[C++][Python] Cannot filter dataset with a timestamp (with timezone) column #37110

brokenjacobs opened this issue Aug 10, 2023 · 7 comments · Fixed by #37135

Comments

@brokenjacobs
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

Pyarrow: 12.0.1

This is similar to #32366 but not exactly the same. I have a dataset where I am trying to filter by a column in my parquet files that is a timestamp with timezone type. Every way that I try to do the filter, pyarrow coerces the type to a non-timestamp type and fails to convert to a table. For example:

(ds is my dataset)
>>> dt
datetime.datetime(2023, 6, 29, 4, 0, tzinfo=<UTC>)
>>> dcast = pa.scalar(dt, type=pa.timestamp('ms', tz='UTC')
>>> filter = (ds.field('readout_time')>dcast)
>>> df = dataset.to_table(filter=filter,columns=['readout_time', 'acceleration_x', 'acceleration_y', 'acceleration_z']).to_pandas()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/_dataset.pyx", line 546, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 3449, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Cannot compare timestamp with timezone to timestamp without timezone, got: timestamp[ms, tz=UTC] and timestamp[ms]

This was just my latest attempt to use an actual arrow scaler with timezone, and it still changes the type to a non timestamp type in the comparison. It also fails using a datetime in the filter directly.

Component(s)

Python

@mapleFU
Copy link
Member

mapleFU commented Aug 10, 2023

ds.schema.to_string would you mind print this?

@brokenjacobs
Copy link
Author

brokenjacobs commented Aug 11, 2023

>>> dataset.schema.to_string
<built-in method to_string of pyarrow.lib.Schema object at 0x12f357570>

But I think you wanted this?:

source_id: string
site_id: string
readout_time: timestamp[ms, tz=UTC]
packet_counter: uint16
roll: float
pitch: float
yaw: float
acceleration_x: float
acceleration_y: float
acceleration_z: float
free_acceleration_x: float
free_acceleration_y: float
free_acceleration_z: float
gyroscope_x: float
gyroscope_y: float
gyroscope_z: float
status_word: int32
kafka_key: string
kakfa_ts_type: uint8
kafka_ts: timestamp[ms]
kafka_partition: uint8
kafka_offset: uint64
kafka_topic: string
ds: string
ms: string
-- schema metadata --
parquet.avro.schema: '{"type": "record", "name": "mti300ahrs", "namespace' + 3061
writer.model.name: 'avro'

or

>>> dataset.schema.to_string()
'source_id: string\nsite_id: string\nreadout_time: timestamp[ms, tz=UTC]\npacket_counter: uint16\nroll: float\npitch: float\nyaw: float\nacceleration_x: float\nacceleration_y: float\nacceleration_z: float\nfree_acceleration_x: float\nfree_acceleration_y: float\nfree_acceleration_z: float\ngyroscope_x: float\ngyroscope_y: float\ngyroscope_z: float\nstatus_word: int32\nkafka_key: string\nkakfa_ts_type: uint8\nkafka_ts: timestamp[ms]\nkafka_partition: uint8\nkafka_offset: uint64\nkafka_topic: string\nds: string\nms: string\n-- schema metadata --\nparquet.avro.schema: \'{"type": "record", "name": "mti300ahrs", "namespace\' + 3061\nwriter.model.name: \'avro\''

The code that created this file set the timezone in the schema.

@mapleFU
Copy link
Member

mapleFU commented Aug 12, 2023

>>> dataset_ts.schema.to_string()
'DATE_OF_BIRTH: timestamp[ns, tz=UTC]'
>>> date
datetime.datetime(1990, 1, 1, 0, 0, tzinfo=datetime.timezone.utc)
>>> dcast = pa.scalar(dt, type=pa.timestamp('ms', tz='UTC')
>>> dcast = pa.scalar(date, type=pa.timestamp('ms', tz='UTC'))
>>> filter = (ds.field('DATE_OF_BIRTH')>dcast)
>>> df = dataset_ts.to_table(filter=filter, columns=['DATE_OF_BIRTH'])
>>> df

Emm I've try to use this in 9.0, seems it works, but in 12.0.1 it would failed. Let me checkout whats changed here

@mapleFU
Copy link
Member

mapleFU commented Aug 12, 2023

>>> filter = (ds.field('DATE_OF_BIRTH') > pa.compute.scalar(date).cast(type=pa.timestamp('ms', 'UTC')))
>>> dataset.to_table(filter=filter)

Can you try like this to workaround first?

@mapleFU
Copy link
Member

mapleFU commented Aug 12, 2023

( I guess it might related to b56b91e )

pitrou pushed a commit that referenced this issue Aug 14, 2023
### Rationale for this change

This patch ( #15180 ) adds a `SmallestTypeFor` to handling expression type. However, it lost timezone when handling.

### What changes are included in this PR?

Add `timezone` in `SmallestTypeFor`

### Are these changes tested?

Currently not

### Are there any user-facing changes?

Yeah it's a bugfix

* Closes: #37110

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou added this to the 14.0.0 milestone Aug 14, 2023
@kou kou changed the title Cannot filter dataset with a timestamp (with timezone) column [C++][Python] Cannot filter dataset with a timestamp (with timezone) column Aug 15, 2023
@brokenjacobs
Copy link
Author

Sorry just got back from vacation, do you still need more testing done on this?

@mapleFU
Copy link
Member

mapleFU commented Aug 30, 2023

@brokenjacobs I've fixed this. but this is not in 13.0.0 release...

loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…pache#37135)

### Rationale for this change

This patch ( apache#15180 ) adds a `SmallestTypeFor` to handling expression type. However, it lost timezone when handling.

### What changes are included in this PR?

Add `timezone` in `SmallestTypeFor`

### Are these changes tested?

Currently not

### Are there any user-facing changes?

Yeah it's a bugfix

* Closes: apache#37110

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants