Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Make casting timestamp and duration zero-copy when TimeUnit matches #34210

Closed
icexelloss opened this issue Feb 15, 2023 · 7 comments · Fixed by #34270
Closed

[C++] Make casting timestamp and duration zero-copy when TimeUnit matches #34210

icexelloss opened this issue Feb 15, 2023 · 7 comments · Fixed by #34270

Comments

@icexelloss
Copy link
Contributor

Describe the bug, including details regarding any error messages, version, and platform.

I am testing performance of casting datatypes with pyarrow Table and saw some unexpected performance.

In short, it seems that casting a column from "tz-naive" to "tz-utc" is much slower than casting from "tz-naive" to "int64", which is unexpected because I think both of these should be metadata-only change.

Here is a partial repo:

In [5]: df = pd.DataFrame({'time': np.arange(100 * 1000  * 1000)})

In [6]: table = pa.Table.from_pandas(df)

In [8]: schema_naive = pa.schema([pa.field('time' , pa.timestamp('ns'))])

In [9]: schema_tz = pa.schema([pa.field('time' , pa.timestamp('ns', tz='UTC'))])

In [10]: table = table.cast(schema_naive)

In [14]: schema_int = pa.schema([pa.field('time' , pa.int64()))])

In [16]: %time table.cast(schema_int)
CPU times: user 114 µs, sys: 30 µs, total: 144 µs
Wall time: 231 µs
Out[16]: 
pyarrow.Table
time: int64
----
time: [[0,1,2,3,4,...,99999995,99999996,99999997,99999998,99999999]]

In [17]: %time table.cast(schema_tz)
CPU times: user 119 ms, sys: 140 ms, total: 260 ms
Wall time: 259 ms
Out[17]: 
pyarrow.Table
time: timestamp[ns, tz=UTC]
----
time: [[1970-01-01 00:00:00.000000000,1970-01-01 00:00:00.000000001,1970-01-01 00:00:00.000000002,1970-01-01 00:00:00.000000003,1970-01-01 00:00:00.000000004,...,1970-01-01 00:00:00.099999995,1970-01-01 00:00:00.099999996,1970-01-01 00:00:00.099999997,1970-01-01 00:00:00.099999998,1970-01-01 00:00:00.099999999]]

In [18]: table
Out[18]: 
pyarrow.Table
time: timestamp[ns]
----
time: [[1970-01-01 00:00:00.000000000,1970-01-01 00:00:00.000000001,1970-01-01 00:00:00.000000002,1970-01-01 00:00:00.000000003,1970-01-01 00:00:00.000000004,...,1970-01-01 00:00:00.099999995,1970-01-01 00:00:00.099999996,1970-01-01 00:00:00.099999997,1970-01-01 00:00:00.099999998,1970-01-01 00:00:00.099999999]]

In [19]: pa.__version__
Out[19]: '11.0.0'

Component(s)

Python

@icexelloss
Copy link
Contributor Author

@jorisvandenbossche
Copy link
Member

Indeed, that seems to be due to not doing this a zero-copy cast because the data types are "different" (while only the unit matters, the timezone can be ignored for this operation).

@rok
Copy link
Member

rok commented Feb 16, 2023

This would would definitely be nice to have.

@icexelloss
Copy link
Contributor Author

@rok This might be a silly question but why don't we dynamically dispatch to a zero copy / no op functions here if the units are the same?

https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_cast_temporal.cc#L158

@rok
Copy link
Member

rok commented Feb 16, 2023

@icexelloss yeah, no op seems like the thing to do. I wonder if zero copy is possible or does the fact we're working with batches prevent that. @westonpace could you comment?

@westonpace
Copy link
Member

I don't think batches should be a problem. It seems we only have the comment from @wesm to go on here. Given that it was made during a rather large refactor my guess is this is more of a "todo" and less of a "this is a concern". I think it should be pretty safe to zero copy (and it is concerning that we don't).

@rok
Copy link
Member

rok commented Feb 20, 2023

Ok, I'll give it a shot.

@rok rok self-assigned this Feb 20, 2023
@rok rok changed the title [Python] Unexpected performance when casting from tz-naive to tz-aware timestamps [C++] Make casting timestamp and duration zero-copy when TimeUnit matches Mar 3, 2023
@rok rok closed this as completed in #34270 Mar 3, 2023
rok added a commit that referenced this issue Mar 3, 2023
…meUnit matches (#34270)

### Rationale for this change

Casting from e.g. `timestamp(s, "UTC")` to `timestamp(s)` could be a metadata only change, but is currently a multiplication operation.

### What changes are included in this PR?

This change adds a zero-copy casting path for durations that have equal units and timestamps that have equal units and potentially different timezones.

### Are these changes tested?

We test for correctness and zero-copy.

### Are there any user-facing changes?

No.
* Closes: #34210

Authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Rok Mihevc <rok@mihevc.org>
@rok rok added this to the 12.0.0 milestone Mar 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants