Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Support Temporal Extraction Functions for duration types #33962

Open
mroeschke opened this issue Jan 31, 2023 · 11 comments
Open

[C++] Support Temporal Extraction Functions for duration types #33962

mroeschke opened this issue Jan 31, 2023 · 11 comments

Comments

@mroeschke
Copy link
Contributor

Describe the enhancement requested

In [2]: import datetime

In [3]: import pyarrow as pa

In [4]: pa.__version__
Out[4]: '10.0.1'

In [5]: pa.compute.day(pa.array([datetime.timedelta(1)]))
ArrowNotImplementedError: Function 'day' has no kernel matching input types (duration[us])

Would be great to support extracting the day, second, microseconds, nanoseconds components of a timedelta

Component(s)

Python

@jorisvandenbossche jorisvandenbossche changed the title ENH: Support Temporal Extraction Functions for duration types [C++] Support Temporal Extraction Functions for duration types Feb 1, 2023
@jorisvandenbossche
Copy link
Member

Note that there is not necessarily a clear definition of what the extracted "second" (or other component) of a duration is.

In pandas, there are two variants: for example Timedelta.seconds and Timedelta.components.seconds (https://pandas.pydata.org/docs/dev/user_guide/timedeltas.html#attributes). I assume we would want the latter here? (since the former is based on an implementation detail of python's datetime.timedelta)

@mroeschke
Copy link
Contributor Author

I assume we would want the latter here?

Yeah I think that would make sense. The context of the request was actually implementing dt.seconds to match what datetime.timedelta returns, but that can be composed from the individual components

@zanmato1984
Copy link
Collaborator

I want to take this but I'm not really much of a Python guy. So could anyone kindly help me to understand more about the Python counter-part of this task?

My questions is that, according to the pandas doc (https://pandas.pydata.org/docs/dev/user_guide/timedeltas.html#timedelta-limitations), the internal representation of pandas Timedelta is an int64 in nanoseconds. Does it mean in C++, we can assume that it always binds to a DurationType with TimeUnit::NANO?

Thanks.

@rok
Copy link
Member

rok commented Dec 17, 2023

@zanmato1984 you're welcome to tackle this!

Does it mean in C++, we can assume that it always binds to a DurationType with TimeUnit::NANO?

Yes, pandas.timedelta appears to be always be duration(TimeUnit::NANOSECOND) as per your link (I did not check further). Although we might want to add component extraction for all possible duration units if we add it.

C++ kernel logic would probably fit here - you might be able to reuse other kernels and just add a WithDurations to register them for duration types. You can add tests here. As for python side of things - you can just add a test like this one.

@zanmato1984
Copy link
Collaborator

@zanmato1984 you're welcome to tackle this!

Does it mean in C++, we can assume that it always binds to a DurationType with TimeUnit::NANO?

Yes, pandas.timedelta appears to be always be duration(TimeUnit::NANOSECOND) as per your link (I did not check further). Although we might want to add component extraction for all possible duration units if we add it.

C++ kernel logic would probably fit here - you might be able to reuse other kernels and just add a WithDurations to register them for duration types. You can add tests here. As for python side of things - you can just add a test like this one.

Thank you @rok for the kind help. I'm making some progress in my local and will come up with a PR later.

@zanmato1984
Copy link
Collaborator

take

@zanmato1984
Copy link
Collaborator

When I'm implementing the kernels, I found some unexpected complexity - though it's actually working.

I want to make sure that this complexity is necessary and I'm doing it the proper way. So I drafted PR #39267 and hope someone familiar with compute kernels could help to confirm it's essentially correct.

If there is a better way, I'm all ears.

Thanks.

@rok
Copy link
Member

rok commented Dec 18, 2023

Thanks for doing this @zanmato1984 !
If you think reusing kernels is unnecessarily complex we might want to avoid it.
Let's discuss on the PR.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Dec 20, 2023

We might need some more discussion about what we actually want here. The current PR adds "day", "second", "milli/micro/nanosecond" and "subsecond" kernels. And I think this is mostly modelled after the Python datetime.timedelta attributes (see also https://pandas.pydata.org/docs/user_guide/timedeltas.html#attributes for some context).

For example the "second" kernel in the PR would return the number of seconds in the duration value that represents the number of seconds for the part of the duration of >= 0 and < 1 day. Equivalent Python example:

>>> import datetime
>>> td = datetime.timedelta(days=2, hours=3, seconds=4, milliseconds=5)
>>> td.seconds
10804
# which is 3 hours (60*60 seconds) + 4 seconds
>>> 3*3600+4
10804

But a reason for Python to have those attributes, is because that is how it is implemented under the hood (it stores separate numbers of days, seconds and microseconds (https://docs.python.org/3/library/datetime.html#timedelta-objects).
In Arrow, we simply store a single value (number of (milli/micro/nano)seconds depending on the unit), so it doesn't necessarily make sense to copy the interface of Python's datetime.timedelta to extract those components (for example, why days and seconds, and not also hours?). Also note that the Python attributes are plural, in contrast to the names for the timestamp/date/time parts.

Checking with some other software about what kind of operations are support for Duration types:

  • Python's datetime.timedelta has an additional method total_seconds(), which always returns all seconds as a float (in the example above, td.total_seconds() returns 183604.005).
    This could be useful to add as an easier way to get the duration in seconds, regardless of the unit (you can already achieve this currently by dividing by a duration of 1 second).
  • As mentioned earlier in this thread, pandas has an additional components attribute, that gives you the different components as they would be displayed, i.e. actually splitted in days/hours/minutes/seconds/milli...)
  • The R lubridate package doesn't seem to have specific methods for its duration type for this type of operations (https://lubridate.tidyverse.org/reference/index.html#durations)
  • The Joda-Time Java package has getStandardDays/getStandardHours/getStandardMinutes/getStandardSeconds methods (https://www.joda.org/joda-time/key_duration.html, https://www.joda.org/joda-time/apidocs/org/joda/time/Duration.html). But in this case, they are not "mutually exclusive", i.e. the seconds still include the days/hours/minutes as well.
  • The Rust chrono crate has a Duration type with num_days/num_hours/num_minutes/.. etc methods (https://docs.rs/chrono/latest/chrono/struct.Duration.html), but again they return the total number of days/hours/minutes/seconds/.., and not e.g. the number of hours after the number of days already has been subtracted (i.e. the number of days is simply "number of seconds / seconds_per_day")

@zanmato1984
Copy link
Collaborator

Thanks @jorisvandenbossche for the informative comment!

Though I'm neutral to adding or not adding these kernels, I think the following

so it doesn't necessarily make sense to copy the interface of Python's datetime.timedelta to extract those components (for example, why days and seconds, and not also hours?). Also note that the Python attributes are plural, in contrast to the names for the timestamp/date/time parts.

makes a great point. I felt the same (unintuitive) way when I was adding the Python test case in my PR:

days = td.dt.days.astype("int64")
...
assert pc.day(tda).equals(pa.array(days))

@zanmato1984
Copy link
Collaborator

Seems so far it's still unclear of what we really want from this request.

Though I may still want to help if the discussion is concluded at some point in the future, I'd excuse myself from assignee and close the PR for now.

@zanmato1984 zanmato1984 removed their assignment Feb 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants