-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-13022: [R] bindings for lubridate's year, isoyear, quarter, month, day, wday, yday, isoweek, hour, minute, and second functions #10507
Conversation
r/R/dplyr-functions.R
Outdated
#' | ||
#' Arrow's `day_of_week` kernel counts from 0 (Monday) to 6 (Sunday), whereas | ||
#' `lubridate::wday` counts from 1 to 7, and allows users to specify which day | ||
#' of the week is first (Sunday by default). This function converts the returned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to mention that we actually also have an "ISO weekday" in the C++ kernels, but only as field in the iso_calendar
kernel, and not as separate one. We could also add iso_day_of_week
if that helps, or even just make day_of_week
follow the ISO conventions. Because the 0 (monday) - 6 (sunday) might be something Python specific.
But then looked at what C++ does, and there the default is actually 0 (sunday) - 6 (saturday), so yet something else ;) (although I see that also Postgres uses that for dow
)
In the end, once you have it in one form, it's an easy conversion to any of the other of course, so it might not matter that much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add options to the C++ kernel to enable different behaviors there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That probably makes more sense than my workaround here actually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we go that way it would probably be best to have another Jira for "TemporalOptions". It's probably best you proceed with the workaround and we loop back to this later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, will do, and will open a JIRA, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please leave a comment with that JIRA issue somewhere in this code that we expect to remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
r/R/expression.R
Outdated
"day" = "day", | ||
"yday" = "day_of_year", | ||
"isoweek" = "iso_week", | ||
"minute" = "minute", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"hour" is missing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, will add in now.
r/R/expression.R
Outdated
"yday" = "day_of_year", | ||
"isoweek" = "iso_week", | ||
"minute" = "minute", | ||
"second" = "second" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that second might do something different. I think "second" in lubridate is the equivalent of "second + subsecond" in arrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Hmm, I think this might be more complicated than that actually, as lubridate rounds it to 1 decimal place, using R's odd-even rounding whereas Arrow doesn't do the rounding, so I'm not sure if this can be done without rounding being implemented (see https://issues.apache.org/jira/browse/ARROW-12744)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean exactly with rounding? A quick try gives me:
> second(ymd_hms("2011-06-04 12:00:01.123456"))
[1] 1.123456
which seems to give all decimals
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it does; I think I must have been getting mixed up with something I'd changed when I needed to update a test I wrote, never mind,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about nanoseconds?
> second(ymd_hms("2011-06-04 12:00:01.123456789"))
[1] 1.123457
Arrow would probably return 1.123456789
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think lubridate / R supports nanosecond resolution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so too. So probably it should be "second = second + round(subsecond, 6)
" to match that behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IDK that we should truncate the data like this. A slight (technical) API difference is probably better than throwing away precision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I would also say that it's not because lubridate does not support nanoseconds that if you actually have nanoseconds (which is possible since arrow supports it) those should be discarded.
0e0a832
to
660ccfd
Compare
d67a224
to
942e92f
Compare
r/R/dplyr-functions.R
Outdated
Expression$create( | ||
"cast", | ||
Expression$create("divide_checked", e1, e2), | ||
options = cast_options(to_type = int32(), allow_float_truncate = TRUE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you make all of the scalars integers (e.g. 1L), do you still need to cast?
r/R/dplyr-functions.R
Outdated
|
||
e2 = Expression$scalar(7) | ||
|
||
# (e1 - e2 * ( e1 %/% e2 )) + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you try doing exactly this expression? I would think it should just work because Ops.Expression
is defined for all of those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, I had no idea, deleted a lot of unnecessary lines of code now, thanks @nealrichardson !
r/R/expression.R
Outdated
@@ -29,8 +29,17 @@ | |||
# stringr spellings of those | |||
"str_length" = "utf8_length", | |||
"str_to_lower" = "utf8_lower", | |||
"str_to_upper" = "utf8_upper" | |||
"str_to_upper" = "utf8_upper", | |||
# str_trim is defined in dplyr.R |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update this comment to say dplyr-functions.R?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One whitespace suggestion but otherwise LGTM, great work!
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
…r the "day_of_week" temporal kernel This is to resolve [ARROW-13054](https://issues.apache.org/jira/browse/ARROW-13054). This will be needed for casting timezone-naive timestamps [ARROW-13033](https://issues.apache.org/jira/browse/ARROW-13033) and defining [starting day of the week](#10507 (review)). Closes #10598 from rok/ARROW-13054 Lead-authored-by: Rok <rok@mihevc.org> Co-authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
No description provided.