Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-13022: [R] bindings for lubridate's year, isoyear, quarter, month, day, wday, yday, isoweek, hour, minute, and second functions #10507

Closed
wants to merge 22 commits into from

Conversation

thisisnic
Copy link
Member

No description provided.

@github-actions
Copy link

#'
#' Arrow's `day_of_week` kernel counts from 0 (Monday) to 6 (Sunday), whereas
#' `lubridate::wday` counts from 1 to 7, and allows users to specify which day
#' of the week is first (Sunday by default). This function converts the returned
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to mention that we actually also have an "ISO weekday" in the C++ kernels, but only as field in the iso_calendar kernel, and not as separate one. We could also add iso_day_of_week if that helps, or even just make day_of_week follow the ISO conventions. Because the 0 (monday) - 6 (sunday) might be something Python specific.

But then looked at what C++ does, and there the default is actually 0 (sunday) - 6 (saturday), so yet something else ;) (although I see that also Postgres uses that for dow)

In the end, once you have it in one form, it's an easy conversion to any of the other of course, so it might not matter that much.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add options to the C++ kernel to enable different behaviors there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That probably makes more sense than my workaround here actually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we go that way it would probably be best to have another Jira for "TemporalOptions". It's probably best you proceed with the workaround and we loop back to this later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, will do, and will open a JIRA, thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please leave a comment with that JIRA issue somewhere in this code that we expect to remove.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

r/R/expression.R Outdated
"day" = "day",
"yday" = "day_of_year",
"isoweek" = "iso_week",
"minute" = "minute",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"hour" is missing here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, will add in now.

r/R/expression.R Outdated
"yday" = "day_of_year",
"isoweek" = "iso_week",
"minute" = "minute",
"second" = "second"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that second might do something different. I think "second" in lubridate is the equivalent of "second + subsecond" in arrow

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Hmm, I think this might be more complicated than that actually, as lubridate rounds it to 1 decimal place, using R's odd-even rounding whereas Arrow doesn't do the rounding, so I'm not sure if this can be done without rounding being implemented (see https://issues.apache.org/jira/browse/ARROW-12744)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean exactly with rounding? A quick try gives me:

> second(ymd_hms("2011-06-04 12:00:01.123456"))
[1] 1.123456

which seems to give all decimals

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it does; I think I must have been getting mixed up with something I'd changed when I needed to update a test I wrote, never mind,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about nanoseconds?

> second(ymd_hms("2011-06-04 12:00:01.123456789"))
[1] 1.123457

Arrow would probably return 1.123456789.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think lubridate / R supports nanosecond resolution

Copy link
Member

@rok rok Jun 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so too. So probably it should be "second = second + round(subsecond, 6)" to match that behavior?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDK that we should truncate the data like this. A slight (technical) API difference is probably better than throwing away precision.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I would also say that it's not because lubridate does not support nanoseconds that if you actually have nanoseconds (which is possible since arrow supports it) those should be discarded.

@thisisnic thisisnic changed the title ARROW-13022: [R] bindings for lubridate's year, isoyear, quarter, month, day, wday, yday, isoweek, minute, and second functions ARROW-13022: [R] bindings for lubridate's year, isoyear, quarter, month, day, wday, yday, isoweek, hour, minute, and second functions Jun 22, 2021
Expression$create(
"cast",
Expression$create("divide_checked", e1, e2),
options = cast_options(to_type = int32(), allow_float_truncate = TRUE,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you make all of the scalars integers (e.g. 1L), do you still need to cast?


e2 = Expression$scalar(7)

# (e1 - e2 * ( e1 %/% e2 )) + 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try doing exactly this expression? I would think it should just work because Ops.Expression is defined for all of those.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, I had no idea, deleted a lot of unnecessary lines of code now, thanks @nealrichardson !

r/R/expression.R Outdated
@@ -29,8 +29,17 @@
# stringr spellings of those
"str_length" = "utf8_length",
"str_to_lower" = "utf8_lower",
"str_to_upper" = "utf8_upper"
"str_to_upper" = "utf8_upper",
# str_trim is defined in dplyr.R
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update this comment to say dplyr-functions.R?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One whitespace suggestion but otherwise LGTM, great work!

Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
jorisvandenbossche pushed a commit that referenced this pull request Jul 7, 2021
…r the "day_of_week" temporal kernel

This is to resolve [ARROW-13054](https://issues.apache.org/jira/browse/ARROW-13054).
This will be needed for casting timezone-naive timestamps [ARROW-13033](https://issues.apache.org/jira/browse/ARROW-13033) and defining [starting day of the week](#10507 (review)).

Closes #10598 from rok/ARROW-13054

Lead-authored-by: Rok <rok@mihevc.org>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants