Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-15098 [R] Add binding for lubridate::duration() and/or as.difftime() #12506

Closed
wants to merge 51 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
af22f2d
first pass at `as.difftime()`
dragosmg Feb 24, 2022
4d997a9
some additions to as.difftime
dragosmg Feb 25, 2022
a1727a5
early stage unit tests
dragosmg Mar 4, 2022
e6b93bc
first pass at `difftime()`
dragosmg Mar 4, 2022
88444f5
pop
dragosmg Mar 4, 2022
ead654b
make tests pass
dragosmg Mar 4, 2022
b0743ae
test without timezones to see if Windows still complains
dragosmg Mar 4, 2022
3953f43
switched to `subtract` kernel for the implementation of `difftime`
dragosmg Mar 7, 2022
f48eb15
skip timezone tests on windows
dragosmg Mar 7, 2022
7a45459
add unit tests passing a regular R object
dragosmg Mar 7, 2022
4f602e9
match unit arg
dragosmg Mar 7, 2022
20ac204
add unit test for difftime + tz + R object
dragosmg Mar 7, 2022
9de212e
add a bit more substance to `as.difftime()`
dragosmg Mar 8, 2022
bc99d53
`as.difftime()` can handle integers
dragosmg Mar 8, 2022
cbcd28f
updated unit tests & error message
dragosmg Mar 8, 2022
22c8839
added comment to a unit test
dragosmg Mar 8, 2022
6a78ea9
another try
dragosmg Mar 8, 2022
2b00aa2
replace `"%X"` with `"%H:%M:%S"` on windows + update comments
dragosmg Mar 9, 2022
b1caedc
`"%H:%M:%S"` for consistency
dragosmg Mar 9, 2022
70d4a0d
comment + `"%H:%M:%S"` for consistency
dragosmg Mar 9, 2022
f13184b
separate `compare_dplyr_binding()` blocks for the unsuported units
dragosmg Mar 10, 2022
a797f9b
typo + simplified logic for integer vs double
dragosmg Mar 10, 2022
047f5d0
style
dragosmg Mar 10, 2022
11d0dec
typo + simplified implementation
dragosmg Mar 11, 2022
b1b3891
switched back to casting as the simplified version was causing some C…
dragosmg Mar 11, 2022
694d4be
one more step back
dragosmg Mar 11, 2022
d8eeafe
go via `int64()` instead of a difference between 2 `time32()`
dragosmg Mar 11, 2022
aa3a366
move `as.difftime()` definition and tests from `...-type.R` to `...-d…
dragosmg Mar 11, 2022
31756b0
separate duration bindings and update `as.difftime()` + tests
dragosmg Mar 14, 2022
0c56cd6
missing closing bracket
dragosmg Mar 14, 2022
779fe91
updated NEWS
dragosmg Mar 14, 2022
ed5c7ed
removed the testing of the actual error message since it's being surf…
dragosmg Mar 15, 2022
60edf1f
simplified `difftime` binding + updated unit tests
dragosmg Mar 16, 2022
3eddb4b
add some times to the test data frame
dragosmg Mar 16, 2022
6a1ca12
remove the `tz` arg
dragosmg Mar 16, 2022
2755e1b
cleaned up some of the tests
dragosmg Mar 16, 2022
b2f7268
`unit` defaults to `"secs"` + improved messages
dragosmg Mar 16, 2022
f8837f8
cast only when needed
dragosmg Mar 21, 2022
245ec89
updated unit tests
dragosmg Mar 21, 2022
448268d
rescue the format tests
dragosmg Mar 21, 2022
7e8a837
clean-up after the rebase mess
dragosmg Mar 21, 2022
d15b36c
improved testing comment
dragosmg Mar 22, 2022
ae9566d
trying the `subtract_checked()` route
dragosmg Mar 22, 2022
638bf21
clean up and replace `subtract_checked` with `-`
dragosmg Mar 22, 2022
e4cd5bb
simplify the implementation when `x` is character + ignore attributes…
dragosmg Mar 23, 2022
b019b92
use `build_expr()`
dragosmg Mar 23, 2022
1941156
revert to subtracting `y = "0:0:0"` when `x` is an hms string
dragosmg Mar 23, 2022
9b6c7fb
implementation with a chain of casting
dragosmg Mar 23, 2022
958e4f9
added Jon's comment for clarity
dragosmg Mar 23, 2022
5492f71
lint
dragosmg Mar 23, 2022
a98f2f9
all branches go through the final cast to `duration("s")` step
dragosmg Mar 24, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions r/NEWS.md
Expand Up @@ -24,6 +24,7 @@
* component extraction functions: `tz()` (timezone), `semester()` (semester), `dst()` (daylight savings time indicator), `date()` (extract date), `epiyear()` (epiyear), improvements to `month()`, which now works with integer inputs.
* `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to create date-times from numeric representations.
* date-time functionality:
* `difftime` and `as.difftime()`
* `as.Date()` to convert to date

# arrow 7.0.0
Expand Down
64 changes: 64 additions & 0 deletions r/R/dplyr-funcs-datetime.R
Expand Up @@ -188,6 +188,9 @@ register_bindings_datetime <- function() {
register_binding("date", function(x) {
build_expr("cast", x, options = list(to_type = date32()))
})
}

register_bindings_duration <- function() {
register_binding("make_datetime", function(year = 1970L,
month = 1L,
day = 1L,
Expand Down Expand Up @@ -236,6 +239,67 @@ register_bindings_datetime <- function() {
tz = "UTC") {
call_binding("make_datetime", year, month, day, hour, min, sec, tz)
})
register_binding("difftime", function(time1,
time2,
tz,
units = "secs") {
if (units != "secs") {
abort("`difftime()` with units other than `secs` not supported in Arrow")
}

if (!missing(tz)) {
warn("`tz` argument is not supported in Arrow, so it will be ignored")
}

# cast to timestamp if time1 and time2 are not dates or timestamp expressions
# (the subtraction of which would output a `duration`)
if (!call_binding("is.instant", time1)) {
time1 <- build_expr("cast", time1, options = cast_options(to_type = timestamp(timezone = "UTC")))
}

if (!call_binding("is.instant", time2)) {
time2 <- build_expr("cast", time2, options = cast_options(to_type = timestamp(timezone = "UTC")))
}

# we need to go build the subtract expression instead of `time1 - time2` to
# prevent complaints when we try to subtract an R object from an Expression
subtract_output <- build_expr("-", time1, time2)
build_expr("cast", subtract_output, options = cast_options(to_type = duration("s")))
})
register_binding("as.difftime", function(x,
format = "%X",
units = "secs") {
# windows doesn't seem to like "%X"
if (format == "%X" & tolower(Sys.info()[["sysname"]]) == "windows") {
format <- "%H:%M:%S"
}

if (units != "secs") {
abort("`as.difftime()` with units other than 'secs' not supported in Arrow")
}

if (call_binding("is.character", x)) {
x <- build_expr("strptime", x, options = list(format = format, unit = 0L))
# complex casting only due to cast type restrictions: time64 -> int64 -> duration(us)
# and then we cast to duration ("s") at the end
x <- x$cast(time64("us"))$cast(int64())$cast(duration("us"))
}

# numeric -> duration not supported in Arrow yet so we use int64() as an
# intermediate step
# TODO revisit if https://issues.apache.org/jira/browse/ARROW-15862 results
# in numeric -> duration support

if (call_binding("is.numeric", x)) {
# coerce x to be int64(). it should work for integer-like doubles and fail
# for pure doubles
dragosmg marked this conversation as resolved.
Show resolved Hide resolved
# if we abort for all doubles, we risk erroring in cases in which
# coercion to int64() would work
x <- build_expr("cast", x, options = cast_options(to_type = int64()))
}

build_expr("cast", x, options = cast_options(to_type = duration(unit = "s")))
})
}

binding_format_datetime <- function(x, format = "", tz = "", usetz = FALSE) {
Expand Down
1 change: 1 addition & 0 deletions r/R/dplyr-funcs.R
Expand Up @@ -106,6 +106,7 @@ create_binding_cache <- function() {
register_bindings_aggregate()
register_bindings_conditional()
register_bindings_datetime()
register_bindings_duration()
register_bindings_math()
register_bindings_string()
register_bindings_type()
Expand Down
137 changes: 137 additions & 0 deletions r/tests/testthat/test-dplyr-funcs-datetime.R
Expand Up @@ -1076,3 +1076,140 @@ test_that("ISO_datetime & ISOdate", {
ignore_attr = TRUE
)
})

test_that("difftime works correctly", {
test_df <- tibble(
time1 = as.POSIXct(
c("2021-02-20", "2021-07-31 0:0:0", "2021-10-30", "2021-01-31 0:0:0")
),
time2 = as.POSIXct(
c("2021-02-20 00:02:01", "2021-07-31 00:03:54", "2021-10-30 00:05:45", "2021-01-31 00:07:36")
),
secs = c(121L, 234L, 345L, 456L)
)

compare_dplyr_binding(
.input %>%
mutate(
secs2 = difftime(time1, time2, units = "secs")
) %>%
collect(),
test_df,
ignore_attr = TRUE
)

# units other than "secs" not supported in arrow
compare_dplyr_binding(
.input %>%
mutate(
mins = difftime(time1, time2, units = "mins")
) %>%
collect(),
test_df,
warning = TRUE,
ignore_attr = TRUE
)

skip_on_os("windows")
test_df_with_tz <- tibble(
time1 = as.POSIXct(
c("2021-02-20", "2021-07-31", "2021-10-30", "2021-01-31"),
tz = "Pacific/Marquesas"
),
time2 = as.POSIXct(
c("2021-02-20 00:02:01", "2021-07-31 00:03:54", "2021-10-30 00:05:45", "2021-01-31 00:07:36"),
tz = "Asia/Kathmandu"
),
secs = c(121L, 234L, 345L, 456L)
)

compare_dplyr_binding(
.input %>%
mutate(secs2 = difftime(time2, time1, units = "secs")) %>%
collect(),
test_df_with_tz
)

compare_dplyr_binding(
.input %>%
mutate(
secs2 = difftime(
as.POSIXct("2022-03-07", tz = "Europe/Bucharest"),
time1,
units = "secs"
)
) %>%
collect(),
test_df_with_tz
)
dragosmg marked this conversation as resolved.
Show resolved Hide resolved

# `tz` is effectively ignored both in R (used only if inputs are POSIXlt) and Arrow
compare_dplyr_binding(
.input %>%
mutate(secs2 = difftime(time2, time1, units = "secs", tz = "Pacific/Marquesas")) %>%
collect(),
test_df_with_tz,
warning = "`tz` argument is not supported in Arrow, so it will be ignored"
)
})

test_that("as.difftime()", {
test_df <- tibble(
hms_string = c("0:7:45", "12:34:56"),
hm_string = c("7:45", "12:34"),
int = c(30L, 75L),
integerish_dbl = c(31, 76),
dbl = c(31.2, 76.4)
)

compare_dplyr_binding(
.input %>%
mutate(hms_difftime = as.difftime(hms_string, units = "secs")) %>%
collect(),
test_df
)

# TODO add test with `format` mismatch returning NA once
# https://issues.apache.org/jira/browse/ARROW-15659 is solved
# for example: as.difftime("07:", format = "%H:%M") should return NA
compare_dplyr_binding(
.input %>%
mutate(hm_difftime = as.difftime(hm_string, units = "secs", format = "%H:%M")) %>%
collect(),
test_df
)

compare_dplyr_binding(
.input %>%
mutate(int_difftime = as.difftime(int, units = "secs")) %>%
collect(),
test_df
)

compare_dplyr_binding(
.input %>%
mutate(integerish_dbl_difftime = as.difftime(integerish_dbl, units = "secs")) %>%
collect(),
test_df
)

# "mins" or other values for units cannot be handled in Arrow
compare_dplyr_binding(
.input %>%
mutate(int_difftime = as.difftime(int, units = "mins")) %>%
collect(),
test_df,
warning = TRUE
)

# only integer (or integer-like) -> duration conversion supported in Arrow.
# double -> duration not supported. we're not testing the content of the
# error message as it is being generated in the C++ code and it might change,
# but we want to make sure that this error is raised in our binding implementation
expect_error(
test_df %>%
arrow_table() %>%
mutate(dbl_difftime = as.difftime(dbl, units = "secs")) %>%
collect()
)
})