[R] Implement lubridate's individual date/time parsers #30030

asfimport · 2021-10-26T12:44:16Z

Parse dates with year, month, and day components:
ymd() ydm() mdy() myd() dmy() dym() yq() ym() my()

Parse date-times with year, month, and day, hour, minute, and second components:
ymd_hms() ymd_hm() ymd_h() dmy_hms() dmy_hm() dmy_h() mdy_hms() mdy_hm() mdy_h() ydm_hms() ydm_hm() ydm_h()

Parse periods with hour, minute, and second components:
ms() hm() hms()

Reporter: Nicola Crane / @thisisnic
Assignee: Rok Mihevc / @rok

Related issues:

[R] Implement lubridate's parsers with year, month and date components (is a parent of)
[R] Implement lubridate's parsers with year, month, and day, hour, minute, and second components (is a parent of)
[R] Implement ym() my() and yq() parsers (is a parent of)
[R] Implement bindings for lubridate's parse_date_time (depends upon)
[R] Implement bindings for lubridate's date_decimal and decimal_date (is depended upon by)

PRs and other links:

GitHub Pull Request #12353

_{Note: This issue was originally created as ARROW-14471. Please see the migration documentation for further details.}

asfimport · 2021-11-08T19:47:13Z

Dewey Dunnington / @paleolimbot:
I did a bit of looking into this...lubridate uses a custom C parser for its order-based datetime parsers. That said, its functionality can be approximated by {}coalesce(strptime(dt_string, "format1"), strptime(dt_string, "format2"), ...){}. Is it worth translating the functions with an approximation that handles most of the use cases?

Some testing that might be useful when putting together a PR:

library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

test_dates <- tibble::tibble(
  string_ymd = c("2021-09-10", "2021/09/10", "20210910", "2021 Sep 10", "2021 September 10", NA),
  string_dmy = c("10-09-2021", "10/09/2021", "10092021", "10 Sep 2021", "10 September 2021", NA),
  string_mdy = c("09-10-2021", "09/10/2021", "09102021", "Sep 10 2021", "September 10 2021", NA),
  date = c(rep(as.Date("2021-09-10"), 5), NA),
  date_midnight = c(rep(as.POSIXct("2021-09-10 00:00:00", tz = "UTC"), 5), NA)
)

# these get dropped by as.POSIXct if the system tz is UTC?
attr(test_dates$date_midnight, "tzone") <- "UTC"

test_datetimes <- tibble::tibble(
  string_ymd_hms = stringr::str_c(test_dates$string_ymd, "01:23:45"),
  string_dmy_hms = stringr::str_c(test_dates$string_dmy, "01:23:45"),
  string_mdy_hms = stringr::str_c(test_dates$string_mdy, "01:23:45"),
  string_ymd_hm = stringr::str_c(test_dates$string_ymd, "01:23"),
  string_dmy_hm = stringr::str_c(test_dates$string_dmy, "01:23"),
  string_mdy_hm = stringr::str_c(test_dates$string_mdy, "01:23"),
  string_ymd_h = stringr::str_c(test_dates$string_ymd, "01"),
  string_dmy_h = stringr::str_c(test_dates$string_dmy, "01"),
  string_mdy_h = stringr::str_c(test_dates$string_mdy, "01"),
  date_second = c(rep(as.POSIXct("2021-09-10 01:23:45", tz = "UTC"), 5), NA),
  date_minute = c(rep(as.POSIXct("2021-09-10 01:23", tz = "UTC"), 5), NA),
  date_hour = c(rep(as.POSIXct("2021-09-10", tz = "UTC") + 60 * 60, 5), NA)
)

# these get dropped by as.POSIXct if the system tz is UTC?
attr(test_datetimes$date_second, "tzone") <- "UTC"
attr(test_datetimes$date_minute, "tzone") <- "UTC"
attr(test_datetimes$date_hour, "tzone") <- "UTC"

# tests with lubridate, R eval
library(testthat, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)

expect_identical(ymd(test_dates$string_ymd), test_dates$date)
expect_identical(dmy(test_dates$string_dmy), test_dates$date)
expect_identical(mdy(test_dates$string_mdy), test_dates$date)

expect_identical(ymd(test_dates$string_ymd, tz = "UTC"), test_dates$date_midnight)
expect_identical(dmy(test_dates$string_dmy, tz = "UTC"), test_dates$date_midnight)
expect_identical(mdy(test_dates$string_mdy, tz = "UTC"), test_dates$date_midnight)

expect_identical(
  ymd_hms(test_datetimes$string_ymd_hms, tz = "UTC"),
  test_datetimes$date_second
)
expect_identical(
  dmy_hms(test_datetimes$string_dmy_hms, tz = "UTC"),
  test_datetimes$date_second
)
expect_identical(
  mdy_hms(test_datetimes$string_mdy_hms, tz = "UTC"),
  test_datetimes$date_second
)

expect_identical(
  ymd_hm(test_datetimes$string_ymd_hm, tz = "UTC"),
  test_datetimes$date_minute
)
expect_identical(
  dmy_hm(test_datetimes$string_dmy_hm, tz = "UTC"),
  test_datetimes$date_minute
)
expect_identical(
  mdy_hm(test_datetimes$string_mdy_hm, tz = "UTC"),
  test_datetimes$date_minute
)

expect_identical(
  ymd_h(test_datetimes$string_ymd_h, tz = "UTC"),
  test_datetimes$date_hour
)
expect_identical(
  dmy_h(test_datetimes$string_dmy_h, tz = "UTC"),
  test_datetimes$date_hour
)
expect_identical(
  mdy_h(test_datetimes$string_mdy_h, tz = "UTC"),
  test_datetimes$date_hour
)

asfimport · 2021-11-18T15:26:29Z

Nicola Crane / @thisisnic:
@paleolimbot When you say 'approximated', how close or far is it?

asfimport · 2021-11-18T16:00:55Z

Dewey Dunnington / @paleolimbot:
lubridate does it by ignoring all non-alpha/numeric characters so it'll happily parse something like "01:::::23::::45" just the same as "012345" or anywhere inbetween. I forget the exact incantations for strptime, but we could provide a list of formats we're willing to consider for each of them or preprocess with a regex to reduce the number of formats we need to consider (i.e., gsub("[^A-Za-z0-9.]", " ")). If that sounds close enough for you I'm happy to make a PR!

asfimport · 2021-11-18T16:06:50Z

Nicola Crane / @thisisnic:
Sounds like it'd be easy enough (famous last words!) to just strip out extra things with regex - so I'd say go ahead!

asfimport · 2022-02-07T11:33:06Z

Dragoș Moldovan-Grünfeld / @dragosmg:
@paleolimbot I don't think we can rely on coalesce() to iterate through the various formats supported for {}ymd(){}. It would need to rely on the assumption that the passed format matches the data or otherwise fail. Sadly, arrow works with a wrong format resulting in weird timestamps:

suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(arrow))
suppressPackageStartupMessages(library(lubridate))

df <- tibble(x = c("09-01-01", "09-01-02", "09-01-03"))
df
#> # A tibble: 3 × 1
#>   x       
#>   <chr>   
#> 1 09-01-01
#> 2 09-01-02
#> 3 09-01-03

# lubridate::ymd()
df %>% 
  mutate(y = ymd(x))
#> # A tibble: 3 × 2
#>   x        y         
#>   <chr>    <date>    
#> 1 09-01-01 2009-01-01
#> 2 09-01-02 2009-01-02
#> 3 09-01-03 2009-01-03

# y = short year correct
df %>% 
  record_batch() %>% 
  mutate(y = strptime(x, format = "%y-%m-%d", unit = "us")) %>% 
  collect()
#> # A tibble: 3 × 2
#>   x        y                  
#>   <chr>    <dttm>             
#> 1 09-01-01 2009-01-01 00:00:00
#> 2 09-01-02 2009-01-02 00:00:00
#> 3 09-01-03 2009-01-03 00:00:00

# Y = long year this should fail in order for us to rely on coalesce
df %>% 
  record_batch() %>% 
  mutate(y = strptime(x, format = "%Y-%m-%d", unit = "us")) %>% 
  collect()
#> # A tibble: 3 × 2
#>   x        y                  
#>   <chr>    <dttm>             
#> 1 09-01-01 0008-12-31 23:58:45
#> 2 09-01-02 0009-01-01 23:58:45
#> 3 09-01-03 0009-01-02 23:58:45

Therefore, my early (and somewhat naive) conclusion would be that we cannot implement arrow::ymd() binding as {}coalesce(strptime(x, format1), strptime(x, format2), ...){}. What do you think?

asfimport · 2022-02-07T11:35:41Z

Dragoș Moldovan-Grünfeld / @dragosmg:
We could to some processing to figure out how many characters we have (in the string to be parsed) in between the separators (or how many characters we have in total, in the cases where we have no separator) and only try with the suitable formats. i.e. in the example above not try to parse with {}%Y{}, only {}%y{}.

asfimport · 2022-02-07T12:21:06Z

Dragoș Moldovan-Grünfeld / @dragosmg:
Another alternative would be for strptime to error when the selected format does not match the data (for example, attempting to parse "09-12-31" with {}"%Y-%m-%d"{}should error due to a mismatch in the length of the year). Then we could rely on this behaviour with {}coalesce{}.

asfimport · 2022-02-07T15:01:41Z

Dragoș Moldovan-Grünfeld / @dragosmg:
lubridate uses guess_formats() to identify the likely candidates. We could try something similar, where we have a list of supported formats (something similar to this), which we then narrow down to the most likely ones. Only then use something like {}coalesce(){}.

asfimport closed this as completed Aug 29, 2022

asfimport assigned rok Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R] Implement lubridate's individual date/time parsers #30030

[R] Implement lubridate's individual date/time parsers #30030

asfimport commented Oct 26, 2021 •

edited

asfimport commented Nov 8, 2021

asfimport commented Nov 18, 2021

asfimport commented Nov 18, 2021

asfimport commented Nov 18, 2021

asfimport commented Feb 7, 2022

asfimport commented Feb 7, 2022

asfimport commented Feb 7, 2022

asfimport commented Feb 7, 2022

[R] Implement lubridate's individual date/time parsers #30030

[R] Implement lubridate's individual date/time parsers #30030

Comments

asfimport commented Oct 26, 2021 • edited

Related issues:

PRs and other links:

asfimport commented Nov 8, 2021

asfimport commented Nov 18, 2021

asfimport commented Nov 18, 2021

asfimport commented Nov 18, 2021

asfimport commented Feb 7, 2022

asfimport commented Feb 7, 2022

asfimport commented Feb 7, 2022

asfimport commented Feb 7, 2022

asfimport commented Oct 26, 2021 •

edited