Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Implement lubridate's individual date/time parsers #30030

Closed
Tracked by #30376
asfimport opened this issue Oct 26, 2021 · 8 comments
Closed
Tracked by #30376

[R] Implement lubridate's individual date/time parsers #30030

asfimport opened this issue Oct 26, 2021 · 8 comments
Assignees
Milestone

Comments

@asfimport
Copy link

asfimport commented Oct 26, 2021

Parse dates with year, month, and day components:
ymd() ydm() mdy() myd() dmy() dym() yq() ym() my()

Parse date-times with year, month, and day, hour, minute, and second components:
ymd_hms() ymd_hm() ymd_h() dmy_hms() dmy_hm() dmy_h() mdy_hms() mdy_hm() mdy_h() ydm_hms() ydm_hm() ydm_h()

Parse periods with hour, minute, and second components:
ms() hm() hms()

Reporter: Nicola Crane / @thisisnic
Assignee: Rok Mihevc / @rok

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-14471. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Dewey Dunnington / @paleolimbot:
I did a bit of looking into this...lubridate uses a custom C parser for its order-based datetime parsers. That said, its functionality can be approximated by {}coalesce(strptime(dt_string, "format1"), strptime(dt_string, "format2"), ...){}. Is it worth translating the functions with an approximation that handles most of the use cases?

Some testing that might be useful when putting together a PR:

library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

test_dates <- tibble::tibble(
  string_ymd = c("2021-09-10", "2021/09/10", "20210910", "2021 Sep 10", "2021 September 10", NA),
  string_dmy = c("10-09-2021", "10/09/2021", "10092021", "10 Sep 2021", "10 September 2021", NA),
  string_mdy = c("09-10-2021", "09/10/2021", "09102021", "Sep 10 2021", "September 10 2021", NA),
  date = c(rep(as.Date("2021-09-10"), 5), NA),
  date_midnight = c(rep(as.POSIXct("2021-09-10 00:00:00", tz = "UTC"), 5), NA)
)

# these get dropped by as.POSIXct if the system tz is UTC?
attr(test_dates$date_midnight, "tzone") <- "UTC"

test_datetimes <- tibble::tibble(
  string_ymd_hms = stringr::str_c(test_dates$string_ymd, "01:23:45"),
  string_dmy_hms = stringr::str_c(test_dates$string_dmy, "01:23:45"),
  string_mdy_hms = stringr::str_c(test_dates$string_mdy, "01:23:45"),
  string_ymd_hm = stringr::str_c(test_dates$string_ymd, "01:23"),
  string_dmy_hm = stringr::str_c(test_dates$string_dmy, "01:23"),
  string_mdy_hm = stringr::str_c(test_dates$string_mdy, "01:23"),
  string_ymd_h = stringr::str_c(test_dates$string_ymd, "01"),
  string_dmy_h = stringr::str_c(test_dates$string_dmy, "01"),
  string_mdy_h = stringr::str_c(test_dates$string_mdy, "01"),
  date_second = c(rep(as.POSIXct("2021-09-10 01:23:45", tz = "UTC"), 5), NA),
  date_minute = c(rep(as.POSIXct("2021-09-10 01:23", tz = "UTC"), 5), NA),
  date_hour = c(rep(as.POSIXct("2021-09-10", tz = "UTC") + 60 * 60, 5), NA)
)

# these get dropped by as.POSIXct if the system tz is UTC?
attr(test_datetimes$date_second, "tzone") <- "UTC"
attr(test_datetimes$date_minute, "tzone") <- "UTC"
attr(test_datetimes$date_hour, "tzone") <- "UTC"

# tests with lubridate, R eval
library(testthat, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)

expect_identical(ymd(test_dates$string_ymd), test_dates$date)
expect_identical(dmy(test_dates$string_dmy), test_dates$date)
expect_identical(mdy(test_dates$string_mdy), test_dates$date)

expect_identical(ymd(test_dates$string_ymd, tz = "UTC"), test_dates$date_midnight)
expect_identical(dmy(test_dates$string_dmy, tz = "UTC"), test_dates$date_midnight)
expect_identical(mdy(test_dates$string_mdy, tz = "UTC"), test_dates$date_midnight)

expect_identical(
  ymd_hms(test_datetimes$string_ymd_hms, tz = "UTC"),
  test_datetimes$date_second
)
expect_identical(
  dmy_hms(test_datetimes$string_dmy_hms, tz = "UTC"),
  test_datetimes$date_second
)
expect_identical(
  mdy_hms(test_datetimes$string_mdy_hms, tz = "UTC"),
  test_datetimes$date_second
)

expect_identical(
  ymd_hm(test_datetimes$string_ymd_hm, tz = "UTC"),
  test_datetimes$date_minute
)
expect_identical(
  dmy_hm(test_datetimes$string_dmy_hm, tz = "UTC"),
  test_datetimes$date_minute
)
expect_identical(
  mdy_hm(test_datetimes$string_mdy_hm, tz = "UTC"),
  test_datetimes$date_minute
)

expect_identical(
  ymd_h(test_datetimes$string_ymd_h, tz = "UTC"),
  test_datetimes$date_hour
)
expect_identical(
  dmy_h(test_datetimes$string_dmy_h, tz = "UTC"),
  test_datetimes$date_hour
)
expect_identical(
  mdy_h(test_datetimes$string_mdy_h, tz = "UTC"),
  test_datetimes$date_hour
)

@asfimport
Copy link
Author

Nicola Crane / @thisisnic:
@paleolimbot When you say 'approximated', how close or far is it?

@asfimport
Copy link
Author

Dewey Dunnington / @paleolimbot:
lubridate does it by ignoring all non-alpha/numeric characters so it'll happily parse something like "01:::::23::::45" just the same as "012345" or anywhere inbetween. I forget the exact incantations for strptime, but we could provide a list of formats we're willing to consider for each of them or preprocess with a regex to reduce the number of formats we need to consider (i.e., gsub("[^A-Za-z0-9.]", " ")). If that sounds close enough for you I'm happy to make a PR!

@asfimport
Copy link
Author

Nicola Crane / @thisisnic:
Sounds like it'd be easy enough (famous last words!) to just strip out extra things with regex - so I'd say go ahead!

@asfimport
Copy link
Author

Dragoș Moldovan-Grünfeld / @dragosmg:
@paleolimbot I don't think we can rely on coalesce() to iterate through the various formats supported for {}ymd(){}. It would need to rely on the assumption that the passed format matches the data or otherwise fail. Sadly, arrow works with a wrong format resulting in weird timestamps:

suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(arrow))
suppressPackageStartupMessages(library(lubridate))

df <- tibble(x = c("09-01-01", "09-01-02", "09-01-03"))
df
#> # A tibble: 3 × 1
#>   x       
#>   <chr>   
#> 1 09-01-01
#> 2 09-01-02
#> 3 09-01-03

# lubridate::ymd()
df %>% 
  mutate(y = ymd(x))
#> # A tibble: 3 × 2
#>   x        y         
#>   <chr>    <date>    
#> 1 09-01-01 2009-01-01
#> 2 09-01-02 2009-01-02
#> 3 09-01-03 2009-01-03

# y = short year correct
df %>% 
  record_batch() %>% 
  mutate(y = strptime(x, format = "%y-%m-%d", unit = "us")) %>% 
  collect()
#> # A tibble: 3 × 2
#>   x        y                  
#>   <chr>    <dttm>             
#> 1 09-01-01 2009-01-01 00:00:00
#> 2 09-01-02 2009-01-02 00:00:00
#> 3 09-01-03 2009-01-03 00:00:00

# Y = long year this should fail in order for us to rely on coalesce
df %>% 
  record_batch() %>% 
  mutate(y = strptime(x, format = "%Y-%m-%d", unit = "us")) %>% 
  collect()
#> # A tibble: 3 × 2
#>   x        y                  
#>   <chr>    <dttm>             
#> 1 09-01-01 0008-12-31 23:58:45
#> 2 09-01-02 0009-01-01 23:58:45
#> 3 09-01-03 0009-01-02 23:58:45

Therefore, my early (and somewhat naive) conclusion would be that we cannot implement arrow::ymd() binding as {}coalesce(strptime(x, format1), strptime(x, format2), ...){}. What do you think?

@asfimport
Copy link
Author

Dragoș Moldovan-Grünfeld / @dragosmg:
We could to some processing to figure out how many characters we have (in the string to be parsed) in between the separators (or how many characters we have in total, in the cases where we have no separator) and only try with the suitable formats. i.e. in the example above not try to parse with {}%Y{}, only {}%y{}.

@asfimport
Copy link
Author

Dragoș Moldovan-Grünfeld / @dragosmg:
Another alternative would be for strptime to error when the selected format does not match the data (for example, attempting to parse "09-12-31" with {}"%Y-%m-%d"{}should error due to a mismatch in the length of the year). Then we could rely on this behaviour with {}coalesce{}. 

@asfimport
Copy link
Author

Dragoș Moldovan-Grünfeld / @dragosmg:
lubridate uses guess_formats() to identify the likely candidates. We could try something similar, where we have a list of supported formats (something similar to this), which we then narrow down to the most likely ones. Only then use something like {}coalesce(){}.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants