Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] fix behaviour when converting timestamps with "" as tzone #30005

Closed
asfimport opened this issue Oct 22, 2021 · 9 comments
Closed

[R] fix behaviour when converting timestamps with "" as tzone #30005

asfimport opened this issue Oct 22, 2021 · 9 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Oct 22, 2021

Form the comments, we've decided to go with option 3:

  • Set the timezone to local time without changing the integer value fo the timestamp. We store whatever integer R passes to us (21600), with CST as the timezone set. Display is then "1970-01-01 00:00:00 CST"
    This is surprising because we are asserting the local timezone when that is not specified in R.

    ============================================

    POSIXct in R can have timezones specified as "" which is typically interpreted as the session local timezone.

    This can lead to surprising results like:

    > Sys.timezone()
    [1] "America/Chicago"
    > as.integer(as.POSIXct("1970-01-01"))
    [1] 21600
    > Sys.setenv(TZ = "UTC")
    > as.integer(as.POSIXct("1970-01-01"))
    [1] 0
    > Sys.setenv(TZ = "Australia/Brisbane")
    > as.integer(as.POSIXct("1970-01-01"))
    [1] -36000

    See also: https://stackoverflow.com/questions/69670142/how-can-i-store-timezone-agnostic-dates-for-sharing-between-r-and-python-using-p/69678923#69678923

    This runs counter to what timestamps without timezones are interpreted as in Arrow:

    arrow/format/Schema.fbs

    Lines 333 to 336 in 0366943

    /// stored as a struct with Date and Time fields. However, it may also be
    /// encoded into a Timestamp column with an empty timezone. The timestamp
    /// values should be computed "as if" the timezone of the date-time values
    /// was UTC; for example, the naive date-time "January 1st 1970, 00h00" would

    However, it may also be encoded into a Timestamp column with an empty timezone. The timestamp values should be computed "as if" the timezone of the date-time values was UTC; for example, the naive date-time "January 1st 1970, 00h00" would be encoded as timestamp value 0.

    Critically in R, when as.POSIXct("1970-01-01 00:00:00") is run, the timestamp value is computed "as if" the timezone of the date-time values was the local timezone (and not UTC like the Arrow spec says).

    This can lead to some surprising results when converting these timezoneless timestamps from R to Arrow. Using as.POSIXct("1970-01-01 00:00:00") as an example, and presume US Central time. We have a few options:

  • Warn when the timezone is "" or not set that the behavior might be surprising
    We store whatever integer R passes to us (21600), with no timezone set. When someone sees this formatted, the times/dates will be what the time was at UTC ("1970-01-01 06:00:00")

  • Set the timezone to UTC without changing the integer value of the timestamp. We store whatever integer R passes to us (21600), with UTC as the timezone set. When someone sees this formatted, the times/dates will be in UTC ("1970-01-01 06:00:00 UTC") This might be surprising / counterintuitive because the timestamps will suddenly be different and will be based in UTC and not local time like people are expecting.

  • Set the timezone to local time without changing the integer value fo the timestamp. We store whatever integer R passes to us (21600), with CST as the timezone set. Display is then "1970-01-01 00:00:00 CST"
    This is surprising because we are asserting the local timezone when that is not specified in R.

    If someone is using a timestamp without tzone in R to represent a timezoneless timestamp, options 2 and 3 above violate that when it is put into Arrow. Whereas, if someone is using a timestamp that just so happens to be without a tzone but they assume it's in local time, option 1 leads to (very) surprising results

Reporter: Jonathan Keane / @jonkeane
Assignee: Dragoș Moldovan-Grünfeld / @dragosmg
Watchers: Rok Mihevc / @rok

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-14442. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Weston Pace / @westonpace:
I believe the third option is correct.

This is surprising because we are asserting the local timezone when that is not specified in R.

But that is exactly what was specified in R if I understand these statements correctly:

Critically in R, when as.POSIXct("1970-01-01 00:00:00") is run, the timestamp value is computed "as if" the timezone of the date-time values was the local timezone

in R can have timezones specified as "" which is typically interpreted as the session local timezone.

It sound to me like the user specified "session local timezone" so that is what I would expect to see in the time zone value in Arrow.

@asfimport
Copy link
Collaborator Author

Nicola Crane / @thisisnic:
Thinking back to when I used R as a consultant (and as someone who only really has learned about the different ways of representing timezones since interacting with the problem working on Arrow), I would have assumed that a timestamp without a timezone was in my timezone (or the timezone of the data's source), but it just hadn't been expressed explicitly. I wouldn't have even had any idea of the concept of a timezoneless timestamp.

Option 3 would represent the most "natural" behaviour from that point of view.

@asfimport
Copy link
Collaborator Author

Carl Boettiger / @cboettig:
FWIW, I also think option 3 is correct. 

@asfimport
Copy link
Collaborator Author

Rok Mihevc / @rok:
We already do something like this for strftime (https://github.com/apache/arrow/blob/master/r/R/dplyr-functions.R#L830-L839) and indeed option 3 seems best as R users would expect this behavior.

@asfimport
Copy link
Collaborator Author

Dragoș Moldovan-Grünfeld / @dragosmg:
If I understand correctly how timestamps (with a missing tz) work in R and how they are converted to arrow, it is not enough to store the integer value R passes to us together with the local timezone, because that timezone is not used when during the conversion - it is mostly metadata.

Therefore, "1970-01-01" in "BST" will always be incorrect by a hour (BST is UTC +0100). I think we need to account for the offset too. Without correcting for the offset, we have the correct timezone, but the wrong time. See below.

> a <- as.POSIXct("1970-01-01")
# the print method adds local tz when it is unspecified
> a 
[1] "1970-01-01 BST"

> attributes(a)
$class
[1] "POSIXct" "POSIXt" 

$tzone
[1] ""

> attr(a, "tzone") <- Sys.timezone()
> attributes(a)
$class
[1] "POSIXct" "POSIXt" 

$tzone
[1] "Europe/London"
# print result looks the same as with an unspecified `tzone` attribute
> a
[1] "1970-01-01 BST"

# yet this is not enough for conversion to arrow, which makes no use of the tzone attribute and converts the equivalent UTC time, but with the desired timezone and, thus, introduces a "mistake".
> Array$create(a)
Array
<timestamp[us, tz=Europe/London]>
[
  1969-12-31 23:00:00.000000
]

@asfimport
Copy link
Collaborator Author

Dragoș Moldovan-Grünfeld / @dragosmg:
I think the solution needs to take the integer value, figure out the offset to UTC, apply the offset, and only then transform it to an arrow timestamp. In this way we will counter arrow's ignoring of the tzone argument during conversion. 

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
That is essentially what the "assume_timezone" kernel does.

@asfimport
Copy link
Collaborator Author

Dragoș Moldovan-Grünfeld / @dragosmg:
My understanding of this issue has evolved a bit. 

Set the timezone to local time without changing the integer value fo the timestamp. We store whatever integer R passes to us (21600), with CST as the timezone set. Display is then "1970-01-01 00:00:00 CST"
This is surprising because we are asserting the local timezone when that is not specified in R.

I think this is a 2 part problem:

  1. If the timezone information is missing in an R POSIXct vector, assume it is the system timezone and pass this info to arrow without modifying the absolute value (seconds since epoch). I think a warning a maybe a bit too strong as a condition when this happens so maybe a message might be more suitable.

  2. Adjust the print method so that the displayed time matches the timezone recorded as metadata.

    The first part will be addressed by this Jira, while part 2 will be addressed by https://issues.apache.org/jira/browse/ARROW-14567

@asfimport
Copy link
Collaborator Author

Jonathan Keane / @jonkeane:
Issue resolved by pull request 12240
#12240

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant