-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-36518: [Java] Fix ArrowFlightJdbcTimeStampVectorAccessor to return Timestamp objects with date and time that corresponds with local time instead of UTC date and time #36519
base: main
Are you sure you want to change the base?
Conversation
- Old behaviour treats the calendar object passed into the corresponding methods to be the timezone to convert the data into, which is incorrect according to what is defined in the JDBC API - Fixed: Correct behaviour is that the calendar object is used to extend the data into the timezone of the calendar object, essentially asserting that the data is at the timezone defined in the calendar object - Fixed: for vectors that do store timezone information (ie. TimestampVector), the getter methods will use the timezone defined in vector as the timezone assertion and ignores the calendar object if one was passed in
…Test - Made timezone converting logic more readable in ArrowFlightJdbcTimeStampVectorAccessor - Fixed checkstyle issues
- Refactored getTimestampForVector() to be more concise and readable
|
|
Hey @wgtmac! So I read through the discussion and although I agree on keeping the behaviour as is (ie. keeping the timestamp returned in UTC time) I still think its worth to address the issue where the actual "timezone" field of the object returned is not in UTC, which causes a lot of confusion when trying to utilize the timestamp object (ie. turning it into a formatted date string, converting the timestamp into local time easily, etc.) |
I think based on your description, yes, we should return a timestamp that is properly in UTC. What I then don't understand is what the point of the |
@lidavidm Yeah so I was also confused at first but according to this article: https://medium.com/@williampuk/sql-timestamp-and-jdbc-timestamp-deep-dive-7ae0ea91e237 it seems like the JDBC drivers for some of the major RDBMS, the provided calendar instance is used to assert that the date and time of the timestamp stored on the db is at the timezone specified by the calendar (ie. if the timestamp is 2022-07-13 00:15 and we pass in a calendar at UTC+8, then we are asserting that the timestamp is 2022-07-13 00:15 UTC+8) then when it is returned, we convert that timestamp into either local system time or UTC+0 (ie. we should see 2022-07-12 16:15 UTC+0 in the timestamp object returned by the method) Quote from article findings: |
Ok, thanks for the references. I'll try to look at this more closely when I get a chance but my time for Java related work is limited these days, so it may be a while. |
@@ -177,7 +174,7 @@ protected static TimeZone getTimeZoneForVector(TimeStampVector vector) { | |||
|
|||
String timezoneName = arrowType.getTimezone(); | |||
if (timezoneName == null) { | |||
return TimeZone.getTimeZone("UTC"); | |||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. This change seems to be the right one: an Arrow timestamp has no defined timezone if the timezone name is not set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@@ -91,14 +92,11 @@ private LocalDateTime getLocalDateTime(Calendar calendar) { | |||
long value = holder.value; | |||
|
|||
LocalDateTime localDateTime = this.longToLocalDateTime.fromLong(value); | |||
ZoneId defaultTimeZone = TimeZone.getDefault().toZoneId(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should avoid the default time zone here, since that will be system dependent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the idea here is that if no calendar is supplied, we will use the system timezone as an assertion of what the timezone of the timestamp value from the db is. So during the accessor process, if 2021-03-28 00:15 is contained in the DB and we don't give the getter a calendar, we will then assume 2021-03-28 00:15 is an instant at the system time zone. Given that since the timestamp is already in the local timezone, no conversion is needed and the value returned by the getter is indeed 2021-03-28 00:15 (LOCAL TIMEZONE)
ZoneId sourceTimeZone = nonNull(this.timeZone) ? this.timeZone.toZoneId() : | ||
nonNull(calendar) ? calendar.getTimeZone().toZoneId() : defaultTimeZone; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- can we just compare to null explicitly, and 2) can we write this as an if-else chain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
TimeZone finalTimeZoneForResultWithoutCalendar = ofNullable(getTimeZoneForVector(vector)) | ||
.orElse(TimeZone.getDefault()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, we should be able to assert what time zone is returned here right? And again, we should avoid using the system timezone for things?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep got it!
…one an if statement and changed all nonNull usage to compare with null explicitly ArrowFlightJdbcTimeStampVectorAccessorTest: Modified test so system timezone is compared more explicitly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic is hard to follow (the original code sure does not help...) But I'm not sure if the changes here are defensible without more explanation of the logic in the code. It might help to make sure we have clear test cases of expected behavior first.
Instant currInstant = Instant.ofEpochMilli(milliseconds); | ||
LocalDateTime getTimestampWithoutTZ = LocalDateTime.ofInstant(currInstant, | ||
TimeZone.getTimeZone("UTC").toZoneId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we directly use ofEpochSecond instead of bouncing through an Instant and a timezone conversion?
LocalDateTime getTimestampWithoutTZ = LocalDateTime.ofInstant(currInstant, | ||
TimeZone.getTimeZone("UTC").toZoneId()); | ||
ZonedDateTime parsedTime = getTimestampWithoutTZ.atZone(calendar.getTimeZone().toZoneId()); | ||
return parsedTime.toEpochSecond() * 1000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to truncate the milliseconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parsedTime.toInstant().toEpochMilli() seems like it would avoid this problem.
@@ -36,21 +38,17 @@ private DateTimeUtils() { | |||
} | |||
|
|||
/** | |||
* Subtracts given Calendar's TimeZone offset from epoch milliseconds. | |||
* Apply calendar timezone to epoch milliseconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we document this function and what steps it's doing in detail?
@@ -36,21 +38,17 @@ private DateTimeUtils() { | |||
} | |||
|
|||
/** | |||
* Subtracts given Calendar's TimeZone offset from epoch milliseconds. | |||
* Apply calendar timezone to epoch milliseconds. | |||
*/ | |||
public static long applyCalendarOffset(long milliseconds, Calendar calendar) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, looking at how it's used, now I'm thinking both implementations are wrong. For instance, it's used in DateVectorAccessor and is applied to the value from the vector and the user-supplied calendar. This function appears to assume the given milliseconds value is a naive timestamp in the given (or default) timezone, and converts it to a UTC timestamp. This is backwards. The value from the Arrow vector is ALREADY a UTC timestamp!
That said, this function does seem to do the right thing for other places where it is used. So I think we need to name and properly document it, and remove it from places where it should NOT be applied.
if (this.timeZone != null) { | ||
sourceTimeZone = this.timeZone.toZoneId(); | ||
} else if (calendar != null) { | ||
sourceTimeZone = calendar.getTimeZone().toZoneId(); | ||
} else { | ||
sourceTimeZone = defaultTimeZone; | ||
} | ||
return localDateTime; | ||
|
||
return localDateTime.atZone(sourceTimeZone).withZoneSameInstant(defaultTimeZone).toLocalDateTime(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is quite right: if the Arrow vector has a timezone, the underlying timestamp is always relative to UTC. this.timeZone
doesn't reflect that.
I think its worth to clarify what my logic/presumptions are. So essentially, the logic in the change is that:
What ended up happening before was that the LocalDateTime that was returned by the getLocalDateTime method reflects the timestamp being converted from the timezone of the calendar input (or system default when it is null) to UTC (since in the original code, the timezone of the vector defaulted to UTC). But Timestamp.valueOf(LocalDateTime) interprets the LocalDateTime passed in as it is in the system default timezone, thus causing the issue where we did get the correct time and date value but the timezone of the Timestamp object is the system timezone instead of UTC (or whatever the timezone of the Vector is) I could modifiy my code such that it keeps the same behaviour as before but the timezone of the Timestamp object returned is accurately reflected instead |
protected static LongToLocalDateTime getLongToLocalDateTimeForVector(TimeStampVector vector, | ||
TimeZone timeZone) { | ||
String timeZoneID = timeZone.getID(); | ||
protected static LongToLocalDateTime getLongToUTCDateTimeForVector(TimeStampVector vector) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good if we can rename the return type LongToLocalDateTime
as well.
TimeZone timeZone) { | ||
String timeZoneID = timeZone.getID(); | ||
protected static LongToLocalDateTime getLongToUTCDateTimeForVector(TimeStampVector vector) { | ||
String timeZoneID = "UTC"; | ||
|
||
ArrowType.Timestamp arrowType = | ||
(ArrowType.Timestamp) vector.getField().getFieldType().getType(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the switch block below, you may use the overload without timeZoneID, which is much simpler:
DateUtility.getLocalDateTimeFromEpochNano(nanoseconds);
DateUtility.getLocalDateTimeFromEpochMicro(microseconds);
DateUtility.getLocalDateTimeFromEpochMilli(milliseconds);
DateUtility.getLocalDateTimeFromEpochMilli(TimeUnit.SECONDS.toMillis(seconds));
@@ -177,7 +174,7 @@ protected static TimeZone getTimeZoneForVector(TimeStampVector vector) { | |||
|
|||
String timezoneName = arrowType.getTimezone(); | |||
if (timezoneName == null) { | |||
return TimeZone.getTimeZone("UTC"); | |||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
ZoneId sourceTimeZone; | ||
|
||
if (this.timeZone != null) { | ||
sourceTimeZone = this.timeZone.toZoneId(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per the comment from @lidavidm, sourceTimeZone should be UTC if the vector has provided a valid timezone.
|
Sorry, I would like to sit down and make a few tests and compare them with other drivers to make sure I have a good handle on what's going on here. But I don't have much time for Arrow Java development beyond simple maintenance PRs. |
@lidavidm Do you know any other expert can join the discussion? I'd like to follow up with this fix but my experience on this is limited. |
No, I don't think any of the contributors that originally worked on the JDBC driver are still active in the project. |
My experience is also limited, hence why I wanted to construct the test cases. |
Thanks, I agree with you @lidavidm |
@jhmannok , could you list the client applications that you've verified this change with? It'd be great to know what's been covered since this API is not that clear. |
I might have missed this as well, but can you verify the behaviors of the Oracle and SQL Server drivers when using time with timezone vs raw time types? |
TimeZone timeZone) { | ||
String timeZoneID = timeZone.getID(); | ||
protected static LongToUTCDateTime getLongToUTCDateTimeForVector(TimeStampVector vector) { | ||
String timeZoneID = "UTC"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Make final
ZonedDateTime sourceTZDateTime = LocalDateTime | ||
.ofInstant(Instant.ofEpochMilli(millis), TimeZone.getTimeZone("UTC").toZoneId()) | ||
.atZone(TimeZone.getTimeZone(timeZone).toZoneId()); | ||
expectedTimestamp = new Timestamp(sourceTZDateTime.toEpochSecond() * 1000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks to truncate millis as well.
Hi, I see that this PR has been without movement for several months. I am interested in continuing with the changes. This PR is over Arrow 13.0.0, but current version is 16. What would be the better way?, creating a new PR with these changes but from current Arrow version, or continuing here? @jhmannok, are you ok if I go on with this work? |
I created a new PR with the changes #43149 |
Rationale for this change
When calling the getTimestamp method from the ArrowFlightJdbcTimeStampVectorAccessor class, the timezone of the Timestamp object returned is incorrect. The timestamp itself appears to be in GMT/UTC time but the timezone field of the Timestamp object is populated with the timezone of the JDBC client instead.
Example
timestamp on db: 2021-03-28T00:15:00.000 (UTC)
timezone of JDBC client: PST/PDT (Vancouver Time)
Calendar cal <- Calendar with UTC timezone
calling getTimestamp(cal) on a result set will return a timestamp like this: 2021-03-28T00:15:00.000 (PST/PDT)
where 2021-03-28T00:15:00.000 appears to be in UTC time but the timezone of the object itself is PST/PDT
What changes are included in this PR?
Are these changes tested?
Many of the time related tests also face this issue and this fixes them all. Also tested the driver jar in a JDBC client and behaviour is fixed.
Are there any user-facing changes?