Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-36518: [Java] Fix ArrowFlightJdbcTimeStampVectorAccessor to return Timestamp objects with date and time that corresponds with local time instead of UTC date and time #36519

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

jhmannok
Copy link

@jhmannok jhmannok commented Jul 6, 2023

Rationale for this change

When calling the getTimestamp method from the ArrowFlightJdbcTimeStampVectorAccessor class, the timezone of the Timestamp object returned is incorrect. The timestamp itself appears to be in GMT/UTC time but the timezone field of the Timestamp object is populated with the timezone of the JDBC client instead.

Example

timestamp on db: 2021-03-28T00:15:00.000 (UTC)

timezone of JDBC client: PST/PDT (Vancouver Time)
Calendar cal <- Calendar with UTC timezone
calling getTimestamp(cal) on a result set will return a timestamp like this: 2021-03-28T00:15:00.000 (PST/PDT)
where 2021-03-28T00:15:00.000 appears to be in UTC time but the timezone of the object itself is PST/PDT

What changes are included in this PR?

  • Fixed: Correct behaviour is that the calendar object is used to extend the data into the timezone of the calendar object, essentially asserting that the data is at the timezone defined in the calendar object and returns a time-related object that has local date and time and local timezone
  • Fixed: for vectors that do store timezone information (ie. TimestampVector), the getter methods will use the timezone defined in vector as the timezone assertion and ignores the calendar object if one was passed in
  • Fixed: applyCalendarOffset now correctly creates a corresponding time-related object using the supplied calendar timezone and returns a time-related object that has local date and time and local timezone

Are these changes tested?

Many of the time related tests also face this issue and this fixes them all. Also tested the driver jar in a JDBC client and behaviour is fixed.

Are there any user-facing changes?

 - Old behaviour treats the calendar object passed into the corresponding methods to be the timezone to convert the data into, which is incorrect according to what is defined in the JDBC API
 - Fixed: Correct behaviour is that the calendar object is used to extend the data into the timezone of the calendar object, essentially asserting that the data is at the timezone defined in the calendar object
 - Fixed: for vectors that do store timezone information (ie. TimestampVector), the getter methods will use the timezone defined in vector as the timezone assertion and ignores the calendar object if one was passed in
…Test

- Made timezone converting logic more readable in ArrowFlightJdbcTimeStampVectorAccessor
- Fixed checkstyle issues
- Refactored getTimestampForVector() to be more concise and readable
@jhmannok jhmannok requested a review from lidavidm as a code owner July 6, 2023 22:24
@github-actions
Copy link

github-actions bot commented Jul 6, 2023

⚠️ GitHub issue #36518 has been automatically assigned in GitHub to PR creator.

@github-actions
Copy link

github-actions bot commented Jul 6, 2023

⚠️ GitHub issue #36518 has no components, please add labels for components.

@wgtmac
Copy link
Member

wgtmac commented Jul 9, 2023

I tried to do similar thing before: #35139. After discussion with @lidavidm, we have agreed on the current behavior. Let me know if you have different opinion after going through the past discussion.

@jhmannok
Copy link
Author

Hey @wgtmac! So I read through the discussion and although I agree on keeping the behaviour as is (ie. keeping the timestamp returned in UTC time) I still think its worth to address the issue where the actual "timezone" field of the object returned is not in UTC, which causes a lot of confusion when trying to utilize the timestamp object (ie. turning it into a formatted date string, converting the timestamp into local time easily, etc.)

@lidavidm
Copy link
Member

I think based on your description, yes, we should return a timestamp that is properly in UTC. What I then don't understand is what the point of the Calendar argument is in the first place; I suppose if we pass a non-UTC Calendar, we should localize the UTC timestamp to the given timezone?

@jhmannok
Copy link
Author

@lidavidm Yeah so I was also confused at first but according to this article: https://medium.com/@williampuk/sql-timestamp-and-jdbc-timestamp-deep-dive-7ae0ea91e237 it seems like the JDBC drivers for some of the major RDBMS, the provided calendar instance is used to assert that the date and time of the timestamp stored on the db is at the timezone specified by the calendar (ie. if the timestamp is 2022-07-13 00:15 and we pass in a calendar at UTC+8, then we are asserting that the timestamp is 2022-07-13 00:15 UTC+8) then when it is returned, we convert that timestamp into either local system time or UTC+0 (ie. we should see 2022-07-12 16:15 UTC+0 in the timestamp object returned by the method)

Quote from article findings:
Providing a Calendar instance We now look at the values returned from ResultSet::getTimestamp(int, Calendar), i.e. the values denoted by (using NY Cal) . The java.util.Calendar object provided to the call is an instance of Calendar at ‘America/New_York’ time zone. According to the Java docs, the method “uses the given calendar to construct an appropriate millisecond value for the timestamp if the underlying database does not store timezone information.” (https://docs.oracle.com/en/java/javase/11/docs/api/java.sql/java/sql/ResultSet.html#getTimestamp(int,java.util.Calendar)) In other words, by providing a Calendar instance, we are telling the driver that, we know the Date and Time value given by the query result is at the time zone of the provided Calendar instance, so please construct a java.sql.Timestamp storing the instant of time computed by all these pieces of information given. It also suggests it does not make sense to use this method if the target type already contains the time zone information, e.g. TIMESTAMP WITH TIME ZONE .

@lidavidm
Copy link
Member

Ok, thanks for the references. I'll try to look at this more closely when I get a chance but my time for Java related work is limited these days, so it may be a while.

@@ -177,7 +174,7 @@ protected static TimeZone getTimeZoneForVector(TimeStampVector vector) {

String timezoneName = arrowType.getTimezone();
if (timezoneName == null) {
return TimeZone.getTimeZone("UTC");
return null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. This change seems to be the right one: an Arrow timestamp has no defined timezone if the timezone name is not set.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@@ -91,14 +92,11 @@ private LocalDateTime getLocalDateTime(Calendar calendar) {
long value = holder.value;

LocalDateTime localDateTime = this.longToLocalDateTime.fromLong(value);
ZoneId defaultTimeZone = TimeZone.getDefault().toZoneId();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should avoid the default time zone here, since that will be system dependent?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the idea here is that if no calendar is supplied, we will use the system timezone as an assertion of what the timezone of the timestamp value from the db is. So during the accessor process, if 2021-03-28 00:15 is contained in the DB and we don't give the getter a calendar, we will then assume 2021-03-28 00:15 is an instant at the system time zone. Given that since the timestamp is already in the local timezone, no conversion is needed and the value returned by the getter is indeed 2021-03-28 00:15 (LOCAL TIMEZONE)

Comment on lines 96 to 97
ZoneId sourceTimeZone = nonNull(this.timeZone) ? this.timeZone.toZoneId() :
nonNull(calendar) ? calendar.getTimeZone().toZoneId() : defaultTimeZone;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. can we just compare to null explicitly, and 2) can we write this as an if-else chain?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

Comment on lines 179 to 180
TimeZone finalTimeZoneForResultWithoutCalendar = ofNullable(getTimeZoneForVector(vector))
.orElse(TimeZone.getDefault());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we should be able to assert what time zone is returned here right? And again, we should avoid using the system timezone for things?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep got it!

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Jul 21, 2023
…one an if statement and changed all nonNull usage to compare with null explicitly

ArrowFlightJdbcTimeStampVectorAccessorTest: Modified test so system timezone is compared more explicitly
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jul 24, 2023
Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic is hard to follow (the original code sure does not help...) But I'm not sure if the changes here are defensible without more explanation of the logic in the code. It might help to make sure we have clear test cases of expected behavior first.

Comment on lines +47 to +49
Instant currInstant = Instant.ofEpochMilli(milliseconds);
LocalDateTime getTimestampWithoutTZ = LocalDateTime.ofInstant(currInstant,
TimeZone.getTimeZone("UTC").toZoneId());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we directly use ofEpochSecond instead of bouncing through an Instant and a timezone conversion?

LocalDateTime getTimestampWithoutTZ = LocalDateTime.ofInstant(currInstant,
TimeZone.getTimeZone("UTC").toZoneId());
ZonedDateTime parsedTime = getTimestampWithoutTZ.atZone(calendar.getTimeZone().toZoneId());
return parsedTime.toEpochSecond() * 1000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to truncate the milliseconds?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parsedTime.toInstant().toEpochMilli() seems like it would avoid this problem.

@@ -36,21 +38,17 @@ private DateTimeUtils() {
}

/**
* Subtracts given Calendar's TimeZone offset from epoch milliseconds.
* Apply calendar timezone to epoch milliseconds.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document this function and what steps it's doing in detail?

@@ -36,21 +38,17 @@ private DateTimeUtils() {
}

/**
* Subtracts given Calendar's TimeZone offset from epoch milliseconds.
* Apply calendar timezone to epoch milliseconds.
*/
public static long applyCalendarOffset(long milliseconds, Calendar calendar) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, looking at how it's used, now I'm thinking both implementations are wrong. For instance, it's used in DateVectorAccessor and is applied to the value from the vector and the user-supplied calendar. This function appears to assume the given milliseconds value is a naive timestamp in the given (or default) timezone, and converts it to a UTC timestamp. This is backwards. The value from the Arrow vector is ALREADY a UTC timestamp!

That said, this function does seem to do the right thing for other places where it is used. So I think we need to name and properly document it, and remove it from places where it should NOT be applied.

Comment on lines +97 to +105
if (this.timeZone != null) {
sourceTimeZone = this.timeZone.toZoneId();
} else if (calendar != null) {
sourceTimeZone = calendar.getTimeZone().toZoneId();
} else {
sourceTimeZone = defaultTimeZone;
}
return localDateTime;

return localDateTime.atZone(sourceTimeZone).withZoneSameInstant(defaultTimeZone).toLocalDateTime();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is quite right: if the Arrow vector has a timezone, the underlying timestamp is always relative to UTC. this.timeZone doesn't reflect that.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jul 26, 2023
@jhmannok
Copy link
Author

jhmannok commented Jul 28, 2023

The logic is hard to follow (the original code sure does not help...) But I'm not sure if the changes here are defensible without more explanation of the logic in the code. It might help to make sure we have clear test cases of expected behavior first.

I think its worth to clarify what my logic/presumptions are.

So essentially, the logic in the change is that:

  1. The timestamp value is retrieved from the database (ie. database shows (2013-03-28 00:00:00.000)). The database does not store any timezone information but the timestamp of 2013-03-28 00:00:00 has an inherent timezone that is either outlined by the database itself or the users themselves know what the timezone is for this timestamp. Thus, the calendar object that could be provided into the accessor methods allows users to assert this information.
  2. When this value is retrieved from the database, the accessor method for getTimestamp first creates a LocalDateTime object using the epochMilliseconds of the timestamp (ArrowFlightJdbcTimeStampVectorAccessor: line 93)
  3. We then determine what timezone the LocalDateTime should be (ArrowFlightJdbcTimeStampVectorAccessor: line 97-103).
  4. Since in ArrowFlightJdbcTimeStampVectorAccessor#getTimestamp line 135 uses Timestamp.valueOf(LocalDateTime) to create the timestamp object that is returned to the client, ArrowFlightJdbcTimeStampVectorAccessor: line 105 essentially creates a ZonedDateTime using the LocalDateTime + the timezone determined in step 3, essentially attaching the timezone to the LocalDateTime value (no timezone conversion takes place yet). Then we convert the timestamp to the system default timezone and return the timestamp as a LocalDateTime.

What ended up happening before was that the LocalDateTime that was returned by the getLocalDateTime method reflects the timestamp being converted from the timezone of the calendar input (or system default when it is null) to UTC (since in the original code, the timezone of the vector defaulted to UTC). But Timestamp.valueOf(LocalDateTime) interprets the LocalDateTime passed in as it is in the system default timezone, thus causing the issue where we did get the correct time and date value but the timezone of the Timestamp object is the system timezone instead of UTC (or whatever the timezone of the Vector is)

I could modifiy my code such that it keeps the same behaviour as before but the timezone of the Timestamp object returned is accurately reflected instead

protected static LongToLocalDateTime getLongToLocalDateTimeForVector(TimeStampVector vector,
TimeZone timeZone) {
String timeZoneID = timeZone.getID();
protected static LongToLocalDateTime getLongToUTCDateTimeForVector(TimeStampVector vector) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good if we can rename the return type LongToLocalDateTime as well.

TimeZone timeZone) {
String timeZoneID = timeZone.getID();
protected static LongToLocalDateTime getLongToUTCDateTimeForVector(TimeStampVector vector) {
String timeZoneID = "UTC";

ArrowType.Timestamp arrowType =
(ArrowType.Timestamp) vector.getField().getFieldType().getType();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the switch block below, you may use the overload without timeZoneID, which is much simpler:

DateUtility.getLocalDateTimeFromEpochNano(nanoseconds);
DateUtility.getLocalDateTimeFromEpochMicro(microseconds);
DateUtility.getLocalDateTimeFromEpochMilli(milliseconds);
DateUtility.getLocalDateTimeFromEpochMilli(TimeUnit.SECONDS.toMillis(seconds));

@@ -177,7 +174,7 @@ protected static TimeZone getTimeZoneForVector(TimeStampVector vector) {

String timezoneName = arrowType.getTimezone();
if (timezoneName == null) {
return TimeZone.getTimeZone("UTC");
return null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

ZoneId sourceTimeZone;

if (this.timeZone != null) {
sourceTimeZone = this.timeZone.toZoneId();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per the comment from @lidavidm, sourceTimeZone should be UTC if the vector has provided a valid timezone.

@jhmannok
Copy link
Author

The logic is hard to follow (the original code sure does not help...) But I'm not sure if the changes here are defensible without more explanation of the logic in the code. It might help to make sure we have clear test cases of expected behavior first.

I think its worth to clarify what my logic/presumptions are.

So essentially, the logic in the change is that:

  1. The timestamp value is retrieved from the database (ie. database shows (2013-03-28 00:00:00.000)). The database does not store any timezone information but the timestamp of 2013-03-28 00:00:00 has an inherent timezone that is either outlined by the database itself or the users themselves know what the timezone is for this timestamp. Thus, the calendar object that could be provided into the accessor methods allows users to assert this information.
  2. When this value is retrieved from the database, the accessor method for getTimestamp first creates a LocalDateTime object using the epochMilliseconds of the timestamp (ArrowFlightJdbcTimeStampVectorAccessor: line 93)
  3. We then determine what timezone the LocalDateTime should be (ArrowFlightJdbcTimeStampVectorAccessor: line 97-103).
  4. Since in ArrowFlightJdbcTimeStampVectorAccessor#getTimestamp line 135 uses Timestamp.valueOf(LocalDateTime) to create the timestamp object that is returned to the client, ArrowFlightJdbcTimeStampVectorAccessor: line 105 essentially creates a ZonedDateTime using the LocalDateTime + the timezone determined in step 3, essentially attaching the timezone to the LocalDateTime value (no timezone conversion takes place yet). Then we convert the timestamp to the system default timezone and return the timestamp as a LocalDateTime.

What ended up happening before was that the LocalDateTime that was returned by the getLocalDateTime method reflects the timestamp being converted from the timezone of the calendar input (or system default when it is null) to UTC (since in the original code, the timezone of the vector defaulted to UTC). But Timestamp.valueOf(LocalDateTime) interprets the LocalDateTime passed in as it is in the system default timezone, thus causing the issue where we did get the correct time and date value but the timezone of the Timestamp object is the system timezone instead of UTC (or whatever the timezone of the Vector is)

I could modifiy my code such that it keeps the same behaviour as before but the timezone of the Timestamp object returned is accurately reflected instead

@lidavidm @wgtmac

@lidavidm
Copy link
Member

Sorry, I would like to sit down and make a few tests and compare them with other drivers to make sure I have a good handle on what's going on here. But I don't have much time for Arrow Java development beyond simple maintenance PRs.

@wgtmac
Copy link
Member

wgtmac commented Aug 22, 2023

@lidavidm Do you know any other expert can join the discussion? I'd like to follow up with this fix but my experience on this is limited.

@lidavidm
Copy link
Member

No, I don't think any of the contributors that originally worked on the JDBC driver are still active in the project.

@lidavidm
Copy link
Member

My experience is also limited, hence why I wanted to construct the test cases.

@wgtmac
Copy link
Member

wgtmac commented Aug 23, 2023

Thanks, I agree with you @lidavidm

@jduo
Copy link
Member

jduo commented Oct 4, 2023

@jhmannok , could you list the client applications that you've verified this change with? It'd be great to know what's been covered since this API is not that clear.

@jduo
Copy link
Member

jduo commented Oct 4, 2023

I might have missed this as well, but can you verify the behaviors of the Oracle and SQL Server drivers when using time with timezone vs raw time types?

TimeZone timeZone) {
String timeZoneID = timeZone.getID();
protected static LongToUTCDateTime getLongToUTCDateTimeForVector(TimeStampVector vector) {
String timeZoneID = "UTC";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Make final

ZonedDateTime sourceTZDateTime = LocalDateTime
.ofInstant(Instant.ofEpochMilli(millis), TimeZone.getTimeZone("UTC").toZoneId())
.atZone(TimeZone.getTimeZone(timeZone).toZoneId());
expectedTimestamp = new Timestamp(sourceTZDateTime.toEpochSecond() * 1000);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks to truncate millis as well.

@scruz-denodo
Copy link

Hi,

I see that this PR has been without movement for several months. I am interested in continuing with the changes.

This PR is over Arrow 13.0.0, but current version is 16. What would be the better way?, creating a new PR with these changes but from current Arrow version, or continuing here?

@jhmannok, are you ok if I go on with this work?

@scruz-denodo
Copy link

I created a new PR with the changes #43149

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants