Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRILL-5002: Using hive's date functions on top of date column gives w… #937

Closed
wants to merge 1 commit into from

Conversation

Projects
None yet
2 participants
@vdiravka
Copy link
Member

commented Sep 7, 2017

…rong results for local time-zone

@vdiravka vdiravka force-pushed the vdiravka:DRILL-5002 branch from d65ebbd to bfdbf9e Sep 7, 2017

@paul-rogers
Copy link
Contributor

left a comment

This PR compounds existing errors; it does not move us toward a stable date/time solution.

The original TPC-H data is in no time zone: it is just a date (called a "local date" in the ISO 8601 standard." According to the standard, "If no UTC relation information is given with a time representation, the time is assumed to be in local time. While it may be safe to assume local time when communicating in the same time zone, it is ambiguous when used in communicating across different time zones."

Another way to think of it is that, with a timezone, a date has no association with UTC -- it cannot be converted to UTC, even to local time.

Why is that? The data in TPC-H is a date. It is the same date if you are in UTC+11 or UTC-11. If is not "1994-06-01, converted to PST by the server, reinterpreted as local time in Ukraine." Doing it that way changes the date since 1994-06-01T00:00:00 PST will be 1996-06-02T10:00:00.

The point is, dates and times without a tz have no fixed relation to UTC and should never be converted to UTC unless the user provides the tz needed to do so.

The original bug is due to the fact that Drill claims to have date/times with no tz, but, in fact, the implementation uses local time (relative to UTC) to store these values, but without a tz. So, we use a Date (which should be UTC), but have it store a value that represents the time in local time. We then do work on this and hope for the best.

The proper solution is to correct the date/time mechanism so that:

  • Date is a tz-undefined date.
  • Time is a tz-undefined (date undefined) time.
  • Date/time is a tz-undefined date time.
  • Date/time tz is a date/time relative to UTC. (AKA "timestamp")

Rules:

  • timestamp = convert_to_utc(no-tz-date_time, tz)
  • no-tz-date-time = convert_from_utc(timestamp)
  • etc.

The Hive functions appear to be written to work without a timezone. That is, to get the month from 1994-06-01, one does not need a time zone.

The bug here seems to be that we are trying to us the wrong mechanism to produce the result, so we are adding hacks on top of those errors without actually addressing the root cause of the problem.

@@ -204,7 +204,11 @@ public static JBlock getDrillObject(JCodeModel m, ObjectInspector oi,
<#elseif entry.hiveType == "TIMESTAMP">
JVar tsVar = jc._else().decl(m.directClass(java.sql.Timestamp.class.getCanonicalName()), "ts",
castedOI.invoke("getPrimitiveJavaObject").arg(returnValue));
jc._else().assign(returnValueHolder.ref("value"), tsVar.invoke("getTime"));
// Bringing relative timestamp value without timezone info to timestamp value in UTC, since Drill keeps date-time values in UTC

This comment has been minimized.

Copy link
@paul-rogers

paul-rogers Sep 10, 2017

Contributor

Drill does not actually keep values in UTC: Drill keeps values in the server time zone. This is why Drill dates, when shipped to a client with a different time zone, get corrupted.

This comment has been minimized.

Copy link
@vdiravka

vdiravka Sep 11, 2017

Author Member

I'm not fully agreed. Let me explain:

Let's take for example TimeStampVector, Drill keeps date-time values in DrillBuf as millis from epoch. Then timestamp values are extracted via getObject() method (I agree that the code in this method is questionable, but it is not current issue, this jira contains other bug).
For PST timezone machine and query select timestamp '1970-01-01 00:00:00' from (VALUES(1)) the 0 millis timestamp value is stored in DrillBuf and only on extracting stage the value is represented with server timezone. The same result for querying the timestamp data from any data source: '1970-01-01 00:00:00' will be stored as 0 millis timestamp for any timezone. But only for hive functions the logic differs:
Let's consider select to_utc_timestamp('1969-12-31 16:00:00','PST') from (VALUES(1)). The result should be "1970-01-01 00:00:00" even with current Drill date/time logic.
The output from hive's UDF is "java.sql.Timestamp" value (for PST timezone machine -- 28800000 millis internally, for UTC timezone machine -- 0 millis internally). But in favour of above logic of getObject() the 0 millis timestamp value should be storing in the DrillBuf for any timezone.

In the result, with my fix the same timestamp values will be stored in DrillBuf for any timezone machine.
Let me know if I am wrong with something.

@vdiravka
Copy link
Member Author

left a comment

@paul-rogers I am totally agreed that Drill's mechanism of storing and representing of date-time values should be updated according to your notes. But I am trying do not mix the issues mentioned by you and this hive function date-time issues, because this issue is connected to the other manner of storing date-time values in DrillBuf. If date-time values should be stored in DrillBuf as same values for any timezone (timestamp values in UTC timezone [1]) - these changes should be in the code. Other date-time issues can be fixed in context of DRILL-5332 and DRILL-5334.

[1] https://drill.apache.org/docs/date-time-and-timestamp/#date,-time,-and-timestamp

@@ -204,7 +204,11 @@ public static JBlock getDrillObject(JCodeModel m, ObjectInspector oi,
<#elseif entry.hiveType == "TIMESTAMP">
JVar tsVar = jc._else().decl(m.directClass(java.sql.Timestamp.class.getCanonicalName()), "ts",
castedOI.invoke("getPrimitiveJavaObject").arg(returnValue));
jc._else().assign(returnValueHolder.ref("value"), tsVar.invoke("getTime"));
// Bringing relative timestamp value without timezone info to timestamp value in UTC, since Drill keeps date-time values in UTC

This comment has been minimized.

Copy link
@vdiravka

vdiravka Sep 11, 2017

Author Member

I'm not fully agreed. Let me explain:

Let's take for example TimeStampVector, Drill keeps date-time values in DrillBuf as millis from epoch. Then timestamp values are extracted via getObject() method (I agree that the code in this method is questionable, but it is not current issue, this jira contains other bug).
For PST timezone machine and query select timestamp '1970-01-01 00:00:00' from (VALUES(1)) the 0 millis timestamp value is stored in DrillBuf and only on extracting stage the value is represented with server timezone. The same result for querying the timestamp data from any data source: '1970-01-01 00:00:00' will be stored as 0 millis timestamp for any timezone. But only for hive functions the logic differs:
Let's consider select to_utc_timestamp('1969-12-31 16:00:00','PST') from (VALUES(1)). The result should be "1970-01-01 00:00:00" even with current Drill date/time logic.
The output from hive's UDF is "java.sql.Timestamp" value (for PST timezone machine -- 28800000 millis internally, for UTC timezone machine -- 0 millis internally). But in favour of above logic of getObject() the 0 millis timestamp value should be storing in the DrillBuf for any timezone.

In the result, with my fix the same timestamp values will be stored in DrillBuf for any timezone machine.
Let me know if I am wrong with something.

@paul-rogers

This comment has been minimized.

Copy link
Contributor

commented Sep 13, 2017

The original description talks about data with local times. The TPC-H data has no TZ. Now, maybe we made one up in creating the Parquet files, but the original date just has dates without a tz.

The fundamental issue is that if we have a tz-less date, 1994-08-12, say, then this cannot be converted to a UTC timestamp. Which of the 23+ time zones would we use? How would the client and server agree on the arbitrary tz? This is like saying that I have a measurement in miles, but we can store distances only in km, so I'll take my length of 5 miles and store it as 5 km, remembering that I'm using km as an alias for miles. Does not make sense.

Your example uses timestamp constants. A timestamp is defined with a timezone, and so it fits Drill's model well. But, TPC dates don't have a timezone. See the TPC-H spec which says:

Date is a value whose external representation can be expressed as YYYY-MM-DD, where all characters are numeric. A date must be able to express any day within at least 14 consecutive years. There is no requirement specific to the internal representation of a date.

That is, TPC-H dates are not midnight on some date in some timezone, they are just dates. The cannot be converted to UTC. And so, they should not be subject to time zone shifting as tzs shift.

My point here is that Hive (according to the docs) implements functions correctly: using tz-less dates. Drill tries to convert to a (fake) UTC and use time-based functions on that data. This is, at best, a hack, and at worst, leads to great complexity and incorrect results.

That said, if all we have is km, and we can't do the miles-to-km conversion correctly, then we do need a way to know that a particular km value is actually miles. Similarly, using the current implementation, how will we know that a particular arbitrary-local-time-encoded-as-fake-UTC value really is local time vs. being an actual UTC time?

All that said, if you fix makes the current implementation work better, then it is a good improvement.

In the interests of moving ahead, let's table the basic discussion and just look at this one fix.

@paul-rogers
Copy link
Contributor

left a comment

Although I disagree with the underlying design, this fix is consistent with the current (incorrect) implementation.
+1

@asfgit asfgit closed this in 4c99f0c Sep 20, 2017

kkhatua added a commit to kkhatua/drill that referenced this pull request Oct 2, 2017

ilooner pushed a commit to ilooner/drill that referenced this pull request Oct 18, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.