DRILL-5002: Using hive's date functions on top of date column gives w… #937
paul-rogers left a comment
This PR compounds existing errors; it does not move us toward a stable date/time solution.
The original TPC-H data is in no time zone: it is just a date (called a "local date" in the ISO 8601 standard." According to the standard, "If no UTC relation information is given with a time representation, the time is assumed to be in local time. While it may be safe to assume local time when communicating in the same time zone, it is ambiguous when used in communicating across different time zones."
Another way to think of it is that, with a timezone, a date has no association with UTC -- it cannot be converted to UTC, even to local time.
Why is that? The data in TPC-H is a date. It is the same date if you are in UTC+11 or UTC-11. If is not "1994-06-01, converted to PST by the server, reinterpreted as local time in Ukraine." Doing it that way changes the date since 1994-06-01T00:00:00 PST will be 1996-06-02T10:00:00.
The point is, dates and times without a tz have no fixed relation to UTC and should never be converted to UTC unless the user provides the tz needed to do so.
The original bug is due to the fact that Drill claims to have date/times with no tz, but, in fact, the implementation uses local time (relative to UTC) to store these values, but without a tz. So, we use a Date (which should be UTC), but have it store a value that represents the time in local time. We then do work on this and hope for the best.
The proper solution is to correct the date/time mechanism so that:
The Hive functions appear to be written to work without a timezone. That is, to get the month from 1994-06-01, one does not need a time zone.
The bug here seems to be that we are trying to us the wrong mechanism to produce the result, so we are adding hacks on top of those errors without actually addressing the root cause of the problem.
vdiravka left a comment
@paul-rogers I am totally agreed that Drill's mechanism of storing and representing of date-time values should be updated according to your notes. But I am trying do not mix the issues mentioned by you and this hive function date-time issues, because this issue is connected to the other manner of storing date-time values in DrillBuf. If date-time values should be stored in DrillBuf as same values for any timezone (timestamp values in UTC timezone ) - these changes should be in the code. Other date-time issues can be fixed in context of DRILL-5332 and DRILL-5334.
The original description talks about data with local times. The TPC-H data has no TZ. Now, maybe we made one up in creating the Parquet files, but the original date just has dates without a tz.
The fundamental issue is that if we have a tz-less date, 1994-08-12, say, then this cannot be converted to a UTC timestamp. Which of the 23+ time zones would we use? How would the client and server agree on the arbitrary tz? This is like saying that I have a measurement in miles, but we can store distances only in km, so I'll take my length of 5 miles and store it as 5 km, remembering that I'm using km as an alias for miles. Does not make sense.
Your example uses
That is, TPC-H dates are not midnight on some date in some timezone, they are just dates. The cannot be converted to UTC. And so, they should not be subject to time zone shifting as tzs shift.
My point here is that Hive (according to the docs) implements functions correctly: using tz-less dates. Drill tries to convert to a (fake) UTC and use time-based functions on that data. This is, at best, a hack, and at worst, leads to great complexity and incorrect results.
That said, if all we have is km, and we can't do the miles-to-km conversion correctly, then we do need a way to know that a particular km value is actually miles. Similarly, using the current implementation, how will we know that a particular arbitrary-local-time-encoded-as-fake-UTC value really is local time vs. being an actual UTC time?
All that said, if you fix makes the current implementation work better, then it is a good improvement.
In the interests of moving ahead, let's table the basic discussion and just look at this one fix.