-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-10177] [SQL] fix reading Timestamp in parquet from Hive #8400
Conversation
Conflicts: sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetHiveCompatibilitySuite.scala
Test build #41466 has finished for PR 8400 at commit
|
Test build #41474 has finished for PR 8400 at commit
|
Test build #1688 has finished for PR 8400 at commit
|
@@ -37,7 +37,8 @@ object DateTimeUtils { | |||
type SQLTimestamp = Long | |||
|
|||
// see http://stackoverflow.com/questions/466321/convert-unix-timestamp-to-julian | |||
final val JULIAN_DAY_OF_EPOCH = 2440587 // and .5 | |||
// it's 2440587.5, rounding up to compatible with Hive | |||
final val JULIAN_DAY_OF_EPOCH = 2440588 // |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: removing the trailing //
?
Can you explain what "they are overlapped" means? |
I feel like Impala and Hive did Julian day conversion in the wrong way at first, but left it as is and made the wrong conversion logic the de facto "standard" way? I don't see an intuitive reason why we should store a Julian timestamp 12 hr later than the real Julian timestamp at the first place. But making Spark SQL behave the same as Impala/Hive makes sense, even if they are doing the conversion wrong. Otherwise it would be pretty hard to deal with legacy data, since neither Spark SQL nor Hive writes version information into generated Parquet files. |
@yhuai The Julian day of epoch of unix timestamp (1970-01-01 0:0:0) is 2440587.5, if we use integer for julian days, then it will be cut off at noon (12pm) of a day. But the nanoseconds is counted from 12am, so they are overlapped from 12am to 12pm. In Hive, it create a calendar from Julian day (which will have hours as 12pm), it then re-set the hours from nanoseconds. In this patch, we use a trick to shift Julian days by 0.5 day then rounding 0.5 down to zero (In Hive, it rounds 0.5 up to 1), then we can add these two parts together. |
LGTM now |
Test build #41520 has finished for PR 8400 at commit
|
Merging to master and branch-1.5. |
We misunderstood the Julian days and nanoseconds of the day in parquet (as TimestampType) from Hive/Impala, they are overlapped, so can't be added together directly. In order to avoid the confusing rounding when do the converting, we use `2440588` as the Julian Day of epoch of unix timestamp (which should be 2440587.5). Author: Davies Liu <davies@databricks.com> Author: Cheng Lian <lian@databricks.com> Closes #8400 from davies/timestamp_parquet. (cherry picked from commit 2f493f7) Signed-off-by: Cheng Lian <lian@databricks.com>
(EDIT: The conclusion made in this comment is WRONG.) Just would like to point out that our original import java.sql._
import java.util._
import org.apache.hadoop.hive.ql.io.parquet.timestamp._
import org.apache.spark.sql.catalyst.util._
TimeZone.setDefault(TimeZone.getTimeZone("GMT"))
val timestamp = Timestamp.valueOf("1970-01-01 00:00:00")
val hiveNanoTime = NanoTimeUtils.getNanoTime(timestamp, false)
val hiveJulianDay = hiveNanoTime.getJulianDay
val hiveTimeOfDayNanos = hiveNanoTime.getTimeOfDayNanos
println(
s"""Hive converts "$timestamp" to Julian timestamp:
|(julianDay=$hiveJulianDay, timeOfDayNanos=$hiveTimeOfDayNanos)
""".stripMargin)
val (sparkJulianDay, sparkTimeOfDayNanos) =
DateTimeUtils.toJulianDay(DateTimeUtils.fromJavaTimestamp(timestamp))
println(
s"""Spark converts "$timestamp" to Julian timestamp:
|(julianDay=$sparkJulianDay, timeOfDayNanos=$sparkTimeOfDayNanos)
""".stripMargin) The result is:
while the correct answer should be:
So Hive is 12 hours later than the expected Julian timestamp, while we were 24 hours later :) |
Asked for more background about Hive's 12 hr offset on HIVE-6394, which is the original JIRA ticket for introducing timestamp support for |
@liancheng Thanks for the comments, but you may made a mistake somewhere, I got this WITHOUT this PR:
|
The result you posted for Spark is totally wrong (you can't get the timestamp back after one round trip), fortunately this is a fake alarm :) For the Hive one, it bad but not wrong (different rounding method, you can still read the timestamp back). In general, maybe using Julian day here is the root of bad things, use signed integer for number of days since epoch should be enough (could be negative). |
@davies Yeah, you're right. Verified that I was actually running the posted snippet against a local revision which I used to debug the Julian conversion logic. Forgot to rebuild after checkout out the code back to the earlier master revision. Anyway, behavior of the current master branch is consistent with Hive. |
We misunderstood the Julian days and nanoseconds of the day in parquet (as TimestampType) from Hive/Impala, they are overlapped, so can't be added together directly.
In order to avoid the confusing rounding when do the converting, we use
2440588
as the Julian Day of epoch of unix timestamp (which should be 2440587.5).