Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-10177] [SQL] fix reading Timestamp in parquet from Hive #8400

Closed
wants to merge 6 commits into from

Conversation

davies
Copy link
Contributor

@davies davies commented Aug 24, 2015

We misunderstood the Julian days and nanoseconds of the day in parquet (as TimestampType) from Hive/Impala, they are overlapped, so can't be added together directly.

In order to avoid the confusing rounding when do the converting, we use 2440588 as the Julian Day of epoch of unix timestamp (which should be 2440587.5).

@SparkQA
Copy link

SparkQA commented Aug 24, 2015

Test build #41466 has finished for PR 8400 at commit e96f92f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 24, 2015

Test build #41474 has finished for PR 8400 at commit 9b1b9ce.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 25, 2015

Test build #1688 has finished for PR 8400 at commit 9b1b9ce.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -37,7 +37,8 @@ object DateTimeUtils {
type SQLTimestamp = Long

// see http://stackoverflow.com/questions/466321/convert-unix-timestamp-to-julian
final val JULIAN_DAY_OF_EPOCH = 2440587 // and .5
// it's 2440587.5, rounding up to compatible with Hive
final val JULIAN_DAY_OF_EPOCH = 2440588 //
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: removing the trailing //?

@yhuai
Copy link
Contributor

yhuai commented Aug 25, 2015

Can you explain what "they are overlapped" means?

@liancheng
Copy link
Contributor

I feel like Impala and Hive did Julian day conversion in the wrong way at first, but left it as is and made the wrong conversion logic the de facto "standard" way? I don't see an intuitive reason why we should store a Julian timestamp 12 hr later than the real Julian timestamp at the first place.

But making Spark SQL behave the same as Impala/Hive makes sense, even if they are doing the conversion wrong. Otherwise it would be pretty hard to deal with legacy data, since neither Spark SQL nor Hive writes version information into generated Parquet files.

@davies
Copy link
Contributor Author

davies commented Aug 25, 2015

@yhuai The Julian day of epoch of unix timestamp (1970-01-01 0:0:0) is 2440587.5, if we use integer for julian days, then it will be cut off at noon (12pm) of a day. But the nanoseconds is counted from 12am, so they are overlapped from 12am to 12pm. In Hive, it create a calendar from Julian day (which will have hours as 12pm), it then re-set the hours from nanoseconds.

In this patch, we use a trick to shift Julian days by 0.5 day then rounding 0.5 down to zero (In Hive, it rounds 0.5 up to 1), then we can add these two parts together.

@liancheng
Copy link
Contributor

LGTM now

@SparkQA
Copy link

SparkQA commented Aug 25, 2015

Test build #41520 has finished for PR 8400 at commit 05437e1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

Merging to master and branch-1.5.

asfgit pushed a commit that referenced this pull request Aug 25, 2015
We misunderstood the Julian days and nanoseconds of the day in parquet (as TimestampType) from Hive/Impala, they are overlapped, so can't be added together directly.

In order to avoid the confusing rounding when do the converting, we use `2440588` as the Julian Day of epoch of unix timestamp (which should be 2440587.5).

Author: Davies Liu <davies@databricks.com>
Author: Cheng Lian <lian@databricks.com>

Closes #8400 from davies/timestamp_parquet.

(cherry picked from commit 2f493f7)
Signed-off-by: Cheng Lian <lian@databricks.com>
@asfgit asfgit closed this in 2f493f7 Aug 25, 2015
@liancheng
Copy link
Contributor

(EDIT: The conclusion made in this comment is WRONG.)

Just would like to point out that our original DateTimeUtils doesn't do Julian date conversion right either. Here is a spark-shell snippet executed against 053d94f:

import java.sql._
import java.util._

import org.apache.hadoop.hive.ql.io.parquet.timestamp._
import org.apache.spark.sql.catalyst.util._

TimeZone.setDefault(TimeZone.getTimeZone("GMT"))
val timestamp = Timestamp.valueOf("1970-01-01 00:00:00")

val hiveNanoTime = NanoTimeUtils.getNanoTime(timestamp, false)
val hiveJulianDay = hiveNanoTime.getJulianDay
val hiveTimeOfDayNanos = hiveNanoTime.getTimeOfDayNanos

println(
  s"""Hive converts "$timestamp" to Julian timestamp:
     |(julianDay=$hiveJulianDay, timeOfDayNanos=$hiveTimeOfDayNanos)
   """.stripMargin)

val (sparkJulianDay, sparkTimeOfDayNanos) =
  DateTimeUtils.toJulianDay(DateTimeUtils.fromJavaTimestamp(timestamp))
println(
  s"""Spark converts "$timestamp" to Julian timestamp:
     |(julianDay=$sparkJulianDay, timeOfDayNanos=$sparkTimeOfDayNanos)
   """.stripMargin)

The result is:

Hive converts "1970-01-01 00:00:00.0" to Julian timestamp:
(julianDay=2440588, timeOfDayNanos=0)

Spark converts "1970-01-01 00:00:00.0" to Julian timestamp:
(julianDay=2440588, timeOfDayNanos=43200000000000)

while the correct answer should be:

(julianDay=2440587, timeOfDayNanos=43200000000000)

So Hive is 12 hours later than the expected Julian timestamp, while we were 24 hours later :)

@liancheng
Copy link
Contributor

Asked for more background about Hive's 12 hr offset on HIVE-6394, which is the original JIRA ticket for introducing timestamp support for ParquetHiveSerDe.

@davies
Copy link
Contributor Author

davies commented Aug 25, 2015

@liancheng Thanks for the comments, but you may made a mistake somewhere, I got this WITHOUT this PR:

scala> TimeZone.setDefault(TimeZone.getTimeZone("GMT"))
scala> val timestamp = Timestamp.valueOf("1970-01-01 00:00:00")
timestamp: java.sql.Timestamp = 1970-01-01 00:00:00.0

scala> val (sparkJulianDay, sparkTimeOfDayNanos) =
     |   DateTimeUtils.toJulianDay(DateTimeUtils.fromJavaTimestamp(timestamp))
sparkJulianDay: Int = 2440587
sparkTimeOfDayNanos: Long = 43200000000000

scala> println(
     |   s"""Spark converts "$timestamp" to Julian timestamp:
     |      |(julianDay=$sparkJulianDay, timeOfDayNanos=$sparkTimeOfDayNanos)
     |    """.stripMargin)
Spark converts "1970-01-01 00:00:00.0" to Julian timestamp:
(julianDay=2440587, timeOfDayNanos=43200000000000)

@davies
Copy link
Contributor Author

davies commented Aug 25, 2015

The result you posted for Spark is totally wrong (you can't get the timestamp back after one round trip), fortunately this is a fake alarm :)

For the Hive one, it bad but not wrong (different rounding method, you can still read the timestamp back).

In general, maybe using Julian day here is the root of bad things, use signed integer for number of days since epoch should be enough (could be negative).

@liancheng
Copy link
Contributor

@davies Yeah, you're right. Verified that I was actually running the posted snippet against a local revision which I used to debug the Julian conversion logic. Forgot to rebuild after checkout out the code back to the earlier master revision.

Anyway, behavior of the current master branch is consistent with Hive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants