Skip to content

Commit

Permalink
[SPARK-27242][SQL] Make formatting TIMESTAMP/DATE literals independen…
Browse files Browse the repository at this point in the history
…t from the default time zone

## What changes were proposed in this pull request?

In the PR, I propose to use the SQL config `spark.sql.session.timeZone` in formatting `TIMESTAMP` literals, and make formatting `DATE` literals independent from time zone. The changes make parsing and formatting `TIMESTAMP`/`DATE` literals consistent, and independent from the default time zone of current JVM.

Also this PR ports `TIMESTAMP`/`DATE` literals formatting on Proleptic Gregorian Calendar via using `TimestampFormatter`/`DateFormatter`.

## How was this patch tested?

Added new tests to `LiteralExpressionSuite`

Closes #24181 from MaxGekk/timezone-aware-literals.

Authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
  • Loading branch information
MaxGekk authored and cloud-fan committed Mar 26, 2019
1 parent 05168e7 commit 6903568
Show file tree
Hide file tree
Showing 7 changed files with 60 additions and 11 deletions.
4 changes: 4 additions & 0 deletions docs/sql-migration-guide-upgrade.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,13 +96,17 @@ displayTitle: Spark SQL Upgrading Guide
- The `weekofyear`, `weekday`, `dayofweek`, `date_trunc`, `from_utc_timestamp`, `to_utc_timestamp`, and `unix_timestamp` functions use java.time API for calculation week number of year, day number of week as well for conversion from/to TimestampType values in UTC time zone.

- the JDBC options `lowerBound` and `upperBound` are converted to TimestampType/DateType values in the same way as casting strings to TimestampType/DateType values. The conversion is based on Proleptic Gregorian calendar, and time zone defined by the SQL config `spark.sql.session.timeZone`. In Spark version 2.4 and earlier, the conversion is based on the hybrid calendar (Julian + Gregorian) and on default system time zone.

- Formatting of `TIMESTAMP` and `DATE` literals.

- In Spark version 2.4 and earlier, invalid time zone ids are silently ignored and replaced by GMT time zone, for example, in the from_utc_timestamp function. Since Spark 3.0, such time zone ids are rejected, and Spark throws `java.time.DateTimeException`.

- In Spark version 2.4 and earlier, the `current_timestamp` function returns a timestamp with millisecond resolution only. Since Spark 3.0, the function can return the result with microsecond resolution if the underlying clock available on the system offers such resolution.

- In Spark version 2.4 abd earlier, when reading a Hive Serde table with Spark native data sources(parquet/orc), Spark will infer the actual file schema and update the table schema in metastore. Since Spark 3.0, Spark doesn't infer the schema anymore. This should not cause any problems to end users, but if it does, please set `spark.sql.hive.caseSensitiveInferenceMode` to `INFER_AND_SAVE`.

- Since Spark 3.0, `TIMESTAMP` literals are converted to strings using the SQL config `spark.sql.session.timeZone`, and `DATE` literals are formatted using the UTC time zone. In Spark version 2.4 and earlier, both conversions use the default time zone of the Java virtual machine.

## Upgrading From Spark SQL 2.3 to 2.4

- In Spark version 2.3 and earlier, the second parameter to array_contains function is implicitly promoted to the element type of first array type parameter. This type promotion can be lossy and may cause `array_contains` function to return wrong result. This problem has been addressed in 2.4 by employing a safer type promotion mechanism. This can cause some change in behavior and are illustrated in the table below.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -41,8 +41,9 @@ import org.json4s.JsonAST._
import org.apache.spark.sql.AnalysisException
import org.apache.spark.sql.catalyst.{CatalystTypeConverters, InternalRow, ScalaReflection}
import org.apache.spark.sql.catalyst.expressions.codegen._
import org.apache.spark.sql.catalyst.util.{ArrayData, DateTimeUtils, MapData}
import org.apache.spark.sql.catalyst.util._
import org.apache.spark.sql.catalyst.util.DateTimeUtils.instantToMicros
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types._
import org.apache.spark.unsafe.types._
import org.apache.spark.util.Utils
Expand Down Expand Up @@ -370,8 +371,11 @@ case class Literal (value: Any, dataType: DataType) extends LeafExpression {
case _ => v + "D"
}
case (v: Decimal, t: DecimalType) => v + "BD"
case (v: Int, DateType) => s"DATE '${DateTimeUtils.toJavaDate(v)}'"
case (v: Long, TimestampType) => s"TIMESTAMP('${DateTimeUtils.toJavaTimestamp(v)}')"
case (v: Int, DateType) => s"DATE '${DateFormatter().format(v)}'"
case (v: Long, TimestampType) =>
val formatter = TimestampFormatter.getFractionFormatter(
DateTimeUtils.getZoneId(SQLConf.get.sessionLocalTimeZone))
s"TIMESTAMP('${formatter.format(v)}')"
case (v: Array[Byte], BinaryType) => s"X'${DatatypeConverter.printHexBinary(v)}'"
case _ => value.toString
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,10 @@
package org.apache.spark.sql.catalyst.expressions

import java.nio.charset.StandardCharsets
import java.time.{Instant, LocalDate}
import java.time.{Instant, LocalDate, LocalDateTime, ZoneOffset}
import java.util.TimeZone

import scala.reflect.runtime.universe.{typeTag, TypeTag}
import scala.reflect.runtime.universe.TypeTag

import org.apache.spark.SparkFunSuite
import org.apache.spark.sql.Row
Expand Down Expand Up @@ -279,4 +280,40 @@ class LiteralExpressionSuite extends SparkFunSuite with ExpressionEvalHelper {
checkEvaluation(Literal(Array(instant0, instant1)), Array(instant0, instant1))
}
}

private def withTimeZones(
sessionTimeZone: String,
systemTimeZone: String)(f: => Unit): Unit = {
withSQLConf(
SQLConf.SESSION_LOCAL_TIMEZONE.key -> sessionTimeZone,
SQLConf.DATETIME_JAVA8API_ENABLED.key -> "true") {
val originTimeZone = TimeZone.getDefault
try {
TimeZone.setDefault(TimeZone.getTimeZone(systemTimeZone))
f
} finally {
TimeZone.setDefault(originTimeZone)
}
}
}

test("format timestamp literal using spark.sql.session.timeZone") {
withTimeZones(sessionTimeZone = "GMT+01:00", systemTimeZone = "GMT-08:00") {
val timestamp = LocalDateTime.of(2019, 3, 21, 0, 2, 3, 456000000)
.atZone(ZoneOffset.UTC)
.toInstant
val expected = "TIMESTAMP('2019-03-21 01:02:03.456')"
val literalStr = Literal.create(timestamp).sql
assert(literalStr === expected)
}
}

test("format date literal independently from time zone") {
withTimeZones(sessionTimeZone = "GMT-11:00", systemTimeZone = "GMT-10:00") {
val date = LocalDate.of(2019, 3, 21)
val expected = "DATE '2019-03-21'"
val literalStr = Literal.create(date).sql
assert(literalStr === expected)
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ select
array_contains(timestamp_array, timestamp '2016-11-15 20:54:00.000'), array_contains(timestamp_array, timestamp '2016-01-01 20:54:00.000')
from primitive_arrays
-- !query 6 schema
struct<array_contains(boolean_array, true):boolean,array_contains(boolean_array, false):boolean,array_contains(tinyint_array, 2):boolean,array_contains(tinyint_array, 0):boolean,array_contains(smallint_array, 2):boolean,array_contains(smallint_array, 0):boolean,array_contains(int_array, 2):boolean,array_contains(int_array, 0):boolean,array_contains(bigint_array, 2):boolean,array_contains(bigint_array, 0):boolean,array_contains(decimal_array, 9223372036854775809):boolean,array_contains(decimal_array, CAST(1 AS DECIMAL(19,0))):boolean,array_contains(double_array, 2.0):boolean,array_contains(double_array, 0.0):boolean,array_contains(float_array, CAST(2.0 AS FLOAT)):boolean,array_contains(float_array, CAST(0.0 AS FLOAT)):boolean,array_contains(date_array, DATE '2016-03-14'):boolean,array_contains(date_array, DATE '2016-01-01'):boolean,array_contains(timestamp_array, TIMESTAMP('2016-11-15 20:54:00.0')):boolean,array_contains(timestamp_array, TIMESTAMP('2016-01-01 20:54:00.0')):boolean>
struct<array_contains(boolean_array, true):boolean,array_contains(boolean_array, false):boolean,array_contains(tinyint_array, 2):boolean,array_contains(tinyint_array, 0):boolean,array_contains(smallint_array, 2):boolean,array_contains(smallint_array, 0):boolean,array_contains(int_array, 2):boolean,array_contains(int_array, 0):boolean,array_contains(bigint_array, 2):boolean,array_contains(bigint_array, 0):boolean,array_contains(decimal_array, 9223372036854775809):boolean,array_contains(decimal_array, CAST(1 AS DECIMAL(19,0))):boolean,array_contains(double_array, 2.0):boolean,array_contains(double_array, 0.0):boolean,array_contains(float_array, CAST(2.0 AS FLOAT)):boolean,array_contains(float_array, CAST(0.0 AS FLOAT)):boolean,array_contains(date_array, DATE '2016-03-14'):boolean,array_contains(date_array, DATE '2016-01-01'):boolean,array_contains(timestamp_array, TIMESTAMP('2016-11-15 20:54:00')):boolean,array_contains(timestamp_array, TIMESTAMP('2016-01-01 20:54:00')):boolean>
-- !query 6 output
true false true false true false true false true false true false true false true false true false true false

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -301,7 +301,7 @@ select date 'mar 11 2016'
-- !query 32
select tImEstAmp '2016-03-11 20:54:00.000'
-- !query 32 schema
struct<TIMESTAMP('2016-03-11 20:54:00.0'):timestamp>
struct<TIMESTAMP('2016-03-11 20:54:00'):timestamp>
-- !query 32 output
2016-03-11 20:54:00

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ struct<array_join(array(DATE '2016-03-14', DATE '2016-03-13'), , ):string>
-- !query 9
SELECT array_join(array(timestamp '2016-11-15 20:54:00.000', timestamp '2016-11-12 20:54:00.000'), ', ')
-- !query 9 schema
struct<array_join(array(TIMESTAMP('2016-11-15 20:54:00.0'), TIMESTAMP('2016-11-12 20:54:00.0')), , ):string>
struct<array_join(array(TIMESTAMP('2016-11-15 20:54:00'), TIMESTAMP('2016-11-12 20:54:00')), , ):string>
-- !query 9 output
2016-11-15 20:54:00, 2016-11-12 20:54:00

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,14 @@

package org.apache.spark.sql.catalyst

import java.sql.Timestamp
import java.time.LocalDateTime

import org.apache.spark.sql.QueryTest
import org.apache.spark.sql.catalyst.dsl.expressions._
import org.apache.spark.sql.catalyst.expressions._
import org.apache.spark.sql.catalyst.util.DateTimeUtils
import org.apache.spark.sql.hive.test.TestHiveSingleton
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.unsafe.types.CalendarInterval

class ExpressionSQLBuilderSuite extends QueryTest with TestHiveSingleton {
Expand Down Expand Up @@ -60,8 +62,10 @@ class ExpressionSQLBuilderSuite extends QueryTest with TestHiveSingleton {
checkSQL(Literal(Double.NaN), "CAST('NaN' AS DOUBLE)")
checkSQL(Literal(BigDecimal("10.0000000").underlying), "10.0000000BD")
checkSQL(Literal(Array(0x01, 0xA3).map(_.toByte)), "X'01A3'")
checkSQL(
Literal(Timestamp.valueOf("2016-01-01 00:00:00")), "TIMESTAMP('2016-01-01 00:00:00.0')")
val timestamp = LocalDateTime.of(2016, 1, 1, 0, 0, 0)
.atZone(DateTimeUtils.getZoneId(SQLConf.get.sessionLocalTimeZone))
.toInstant
checkSQL(Literal(timestamp), "TIMESTAMP('2016-01-01 00:00:00')")
// TODO tests for decimals
}

Expand Down

0 comments on commit 6903568

Please sign in to comment.