-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-30808][SQL] Enable Java 8 time API in Thrift server #27552
Conversation
@cloud-fan I tried to set the SQL config before the collect action in spark/sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala Lines 54 to 59 in aa0d136
|
Test build #118319 has finished for PR 27552 at commit
|
...r/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala
Show resolved
Hide resolved
Test build #118347 has finished for PR 27552 at commit
|
It's probably because the |
Test build #118352 has finished for PR 27552 at commit
|
@cloud-fan Is it possible to create/restore a Dataset from an executedPlan? |
@MaxGekk it's possible from a logical plan, e.g. |
…server-java8-time-api # Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala # sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLDriver.scala # sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveComparisonTest.scala
Test build #118509 has finished for PR 27552 at commit
|
Test build #118508 has finished for PR 27552 at commit
|
Test build #118510 has finished for PR 27552 at commit
|
jenkins, retest this, please |
Test build #118511 has finished for PR 27552 at commit
|
sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala
Outdated
Show resolved
Hide resolved
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLEnv.scala
Outdated
Show resolved
Hide resolved
...hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLDriver.scala
Outdated
Show resolved
Hide resolved
Test build #118549 has finished for PR 27552 at commit
|
Test build #118552 has finished for PR 27552 at commit
|
So many HiveComparisonTest related tests failed, I will revert this cdb322d |
This reverts commit cdb322d.
Test build #118569 has finished for PR 27552 at commit
|
@cloud-fan Something wrong is going on here. Commands issues from HiveComparisonTest are executed twice, it seems. |
@MaxGekk yea creating the df again may execute the command again. Let's keep the lazy val. |
I would prefer to cut off this PR at this point #27552 (comment), and implement moving settings of Making dataset as lazy val doesn't help me, so, I stuck for now. |
While debugging the
val result: Seq[Seq[Any]] = Dataset.ofRows(ds.sparkSession, ds.queryExecution.logical)
.queryExecution
.executedPlan
.executeCollectPublic().map(_.toSeq).toSeq This causes side effects here:
|
@MaxGekk I see the problem now. We should use |
Test build #118579 has finished for PR 27552 at commit
|
thanks, merging to master/3.0! |
### What changes were proposed in this pull request? - Set `spark.sql.datetime.java8API.enabled` to `true` in `hiveResultString()`, and restore it back at the end of the call. - Convert collected `java.time.Instant` & `java.time.LocalDate` to `java.sql.Timestamp` and `java.sql.Date` for correct formatting. ### Why are the changes needed? Because of textual representation of timestamps/dates before 1582 year is incorrect: ```shell $ export TZ="America/Los_Angeles" $ ./bin/spark-sql -S ``` ```sql spark-sql> set spark.sql.session.timeZone=America/Los_Angeles; spark.sql.session.timeZone America/Los_Angeles spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20'); 1001-01-01 00:07:02 ``` It must be 1001-01-01 00:**00:00**. ### Does this PR introduce any user-facing change? Yes. After the changes: ```shell $ export TZ="America/Los_Angeles" $ ./bin/spark-sql -S ``` ```sql spark-sql> set spark.sql.session.timeZone=America/Los_Angeles; spark.sql.session.timeZone America/Los_Angeles spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20'); 1001-01-01 00:00:00 ``` ### How was this patch tested? By running hive-thiftserver tests. In particular: ``` ./build/sbt -Phadoop-2.7 -Phive-2.3 -Phive-thriftserver "hive-thriftserver/test:testOnly *SparkThriftServerProtocolVersionsSuite" ``` Closes #27552 from MaxGekk/hive-thriftserver-java8-time-api. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit afaeb29) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
sessionWithJava8DatetimeEnabled.withActive { | ||
// We cannot collect the original dataset because its encoders could be created | ||
// with disabled Java 8 date-time API. | ||
val result: Seq[Seq[Any]] = Dataset.ofRows(ds.sparkSession, ds.logicalPlan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
found a problem. Dataset.ofRows
will set the input session as active, so we should write Dataset.ofRows(sessionWithJava8DatetimeEnabled, ...
and remove the outer sessionWithJava8DatetimeEnabled.withActive
.
@@ -512,7 +511,7 @@ class SQLQueryTestSuite extends QueryTest with SharedSparkSession { | |||
val schema = df.schema.catalogString | |||
// Get answer, but also get rid of the #1234 expression ids that show up in explain plans | |||
val answer = SQLExecution.withNewExecutionId(df.queryExecution, Some(sql)) { | |||
hiveResultString(df.queryExecution.executedPlan).map(replaceNotIncludedMsg) | |||
hiveResultString(df).map(replaceNotIncludedMsg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in ThriftServerQueryTestSuite
, we get the result by JDBC, so there is no DataFrame created.
We should follow pgsql and return java 8 datetime when the config is enabled. https://jdbc.postgresql.org/documentation/head/8-date-time.html
// Convert date-time instances to types that are acceptable by Hive libs | ||
// used in conversions to strings. | ||
val resultRow = row.map { | ||
case i: Instant => Timestamp.from(i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems no java8 datetime values to be add to the row
buffer here by SparkExecuteStatementOperation#addNonNullColumnValue
https://github.com/apache/spark/pull/27552/files#diff-72dcd8f81a51c8a815159fdf0332acdcR84-R116
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you help fix it? I think we should output java8 datetime values if the config is enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are limited by hive-jdbc
module, see https://github.com/apache/hive/blob/a7e704c679a00db68db9b9f921d133d79a32cfcc/jdbc/src/java/org/apache/hive/jdbc/HiveBaseResultSet.java#L427-L457, we might need our own jdbc driver implementation to achieve this
case _ => | ||
val sessionWithJava8DatetimeEnabled = { | ||
val cloned = ds.sparkSession.cloneSession() | ||
cloned.conf.set(SQLConf.DATETIME_JAVA8API_ENABLED.key, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this always true
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the old Date
/Timestamp
doesn't follow the new calendar and may produce wrong string for some date/timestamp values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh wait, we format Date/Timestamp
by our own formatter, so this should be no problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-- !query
set spark.sql.datetime.java8API.enabled
-- !query schema
struct<key:string,value:string>
-- !query output
spark.sql.datetime.java8API.enabled false
-- !query
set set spark.sql.session.timeZone=America/Los_Angeles
-- !query schema
struct<key:string,value:string>
-- !query output
set spark.sql.session.timeZone America/Los_Angeles
-- !query
SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20')
-- !query schema
struct<date_trunc(MILLENNIUM, CAST(DATE '1970-03-20' AS TIMESTAMP)):timestamp>
-- !query output
1001-01-01 00:00:00
I rm this line and run SQLQueryTestSuite
with cases above, the results are the same. Or does this problem only exists for spark-sql
script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or does this problem only exists for spark-sql script?
Only when thrift-server is involved in the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also pass these tests through ThriftServerQueryTestSuite
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone America/Los_Angeles
spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20');
1001-01-01 00:00:00
spark-sql> select version();
3.1.0 b3dcb63a682bc31827a86cf381f157a81e9e07ac
Also correct with bin/spark-sql
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I tested it too and looks fine. Maybe some refactor of how to format old Date/Timestamp
fixes it already.
@yaooqinn can you send a PR to revert it? Let's see if all tests pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
This reverts commit afaeb29. ### What changes were proposed in this pull request? Based on the result and comment from #27552 (comment) In the hive module, server-side provides datetime values simply use `value.toSting`, and the client-side regenerates the results back in `HiveBaseResultSet` with `java.sql.Date(Timestamp).valueOf`. there will be inconsistency between client and server if we use java8 APIs ### Why are the changes needed? the change is still unclear enough ### Does this PR introduce any user-facing change? no ### How was this patch tested? Nah Closes #27733 from yaooqinn/SPARK-30808. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
This reverts commit afaeb29. ### What changes were proposed in this pull request? Based on the result and comment from #27552 (comment) In the hive module, server-side provides datetime values simply use `value.toSting`, and the client-side regenerates the results back in `HiveBaseResultSet` with `java.sql.Date(Timestamp).valueOf`. there will be inconsistency between client and server if we use java8 APIs ### Why are the changes needed? the change is still unclear enough ### Does this PR introduce any user-facing change? no ### How was this patch tested? Nah Closes #27733 from yaooqinn/SPARK-30808. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 1fac06c) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? - Set `spark.sql.datetime.java8API.enabled` to `true` in `hiveResultString()`, and restore it back at the end of the call. - Convert collected `java.time.Instant` & `java.time.LocalDate` to `java.sql.Timestamp` and `java.sql.Date` for correct formatting. ### Why are the changes needed? Because of textual representation of timestamps/dates before 1582 year is incorrect: ```shell $ export TZ="America/Los_Angeles" $ ./bin/spark-sql -S ``` ```sql spark-sql> set spark.sql.session.timeZone=America/Los_Angeles; spark.sql.session.timeZone America/Los_Angeles spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20'); 1001-01-01 00:07:02 ``` It must be 1001-01-01 00:**00:00**. ### Does this PR introduce any user-facing change? Yes. After the changes: ```shell $ export TZ="America/Los_Angeles" $ ./bin/spark-sql -S ``` ```sql spark-sql> set spark.sql.session.timeZone=America/Los_Angeles; spark.sql.session.timeZone America/Los_Angeles spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20'); 1001-01-01 00:00:00 ``` ### How was this patch tested? By running hive-thiftserver tests. In particular: ``` ./build/sbt -Phadoop-2.7 -Phive-2.3 -Phive-thriftserver "hive-thriftserver/test:testOnly *SparkThriftServerProtocolVersionsSuite" ``` Closes apache#27552 from MaxGekk/hive-thriftserver-java8-time-api. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
This reverts commit afaeb29. ### What changes were proposed in this pull request? Based on the result and comment from apache#27552 (comment) In the hive module, server-side provides datetime values simply use `value.toSting`, and the client-side regenerates the results back in `HiveBaseResultSet` with `java.sql.Date(Timestamp).valueOf`. there will be inconsistency between client and server if we use java8 APIs ### Why are the changes needed? the change is still unclear enough ### Does this PR introduce any user-facing change? no ### How was this patch tested? Nah Closes apache#27733 from yaooqinn/SPARK-30808. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? - Set `spark.sql.datetime.java8API.enabled` to `true` in `hiveResultString()`, and restore it back at the end of the call. - Convert collected `java.time.Instant` & `java.time.LocalDate` to `java.sql.Timestamp` and `java.sql.Date` for correct formatting. ### Why are the changes needed? Because of textual representation of timestamps/dates before 1582 year is incorrect: ```shell $ export TZ="America/Los_Angeles" $ ./bin/spark-sql -S ``` ```sql spark-sql> set spark.sql.session.timeZone=America/Los_Angeles; spark.sql.session.timeZone America/Los_Angeles spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20'); 1001-01-01 00:07:02 ``` It must be 1001-01-01 00:**00:00**. ### Does this PR introduce any user-facing change? Yes. After the changes: ```shell $ export TZ="America/Los_Angeles" $ ./bin/spark-sql -S ``` ```sql spark-sql> set spark.sql.session.timeZone=America/Los_Angeles; spark.sql.session.timeZone America/Los_Angeles spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20'); 1001-01-01 00:00:00 ``` ### How was this patch tested? By running hive-thiftserver tests. In particular: ``` ./build/sbt -Phadoop-2.7 -Phive-2.3 -Phive-thriftserver "hive-thriftserver/test:testOnly *SparkThriftServerProtocolVersionsSuite" ``` Closes apache#27552 from MaxGekk/hive-thriftserver-java8-time-api. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> # Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala # sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
What changes were proposed in this pull request?
spark.sql.datetime.java8API.enabled
totrue
inhiveResultString()
, and restore it back at the end of the call.java.time.Instant
&java.time.LocalDate
tojava.sql.Timestamp
andjava.sql.Date
for correct formatting.Why are the changes needed?
Because of textual representation of timestamps/dates before 1582 year is incorrect:
It must be 1001-01-01 00:00:00.
Does this PR introduce any user-facing change?
Yes. After the changes:
How was this patch tested?
By running hive-thiftserver tests. In particular: