Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-30808][SQL] Enable Java 8 time API in Thrift server #27552

Closed
wants to merge 19 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Feb 12, 2020

What changes were proposed in this pull request?

  • Set spark.sql.datetime.java8API.enabled to true in hiveResultString(), and restore it back at the end of the call.
  • Convert collected java.time.Instant & java.time.LocalDate to java.sql.Timestamp and java.sql.Date for correct formatting.

Why are the changes needed?

Because of textual representation of timestamps/dates before 1582 year is incorrect:

$ export TZ="America/Los_Angeles"
$ ./bin/spark-sql -S
spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone	America/Los_Angeles
spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20');
1001-01-01 00:07:02

It must be 1001-01-01 00:00:00.

Does this PR introduce any user-facing change?

Yes. After the changes:

$ export TZ="America/Los_Angeles"
$ ./bin/spark-sql -S
spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone	America/Los_Angeles
spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20');
1001-01-01 00:00:00

How was this patch tested?

By running hive-thiftserver tests. In particular:

./build/sbt -Phadoop-2.7 -Phive-2.3 -Phive-thriftserver "hive-thriftserver/test:testOnly *SparkThriftServerProtocolVersionsSuite"

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 12, 2020

@cloud-fan I tried to set the SQL config before the collect action in

val result: Seq[Seq[Any]] = other.executeCollectPublic().map(_.toSeq).toSeq
// We need the types so we can output struct field names
val types = executedPlan.output.map(_.dataType)
// Reformat to match hive tab delimited output.
result.map(_.zip(types).map(e => toHiveString(e)))
.map(_.mkString("\t"))
but it doesn't work - the SQL config didn't change the behavior :-(

@SparkQA
Copy link

SparkQA commented Feb 12, 2020

Test build #118319 has finished for PR 27552 at commit 916838a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk MaxGekk changed the title [WIP] Enable Java 8 time API in Thrift server [SPARK-30808][SQL] Enable Java 8 time API in Thrift server Feb 13, 2020
@SparkQA
Copy link

SparkQA commented Feb 13, 2020

Test build #118347 has finished for PR 27552 at commit d98cdbc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

but it doesn't work - the SQL config didn't change the behavior

It's probably because the Dataset is already there and its encoder is already created, which decides to return old timestamp/date. We can create a new Dataset to fix it, e.g. df.filter(lit("true")).collect

@SparkQA
Copy link

SparkQA commented Feb 13, 2020

Test build #118352 has finished for PR 27552 at commit 580dd09.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 13, 2020

We can create a new Dataset to fix it ...

@cloud-fan Is it possible to create/restore a Dataset from an executedPlan?

@cloud-fan
Copy link
Contributor

@MaxGekk it's possible from a logical plan, e.g. Dataset.ofRows(session, df.logicalPlan)

…server-java8-time-api

# Conflicts:
#	sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala
#	sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLDriver.scala
#	sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveComparisonTest.scala
@SparkQA
Copy link

SparkQA commented Feb 16, 2020

Test build #118509 has finished for PR 27552 at commit 804e709.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 16, 2020

Test build #118508 has finished for PR 27552 at commit deaef58.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 16, 2020

Test build #118510 has finished for PR 27552 at commit 80f4c89.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 16, 2020

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Feb 17, 2020

Test build #118511 has finished for PR 27552 at commit 80f4c89.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 17, 2020

Test build #118549 has finished for PR 27552 at commit ce77628.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 17, 2020

Test build #118552 has finished for PR 27552 at commit f9112b7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 17, 2020

So many HiveComparisonTest related tests failed, I will revert this cdb322d

@SparkQA
Copy link

SparkQA commented Feb 17, 2020

Test build #118569 has finished for PR 27552 at commit 2129f30.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 17, 2020

@cloud-fan Something wrong is going on here. Commands issues from HiveComparisonTest are executed twice, it seems.

@cloud-fan
Copy link
Contributor

@MaxGekk yea creating the df again may execute the command again. Let's keep the lazy val.

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 17, 2020

I would prefer to cut off this PR at this point #27552 (comment), and implement moving settings of spark.sql.datetime.java8API.enabled to hiveResult() separately.

Making dataset as lazy val doesn't help me, so, I stuck for now.

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 17, 2020

While debugging the test ambiguousReferences resolved as hive test from org.apache.spark.sql.hive.execution.HiveResolutionSuite, I see CreateDataSourceTableCommand is invoke twice for the same table name t1 :

  1. In lazy val dataset = Dataset.ofRows(sparkSession, logical) inside of TestHiveQueryExecution
  2. In hiveResultString():
val result: Seq[Seq[Any]] = Dataset.ofRows(ds.sparkSession, ds.queryExecution.logical)
            .queryExecution
            .executedPlan
            .executeCollectPublic().map(_.toSeq).toSeq

This causes side effects here:

LocalRelation(c.output, withAction("command", queryExecution)(_.executeCollect()))

@cloud-fan
Copy link
Contributor

cloud-fan commented Feb 17, 2020

@MaxGekk I see the problem now. We should use ds.logicalPlan instead of ds.queryExecution.logical. As ds.logicalPlan can be LocalRelation to hold the command result.

@SparkQA
Copy link

SparkQA commented Feb 17, 2020

Test build #118579 has finished for PR 27552 at commit 880b1de.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan cloud-fan closed this in afaeb29 Feb 17, 2020
@cloud-fan
Copy link
Contributor

thanks, merging to master/3.0!

cloud-fan pushed a commit that referenced this pull request Feb 17, 2020
### What changes were proposed in this pull request?
- Set `spark.sql.datetime.java8API.enabled` to `true` in `hiveResultString()`, and restore it back at the end of the call.
- Convert collected `java.time.Instant` & `java.time.LocalDate` to `java.sql.Timestamp` and `java.sql.Date` for correct formatting.

### Why are the changes needed?
Because of textual representation of timestamps/dates before 1582 year is incorrect:
```shell
$ export TZ="America/Los_Angeles"
$ ./bin/spark-sql -S
```
```sql
spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone	America/Los_Angeles
spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20');
1001-01-01 00:07:02
```
It must be 1001-01-01 00:**00:00**.

### Does this PR introduce any user-facing change?
Yes. After the changes:
```shell
$ export TZ="America/Los_Angeles"
$ ./bin/spark-sql -S
```
```sql
spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone	America/Los_Angeles
spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20');
1001-01-01 00:00:00
```

### How was this patch tested?
By running hive-thiftserver tests. In particular:
```
./build/sbt -Phadoop-2.7 -Phive-2.3 -Phive-thriftserver "hive-thriftserver/test:testOnly *SparkThriftServerProtocolVersionsSuite"
```

Closes #27552 from MaxGekk/hive-thriftserver-java8-time-api.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit afaeb29)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
sessionWithJava8DatetimeEnabled.withActive {
// We cannot collect the original dataset because its encoders could be created
// with disabled Java 8 date-time API.
val result: Seq[Seq[Any]] = Dataset.ofRows(ds.sparkSession, ds.logicalPlan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

found a problem. Dataset.ofRows will set the input session as active, so we should write Dataset.ofRows(sessionWithJava8DatetimeEnabled, ... and remove the outer sessionWithJava8DatetimeEnabled.withActive.

@@ -512,7 +511,7 @@ class SQLQueryTestSuite extends QueryTest with SharedSparkSession {
val schema = df.schema.catalogString
// Get answer, but also get rid of the #1234 expression ids that show up in explain plans
val answer = SQLExecution.withNewExecutionId(df.queryExecution, Some(sql)) {
hiveResultString(df.queryExecution.executedPlan).map(replaceNotIncludedMsg)
hiveResultString(df).map(replaceNotIncludedMsg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in ThriftServerQueryTestSuite, we get the result by JDBC, so there is no DataFrame created.

We should follow pgsql and return java 8 datetime when the config is enabled. https://jdbc.postgresql.org/documentation/head/8-date-time.html

cc @wangyum @yaooqinn

// Convert date-time instances to types that are acceptable by Hive libs
// used in conversions to strings.
val resultRow = row.map {
case i: Instant => Timestamp.from(i)
Copy link
Member

@yaooqinn yaooqinn Feb 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems no java8 datetime values to be add to the row buffer here by SparkExecuteStatementOperation#addNonNullColumnValue

https://github.com/apache/spark/pull/27552/files#diff-72dcd8f81a51c8a815159fdf0332acdcR84-R116

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you help fix it? I think we should output java8 datetime values if the config is enabled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are limited by hive-jdbc module, see https://github.com/apache/hive/blob/a7e704c679a00db68db9b9f921d133d79a32cfcc/jdbc/src/java/org/apache/hive/jdbc/HiveBaseResultSet.java#L427-L457, we might need our own jdbc driver implementation to achieve this

case _ =>
val sessionWithJava8DatetimeEnabled = {
val cloned = ds.sparkSession.cloneSession()
cloned.conf.set(SQLConf.DATETIME_JAVA8API_ENABLED.key, true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this always true?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the old Date/Timestamp doesn't follow the new calendar and may produce wrong string for some date/timestamp values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh wait, we format Date/Timestamp by our own formatter, so this should be no problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-- !query
set spark.sql.datetime.java8API.enabled
-- !query schema
struct<key:string,value:string>
-- !query output
spark.sql.datetime.java8API.enabled	false


-- !query
set set spark.sql.session.timeZone=America/Los_Angeles
-- !query schema
struct<key:string,value:string>
-- !query output
set spark.sql.session.timeZone	America/Los_Angeles


-- !query
SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20')
-- !query schema
struct<date_trunc(MILLENNIUM, CAST(DATE '1970-03-20' AS TIMESTAMP)):timestamp>
-- !query output
1001-01-01 00:00:00

I rm this line and run SQLQueryTestSuite with cases above, the results are the same. Or does this problem only exists for spark-sql script?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or does this problem only exists for spark-sql script?

Only when thrift-server is involved in the loop.

Copy link
Member

@yaooqinn yaooqinn Feb 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also pass these tests through ThriftServerQueryTestSuite

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone	America/Los_Angeles
spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20');
1001-01-01 00:00:00
spark-sql> select version();
3.1.0 b3dcb63a682bc31827a86cf381f157a81e9e07ac

Also correct with bin/spark-sql

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I tested it too and looks fine. Maybe some refactor of how to format old Date/Timestamp fixes it already.

@yaooqinn can you send a PR to revert it? Let's see if all tests pass.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

cloud-fan pushed a commit that referenced this pull request Mar 3, 2020
This reverts commit afaeb29.

### What changes were proposed in this pull request?

Based on the result and comment from #27552 (comment)

In the hive module, server-side provides datetime values simply use `value.toSting`, and the client-side regenerates the results back in `HiveBaseResultSet` with `java.sql.Date(Timestamp).valueOf`.
there will be inconsistency between client and server if we use java8 APIs

### Why are the changes needed?

the change is still unclear enough

### Does this PR introduce any user-facing change?

no
### How was this patch tested?

Nah

Closes #27733 from yaooqinn/SPARK-30808.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Mar 3, 2020
This reverts commit afaeb29.

### What changes were proposed in this pull request?

Based on the result and comment from #27552 (comment)

In the hive module, server-side provides datetime values simply use `value.toSting`, and the client-side regenerates the results back in `HiveBaseResultSet` with `java.sql.Date(Timestamp).valueOf`.
there will be inconsistency between client and server if we use java8 APIs

### Why are the changes needed?

the change is still unclear enough

### Does this PR introduce any user-facing change?

no
### How was this patch tested?

Nah

Closes #27733 from yaooqinn/SPARK-30808.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 1fac06c)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?
- Set `spark.sql.datetime.java8API.enabled` to `true` in `hiveResultString()`, and restore it back at the end of the call.
- Convert collected `java.time.Instant` & `java.time.LocalDate` to `java.sql.Timestamp` and `java.sql.Date` for correct formatting.

### Why are the changes needed?
Because of textual representation of timestamps/dates before 1582 year is incorrect:
```shell
$ export TZ="America/Los_Angeles"
$ ./bin/spark-sql -S
```
```sql
spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone	America/Los_Angeles
spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20');
1001-01-01 00:07:02
```
It must be 1001-01-01 00:**00:00**.

### Does this PR introduce any user-facing change?
Yes. After the changes:
```shell
$ export TZ="America/Los_Angeles"
$ ./bin/spark-sql -S
```
```sql
spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone	America/Los_Angeles
spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20');
1001-01-01 00:00:00
```

### How was this patch tested?
By running hive-thiftserver tests. In particular:
```
./build/sbt -Phadoop-2.7 -Phive-2.3 -Phive-thriftserver "hive-thriftserver/test:testOnly *SparkThriftServerProtocolVersionsSuite"
```

Closes apache#27552 from MaxGekk/hive-thriftserver-java8-time-api.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
This reverts commit afaeb29.

### What changes were proposed in this pull request?

Based on the result and comment from apache#27552 (comment)

In the hive module, server-side provides datetime values simply use `value.toSting`, and the client-side regenerates the results back in `HiveBaseResultSet` with `java.sql.Date(Timestamp).valueOf`.
there will be inconsistency between client and server if we use java8 APIs

### Why are the changes needed?

the change is still unclear enough

### Does this PR introduce any user-facing change?

no
### How was this patch tested?

Nah

Closes apache#27733 from yaooqinn/SPARK-30808.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
MaxGekk added a commit to MaxGekk/spark that referenced this pull request Jun 2, 2020
### What changes were proposed in this pull request?
- Set `spark.sql.datetime.java8API.enabled` to `true` in `hiveResultString()`, and restore it back at the end of the call.
- Convert collected `java.time.Instant` & `java.time.LocalDate` to `java.sql.Timestamp` and `java.sql.Date` for correct formatting.

### Why are the changes needed?
Because of textual representation of timestamps/dates before 1582 year is incorrect:
```shell
$ export TZ="America/Los_Angeles"
$ ./bin/spark-sql -S
```
```sql
spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone	America/Los_Angeles
spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20');
1001-01-01 00:07:02
```
It must be 1001-01-01 00:**00:00**.

### Does this PR introduce any user-facing change?
Yes. After the changes:
```shell
$ export TZ="America/Los_Angeles"
$ ./bin/spark-sql -S
```
```sql
spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone	America/Los_Angeles
spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20');
1001-01-01 00:00:00
```

### How was this patch tested?
By running hive-thiftserver tests. In particular:
```
./build/sbt -Phadoop-2.7 -Phive-2.3 -Phive-thriftserver "hive-thriftserver/test:testOnly *SparkThriftServerProtocolVersionsSuite"
```

Closes apache#27552 from MaxGekk/hive-thriftserver-java8-time-api.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

# Conflicts:
#	sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala
#	sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
@MaxGekk MaxGekk deleted the hive-thriftserver-java8-time-api branch June 5, 2020 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants