Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29783][SQL] Support SQL Standard/ISO_8601 output style for interval type #26418

Closed
wants to merge 22 commits into from
Closed

Conversation

yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Nov 7, 2019

What changes were proposed in this pull request?

Add 3 interval output types which are named as SQL_STANDARD, ISO_8601, MULTI_UNITS. And we add a new conf spark.sql.dialect.intervalOutputStyle for this. The MULTI_UNITS style displays the interval values in the former behavior and it is the default. The newly added SQL_STANDARD, ISO_8601 styles can be found in the following table.

Style conf Year-Month Interval Day-Time Interval Mixed Interval
Format With Time Unit Designators MULTI_UNITS 1 year 2 mons 1 days 2 hours 3 minutes 4.123456 seconds interval 1 days 2 hours 3 minutes 4.123456 seconds
SQL STANDARD SQL_STANDARD 1-2 3 4:05:06 -1-2 3 -4:05:06
ISO8601 Basic Format ISO_8601 P1Y2M P3DT4H5M6S P-1Y-2M3D-4H-5M-6S

Why are the changes needed?

for ANSI SQL support

Does this PR introduce any user-facing change?

yes,interval out now has 3 output styles

How was this patch tested?

add new unit tests

cc @cloud-fan @maropu @MaxGekk @HyukjinKwon thanks.

@SparkQA
Copy link

SparkQA commented Nov 7, 2019

Test build #113355 has finished for PR 26418 at commit 88418e0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

can you check with some other databases?

@yaooqinn
Copy link
Member Author

yaooqinn commented Nov 7, 2019

1. vertica

https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/DataTypes/Date-Time/DisplayingOrOmittingIntervalUnitsInOutput.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Data%20Types%7CDate%2FTime%20Data%20Types%7CINTERVAL%7C_____1

If DATESTYLE is set to SQL, interval unit display always conforms to the SQL:2008 standard, which omits interval unit display. If DATESTYLE is set to ISO, you can use SET INTERVALSTYLE to omit or display interval unit display.

2. mysql

MySQL seems to have intervals but not interval types, using time and year as an alternative.
https://dev.mysql.com/doc/refman/8.0/en/year.html
https://dev.mysql.com/doc/refman/8.0/en/time.html

3. Postgres

https://www.postgresql.org/docs/9.0/datatype-datetime.html#DATATYPE-INTERVAL-OUTPUT

Pg has 4 output styles for intervals. sql_standard, postgres, postgres_verbose, and iso_8601

4 presto

no so clear on the online doc https://prestodb.github.io/docs/current/language/types.html#date-and-time, verifed by manual

presto> select INTERVAL '3' DAY;
     _col0
----------------
 3 00:00:00.000
(1 row)

Query 20191107_060825_00008_n4kma, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:01 [0 rows, 0B] [0 rows/s, 0B/s]

presto> select INTERVAL '3' MONTH;
 _col0
-------
 0-3
(1 row)

Query 20191107_060834_00009_n4kma, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

@cloud-fan Currently I have checked for these dbs. thanks

@SparkQA
Copy link

SparkQA commented Nov 7, 2019

Test build #113368 has finished for PR 26418 at commit 02be22b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class LocalShuffleReaderExec(child: SparkPlan) extends UnaryExecNode
  • class ContinuousRecordEndpoint(buckets: Seq[Seq[UnsafeRow]], lock: Object)

@SparkQA
Copy link

SparkQA commented Nov 7, 2019

Test build #113366 has finished for PR 26418 at commit 60441d5.

  • This patch fails due to an unknown error code, -9.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@@ -409,6 +409,8 @@ case class Literal (value: Any, dataType: DataType) extends LeafExpression {
DateTimeUtils.getZoneId(SQLConf.get.sessionLocalTimeZone))
s"TIMESTAMP('${formatter.format(v)}')"
case (v: Array[Byte], BinaryType) => s"X'${DatatypeConverter.printHexBinary(v)}'"
case (v: CalendarInterval, CalendarIntervalType) if SQLConf.get.ansiEnabled =>
IntervalUtils.toSqlStandardString(v)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be something that can be parsed, I think we need to output something like INTERVAL'1 year 2 days'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't support SQL standard input for intervals? I missed that, may cause user behavior change. But If we stay multi-unit style here, would there be conflicting between literals and other exprs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed back , thanks

@yaooqinn
Copy link
Member Author

yaooqinn commented Nov 7, 2019

I add a new config to control this output behavior, since ansi switch is too dressy for this.

@SparkQA
Copy link

SparkQA commented Nov 7, 2019

Test build #113383 has finished for PR 26418 at commit fff17d5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

.stringConf
.transform(_.toUpperCase(Locale.ROOT))
.checkValues(IntervalStyle.values.map(_.toString))
.createWithDefault(IntervalStyle.MULTI_UNITS.toString)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think ansiEnabled is enough for this feature. Any concern?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I guess some users may already rely on the output string

@yaooqinn yaooqinn changed the title [SPARK-29783][SQL] Support SQL Standard output style for interval type [SPARK-29783][SQL] Support SQL Standard/ISO_8601 output style for interval type Nov 11, 2019
@MaxGekk
Copy link
Member

MaxGekk commented Nov 11, 2019

You have to change the code gen in castToStringCode as well.

@SparkQA
Copy link

SparkQA commented Nov 11, 2019

Test build #113597 has finished for PR 26418 at commit 005a28d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn
Copy link
Member Author

You have to change the code gen in castToStringCode as well.

thanks @MaxGekk

@SparkQA
Copy link

SparkQA commented Nov 12, 2019

Test build #113607 has finished for PR 26418 at commit fa9c41e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

@yaooqinn can you revert the json side changes?

@yaooqinn
Copy link
Member Author

@yaooqinn can you revert the json side changes?

the changes in JacksonGenerator.scala should be kept from consistency, cause interval's toString has changed. I will rm the others.

@@ -214,10 +218,15 @@ private[sql] class JacksonGenerator(
private def writeMapData(
map: MapData, mapType: MapType, fieldWriter: ValueWriter): Unit = {
val keyArray = map.keyArray()
val keyString = mapType.keyType match {
case CalendarIntervalType =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can't happen actually. We don't allow writing out interval values. Do you have an example that can hit this code path?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah to_json do not have the interval type check. makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yaooqinn, how about to_csv?

@@ -26,6 +26,8 @@ import com.fasterxml.jackson.core.{JsonFactory, JsonParser}
import org.apache.spark.internal.Logging
import org.apache.spark.sql.catalyst.util._
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.internal.SQLConf.IntervalStyle
import org.apache.spark.sql.internal.SQLConf.IntervalStyle.IntervalStyle
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused import

val keyString = mapType.keyType match {
case CalendarIntervalType =>
(i: Int) => IntervalUtils.toMultiUnitsString(keyArray.getInterval(i))
case _ => (i: Int) => keyArray.get(i, mapType.keyType).toString
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's fragile to rely on toString. e.g. UnsafeRow.toString is not human readable. Shall we recursively write map key as json object? cc @HyukjinKwon

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I am sorry I missed this cc. in JSON the key should be a string. We should either make it string always or explicitly disallow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @viirya I think we talked about this before.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I think currently the map key is not very useful for some types. To make human readable map keys, we need do specific serialization for some map key types. Maybe I create a JIRA ticket to follow it up?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah .. +1 !

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SparkQA
Copy link

SparkQA commented Nov 13, 2019

Test build #113717 has finished for PR 26418 at commit 8d18fac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 13, 2019

Test build #113719 has finished for PR 26418 at commit 0c530cc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -88,3 +88,46 @@ select justify_interval(interval '1 month -59 day 25 hour');
select justify_days(interval '1 month 59 day -25 hour');
select justify_hours(interval '1 month 59 day -25 hour');
select justify_interval(interval '1 month 59 day -25 hour');

-- interval output style
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can create a interval-display.sql test file to put these test cases, and then a interval-display-sql-standard.sql which does

--SET spark.sql.intervalOutputStyle = SQL_STANDARD;
--import interval-display.sql

also a interval-display-iso_8601.sql

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--SET spark.sql.intervalOutputStyle = SQL_STANDARD
--import interval-display.sql

these 2 don't fit each other well,
When regenerating golden files, the set operations will not happen, considering this as a bug,
also cc @maropu

if (regenerateGoldenFiles || !isTestWithConfigSets) {
runQueries(queries, testCase, None)
} else {
val configSets = {
val configLines = comments.filter(_.startsWith("--SET")).map(_.substring(5))
val configs = configLines.map(_.split(",").map { confAndValue =>
val (conf, value) = confAndValue.span(_ != '=')
conf.trim -> value.substring(1).trim

Copy link
Member Author

@yaooqinn yaooqinn Nov 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this PR, the golden files are generated with this fix #26557, because it has nothing to do with this PR, so I just raised a follow-up to the ticket SPARK-29873 which involved this feature.

@SparkQA
Copy link

SparkQA commented Nov 16, 2019

Test build #113927 has finished for PR 26418 at commit 7641e5e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 16, 2019

Test build #113936 has finished for PR 26418 at commit 0f54af8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 5cebe58 Nov 18, 2019
val keyString = mapType.keyType match {
case CalendarIntervalType =>
(i: Int) => IntervalUtils.toMultiUnitsString(keyArray.getInterval(i))
case _ => (i: Int) => keyArray.get(i, mapType.keyType).toString
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code path shouldn't be here per each map here BTW.

@@ -409,6 +409,7 @@ case class Literal (value: Any, dataType: DataType) extends LeafExpression {
DateTimeUtils.getZoneId(SQLConf.get.sessionLocalTimeZone))
s"TIMESTAMP('${formatter.format(v)}')"
case (v: Array[Byte], BinaryType) => s"X'${DatatypeConverter.printHexBinary(v)}'"
case (v: CalendarInterval, CalendarIntervalType) => IntervalUtils.toMultiUnitsString(v)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if this is already asked above but why we didn't change this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have not supported to parse interval from iso/SQL standard format yet

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we not support iso/SQL standard format here together?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sb.append(value).append(' ').append(unit).append(' ');
}
return "CalendarInterval(months= " + months + ", days = " + days + ", microsecond = " +
microseconds + ")";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we use such string representation now? Was it in order to put the same logics into IntervalUtils? If that's the case, we didn't have to move but use toString of this class until this case becomes completely exposed.

@HyukjinKwon
Copy link
Member

Let me make a followup by myself.

@HyukjinKwon
Copy link
Member

Made #26572

@@ -93,7 +93,10 @@ class ThriftServerQueryTestSuite extends SQLQueryTestSuite {
"subquery/in-subquery/in-group-by.sql",
"subquery/in-subquery/simple-in.sql",
"subquery/in-subquery/in-order-by.sql",
"subquery/in-subquery/in-set-operations.sql"
"subquery/in-subquery/in-set-operations.sql",
// SPARK-29783: need to set conf
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was it only because we couldn't set the configurations? If Thrift server does not respect configuration like:

--SET spark.sql.intervalOutputStyle = ISO_8601

it looks a bug to me. @wangyum

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this config:

HyukjinKwon added a commit that referenced this pull request Nov 19, 2019
### What changes were proposed in this pull request?

This is a followup of #26418. This PR removed `CalendarInterval`'s `toString` with an unfinished changes.

### Why are the changes needed?

1. Ideally we should make each PR isolated and separate targeting one issue without touching unrelated codes.

2. There are some other places where the string formats were exposed to users. For example:

    ```scala
    scala> sql("select interval 1 days as a").selectExpr("to_csv(struct(a))").show()
    ```
    ```
    +--------------------------+
    |to_csv(named_struct(a, a))|
    +--------------------------+
    |      "CalendarInterval...|
    +--------------------------+
    ```

3.  Such fixes:

    ```diff
     private def writeMapData(
        map: MapData, mapType: MapType, fieldWriter: ValueWriter): Unit = {
      val keyArray = map.keyArray()
    + val keyString = mapType.keyType match {
    +   case CalendarIntervalType =>
    +    (i: Int) => IntervalUtils.toMultiUnitsString(keyArray.getInterval(i))
    +   case _ => (i: Int) => keyArray.get(i, mapType.keyType).toString
    + }
    ```

    can cause performance regression due to type dispatch for each map.

### Does this PR introduce any user-facing change?

Yes, see 2. case above.

### How was this patch tested?

Manually tested.

Closes #26572 from HyukjinKwon/SPARK-29783.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
cloud-fan pushed a commit that referenced this pull request Jan 21, 2020
… we decide not to follow ANSI and no round trip

### What changes were proposed in this pull request?

This revert #26418, file a new ticket under  https://issues.apache.org/jira/browse/SPARK-30546 for better tracking interval behavior
### Why are the changes needed?

Revert interval ISO/ANSI SQL Standard output since we decide not to follow ANSI and there is no round trip

### Does this PR introduce any user-facing change?

no, not released yet

### How was this patch tested?

existing uts

Closes #27304 from yaooqinn/SPARK-30593.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
9 participants