New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-29783][SQL] Support SQL Standard/ISO_8601 output style for interval type #26418
Conversation
Test build #113355 has finished for PR 26418 at commit
|
can you check with some other databases? |
1. vertica
2. mysqlMySQL seems to have intervals but not interval types, using time and year as an alternative. 3. Postgreshttps://www.postgresql.org/docs/9.0/datatype-datetime.html#DATATYPE-INTERVAL-OUTPUT Pg has 4 output styles for intervals. sql_standard, postgres, postgres_verbose, and iso_8601 4 prestono so clear on the online doc https://prestodb.github.io/docs/current/language/types.html#date-and-time, verifed by manual
@cloud-fan Currently I have checked for these dbs. thanks |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala
Outdated
Show resolved
Hide resolved
Test build #113368 has finished for PR 26418 at commit
|
Test build #113366 has finished for PR 26418 at commit
|
@@ -409,6 +409,8 @@ case class Literal (value: Any, dataType: DataType) extends LeafExpression { | |||
DateTimeUtils.getZoneId(SQLConf.get.sessionLocalTimeZone)) | |||
s"TIMESTAMP('${formatter.format(v)}')" | |||
case (v: Array[Byte], BinaryType) => s"X'${DatatypeConverter.printHexBinary(v)}'" | |||
case (v: CalendarInterval, CalendarIntervalType) if SQLConf.get.ansiEnabled => | |||
IntervalUtils.toSqlStandardString(v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be something that can be parsed, I think we need to output something like INTERVAL'1 year 2 days'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't support SQL standard input for intervals? I missed that, may cause user behavior change. But If we stay multi-unit style here, would there be conflicting between literals and other exprs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed back , thanks
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala
Show resolved
Hide resolved
I add a new config to control this output behavior, since ansi switch is too dressy for this. |
Test build #113383 has finished for PR 26418 at commit
|
.stringConf | ||
.transform(_.toUpperCase(Locale.ROOT)) | ||
.checkValues(IntervalStyle.values.map(_.toString)) | ||
.createWithDefault(IntervalStyle.MULTI_UNITS.toString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally think ansiEnabled
is enough for this feature. Any concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I guess some users may already rely on the output string
You have to change the code gen in |
Test build #113597 has finished for PR 26418 at commit
|
thanks @MaxGekk |
Test build #113607 has finished for PR 26418 at commit
|
@yaooqinn can you revert the json side changes? |
the changes in |
@@ -214,10 +218,15 @@ private[sql] class JacksonGenerator( | |||
private def writeMapData( | |||
map: MapData, mapType: MapType, fieldWriter: ValueWriter): Unit = { | |||
val keyArray = map.keyArray() | |||
val keyString = mapType.keyType match { | |||
case CalendarIntervalType => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can't happen actually. We don't allow writing out interval values. Do you have an example that can hit this code path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah to_json
do not have the interval type check. makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yaooqinn, how about to_csv
?
@@ -26,6 +26,8 @@ import com.fasterxml.jackson.core.{JsonFactory, JsonParser} | |||
import org.apache.spark.internal.Logging | |||
import org.apache.spark.sql.catalyst.util._ | |||
import org.apache.spark.sql.internal.SQLConf | |||
import org.apache.spark.sql.internal.SQLConf.IntervalStyle | |||
import org.apache.spark.sql.internal.SQLConf.IntervalStyle.IntervalStyle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused import
val keyString = mapType.keyType match { | ||
case CalendarIntervalType => | ||
(i: Int) => IntervalUtils.toMultiUnitsString(keyArray.getInterval(i)) | ||
case _ => (i: Int) => keyArray.get(i, mapType.keyType).toString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's fragile to rely on toString
. e.g. UnsafeRow.toString
is not human readable. Shall we recursively write map key as json object? cc @HyukjinKwon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I am sorry I missed this cc. in JSON the key should be a string. We should either make it string always or explicitly disallow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @viirya I think we talked about this before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I think currently the map key is not very useful for some types. To make human readable map keys, we need do specific serialization for some map key types. Maybe I create a JIRA ticket to follow it up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah .. +1 !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created https://issues.apache.org/jira/browse/SPARK-29946 to follow it up.
Test build #113717 has finished for PR 26418 at commit
|
Test build #113719 has finished for PR 26418 at commit
|
@@ -88,3 +88,46 @@ select justify_interval(interval '1 month -59 day 25 hour'); | |||
select justify_days(interval '1 month 59 day -25 hour'); | |||
select justify_hours(interval '1 month 59 day -25 hour'); | |||
select justify_interval(interval '1 month 59 day -25 hour'); | |||
|
|||
-- interval output style |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can create a interval-display.sql
test file to put these test cases, and then a interval-display-sql-standard.sql
which does
--SET spark.sql.intervalOutputStyle = SQL_STANDARD;
--import interval-display.sql
also a interval-display-iso_8601.sql
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--SET spark.sql.intervalOutputStyle = SQL_STANDARD
--import interval-display.sql
these 2 don't fit each other well,
When regenerating golden files, the set operations will not happen, considering this as a bug,
also cc @maropu
spark/sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala
Lines 288 to 295 in 40ea4a1
if (regenerateGoldenFiles || !isTestWithConfigSets) { | |
runQueries(queries, testCase, None) | |
} else { | |
val configSets = { | |
val configLines = comments.filter(_.startsWith("--SET")).map(_.substring(5)) | |
val configs = configLines.map(_.split(",").map { confAndValue => | |
val (conf, value) = confAndValue.span(_ != '=') | |
conf.trim -> value.substring(1).trim |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this PR, the golden files are generated with this fix #26557, because it has nothing to do with this PR, so I just raised a follow-up to the ticket SPARK-29873 which involved this feature.
Test build #113927 has finished for PR 26418 at commit
|
Test build #113936 has finished for PR 26418 at commit
|
thanks, merging to master! |
val keyString = mapType.keyType match { | ||
case CalendarIntervalType => | ||
(i: Int) => IntervalUtils.toMultiUnitsString(keyArray.getInterval(i)) | ||
case _ => (i: Int) => keyArray.get(i, mapType.keyType).toString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code path shouldn't be here per each map here BTW.
@@ -409,6 +409,7 @@ case class Literal (value: Any, dataType: DataType) extends LeafExpression { | |||
DateTimeUtils.getZoneId(SQLConf.get.sessionLocalTimeZone)) | |||
s"TIMESTAMP('${formatter.format(v)}')" | |||
case (v: Array[Byte], BinaryType) => s"X'${DatatypeConverter.printHexBinary(v)}'" | |||
case (v: CalendarInterval, CalendarIntervalType) => IntervalUtils.toMultiUnitsString(v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if this is already asked above but why we didn't change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have not supported to parse interval from iso/SQL standard format yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did we not support iso/SQL standard format here together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sb.append(value).append(' ').append(unit).append(' '); | ||
} | ||
return "CalendarInterval(months= " + months + ", days = " + days + ", microsecond = " + | ||
microseconds + ")"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did we use such string representation now? Was it in order to put the same logics into IntervalUtils
? If that's the case, we didn't have to move but use toString
of this class until this case becomes completely exposed.
Let me make a followup by myself. |
Made #26572 |
@@ -93,7 +93,10 @@ class ThriftServerQueryTestSuite extends SQLQueryTestSuite { | |||
"subquery/in-subquery/in-group-by.sql", | |||
"subquery/in-subquery/simple-in.sql", | |||
"subquery/in-subquery/in-order-by.sql", | |||
"subquery/in-subquery/in-set-operations.sql" | |||
"subquery/in-subquery/in-set-operations.sql", | |||
// SPARK-29783: need to set conf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was it only because we couldn't set the configurations? If Thrift server does not respect configuration like:
--SET spark.sql.intervalOutputStyle = ISO_8601
it looks a bug to me. @wangyum
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove this config:
Line 78 in 3163b6b
override val isTestWithConfigSets = false |
### What changes were proposed in this pull request? This is a followup of #26418. This PR removed `CalendarInterval`'s `toString` with an unfinished changes. ### Why are the changes needed? 1. Ideally we should make each PR isolated and separate targeting one issue without touching unrelated codes. 2. There are some other places where the string formats were exposed to users. For example: ```scala scala> sql("select interval 1 days as a").selectExpr("to_csv(struct(a))").show() ``` ``` +--------------------------+ |to_csv(named_struct(a, a))| +--------------------------+ | "CalendarInterval...| +--------------------------+ ``` 3. Such fixes: ```diff private def writeMapData( map: MapData, mapType: MapType, fieldWriter: ValueWriter): Unit = { val keyArray = map.keyArray() + val keyString = mapType.keyType match { + case CalendarIntervalType => + (i: Int) => IntervalUtils.toMultiUnitsString(keyArray.getInterval(i)) + case _ => (i: Int) => keyArray.get(i, mapType.keyType).toString + } ``` can cause performance regression due to type dispatch for each map. ### Does this PR introduce any user-facing change? Yes, see 2. case above. ### How was this patch tested? Manually tested. Closes #26572 from HyukjinKwon/SPARK-29783. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
… we decide not to follow ANSI and no round trip ### What changes were proposed in this pull request? This revert #26418, file a new ticket under https://issues.apache.org/jira/browse/SPARK-30546 for better tracking interval behavior ### Why are the changes needed? Revert interval ISO/ANSI SQL Standard output since we decide not to follow ANSI and there is no round trip ### Does this PR introduce any user-facing change? no, not released yet ### How was this patch tested? existing uts Closes #27304 from yaooqinn/SPARK-30593. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Add 3 interval output types which are named as
SQL_STANDARD
,ISO_8601
,MULTI_UNITS
. And we add a new confspark.sql.dialect.intervalOutputStyle
for this. TheMULTI_UNITS
style displays the interval values in the former behavior and it is the default. The newly addedSQL_STANDARD
,ISO_8601
styles can be found in the following table.Why are the changes needed?
for ANSI SQL support
Does this PR introduce any user-facing change?
yes,interval out now has 3 output styles
How was this patch tested?
add new unit tests
cc @cloud-fan @maropu @MaxGekk @HyukjinKwon thanks.