Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-28435][SQL] Support accepting the interval keyword in the schema string #25189

Closed
wants to merge 4 commits into from
Closed

Conversation

wangyum
Copy link
Member

@wangyum wangyum commented Jul 18, 2019

What changes were proposed in this pull request?

#7355 add support casting between IntervalType and StringType for scala interface:

import org.apache.spark.sql.types._
import org.apache.spark.sql.catalyst.expressions._

Cast(Literal("interval 3 month 1 hours"), CalendarIntervalType).eval()
res0: Any = interval 3 months 1 hours

But SQL interface does not support it:

scala> spark.sql("SELECT CAST('interval 3 month 1 hour' AS interval)").show
org.apache.spark.sql.catalyst.parser.ParseException:
DataType interval is not supported.(line 1, pos 41)

== SQL ==
SELECT CAST('interval 3 month 1 hour' AS interval)
-----------------------------------------^^^

  at org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitPrimitiveDataType$1(AstBuilder.scala:1931)
  at org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:108)
  at org.apache.spark.sql.catalyst.parser.AstBuilder.visitPrimitiveDataType(AstBuilder.scala:1909)
  at org.apache.spark.sql.catalyst.parser.AstBuilder.visitPrimitiveDataType(AstBuilder.scala:52)
...

This PR add supports accepting the interval keyword in the schema string. So that SQL interface can support this feature.

How was this patch tested?

unit tests

@wangyum wangyum changed the title [SPARK-28435][SQL] Support cast string to interval for SQL interface [SPARK-28435][SQL] Support cast StringType to IntervalType for SQL interface Jul 18, 2019
@SparkQA
Copy link

SparkQA commented Jul 18, 2019

Test build #107826 has finished for PR 25189 at commit e7e2f5b.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented Jul 18, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Jul 18, 2019

Test build #107835 has finished for PR 25189 at commit e7e2f5b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

-- cast string to interval and interval to string
SELECT CAST('interval 3 month 1 hour' AS interval);
SELECT CAST(interval 3 month 1 hour AS string);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Moves these tests in the end of this file?

@SparkQA
Copy link

SparkQA commented Jul 19, 2019

Test build #107892 has finished for PR 25189 at commit a2f3676.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 19, 2019

Test build #107896 has finished for PR 25189 at commit 4f32916.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@maropu maropu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me cc: @dongjoon-hyun @gatorsmile

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @wangyum . Your contribution is good, but I'd give -1 for this PR (AS-IS) with the following reasons.

  1. This PR silently rename calendarinterval to interval and that causes most of the changes in this PR (is it inevitable?)
  2. However, the PR title and description doesn't mentioned it at all.
  3. This situation is repeated several times in your PRs .

The PR should be consist in three parts; title/description/code.

In addition, I'm wondering if this is the only way to support this casting. This is just a design choice. You can double down your direction, but could you please try to find a minimal way before we go down this way?

cc @gatorsmile

@wangyum
Copy link
Member Author

wangyum commented Jul 22, 2019

How about split it into 2 PRs:

  1. Rename calendarinterval to interval.
  2. Support cast StringType to IntervalType for SQL interface.

@dongjoon-hyun
Copy link
Member

Otherwise, you can update this PR title and description according to your contribution.

If you suggest that, of course, it's possible and those PR look more clear as a single PR. For (1), the PR should describe the context in the PR description instead of pointing this PR.

@wangyum
Copy link
Member Author

wangyum commented Jul 22, 2019

Created a new PR(#25225) to make it more clear.

@dongjoon-hyun
Copy link
Member

#25225 is merged now. Could you rebase this PR?

@SparkQA
Copy link

SparkQA commented Jul 23, 2019

Test build #108033 has finished for PR 25189 at commit ca111a4.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class CannotReplaceMissingTableException(
  • case class MakeDate(year: Expression, month: Expression, day: Expression)
  • case class ReplaceTable(
  • case class ReplaceTableAsSelect(
  • case class ReplaceTableStatement(
  • case class ReplaceTableAsSelectStatement(
  • case class ReplaceTableExec(
  • case class AtomicReplaceTableExec(
  • case class AtomicCreateTableAsSelectExec(
  • case class ReplaceTableAsSelectExec(
  • case class AtomicReplaceTableAsSelectExec(

@wangyum
Copy link
Member Author

wangyum commented Jul 23, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Jul 23, 2019

Test build #108041 has finished for PR 25189 at commit ca111a4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class CannotReplaceMissingTableException(
  • case class MakeDate(year: Expression, month: Expression, day: Expression)
  • case class ReplaceTable(
  • case class ReplaceTableAsSelect(
  • case class ReplaceTableStatement(
  • case class ReplaceTableAsSelectStatement(
  • case class ReplaceTableExec(
  • case class AtomicReplaceTableExec(
  • case class AtomicCreateTableAsSelectExec(
  • case class ReplaceTableAsSelectExec(
  • case class AtomicReplaceTableAsSelectExec(

@maropu
Copy link
Member

maropu commented Jul 24, 2019

Weird message log? Is this not related to this pr, right?

This patch adds the following public classes (experimental):
class CannotReplaceMissingTableException(
case class MakeDate(year: Expression, month: Expression, day: Expression)
case class ReplaceTable(
case class ReplaceTableAsSelect(
case class ReplaceTableStatement(
case class ReplaceTableAsSelectStatement(
case class ReplaceTableExec(
case class AtomicReplaceTableExec(
case class AtomicCreateTableAsSelectExec(
case class ReplaceTableAsSelectExec(
case class AtomicReplaceTableAsSelectExec(

Anyway, could you update the title/description? (It seems the current title doesn't explain about the pr directly?)

@wangyum wangyum changed the title [SPARK-28435][SQL] Support cast StringType to IntervalType for SQL interface [SPARK-28435][SQL] Add support accepting the interval keyword in the schema string Jul 24, 2019
@wangyum
Copy link
Member Author

wangyum commented Jul 24, 2019

Thank you @maropu Updated the title and description.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @wangyum and @maropu .
Yes. The message is irrelevant to this PR.

Merged to master.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-28435][SQL] Add support accepting the interval keyword in the schema string [SPARK-28435][SQL] Support accepting the interval keyword in the schema string Jul 24, 2019
@wangyum wangyum deleted the SPARK-28435 branch July 24, 2019 02:48
@cloud-fan
Copy link
Contributor

I'm a little hesitant to add interval relate features and make interval type more visible to users. AFAIK there are ongoing discussions about if we should revisit the interval type and make it follow SQL standard. cc @MaxGekk @mgaido91

PRs like this will add more backward compatibility concerns if we change the interval type in the future.

@mgaido91
Copy link
Contributor

mgaido91 commented Aug 1, 2019

there are ongoing discussions about if we should revisit the interval type and make it follow SQL standard

@cloud-fan thanks for pinging me. May you please point out which discussion you're referring to? I am probably missing it...

@cloud-fan
Copy link
Contributor

@mgaido91 I really can't recall it, maybe in some PRs or JIRA tickets or dev list emails. I was pinging some people who might be related :P

Also cc @HyukjinKwon

@dongjoon-hyun
Copy link
Member

Hi, @cloud-fan . This is not that-level work because this is only SQL layer.

I guess you mean this.

There was @rxin and @cloud-fan 's positive comment there.

Shall we talk on that PR?

@MaxGekk
Copy link
Member

MaxGekk commented Aug 1, 2019

... if we should revisit the interval type and make it follow SQL standard. cc @MaxGekk @mgaido91

I think we should support 2 separate data types for intervals as it is defined by SQL standard. I opened 2 JIRA tickets for that: SPARK-27790 (SPARK-27791 & SPARK-27793) but both types must have the same keyword for literals - interval. I would keep existing type CalendarInterval as is. I don't see any reasons to assign more generic name to it because it is just mix of 2 types with not very well defined semantics.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Aug 1, 2019

@MaxGekk . Those are new feature requests. We are open for new features always.

Before those request, we merged #25225 after considering that limitations 10 days ago. Please see the following comments.

Please make a PR with a needed change. I'm looking forward to seeing your PRs.

@cloud-fan
Copy link
Contributor

@MaxGekk I think one thing we can do right now is to add a restrict on the interval literal: we can't allow an interval with both year-month fields and day-time fields.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Aug 2, 2019

Yea, it's really rather about #25022. But I have to say there are still multiple tasks to fully support interval. For instance, we should map relevant language-native types into internal in Python and R as well. See SPARK-28493 and SPARK-28492. Also, we should consider about mapping R <> Arrow, Pandas <> Arrow too.

I think we should officially document those interval related features are all experimental and highly unstable to alleviate the concern.

@mgaido91
Copy link
Contributor

mgaido91 commented Aug 3, 2019

I agree with @MaxGekk when he says:

I think we should support 2 separate data types for intervals as it is defined by SQL standard.

Hence, I honestly share @cloud-fan 's doubts on making this visible to users. Making visible something which we are going to change (hopefully in the same 3.0 timeframe) seems not a great choice and may introduce more concerns on backward compatibility when we want to get SQL standard compliance and/or we may then break user's applications.

Another thing we may do is to transform the CalendarInterval type in order to make it SQL compliant. In this case, I think the only thing to do is changing the type itself in order to add the needed restrictions and the ability to specify its precision. This may cause backward compatibility issues - due to the new restriction introduced - but it may be the cleanest way IMHO.

@cloud-fan
Copy link
Contributor

I'm OK to expose some parts of the interval that are not going to change, e.g. the interval literal syntax is fine to me: SELECT INTERVAL '3 year 2 month'.

About this interval cast syntax, does Postgres/Presto/SQL Server/Oracle have it?

@dongjoon-hyun
Copy link
Member

Yep. Of course, this is a supported syntax, @cloud-fan .

@dongjoon-hyun
Copy link
Member

postgres=# SELECT INTERVAL '3 year 2 month';
    interval
----------------
 3 years 2 mons
(1 row)

@cloud-fan
Copy link
Contributor

yea I know the interval literal syntax is standard, I was asking about the cast syntax.

@dongjoon-hyun
Copy link
Member

Oh, I got it.

@maropu
Copy link
Member

maropu commented Aug 7, 2019

Yea, pg can explicitly cast a text to interval;

postgres=# \d t
                  Table "public.t"
   Column    | Type | Collation | Nullable | Default 
-------------+------+-----------+----------+---------
 intervalstr | text |           |          | 

postgres=# select * from t;
 intervalstr 
-------------
 1 hour
(1 row)

postgres=# select cast(intervalstr as interval) from t;
 intervalstr 
-------------
 01:00:00
(1 row)

postgres=# create temporary view v as select cast(intervalstr as interval) from t;
postgres=# \d v
                   View "pg_temp_3.v"
   Column    |   Type   | Collation | Nullable | Default 
-------------+----------+-----------+----------+---------
 intervalstr | interval |           |          | 

// literal case
postgres=# select cast('1 hour' as interval);
 interval 
----------
 01:00:00
(1 row)

@dongjoon-hyun
Copy link
Member

What he meant is SELECT CAST('interval 3 month 1 hour' AS interval), isn't it?

postgres=# SELECT CAST('3 month 1 hour' AS interval);
    interval
-----------------
 3 mons 01:00:00
(1 row)

postgres=# SELECT CAST('interval 3 month 1 hour' AS interval);
ERROR:  invalid input syntax for type interval: "interval 3 month 1 hour"
LINE 1: SELECT CAST('interval 3 month 1 hour' AS interval);

@maropu
Copy link
Member

maropu commented Aug 7, 2019

btw, mysql/presto doesn't have interval as an exposed type for casts.

@dongjoon-hyun
Copy link
Member

Yep. It's different and also PostgreSQL has additional variants like #25225 (comment) .

@yaooqinn
Copy link
Member

Hi, this pr seems to expose interval type to table schema for creating and altering table, e.g. in 2.4 or earlier, HiveClientImpl.verifyColumnDataType will fail those commands with intervals. In #27277, I'd like to restore this and related unexpected behaviors for keeping interval type internal and keep the expected cast behavior involved here, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
9 participants