Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-28690][SQL] Add date_part function for timestamps/dates #25410

Closed
wants to merge 28 commits into from
Closed
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
cc81650
Add the date_part() function
MaxGekk Aug 11, 2019
0e571be
Merge remote-tracking branch 'remotes/origin/master' into date_part
MaxGekk Aug 11, 2019
b856859
Support millennium, century and decade by date_part()
MaxGekk Aug 11, 2019
d23b8ba
Uncomment the first usage of date_part()
MaxGekk Aug 11, 2019
af51e52
Reuse date_part() from extract
MaxGekk Aug 11, 2019
e68611a
Add a description for DatePart
MaxGekk Aug 12, 2019
e1d5a75
Merge remote-tracking branch 'remotes/origin/master' into date_part
MaxGekk Aug 14, 2019
ae353b9
Regen extract.sql.out
MaxGekk Aug 14, 2019
f3b2772
Regen timestamp.sql.out
MaxGekk Aug 14, 2019
b795ebc
Add synonyms for field
MaxGekk Aug 14, 2019
efc3ee0
Regen date_part.sql.out
MaxGekk Aug 14, 2019
188d7de
Merge remote-tracking branch 'remotes/origin/master' into date_part
MaxGekk Aug 14, 2019
9935c01
Add isoyear, milliseconds, microseconds and epoch
MaxGekk Aug 14, 2019
c918eb0
Re-gen extract.sql.out
MaxGekk Aug 14, 2019
bcf73d2
Re-gen date.sql.out
MaxGekk Aug 14, 2019
7c549c7
Uncomment 2 queries in timestamp.sql
MaxGekk Aug 15, 2019
1b2c8d4
List all values of field
MaxGekk Aug 15, 2019
2fa25b1
Put synonyms to ()
MaxGekk Aug 16, 2019
494679c
Remove unneeded backquotes
MaxGekk Aug 16, 2019
5c8c34d
Remove backquotes around year, month, day, hour, minute and second
MaxGekk Aug 16, 2019
05b3746
Revert "Remove backquotes around year, month, day, hour, minute and s…
MaxGekk Aug 17, 2019
74a7fec
Turn off ansi mode for a query
MaxGekk Aug 23, 2019
bc707df
Fix
maropu Aug 16, 2019
d86c092
Re-gen date_part.sql.out
MaxGekk Aug 23, 2019
ade541d
Re-gen extract.sql.out
MaxGekk Aug 23, 2019
43968c2
Change type of errorHandleFunc to Nothing
MaxGekk Sep 4, 2019
b292931
Merge remote-tracking branch 'origin/master' into date_part
MaxGekk Sep 4, 2019
600eee6
Use backquotes around year, month, day, hour, minute and second
MaxGekk Sep 6, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -417,6 +417,7 @@ object FunctionRegistry {
expression[TimeWindow]("window"),
expression[MakeDate]("make_date"),
expression[MakeTimestamp]("make_timestamp"),
expression[DatePart]("date_part"),

// collection functions
expression[CreateArray]("array"),
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1963,3 +1963,91 @@ case class Epoch(child: Expression, timeZoneId: Option[String] = None)
defineCodeGen(ctx, ev, c => s"$dtu.getEpoch($c, $zid)")
}
}

object DatePart {

def parseExtractField(
extractField: String,
source: Expression,
errorHandleFunc: => Unit): Expression = extractField.toUpperCase(Locale.ROOT) match {
case "MILLENNIUM" | "MILLENNIA" | "MIL" | "MILS" => Millennium(source)
case "CENTURY" | "CENTURIES" | "C" | "CENT" => Century(source)
case "DECADE" | "DECADES" | "DEC" | "DECS" => Decade(source)
case "YEAR" | "Y" | "YEARS" | "YR" | "YRS" => Year(source)
case "ISOYEAR" => IsoYear(source)
case "QUARTER" | "QTR" => Quarter(source)
case "MONTH" | "MON" | "MONS" | "MONTHS" => Month(source)
case "WEEK" | "W" | "WEEKS" => WeekOfYear(source)
case "DAY" | "D" | "DAYS" => DayOfMonth(source)
case "DAYOFWEEK" => DayOfWeek(source)
case "DOW" => Subtract(DayOfWeek(source), Literal(1))
case "ISODOW" => Add(WeekDay(source), Literal(1))
case "DOY" => DayOfYear(source)
case "HOUR" | "H" | "HOURS" | "HR" | "HRS" => Hour(source)
case "MINUTE" | "M" | "MIN" | "MINS" | "MINUTES" => Minute(source)
case "SECOND" | "S" | "SEC" | "SECONDS" | "SECS" => Second(source)
case "MILLISECONDS" | "MSEC" | "MSECS" | "MILLISECON" | "MSECONDS" | "MS" =>
Milliseconds(source)
case "MICROSECONDS" | "USEC" | "USECS" | "USECONDS" | "MICROSECON" | "US" =>
Microseconds(source)
case "EPOCH" => Epoch(source)
case other =>
errorHandleFunc.asInstanceOf[Expression]
MaxGekk marked this conversation as resolved.
Show resolved Hide resolved
}
}

@ExpressionDescription(
usage = "_FUNC_(field, source) - Extracts a part of the date/timestamp.",
arguments = """
Arguments:
* field - selects which part of the source should be extracted. Supported string values are:
["MILLENNIUM", ("MILLENNIA", "MIL", "MILS"),
"CENTURY", ("CENTURIES", "C", "CENT"),
"DECADE", ("DECADES", "DEC", "DECS"),
"YEAR", ("Y", "YEARS", "YR", "YRS"),
"ISOYEAR",
"QUARTER", ("QTR"),
"MONTH", ("MON", "MONS", "MONTHS"),
"WEEK", ("W", "WEEKS"),
"DAY", ("D", "DAYS"),
"DAYOFWEEK",
"DOW",
"ISODOW",
"DOY",
"HOUR", ("H", "HOURS", "HR", "HRS"),
"MINUTE", ("M", "MIN", "MINS", "MINUTES"),
"SECOND", ("S", "SEC", "SECONDS", "SECS"),
"MILLISECONDS", ("MSEC", "MSECS", "MILLISECON", "MSECONDS", "MS"),
"MICROSECONDS", ("USEC", "USECS", "USECONDS", "MICROSECON", "US"),
"EPOCH"]
MaxGekk marked this conversation as resolved.
Show resolved Hide resolved
* source - a date (or timestamp) column from where `field` should be extracted
""",
examples = """
Examples:
> SELECT _FUNC_('YEAR', TIMESTAMP '2019-08-12 01:00:00.123456');
2019
> SELECT _FUNC_('week', timestamp'2019-08-12 01:00:00.123456');
33
> SELECT _FUNC_('doy', DATE'2019-08-12');
224
""",
since = "3.0.0")
case class DatePart(field: Expression, source: Expression, child: Expression)
MaxGekk marked this conversation as resolved.
Show resolved Hide resolved
extends RuntimeReplaceable {

def this(field: Expression, source: Expression) {
this(field, source, {
if (!field.foldable) {
throw new AnalysisException("The field parameter needs to be a foldable string value.")
MaxGekk marked this conversation as resolved.
Show resolved Hide resolved
}
val fieldStr = field.eval().asInstanceOf[UTF8String].toString
DatePart.parseExtractField(fieldStr, source, {
throw new AnalysisException(s"Literals of type '$fieldStr' are currently not supported.")
})
})
}

override def flatArguments: Iterator[Any] = Iterator(field, source)
override def sql: String = s"$prettyName(${field.sql}, ${source.sql})"
override def prettyName: String = "date_part"
}
Original file line number Diff line number Diff line change
Expand Up @@ -1409,48 +1409,12 @@ class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging
* Create a Extract expression.
*/
override def visitExtract(ctx: ExtractContext): Expression = withOrigin(ctx) {
ctx.field.getText.toUpperCase(Locale.ROOT) match {
case "MILLENNIUM" | "MILLENNIA" | "MIL" | "MILS" =>
Millennium(expression(ctx.source))
case "CENTURY" | "CENTURIES" | "C" | "CENT" =>
Century(expression(ctx.source))
case "DECADE" | "DECADES" | "DEC" | "DECS" =>
Decade(expression(ctx.source))
case "YEAR" | "Y" | "YEARS" | "YR" | "YRS" =>
Year(expression(ctx.source))
case "ISOYEAR" =>
IsoYear(expression(ctx.source))
case "QUARTER" | "QTR" =>
Quarter(expression(ctx.source))
case "MONTH" | "MON" | "MONS" | "MONTHS" =>
Month(expression(ctx.source))
case "WEEK" | "W" | "WEEKS" =>
WeekOfYear(expression(ctx.source))
case "DAY" | "D" | "DAYS" =>
DayOfMonth(expression(ctx.source))
case "DAYOFWEEK" =>
DayOfWeek(expression(ctx.source))
case "DOW" =>
Subtract(DayOfWeek(expression(ctx.source)), Literal(1))
case "ISODOW" =>
Add(WeekDay(expression(ctx.source)), Literal(1))
case "DOY" =>
DayOfYear(expression(ctx.source))
case "HOUR" | "H" | "HOURS" | "HR" | "HRS" =>
Hour(expression(ctx.source))
case "MINUTE" | "M" | "MIN" | "MINS" | "MINUTES" =>
Minute(expression(ctx.source))
case "SECOND" | "S" | "SEC" | "SECONDS" | "SECS" =>
Second(expression(ctx.source))
case "MILLISECONDS" | "MSEC" | "MSECS" | "MILLISECON" | "MSECONDS" | "MS" =>
Milliseconds(expression(ctx.source))
case "MICROSECONDS" | "USEC" | "USECS" | "USECONDS" | "MICROSECON" | "US" =>
Microseconds(expression(ctx.source))
case "EPOCH" =>
Epoch(expression(ctx.source))
case other =>
throw new ParseException(s"Literals of type '$other' are currently not supported.", ctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception changed from ParseException to AnalysisException? Can we keep the current behaviour?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ParseException requires either ctx: ParserRuleContext or val start: Origin, val stop: Origin that are not available to me at the point. And it is still ParseException in the output: https://github.com/apache/spark/pull/25410/files#diff-6f4edc80e2cc973e748705e85a6053b4R514

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... but I can pass ctx: ParserRuleContext to DatePart's constructor. I am just not sure that is is good practice. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of passing ctx into DataPart, can't we resolve extractField in AstBuilder then pass the resolved into DataPart?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will work for extract but how about the date_part() function itself? The same code should present for the function, shouldn't it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, I see! sorry, but I need to be away from a keybord, so I'll recheck in hours.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've heard that its a little hard to read parser error messages in spark. So, I personally think we should throw ParseException with ctx for the extract case as much as possible. Then, how about writing it like this? d793f49

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it would be nice to point out the exact position of the error but I don't like this ad-hoc solution. There are a few similar places in datetimeExpressions and others where we could throw ParseException instead of just AnalysisException. I believe we need a common approach. Why do you think that propagation of the parsing context or positions look not good? We could propagate such parameters to all expression constructors.

Copy link
Member

@maropu maropu Aug 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, I don't think the approach is bad and I'm just looking for a better solution to keep the current error handling. If we already have the similar logic, can we follow that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu @dongjoon-hyun If you think the commit d793f49 is necessary for merging the PR, I will apply the patch.

}
val fieldStr = ctx.field.getText
val source = expression(ctx.source)
val extractField = DatePart.parseExtractField(fieldStr, source, {
throw new ParseException(s"Literals of type '$fieldStr' are currently not supported.", ctx)
})
new DatePart(Literal(fieldStr), expression(ctx.source), extractField)
}

/**
Expand Down
68 changes: 68 additions & 0 deletions sql/core/src/test/resources/sql-tests/inputs/date_part.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
CREATE TEMPORARY VIEW t AS select '2011-05-06 07:08:09.1234567' as c;

select date_part('millennium', c) from t;
select date_part('millennia', c) from t;
select date_part('mil', c) from t;
select date_part('mils', c) from t;

select date_part('century', c) from t;
select date_part('centuries', c) from t;
select date_part('c', c) from t;
select date_part('cent', c) from t;

select date_part('decade', c) from t;
select date_part('decades', c) from t;
select date_part('dec', c) from t;
select date_part('decs', c) from t;

select date_part('year', c) from t;
select date_part('y', c) from t;
select date_part('years', c) from t;
select date_part('yr', c) from t;
select date_part('yrs', c) from t;

select date_part('quarter', c) from t;
select date_part('qtr', c) from t;

select date_part('month', c) from t;
select date_part('mon', c) from t;
select date_part('mons', c) from t;
select date_part('months', c) from t;

select date_part('week', c) from t;
select date_part('w', c) from t;
select date_part('weeks', c) from t;

select date_part('day', c) from t;
select date_part('d', c) from t;
select date_part('days', c) from t;

select date_part('dayofweek', c) from t;

select date_part('dow', c) from t;

select date_part('isodow', c) from t;

select date_part('doy', c) from t;

select date_part('hour', c) from t;
select date_part('h', c) from t;
select date_part('hours', c) from t;
select date_part('hr', c) from t;
select date_part('hrs', c) from t;

select date_part('minute', c) from t;
select date_part('m', c) from t;
select date_part('min', c) from t;
select date_part('mins', c) from t;
select date_part('minutes', c) from t;

select date_part('second', c) from t;
select date_part('s', c) from t;
select date_part('sec', c) from t;
select date_part('seconds', c) from t;
select date_part('secs', c) from t;

select date_part('not_supported', c) from t;

select date_part(c, c) from t;
34 changes: 17 additions & 17 deletions sql/core/src/test/resources/sql-tests/inputs/pgSQL/timestamp.sql
Original file line number Diff line number Diff line change
Expand Up @@ -186,23 +186,23 @@ SELECT '' AS date_trunc_week, date_trunc( 'week', timestamp '2004-02-29 15:44:17
-- FROM TIMESTAMP_TBL
-- WHERE d1 BETWEEN timestamp '1902-01-01'
-- AND timestamp '2038-01-01';

-- [SPARK-28420] Date/Time Functions: date_part
-- SELECT '' AS "54", d1 as "timestamp",
-- date_part( 'year', d1) AS year, date_part( 'month', d1) AS month,
-- date_part( 'day', d1) AS day, date_part( 'hour', d1) AS hour,
-- date_part( 'minute', d1) AS minute, date_part( 'second', d1) AS second
-- FROM TIMESTAMP_TBL WHERE d1 BETWEEN '1902-01-01' AND '2038-01-01';

-- SELECT '' AS "54", d1 as "timestamp",
-- date_part( 'quarter', d1) AS quarter, date_part( 'msec', d1) AS msec,
-- date_part( 'usec', d1) AS usec
-- FROM TIMESTAMP_TBL WHERE d1 BETWEEN '1902-01-01' AND '2038-01-01';

-- SELECT '' AS "54", d1 as "timestamp",
-- date_part( 'isoyear', d1) AS isoyear, date_part( 'week', d1) AS week,
-- date_part( 'dow', d1) AS dow
-- FROM TIMESTAMP_TBL WHERE d1 BETWEEN '1902-01-01' AND '2038-01-01';
-- [SPARK-28767] ParseException: no viable alternative at input 'year'
set spark.sql.parser.ansi.enabled=false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are we doing here? This test is for timestamp but why do we test the parser?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To use year as an alias name in the query below, it just turns off the ansi mode temporarily;
year cannot be used as an alias name with ansi=true because that is a reserved keyword: https://github.com/apache/spark/pull/25410/files#r314599685

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we just quote it? e.g. select 1 as 'year'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can quote or set the variable. Please, take a look at the comments: https://github.com/apache/spark/pull/25410/files/af51e524d90253d26dc848d4776328c5f8359d88#r314593244 . Do you think it is better to use backquotes instead of setting the variable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, quoting looks ok to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to quote it, to not distract people from the timestamp tests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does pgsql quote it in its test?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In pgSQL, year is not reserved, so we can use it as an alias name.
https://www.postgresql.org/docs/11/sql-keywords-appendix.html
Even if its reserved, we can use it though....;

postgres=# select 1 as year;
 year 
------
    1
(1 row)

postgres=# create table year(t int);
CREATE TABLE
postgres=# select 1 as select;
 select 
--------
      1
(1 row)

postgres=# create table select(t int);
2019-09-06 14:44:35.490 JST [6166] ERROR:  syntax error at or near "select" at character 14
2019-09-06 14:44:35.490 JST [6166] STATEMENT:  create table select(t int);
ERROR:  syntax error at or near "select"
LINE 1: create table select(t int);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to quote it, to not distract people from the timestamp tests

@dongjoon-hyun I hope you will be not so unhappy if I use backquotes again here.

SELECT '' AS `54`, d1 as `timestamp`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. What I meant was backquoting for the special value like 54.
For the other values like year and month, we don't need back quoting.

So,

  1. For the places where PostgreSQL uses AS "..", we can use backquoting.
  2. For the places where PostgreSQL uses AS xxx, we don't need backquoting, too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I will remove the backquoting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun When I removed them, I got the error:

-- !query 13
SELECT '' AS `54`, d1 as `timestamp`,
    date_part( 'year', d1) AS year, date_part( 'month', d1) AS month,
    date_part( 'day', d1) AS day, date_part( 'hour', d1) AS hour,
    date_part( 'minute', d1) AS minute, date_part( 'second', d1) AS second
    FROM TIMESTAMP_TBL WHERE d1 BETWEEN '1902-01-01' AND '2038-01-01'
-- !query 13 schema
struct<>
-- !query 13 output
org.apache.spark.sql.catalyst.parser.ParseException

no viable alternative at input 'year'(line 2, pos 30)

== SQL ==
SELECT '' AS `54`, d1 as `timestamp`,
    date_part( 'year', d1) AS year, date_part( 'month', d1) AS month,
------------------------------^^^
    date_part( 'day', d1) AS day, date_part( 'hour', d1) AS hour,
    date_part( 'minute', d1) AS minute, date_part( 'second', d1) AS second
    FROM TIMESTAMP_TBL WHERE d1 BETWEEN '1902-01-01' AND '2038-01-01'

Copy link
Member Author

@MaxGekk MaxGekk Aug 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would guess, year belongs to a function as well. Probably, this confuses the parser?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun I removed backquotes only around names that are not functions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

? @MaxGekk . The master branch seems to be same. Could you push your change?

spark-sql> select 1 as year;
1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted the backquotes back and opened JIRA for the issue: https://issues.apache.org/jira/browse/SPARK-28767

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because spark.sql.parser.ansi.enabled is set to false by default:

spark-sql> set spark.sql.parser.ansi.enabled=true;
spark.sql.parser.ansi.enabled	true
spark-sql> select 1 as year;
Error in query: 
no viable alternative at input 'year'(line 1, pos 12)

== SQL ==
select 1 as year
------------^^^

Copy link
Member

@maropu maropu Aug 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you checked that? #25114 (comment)
I think that's an expected behaviour in the ansi mode.

date_part( 'year', d1) AS year, date_part( 'month', d1) AS month,
date_part( 'day', d1) AS day, date_part( 'hour', d1) AS hour,
date_part( 'minute', d1) AS minute, date_part( 'second', d1) AS second
FROM TIMESTAMP_TBL WHERE d1 BETWEEN '1902-01-01' AND '2038-01-01';
set spark.sql.parser.ansi.enabled=true;
SELECT '' AS `54`, d1 as `timestamp`,
date_part( 'quarter', d1) AS quarter, date_part( 'msec', d1) AS msec,
date_part( 'usec', d1) AS usec
FROM TIMESTAMP_TBL WHERE d1 BETWEEN '1902-01-01' AND '2038-01-01';

SELECT '' AS `54`, d1 as `timestamp`,
date_part( 'isoyear', d1) AS isoyear, date_part( 'week', d1) AS week,
date_part( 'dow', d1) AS dow
FROM TIMESTAMP_TBL WHERE d1 BETWEEN '1902-01-01' AND '2038-01-01';

-- [SPARK-28137] Data Type Formatting Functions
-- TO_CHAR()
Expand Down
Loading