[SPARK-28690][SQL] Add `date_part` function for timestamps/dates #25410

MaxGekk · 2019-08-11T19:08:51Z

What changes were proposed in this pull request?

In the PR, I propose new function date_part(). The function is modeled on the traditional Ingres equivalent to the SQL-standard function extract:

date_part('field', source)

and added for feature parity with PostgreSQL (https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT).

The source can have DATE or TIMESTAMP type. Supported string values of 'field' are:

millennium - the current millennium for given date (or a timestamp implicitly casted to a date). For example, years in the 1900s are in the second millennium. The third millennium started January 1, 2001.
century - the current millennium for given date (or timestamp). The first century starts at 0001-01-01 AD.
decade - the current decade for given date (or timestamp). Actually, this is the year field divided by 10.
isoyear` - the ISO 8601 week-numbering year that the date falls in. Each ISO 8601 week-numbering year begins with the Monday of the week containing the 4th of January.
year, month, day, hour, minute, second
week - the number of the ISO 8601 week-numbering week of the year. By definition, ISO weeks start on Mondays and the first week of a year contains January 4 of that year.
quarter - the quarter of the year (1 - 4)
dayofweek - the day of the week for date/timestamp (1 = Sunday, 2 = Monday, ..., 7 = Saturday)
dow - the day of the week as Sunday (0) to Saturday (6)
isodow - the day of the week as Monday (1) to Sunday (7)
doy - the day of the year (1 - 365/366)
milliseconds - the seconds field including fractional parts multiplied by 1,000.
microseconds - the seconds field including fractional parts multiplied by 1,000,000.
epoch - the number of seconds since 1970-01-01 00:00:00 local time in microsecond precision.

Here are examples:

spark-sql> select date_part('year', timestamp'2019-08-12 01:00:00.123456');
2019
spark-sql> select date_part('week', timestamp'2019-08-12 01:00:00.123456');
33
spark-sql> select date_part('doy', timestamp'2019-08-12 01:00:00.123456');
224

I changed implementation of extract to re-use date_part() internally.

How was this patch tested?

Added date_part.sql and regenerated results of extract.sql.

sql/core/src/test/resources/sql-tests/inputs/pgSQL/timestamp.sql

SparkQA · 2019-08-11T21:46:28Z

Test build #108938 has finished for PR 25410 at commit af51e52.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala

SparkQA · 2019-08-12T07:05:01Z

Test build #108953 has finished for PR 25410 at commit e68611a.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2019-08-12T07:15:15Z

jenkins, retest this, please

SparkQA · 2019-08-12T10:57:16Z

Test build #108957 has finished for PR 25410 at commit e68611a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-08-13T23:33:49Z

Could you rebase this to the master please, @MaxGekk ?

SparkQA · 2019-08-14T16:43:46Z

Test build #109110 has finished for PR 25410 at commit efc3ee0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-14T20:38:01Z

Test build #109119 has finished for PR 25410 at commit bcf73d2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/test/resources/sql-tests/inputs/pgSQL/timestamp.sql

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala

MaxGekk · 2019-09-03T09:19:59Z

@dongjoon-hyun Please, take a look at the PR when you have time.

MaxGekk · 2019-09-04T13:43:58Z

@cloud-fan @HyukjinKwon @srowen Could you take a look at the PR, please.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala

SparkQA · 2019-09-04T19:35:54Z

Test build #110132 has finished for PR 25410 at commit b292931.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class FileCommitProtocol extends Logging
trait ResourceAllocator
class RpcAbortException(message: String) extends Exception(message)
public final class JavaStructuredKerberizedKafkaWordCount
public final class JavaDirectKerberizedKafkaWordCount
class BindingParquetOutputCommitter(
class PathOutputCommitProtocol(
class DecisionTreeParams(Params):
case class UnresolvedTable(v1Table: CatalogTable) extends Table
implicit class IdentifierHelper(identifier: TableIdentifier)
class CatalogManager(conf: SQLConf) extends Logging
case class ShowTables(
trait AlterTableStatement extends ParsedStatement
case class ShowTablesStatement(namespace: Option[Seq[String]], pattern: Option[String])
case class ReuseAdaptiveSubquery(
trait CostEvaluator
case class SimpleCost(value: Long) extends Cost
case class DescribeTableExec(
case class ShowTablesExec(
case class AppendDataExecV1(
case class OverwriteByExpressionExecV1(
sealed trait V1FallbackWriters extends SupportsV1Write
protected implicit class toV1WriteBuilder(builder: WriteBuilder)
trait SupportsV1Write extends SparkPlan
class V2SessionCatalog(sessionState: SessionState) extends TableCatalog with SupportsNamespaces
trait V1WriteBuilder extends WriteBuilder

cloud-fan · 2019-09-05T12:44:03Z

sql/core/src/test/resources/sql-tests/inputs/pgSQL/timestamp.sql

--    date_part( 'dow', d1) AS dow
--    FROM TIMESTAMP_TBL WHERE d1 BETWEEN '1902-01-01' AND '2038-01-01';
+-- [SPARK-28767] ParseException: no viable alternative at input 'year'
+set spark.sql.parser.ansi.enabled=false;


what are we doing here? This test is for timestamp but why do we test the parser？

To use year as an alias name in the query below, it just turns off the ansi mode temporarily;
year cannot be used as an alias name with ansi=true because that is a reserved keyword: https://github.com/apache/spark/pull/25410/files#r314599685

can't we just quote it? e.g. select 1 as 'year'

We can quote or set the variable. Please, take a look at the comments: https://github.com/apache/spark/pull/25410/files/af51e524d90253d26dc848d4776328c5f8359d88#r314593244 . Do you think it is better to use backquotes instead of setting the variable?

yea, quoting looks ok to me.

I'd like to quote it, to not distract people from the timestamp tests

does pgsql quote it in its test?

In pgSQL, year is not reserved, so we can use it as an alias name.
https://www.postgresql.org/docs/11/sql-keywords-appendix.html
Even if its reserved, we can use it though....;

postgres=# select 1 as year; year ------ 1 (1 row) postgres=# create table year(t int); CREATE TABLE postgres=# select 1 as select; select -------- 1 (1 row) postgres=# create table select(t int); 2019-09-06 14:44:35.490 JST [6166] ERROR: syntax error at or near "select" at character 14 2019-09-06 14:44:35.490 JST [6166] STATEMENT: create table select(t int); ERROR: syntax error at or near "select" LINE 1: create table select(t int);

I'd like to quote it, to not distract people from the timestamp tests

@dongjoon-hyun I hope you will be not so unhappy if I use backquotes again here.

SparkQA · 2019-09-06T10:38:56Z

Test build #110225 has finished for PR 25410 at commit 600eee6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-09-06T14:18:33Z

LGTM.

@maropu do you know why Spark treats year as reserved keyword? I thought we followed pgsql at that time...

maropu · 2019-09-06T14:33:28Z

As for reserved keywords, we just follow the SQL-2011 standard and it reserves year. In fact, I don't know why pgSQL doesn't reserve year...

maropu · 2019-09-06T14:36:27Z

Thanks, max! Merged to master.

MaxGekk · 2019-09-06T17:13:59Z

@maropu @dongjoon-hyun @cloud-fan Thank you for your review.

### What changes were proposed in this pull request? In the PR, I propose to extend `ExtractBenchmark` and add new ones for: - `EXTRACT` and `DATE` as input column - the `DATE_PART` function and `DATE`/`TIMESTAMP` input column ### Why are the changes needed? The `EXTRACT` expression is rebased on the `DATE_PART` expression by the PR #25410 where some of sub-expressions take `DATE` column as the input (`Millennium`, `Year` and etc.) but others require `TIMESTAMP` column (`Hour`, `Minute`). Separate benchmarks for `DATE` should exclude overhead of implicit conversions `DATE` <-> `TIMESTAMP`. ### Does this PR introduce any user-facing change? No, it doesn't. ### How was this patch tested? - Regenerated results of `ExtractBenchmark` Closes #25772 from MaxGekk/date_part-benchmark. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

## What changes were proposed in this pull request? In the PR, I propose new function `date_part()`. The function is modeled on the traditional Ingres equivalent to the SQL-standard function `extract`: ``` date_part('field', source) ``` and added for feature parity with PostgreSQL (https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT). The `source` can have `DATE` or `TIMESTAMP` type. Supported string values of `'field'` are: - `millennium` - the current millennium for given date (or a timestamp implicitly casted to a date). For example, years in the 1900s are in the second millennium. The third millennium started _January 1, 2001_. - `century` - the current millennium for given date (or timestamp). The first century starts at 0001-01-01 AD. - `decade` - the current decade for given date (or timestamp). Actually, this is the year field divided by 10. - isoyear` - the ISO 8601 week-numbering year that the date falls in. Each ISO 8601 week-numbering year begins with the Monday of the week containing the 4th of January. - `year`, `month`, `day`, `hour`, `minute`, `second` - `week` - the number of the ISO 8601 week-numbering week of the year. By definition, ISO weeks start on Mondays and the first week of a year contains January 4 of that year. - `quarter` - the quarter of the year (1 - 4) - `dayofweek` - the day of the week for date/timestamp (1 = Sunday, 2 = Monday, ..., 7 = Saturday) - `dow` - the day of the week as Sunday (0) to Saturday (6) - `isodow` - the day of the week as Monday (1) to Sunday (7) - `doy` - the day of the year (1 - 365/366) - `milliseconds` - the seconds field including fractional parts multiplied by 1,000. - `microseconds` - the seconds field including fractional parts multiplied by 1,000,000. - `epoch` - the number of seconds since 1970-01-01 00:00:00 local time in microsecond precision. Here are examples: ```sql spark-sql> select date_part('year', timestamp'2019-08-12 01:00:00.123456'); 2019 spark-sql> select date_part('week', timestamp'2019-08-12 01:00:00.123456'); 33 spark-sql> select date_part('doy', timestamp'2019-08-12 01:00:00.123456'); 224 ``` I changed implementation of `extract` to re-use `date_part()` internally. ## How was this patch tested? Added `date_part.sql` and regenerated results of `extract.sql`. Closes apache#25410 from MaxGekk/date_part. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

### What changes were proposed in this pull request? In the PR, I propose to extend `ExtractBenchmark` and add new ones for: - `EXTRACT` and `DATE` as input column - the `DATE_PART` function and `DATE`/`TIMESTAMP` input column ### Why are the changes needed? The `EXTRACT` expression is rebased on the `DATE_PART` expression by the PR apache#25410 where some of sub-expressions take `DATE` column as the input (`Millennium`, `Year` and etc.) but others require `TIMESTAMP` column (`Hour`, `Minute`). Separate benchmarks for `DATE` should exclude overhead of implicit conversions `DATE` <-> `TIMESTAMP`. ### Does this PR introduce any user-facing change? No, it doesn't. ### How was this patch tested? - Regenerated results of `ExtractBenchmark` Closes apache#25772 from MaxGekk/date_part-benchmark. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

MaxGekk added 5 commits August 11, 2019 22:10

Add the date_part() function

cc81650

Merge remote-tracking branch 'remotes/origin/master' into date_part

0e571be

Support millennium, century and decade by date_part()

b856859

Uncomment the first usage of date_part()

d23b8ba

Reuse date_part() from extract

af51e52

MaxGekk commented Aug 11, 2019

View reviewed changes

sql/core/src/test/resources/sql-tests/inputs/pgSQL/timestamp.sql Outdated Show resolved Hide resolved

dongjoon-hyun added the SQL label Aug 12, 2019

dongjoon-hyun reviewed Aug 12, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala Show resolved Hide resolved

Add a description for DatePart

e68611a

MaxGekk added 8 commits August 14, 2019 17:34

Merge remote-tracking branch 'remotes/origin/master' into date_part

e1d5a75

Regen extract.sql.out

ae353b9

Regen timestamp.sql.out

f3b2772

Add synonyms for field

b795ebc

Regen date_part.sql.out

efc3ee0

Merge remote-tracking branch 'remotes/origin/master' into date_part

188d7de

Add isoyear, milliseconds, microseconds and epoch

9935c01

Re-gen extract.sql.out

c918eb0

Re-gen date.sql.out

bcf73d2

dongjoon-hyun changed the title ~~[SPARK-28690][SQL] Add the date_part function for timestamps and dates~~ [SPARK-28690][SQL] Add date_part function for timestamps/dates Aug 14, 2019

dongjoon-hyun reviewed Aug 14, 2019

View reviewed changes

sql/core/src/test/resources/sql-tests/inputs/pgSQL/timestamp.sql Outdated Show resolved Hide resolved

maropu reviewed Aug 15, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala Show resolved Hide resolved

maropu reviewed Aug 15, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala Outdated Show resolved Hide resolved

Uncomment 2 queries in timestamp.sql

7c549c7

cloud-fan reviewed Sep 4, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala Outdated Show resolved Hide resolved

MaxGekk added 2 commits September 4, 2019 20:45

Change type of errorHandleFunc to Nothing

43968c2

Merge remote-tracking branch 'origin/master' into date_part

b292931

maropu approved these changes Sep 4, 2019

View reviewed changes

cloud-fan reviewed Sep 5, 2019

View reviewed changes

Use backquotes around year, month, day, hour, minute and second

600eee6

maropu closed this in 67b4329 Sep 6, 2019

MaxGekk mentioned this pull request Sep 12, 2019

[SPARK-29065][SQL][TEST] Extend EXTRACT benchmark #25772

Closed

MaxGekk deleted the date_part branch October 5, 2019 19:17

MaxGekk mentioned this pull request Oct 17, 2019

[SPARK-28420][SQL] Support the INTERVAL type in date_part() #25981

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-28690][SQL] Add `date_part` function for timestamps/dates #25410

[SPARK-28690][SQL] Add `date_part` function for timestamps/dates #25410

MaxGekk commented Aug 11, 2019 •

edited

Loading

SparkQA commented Aug 11, 2019

SparkQA commented Aug 12, 2019

MaxGekk commented Aug 12, 2019

SparkQA commented Aug 12, 2019

dongjoon-hyun commented Aug 13, 2019

SparkQA commented Aug 14, 2019

SparkQA commented Aug 14, 2019

MaxGekk commented Sep 3, 2019

MaxGekk commented Sep 4, 2019

SparkQA commented Sep 4, 2019

cloud-fan Sep 5, 2019

maropu Sep 5, 2019

cloud-fan Sep 5, 2019

MaxGekk Sep 5, 2019

maropu Sep 5, 2019

cloud-fan Sep 6, 2019

cloud-fan Sep 6, 2019

maropu Sep 6, 2019

MaxGekk Sep 6, 2019

SparkQA commented Sep 6, 2019

cloud-fan commented Sep 6, 2019

maropu commented Sep 6, 2019 •

edited

Loading

maropu commented Sep 6, 2019 •

edited

Loading

MaxGekk commented Sep 6, 2019

[SPARK-28690][SQL] Add date_part function for timestamps/dates #25410

[SPARK-28690][SQL] Add date_part function for timestamps/dates #25410

Conversation

MaxGekk commented Aug 11, 2019 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Aug 11, 2019

SparkQA commented Aug 12, 2019

MaxGekk commented Aug 12, 2019

SparkQA commented Aug 12, 2019

dongjoon-hyun commented Aug 13, 2019

SparkQA commented Aug 14, 2019

SparkQA commented Aug 14, 2019

MaxGekk commented Sep 3, 2019

MaxGekk commented Sep 4, 2019

SparkQA commented Sep 4, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 6, 2019

cloud-fan commented Sep 6, 2019

maropu commented Sep 6, 2019 • edited Loading

maropu commented Sep 6, 2019 • edited Loading

MaxGekk commented Sep 6, 2019

[SPARK-28690][SQL] Add `date_part` function for timestamps/dates #25410

[SPARK-28690][SQL] Add `date_part` function for timestamps/dates #25410

MaxGekk commented Aug 11, 2019 •

edited

Loading

maropu commented Sep 6, 2019 •

edited

Loading

maropu commented Sep 6, 2019 •

edited

Loading