Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-28690][SQL] Add date_part function for timestamps/dates #25410

Closed
wants to merge 28 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Aug 11, 2019

What changes were proposed in this pull request?

In the PR, I propose new function date_part(). The function is modeled on the traditional Ingres equivalent to the SQL-standard function extract:

date_part('field', source)

and added for feature parity with PostgreSQL (https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT).

The source can have DATE or TIMESTAMP type. Supported string values of 'field' are:

  • millennium - the current millennium for given date (or a timestamp implicitly casted to a date). For example, years in the 1900s are in the second millennium. The third millennium started January 1, 2001.
  • century - the current millennium for given date (or timestamp). The first century starts at 0001-01-01 AD.
  • decade - the current decade for given date (or timestamp). Actually, this is the year field divided by 10.
  • isoyear` - the ISO 8601 week-numbering year that the date falls in. Each ISO 8601 week-numbering year begins with the Monday of the week containing the 4th of January.
  • year, month, day, hour, minute, second
  • week - the number of the ISO 8601 week-numbering week of the year. By definition, ISO weeks start on Mondays and the first week of a year contains January 4 of that year.
  • quarter - the quarter of the year (1 - 4)
  • dayofweek - the day of the week for date/timestamp (1 = Sunday, 2 = Monday, ..., 7 = Saturday)
  • dow - the day of the week as Sunday (0) to Saturday (6)
  • isodow - the day of the week as Monday (1) to Sunday (7)
  • doy - the day of the year (1 - 365/366)
  • milliseconds - the seconds field including fractional parts multiplied by 1,000.
  • microseconds - the seconds field including fractional parts multiplied by 1,000,000.
  • epoch - the number of seconds since 1970-01-01 00:00:00 local time in microsecond precision.

Here are examples:

spark-sql> select date_part('year', timestamp'2019-08-12 01:00:00.123456');
2019
spark-sql> select date_part('week', timestamp'2019-08-12 01:00:00.123456');
33
spark-sql> select date_part('doy', timestamp'2019-08-12 01:00:00.123456');
224

I changed implementation of extract to re-use date_part() internally.

How was this patch tested?

Added date_part.sql and regenerated results of extract.sql.

@SparkQA
Copy link

SparkQA commented Aug 11, 2019

Test build #108938 has finished for PR 25410 at commit af51e52.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 12, 2019

Test build #108953 has finished for PR 25410 at commit e68611a.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Aug 12, 2019

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Aug 12, 2019

Test build #108957 has finished for PR 25410 at commit e68611a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Could you rebase this to the master please, @MaxGekk ?

@SparkQA
Copy link

SparkQA commented Aug 14, 2019

Test build #109110 has finished for PR 25410 at commit efc3ee0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 14, 2019

Test build #109119 has finished for PR 25410 at commit bcf73d2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-28690][SQL] Add the date_part function for timestamps and dates [SPARK-28690][SQL] Add date_part function for timestamps/dates Aug 14, 2019
@MaxGekk
Copy link
Member Author

MaxGekk commented Sep 3, 2019

@dongjoon-hyun Please, take a look at the PR when you have time.

@MaxGekk
Copy link
Member Author

MaxGekk commented Sep 4, 2019

@cloud-fan @HyukjinKwon @srowen Could you take a look at the PR, please.

@SparkQA
Copy link

SparkQA commented Sep 4, 2019

Test build #110132 has finished for PR 25410 at commit b292931.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • abstract class FileCommitProtocol extends Logging
  • trait ResourceAllocator
  • class RpcAbortException(message: String) extends Exception(message)
  • public final class JavaStructuredKerberizedKafkaWordCount
  • public final class JavaDirectKerberizedKafkaWordCount
  • class BindingParquetOutputCommitter(
  • class PathOutputCommitProtocol(
  • class DecisionTreeParams(Params):
  • case class UnresolvedTable(v1Table: CatalogTable) extends Table
  • implicit class IdentifierHelper(identifier: TableIdentifier)
  • class CatalogManager(conf: SQLConf) extends Logging
  • case class ShowTables(
  • trait AlterTableStatement extends ParsedStatement
  • case class ShowTablesStatement(namespace: Option[Seq[String]], pattern: Option[String])
  • case class ReuseAdaptiveSubquery(
  • trait CostEvaluator
  • case class SimpleCost(value: Long) extends Cost
  • case class DescribeTableExec(
  • case class ShowTablesExec(
  • case class AppendDataExecV1(
  • case class OverwriteByExpressionExecV1(
  • sealed trait V1FallbackWriters extends SupportsV1Write
  • protected implicit class toV1WriteBuilder(builder: WriteBuilder)
  • trait SupportsV1Write extends SparkPlan
  • class V2SessionCatalog(sessionState: SessionState) extends TableCatalog with SupportsNamespaces
  • trait V1WriteBuilder extends WriteBuilder

-- date_part( 'dow', d1) AS dow
-- FROM TIMESTAMP_TBL WHERE d1 BETWEEN '1902-01-01' AND '2038-01-01';
-- [SPARK-28767] ParseException: no viable alternative at input 'year'
set spark.sql.parser.ansi.enabled=false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are we doing here? This test is for timestamp but why do we test the parser?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To use year as an alias name in the query below, it just turns off the ansi mode temporarily;
year cannot be used as an alias name with ansi=true because that is a reserved keyword: https://github.com/apache/spark/pull/25410/files#r314599685

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we just quote it? e.g. select 1 as 'year'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can quote or set the variable. Please, take a look at the comments: https://github.com/apache/spark/pull/25410/files/af51e524d90253d26dc848d4776328c5f8359d88#r314593244 . Do you think it is better to use backquotes instead of setting the variable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, quoting looks ok to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to quote it, to not distract people from the timestamp tests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does pgsql quote it in its test?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In pgSQL, year is not reserved, so we can use it as an alias name.
https://www.postgresql.org/docs/11/sql-keywords-appendix.html
Even if its reserved, we can use it though....;

postgres=# select 1 as year;
 year 
------
    1
(1 row)

postgres=# create table year(t int);
CREATE TABLE
postgres=# select 1 as select;
 select 
--------
      1
(1 row)

postgres=# create table select(t int);
2019-09-06 14:44:35.490 JST [6166] ERROR:  syntax error at or near "select" at character 14
2019-09-06 14:44:35.490 JST [6166] STATEMENT:  create table select(t int);
ERROR:  syntax error at or near "select"
LINE 1: create table select(t int);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to quote it, to not distract people from the timestamp tests

@dongjoon-hyun I hope you will be not so unhappy if I use backquotes again here.

@SparkQA
Copy link

SparkQA commented Sep 6, 2019

Test build #110225 has finished for PR 25410 at commit 600eee6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM.

@maropu do you know why Spark treats year as reserved keyword? I thought we followed pgsql at that time...

@maropu
Copy link
Member

maropu commented Sep 6, 2019

As for reserved keywords, we just follow the SQL-2011 standard and it reserves year. In fact, I don't know why pgSQL doesn't reserve year...

@maropu maropu closed this in 67b4329 Sep 6, 2019
@maropu
Copy link
Member

maropu commented Sep 6, 2019

Thanks, max! Merged to master.

@MaxGekk
Copy link
Member Author

MaxGekk commented Sep 6, 2019

@maropu @dongjoon-hyun @cloud-fan Thank you for your review.

maropu pushed a commit that referenced this pull request Sep 12, 2019
### What changes were proposed in this pull request?

In the PR, I propose to extend `ExtractBenchmark` and add new ones for:
- `EXTRACT` and `DATE` as input column
- the `DATE_PART` function and `DATE`/`TIMESTAMP` input column

### Why are the changes needed?

The `EXTRACT` expression is rebased on the `DATE_PART` expression by the PR #25410 where some of sub-expressions take `DATE` column as the input (`Millennium`, `Year` and etc.) but others require `TIMESTAMP` column (`Hour`, `Minute`). Separate benchmarks for `DATE` should exclude overhead of implicit conversions `DATE` <-> `TIMESTAMP`.

### Does this PR introduce any user-facing change?

No, it doesn't.

### How was this patch tested?
- Regenerated results of `ExtractBenchmark`

Closes #25772 from MaxGekk/date_part-benchmark.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
PavithraRamachandran pushed a commit to PavithraRamachandran/spark that referenced this pull request Sep 15, 2019
## What changes were proposed in this pull request?

In the PR, I propose new function `date_part()`. The function is modeled on the traditional Ingres equivalent to the SQL-standard function `extract`:
```
date_part('field', source)
```
and added for feature parity with PostgreSQL (https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT).

The `source` can have `DATE` or `TIMESTAMP` type. Supported string values of `'field'` are:
- `millennium` - the current millennium for given date (or a timestamp implicitly casted to a date). For example, years in the 1900s are in the second millennium. The third millennium started _January 1, 2001_.
- `century` - the current millennium for given date (or timestamp). The first century starts at 0001-01-01 AD.
- `decade` - the current decade for given date (or timestamp). Actually, this is the year field divided by 10.
- isoyear` - the ISO 8601 week-numbering year that the date falls in. Each ISO 8601 week-numbering year begins with the Monday of the week containing the 4th of January.
- `year`, `month`, `day`, `hour`, `minute`, `second`
- `week` - the number of the ISO 8601 week-numbering week of the year. By definition, ISO weeks start on Mondays and the first week of a year contains January 4 of that year.
- `quarter` - the quarter of the year (1 - 4)
- `dayofweek` - the day of the week for date/timestamp (1 = Sunday, 2 = Monday, ..., 7 = Saturday)
- `dow` - the day of the week as Sunday (0) to Saturday (6)
- `isodow` - the day of the week as Monday (1) to Sunday (7)
- `doy` - the day of the year (1 - 365/366)
- `milliseconds` - the seconds field including fractional parts multiplied by 1,000.
- `microseconds` - the seconds field including fractional parts multiplied by 1,000,000.
- `epoch` - the number of seconds since 1970-01-01 00:00:00 local time in microsecond precision.

Here are examples:
```sql
spark-sql> select date_part('year', timestamp'2019-08-12 01:00:00.123456');
2019
spark-sql> select date_part('week', timestamp'2019-08-12 01:00:00.123456');
33
spark-sql> select date_part('doy', timestamp'2019-08-12 01:00:00.123456');
224
```

I changed implementation of `extract` to re-use `date_part()` internally.

## How was this patch tested?

Added `date_part.sql` and regenerated results of `extract.sql`.

Closes apache#25410 from MaxGekk/date_part.

Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
PavithraRamachandran pushed a commit to PavithraRamachandran/spark that referenced this pull request Sep 15, 2019
### What changes were proposed in this pull request?

In the PR, I propose to extend `ExtractBenchmark` and add new ones for:
- `EXTRACT` and `DATE` as input column
- the `DATE_PART` function and `DATE`/`TIMESTAMP` input column

### Why are the changes needed?

The `EXTRACT` expression is rebased on the `DATE_PART` expression by the PR apache#25410 where some of sub-expressions take `DATE` column as the input (`Millennium`, `Year` and etc.) but others require `TIMESTAMP` column (`Hour`, `Minute`). Separate benchmarks for `DATE` should exclude overhead of implicit conversions `DATE` <-> `TIMESTAMP`.

### Does this PR introduce any user-facing change?

No, it doesn't.

### How was this patch tested?
- Regenerated results of `ExtractBenchmark`

Closes apache#25772 from MaxGekk/date_part-benchmark.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
@MaxGekk MaxGekk deleted the date_part branch October 5, 2019 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants