-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-28690][SQL] Add date_part
function for timestamps/dates
#25410
Conversation
sql/core/src/test/resources/sql-tests/inputs/pgSQL/timestamp.sql
Outdated
Show resolved
Hide resolved
Test build #108938 has finished for PR 25410 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
Show resolved
Hide resolved
Test build #108953 has finished for PR 25410 at commit
|
jenkins, retest this, please |
Test build #108957 has finished for PR 25410 at commit
|
Could you rebase this to the master please, @MaxGekk ? |
Test build #109110 has finished for PR 25410 at commit
|
Test build #109119 has finished for PR 25410 at commit
|
date_part
function for timestamps and datesdate_part
function for timestamps/dates
sql/core/src/test/resources/sql-tests/inputs/pgSQL/timestamp.sql
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
Outdated
Show resolved
Hide resolved
@dongjoon-hyun Please, take a look at the PR when you have time. |
@cloud-fan @HyukjinKwon @srowen Could you take a look at the PR, please. |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
Outdated
Show resolved
Hide resolved
Test build #110132 has finished for PR 25410 at commit
|
-- date_part( 'dow', d1) AS dow | ||
-- FROM TIMESTAMP_TBL WHERE d1 BETWEEN '1902-01-01' AND '2038-01-01'; | ||
-- [SPARK-28767] ParseException: no viable alternative at input 'year' | ||
set spark.sql.parser.ansi.enabled=false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are we doing here? This test is for timestamp but why do we test the parser?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To use year
as an alias name in the query below, it just turns off the ansi mode temporarily;
year
cannot be used as an alias name with ansi=true because that is a reserved keyword: https://github.com/apache/spark/pull/25410/files#r314599685
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we just quote it? e.g. select 1 as 'year'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can quote or set the variable. Please, take a look at the comments: https://github.com/apache/spark/pull/25410/files/af51e524d90253d26dc848d4776328c5f8359d88#r314593244 . Do you think it is better to use backquotes instead of setting the variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, quoting looks ok to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to quote it, to not distract people from the timestamp tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does pgsql quote it in its test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In pgSQL, year
is not reserved, so we can use it as an alias name.
https://www.postgresql.org/docs/11/sql-keywords-appendix.html
Even if its reserved, we can use it though....;
postgres=# select 1 as year;
year
------
1
(1 row)
postgres=# create table year(t int);
CREATE TABLE
postgres=# select 1 as select;
select
--------
1
(1 row)
postgres=# create table select(t int);
2019-09-06 14:44:35.490 JST [6166] ERROR: syntax error at or near "select" at character 14
2019-09-06 14:44:35.490 JST [6166] STATEMENT: create table select(t int);
ERROR: syntax error at or near "select"
LINE 1: create table select(t int);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to quote it, to not distract people from the timestamp tests
@dongjoon-hyun I hope you will be not so unhappy if I use backquotes again here.
Test build #110225 has finished for PR 25410 at commit
|
LGTM. @maropu do you know why Spark treats |
As for reserved keywords, we just follow the SQL-2011 standard and it reserves |
Thanks, max! Merged to master. |
@maropu @dongjoon-hyun @cloud-fan Thank you for your review. |
### What changes were proposed in this pull request? In the PR, I propose to extend `ExtractBenchmark` and add new ones for: - `EXTRACT` and `DATE` as input column - the `DATE_PART` function and `DATE`/`TIMESTAMP` input column ### Why are the changes needed? The `EXTRACT` expression is rebased on the `DATE_PART` expression by the PR #25410 where some of sub-expressions take `DATE` column as the input (`Millennium`, `Year` and etc.) but others require `TIMESTAMP` column (`Hour`, `Minute`). Separate benchmarks for `DATE` should exclude overhead of implicit conversions `DATE` <-> `TIMESTAMP`. ### Does this PR introduce any user-facing change? No, it doesn't. ### How was this patch tested? - Regenerated results of `ExtractBenchmark` Closes #25772 from MaxGekk/date_part-benchmark. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
## What changes were proposed in this pull request? In the PR, I propose new function `date_part()`. The function is modeled on the traditional Ingres equivalent to the SQL-standard function `extract`: ``` date_part('field', source) ``` and added for feature parity with PostgreSQL (https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT). The `source` can have `DATE` or `TIMESTAMP` type. Supported string values of `'field'` are: - `millennium` - the current millennium for given date (or a timestamp implicitly casted to a date). For example, years in the 1900s are in the second millennium. The third millennium started _January 1, 2001_. - `century` - the current millennium for given date (or timestamp). The first century starts at 0001-01-01 AD. - `decade` - the current decade for given date (or timestamp). Actually, this is the year field divided by 10. - isoyear` - the ISO 8601 week-numbering year that the date falls in. Each ISO 8601 week-numbering year begins with the Monday of the week containing the 4th of January. - `year`, `month`, `day`, `hour`, `minute`, `second` - `week` - the number of the ISO 8601 week-numbering week of the year. By definition, ISO weeks start on Mondays and the first week of a year contains January 4 of that year. - `quarter` - the quarter of the year (1 - 4) - `dayofweek` - the day of the week for date/timestamp (1 = Sunday, 2 = Monday, ..., 7 = Saturday) - `dow` - the day of the week as Sunday (0) to Saturday (6) - `isodow` - the day of the week as Monday (1) to Sunday (7) - `doy` - the day of the year (1 - 365/366) - `milliseconds` - the seconds field including fractional parts multiplied by 1,000. - `microseconds` - the seconds field including fractional parts multiplied by 1,000,000. - `epoch` - the number of seconds since 1970-01-01 00:00:00 local time in microsecond precision. Here are examples: ```sql spark-sql> select date_part('year', timestamp'2019-08-12 01:00:00.123456'); 2019 spark-sql> select date_part('week', timestamp'2019-08-12 01:00:00.123456'); 33 spark-sql> select date_part('doy', timestamp'2019-08-12 01:00:00.123456'); 224 ``` I changed implementation of `extract` to re-use `date_part()` internally. ## How was this patch tested? Added `date_part.sql` and regenerated results of `extract.sql`. Closes apache#25410 from MaxGekk/date_part. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request? In the PR, I propose to extend `ExtractBenchmark` and add new ones for: - `EXTRACT` and `DATE` as input column - the `DATE_PART` function and `DATE`/`TIMESTAMP` input column ### Why are the changes needed? The `EXTRACT` expression is rebased on the `DATE_PART` expression by the PR apache#25410 where some of sub-expressions take `DATE` column as the input (`Millennium`, `Year` and etc.) but others require `TIMESTAMP` column (`Hour`, `Minute`). Separate benchmarks for `DATE` should exclude overhead of implicit conversions `DATE` <-> `TIMESTAMP`. ### Does this PR introduce any user-facing change? No, it doesn't. ### How was this patch tested? - Regenerated results of `ExtractBenchmark` Closes apache#25772 from MaxGekk/date_part-benchmark. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
What changes were proposed in this pull request?
In the PR, I propose new function
date_part()
. The function is modeled on the traditional Ingres equivalent to the SQL-standard functionextract
:and added for feature parity with PostgreSQL (https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT).
The
source
can haveDATE
orTIMESTAMP
type. Supported string values of'field'
are:millennium
- the current millennium for given date (or a timestamp implicitly casted to a date). For example, years in the 1900s are in the second millennium. The third millennium started January 1, 2001.century
- the current millennium for given date (or timestamp). The first century starts at 0001-01-01 AD.decade
- the current decade for given date (or timestamp). Actually, this is the year field divided by 10.year
,month
,day
,hour
,minute
,second
week
- the number of the ISO 8601 week-numbering week of the year. By definition, ISO weeks start on Mondays and the first week of a year contains January 4 of that year.quarter
- the quarter of the year (1 - 4)dayofweek
- the day of the week for date/timestamp (1 = Sunday, 2 = Monday, ..., 7 = Saturday)dow
- the day of the week as Sunday (0) to Saturday (6)isodow
- the day of the week as Monday (1) to Sunday (7)doy
- the day of the year (1 - 365/366)milliseconds
- the seconds field including fractional parts multiplied by 1,000.microseconds
- the seconds field including fractional parts multiplied by 1,000,000.epoch
- the number of seconds since 1970-01-01 00:00:00 local time in microsecond precision.Here are examples:
I changed implementation of
extract
to re-usedate_part()
internally.How was this patch tested?
Added
date_part.sql
and regenerated results ofextract.sql
.