Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL: Implement DATE_TRUNC function #46473

Merged
merged 12 commits into from
Sep 11, 2019
Merged

SQL: Implement DATE_TRUNC function #46473

merged 12 commits into from
Sep 11, 2019

Conversation

matriv
Copy link
Contributor

@matriv matriv commented Sep 8, 2019

DATE_TRUNC(, <date/datetime>) is a function that allows
the user to truncate a timestamp to the specified field by zeroing out
the rest of the fields. The function is implemented according to the
spec from PostgreSQL: https://www.postgresql.org/docs/current/functions-datetime.html#FUNCTIONS-DATETIME-TRUNC

Closes: #46319

DATE_TRUNC(<truncate field>, <date/datetime>) is a function that allows
the user to truncate a timestamp to the specified field by zeroing out
the rest of the fields. The function is implemented according to the
spec from PostgreSQL: https://www.postgresql.org/docs/current/functions-datetime.html#FUNCTIONS-DATETIME-TRUNC

Closes: elastic#46319
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

;

selectDateTruncWithDate
schema::dt_mil:s|dt_cent:s|dt_dec:s|dt_year:s|dt_quarter:s|dt_month:s|dt_week:s|dt_day:s
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have to cast as string here for the time being, working on a solution to properly recognise the DATE type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed with: 4b119dc

truncateDateDecades
schema::decades:s
// tag::truncateDateDecades
SELECT DATE_TRUNC('decade', CAST('2019-09-04' AS DATE))::string AS decades;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have to cast as string here for the time being, working on a solution to properly recognise the DATE type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed with: 4b119dc

truncateDateQuarter
schema::quarter:s
// tag::truncateDateQuarter
SELECT DATE_TRUNC('quarters', CAST('2019-09-04' AS DATE))::string AS quarter;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed with: 4b119dc

@@ -121,6 +121,100 @@ SELECT WEEK(birth_date) week, birth_date FROM test_emp WHERE WEEK(birth_date) >
2 |1953-01-07T00:00:00.000Z
;

selectDateTruncWithDateTime
schema::dt_hour:ts|dt_min:ts|dt_sec:ts|dt_millis:s|dt_micro:s|dt_nano:s
Copy link
Contributor Author

@matriv matriv Sep 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have to cast some columns to string as the .XXX msecs part is not properly read by the CSV infrastructure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth opening another issue for this to fix it long term (either inside the CSV library and elsewhere).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: #46511

Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had some minor comments mainly around style but otherwise LGTM.

@@ -240,4 +241,9 @@ static Time asTime(long millis, ZoneId zoneId) {
return new Time(ZonedDateTime.ofInstant(Instant.ofEpochMilli(millis), zoneId)
.toLocalTime().atDate(JdbcTestUtils.EPOCH).atZone(zoneId).toInstant().toEpochMilli());
}

// Used to convert the DATE read from CSV file to a java.sql.Date at the System's timezone (-Dtests.timezone=XXXX)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

//change the internal timezone of a java.sql.Date class from UTC to that of the JVM
//used by the CsvTest

Considering this method is not used anywhere else, I would just move it to the Assert class and make it private static.

@@ -121,6 +121,100 @@ SELECT WEEK(birth_date) week, birth_date FROM test_emp WHERE WEEK(birth_date) >
2 |1953-01-07T00:00:00.000Z
;

selectDateTruncWithDateTime
schema::dt_hour:ts|dt_min:ts|dt_sec:ts|dt_millis:s|dt_micro:s|dt_nano:s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth opening another issue for this to fix it long term (either inside the CSV library and elsewhere).

return aliases;
}

public static DatePart resolveTruncate(String truncateTo) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An EnumMap is better than iterating and converting the string to lowercase every time.

Copy link
Contributor Author

@matriv matriv Sep 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I don't get how the EnumMap can be beneficial here. We can receive any string like: w or milliseconds or MILLENNIUM (or NaNOSecond) and we need to find out if it resolves to an Enum name or its aliases. I was thinking of creating a HashMap<String, Enum> for faster resolution to avoid the iteration, but we still need to do the lowerCase().

Can you please explain more your suggested solution? (probably I'm missing something)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, that is what I meant, an map of enums: Map<String, Enum> (not EnumMap since the keys are not enum`s themselves).
The point of lowercase is you don't have to do equalsIgnoreCase or lowercase the enum name since you can do that directly when initializing the map - this is the lowercasing I meant not that of the argument.

}

@Override
public int hashCode() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nitpick, generally within the code hashcode appears before equals (though not consistently).

Copy link
Contributor Author

@matriv matriv Sep 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's quite a mixture currently in our classes, but can change.

}

@Override
public String getWriteableName() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move the serialization methods (getWritable, doWrite) underneath the constructor for StreamInput - it helps keep the IO bit in one place (see the rest of the classes) for both reading and writing data.

@@ -148,4 +150,95 @@ public static int getNanoPrecision(Expression precisionExpression, int nano) {
nano = nano - nano % (int) Math.pow(10, (9 - precision));
return nano;
}

public static ZonedDateTime truncate(ZonedDateTime dateTime, DateTrunc.DatePart datePart) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the enum is not reused anywhere else (and it doesn't seem to be), I would move the truncate logic inside the enum itself as it looks self-contained.

error("SELECT DATE_TRUNC(int, date) FROM test"));
assertEquals("1:8: second argument of [DATE_TRUNC(keyword, keyword)] must be [date or datetime], found value [keyword] " +
"type [keyword]", error("SELECT DATE_TRUNC(keyword, keyword) FROM test"));
assertEquals("1:8: first argument of [DATE_TRUNC('invalid', keyword)] must be one of [MILLENNIUM, CENTURY, DECADE, " + "" +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth checking whether the message can be improved by using the levenshtein distance to suggest the best appropriate match (did you mean?) same as with do with field names.
It might not work though for short fields so it should be applied for properties with at least 3-4 chars.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice suggestion, thx! Please check edf42ce in case you see some further improvement there.

@@ -39,8 +45,9 @@ public Combinations(int n, int k) {

@Override
public Iterator<BitSet> iterator() {
return new Iterator<BitSet>() {
return new Iterator<>() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This auto-formatting is kinda of annoying and distracting - make sure to enable auto-formatting on the lines you actually change not the whole file.
It's fine to improve the code however I would that in a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was done manually, on purpose, it's not auto-formatting just a small code style improvement.
If you agree, I think I'd keep it since it's quite small.

@matriv
Copy link
Contributor Author

matriv commented Sep 10, 2019

@elasticmachine run elasticsearch-ci/1

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great work!

Though, I would like to see more tests. For example:

  • DATE_TRUNC(null, date)
  • DATE_TRUNC(field_name, date + INTERVAL 12 YEAR)
  • SELECT DATE_TRUNC(CAST(CHAR(123) AS VARCHAR)

And, if you feel adventurous:

  • SELECT date, part, code, DATE_TRUNC(CASE WHEN code IS NOT NULL THEN CAST(CHAR(code) AS VARCHAR) ELSE part.keyword END, date + INTERVAL 100 YEAR) AS x FROM test WHERE x > '2018-09-04'::date

with this test data:

{"index":{"_id":1}}
{"date":"2004-07-31T11:57:52.000Z","part":"month","code":109}
{"index":{"_id":2}}
{"date":"1992-07-14T08:16:44.444Z","part":"minute","code":110}
{"index":{"_id":3}}
{"date":"2992-02-12T23:16:33.567Z","part":"second","code":115}
{"index":{"_id":4}}
{"date":"992-03-12T23:16:33.567Z","part":"year"}


*Input*:

<1> string expression denoting the unit to which the date/datetime should be truncated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this phrase be string expression denoting the unit which the date/datetime should be truncated to?

<1> string expression denoting the unit to which the date/datetime should be truncated
<2> date/datetime expression

*Output*: date/datetime, same as datetime_exp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

`datetime_exp`

@matriv
Copy link
Contributor Author

matriv commented Sep 10, 2019

@astefan Thx for your tests proposals! The null truncateTo field exposed a bug indeed.
Added more tests.

@@ -264,7 +264,7 @@ DATE_TRUNC(
<1> string expression denoting the unit to which the date/datetime should be truncated to
<2> date/datetime expression

*Output*: date/datetime, same as `datetime_exp`
*Output*: datetime (even if `datetime_exp` is of type date)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't mention the even if .... part.

Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few more comments.

MICROSECOND("microseconds", "mcs"),
NANOSECOND("nanoseconds", "ns");

private static final Set<String> ALL_DATE_PARTS;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's a need for this set since RESOLVE_MAP.keySet() returns the same thing (which is a wrapper around the internal map).

NANOSECOND("nanoseconds", "ns");

private static final Set<String> ALL_DATE_PARTS;
private static final Map<String, DatePart> RESOLVE_MAP;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is a bit confusing - how about NAME_TO_PART or NAME_TO_FIELD?

return StringUtils.findSimilar(match, ALL_DATE_PARTS);
}

public static ZonedDateTime truncate(ZonedDateTime dateTime, DateTrunc.DatePart datePart) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my previous comment I was suggesting to move this extraction inside the enum itself so each one would have its own of implementation : DateTruncate.NANOSECOND.of(ZonedDateTime)


public class DateTrunc extends BinaryScalarFunction {

public enum DatePart {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering the enum is bound to DateTrunc it might make sense to drop Date from its name it simply becomes Part or maybe Field.

@matriv matriv merged commit b37e967 into elastic:master Sep 11, 2019
@matriv matriv deleted the impl-46319 branch September 11, 2019 18:09
matriv added a commit that referenced this pull request Sep 11, 2019
DATE_TRUNC(<truncate field>, <date/datetime>) is a function that allows
the user to truncate a timestamp to the specified field by zeroing out
the rest of the fields. The function is implemented according to the
spec from PostgreSQL: https://www.postgresql.org/docs/current/functions-datetime.html#FUNCTIONS-DATETIME-TRUNC

Closes: #46319
(cherry picked from commit b37e967)
matriv added a commit to matriv/elasticsearch that referenced this pull request Sep 26, 2019
To be on the safe side in terms of use cases also add the alias
DATETRUNC to the DATE_TRUNC function.

Follows: elastic#46473
matriv added a commit that referenced this pull request Sep 27, 2019
To be on the safe side in terms of use cases also add the alias
DATETRUNC to the DATE_TRUNC function.

Follows: #46473
matriv added a commit that referenced this pull request Sep 27, 2019
To be on the safe side in terms of use cases also add the alias
DATETRUNC to the DATE_TRUNC function.

Follows: #46473

(cherry picked from commit 9ac223c)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SQL: add DATE_TRUNC function
5 participants