Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL: use calendar interval of 1y instead of fixed interval for grouping by YEAR and HISTOGRAMs #47558

Merged
merged 4 commits into from
Oct 9, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions docs/reference/sql/functions/grouping.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,16 @@ the multiple of a day. E.g.: for `HISTOGRAM(CAST(birth_date AS DATE), INTERVAL '
actually used will be `INTERVAL '2' DAY`. If the interval specified is less than 1 day, e.g.:
`HISTOGRAM(CAST(birth_date AS DATE), INTERVAL '20' HOUR)` then the interval used will be `INTERVAL '1' DAY`.

[IMPORTANT]
All intervals specified for a date/time HISTOGRAM will use a <<search-aggregations-bucket-datehistogram-aggregation,fixed interval>>
in their `date_histogram` aggregation definition, with the notable exception of `INTERVAL '1' YEAR` where a calendar interval is used.
astefan marked this conversation as resolved.
Show resolved Hide resolved
The choice for a calendar interval was made for having a more intuitive result for YEAR groupings. Calendar intervals consider a one year
bucket as the one starting on January 1st that specific year, whereas a fixed interval one-year-bucket considers one year as a number
of milliseconds (for example, `31536000000ms` corresponding to 365 days, 24 hours per day, 60 minutes per hour etc.). With fixed intervals,
the day of February 5th, 2019 for example, belongs to a bucket that starts on December 20th, 2018 and {es} (and implicitly {es-sql}) would
have returned the year 2018 for a date that's actually in 2019. With calendar interval this behavior is more intuitive, having the day of
February 5th, 2019 actually belonging to the 2019 year bucket.

[IMPORTANT]
Histogram in SQL cannot be applied applied on **TIME** type.
E.g.: `HISTOGRAM(CAST(birth_date AS TIME), INTERVAL '10' MINUTES)` is currently not supported.
103 changes: 51 additions & 52 deletions x-pack/plugin/sql/qa/src/main/resources/agg.csv-spec
Original file line number Diff line number Diff line change
Expand Up @@ -273,47 +273,46 @@ histogramDateTime
schema::h:ts|c:l
SELECT HISTOGRAM(birth_date, INTERVAL 1 YEAR) AS h, COUNT(*) as c FROM test_emp GROUP BY h;

h | c
--------------------+---------------
null |10
1951-04-11T00:00:00Z|1
1952-04-05T00:00:00Z|10
1953-03-31T00:00:00Z|10
1954-03-26T00:00:00Z|7
1955-03-21T00:00:00Z|4
1956-03-15T00:00:00Z|4
1957-03-10T00:00:00Z|6
1958-03-05T00:00:00Z|6
1959-02-28T00:00:00Z|9
1960-02-23T00:00:00Z|7
1961-02-17T00:00:00Z|8
1962-02-12T00:00:00Z|6
1963-02-07T00:00:00Z|7
1964-02-02T00:00:00Z|5

h | c
------------------------+---------------
null |10
1952-01-01T00:00:00.000Z|8
1953-01-01T00:00:00.000Z|11
1954-01-01T00:00:00.000Z|8
1955-01-01T00:00:00.000Z|4
1956-01-01T00:00:00.000Z|5
1957-01-01T00:00:00.000Z|4
1958-01-01T00:00:00.000Z|7
1959-01-01T00:00:00.000Z|9
1960-01-01T00:00:00.000Z|8
1961-01-01T00:00:00.000Z|8
1962-01-01T00:00:00.000Z|6
1963-01-01T00:00:00.000Z|7
1964-01-01T00:00:00.000Z|4
1965-01-01T00:00:00.000Z|1
;

histogramDateTimeWithCountAndOrder
schema::h:ts|c:l
SELECT HISTOGRAM(birth_date, INTERVAL 1 YEAR) AS h, COUNT(*) as c FROM test_emp GROUP BY h ORDER BY h DESC;

h | c
--------------------+---------------
1964-02-02T00:00:00Z|5
1963-02-07T00:00:00Z|7
1962-02-12T00:00:00Z|6
1961-02-17T00:00:00Z|8
1960-02-23T00:00:00Z|7
1959-02-28T00:00:00Z|9
1958-03-05T00:00:00Z|6
1957-03-10T00:00:00Z|6
1956-03-15T00:00:00Z|4
1955-03-21T00:00:00Z|4
1954-03-26T00:00:00Z|7
1953-03-31T00:00:00Z|10
1952-04-05T00:00:00Z|10
1951-04-11T00:00:00Z|1
null |10
h | c
------------------------+---------------
1965-01-01T00:00:00.000Z|1
1964-01-01T00:00:00.000Z|4
1963-01-01T00:00:00.000Z|7
1962-01-01T00:00:00.000Z|6
1961-01-01T00:00:00.000Z|8
1960-01-01T00:00:00.000Z|8
1959-01-01T00:00:00.000Z|9
1958-01-01T00:00:00.000Z|7
1957-01-01T00:00:00.000Z|4
1956-01-01T00:00:00.000Z|5
1955-01-01T00:00:00.000Z|4
1954-01-01T00:00:00.000Z|8
1953-01-01T00:00:00.000Z|11
1952-01-01T00:00:00.000Z|8
null |10
;

histogramDateTimeWithMonthOnTop
Expand Down Expand Up @@ -369,23 +368,23 @@ histogramGroupByWithoutAlias
schema::h:ts|c:l
SELECT HISTOGRAM(birth_date, INTERVAL 1 YEAR) AS h, COUNT(*) as c FROM test_emp GROUP BY HISTOGRAM(birth_date, INTERVAL 1 YEAR) ORDER BY h DESC;

h | c
--------------------+---------------
1964-02-02T00:00:00Z|5
1963-02-07T00:00:00Z|7
1962-02-12T00:00:00Z|6
1961-02-17T00:00:00Z|8
1960-02-23T00:00:00Z|7
1959-02-28T00:00:00Z|9
1958-03-05T00:00:00Z|6
1957-03-10T00:00:00Z|6
1956-03-15T00:00:00Z|4
1955-03-21T00:00:00Z|4
1954-03-26T00:00:00Z|7
1953-03-31T00:00:00Z|10
1952-04-05T00:00:00Z|10
1951-04-11T00:00:00Z|1
null |10
h | c
------------------------+---------------
1965-01-01T00:00:00.000Z|1
1964-01-01T00:00:00.000Z|4
1963-01-01T00:00:00.000Z|7
1962-01-01T00:00:00.000Z|6
1961-01-01T00:00:00.000Z|8
1960-01-01T00:00:00.000Z|8
1959-01-01T00:00:00.000Z|9
1958-01-01T00:00:00.000Z|7
1957-01-01T00:00:00.000Z|4
1956-01-01T00:00:00.000Z|5
1955-01-01T00:00:00.000Z|4
1954-01-01T00:00:00.000Z|8
1953-01-01T00:00:00.000Z|11
1952-01-01T00:00:00.000Z|8
null |10
;

countAll
Expand Down
34 changes: 17 additions & 17 deletions x-pack/plugin/sql/qa/src/main/resources/docs/docs.csv-spec
Original file line number Diff line number Diff line change
Expand Up @@ -811,23 +811,23 @@ schema::h:ts|c:l
SELECT HISTOGRAM(birth_date, INTERVAL 1 YEAR) AS h, COUNT(*) AS c FROM emp GROUP BY h;


h | c
--------------------+---------------
null |10
1951-04-11T00:00:00Z|1
1952-04-05T00:00:00Z|10
1953-03-31T00:00:00Z|10
1954-03-26T00:00:00Z|7
1955-03-21T00:00:00Z|4
1956-03-15T00:00:00Z|4
1957-03-10T00:00:00Z|6
1958-03-05T00:00:00Z|6
1959-02-28T00:00:00Z|9
1960-02-23T00:00:00Z|7
1961-02-17T00:00:00Z|8
1962-02-12T00:00:00Z|6
1963-02-07T00:00:00Z|7
1964-02-02T00:00:00Z|5
h | c
------------------------+---------------
null |10
1952-01-01T00:00:00.000Z|8
1953-01-01T00:00:00.000Z|11
1954-01-01T00:00:00.000Z|8
1955-01-01T00:00:00.000Z|4
1956-01-01T00:00:00.000Z|5
1957-01-01T00:00:00.000Z|4
1958-01-01T00:00:00.000Z|7
1959-01-01T00:00:00.000Z|9
1960-01-01T00:00:00.000Z|8
1961-01-01T00:00:00.000Z|8
1962-01-01T00:00:00.000Z|6
1963-01-01T00:00:00.000Z|7
1964-01-01T00:00:00.000Z|4
1965-01-01T00:00:00.000Z|1

// end::histogramDateTime
;
Expand Down
36 changes: 18 additions & 18 deletions x-pack/plugin/sql/qa/src/main/resources/math.csv-spec
Original file line number Diff line number Diff line change
Expand Up @@ -101,31 +101,31 @@ SELECT MIN(salary) mi, MAX(salary) ma, YEAR(hire_date) year, ROUND(AVG(languages

mi:i | ma:i | year:i |ROUND(AVG(languages), 1):d|TRUNCATE(AVG(languages), 1):d| COUNT(*):l
---------------+---------------+---------------+--------------------------+-----------------------------+---------------
25324 |70011 |1986 |3.0 |3.0 |15
25945 |73578 |1987 |2.9 |2.8 |9
25976 |74970 |1988 |3.0 |3.0 |13
31120 |71165 |1989 |3.1 |3.0 |12
30404 |58715 |1992 |3.0 |3.0 |3
35742 |67492 |1993 |2.8 |2.7 |4
45656 |45656 |1995 |3.0 |3.0 |1
25324 |70011 |1987 |3.0 |3.0 |15
25945 |73578 |1988 |2.9 |2.8 |9
25976 |74970 |1989 |3.0 |3.0 |13
31120 |71165 |1990 |3.1 |3.0 |12
30404 |58715 |1993 |3.0 |3.0 |3
35742 |67492 |1994 |2.8 |2.7 |4
45656 |45656 |1996 |3.0 |3.0 |1
;

minMaxRoundWithHavingRound
SELECT MIN(salary) mi, MAX(salary) ma, YEAR(hire_date) year, ROUND(AVG(languages),1), COUNT(*) FROM test_emp GROUP BY YEAR(hire_date) HAVING ROUND(AVG(languages), 1) > 2.5 ORDER BY YEAR(hire_date);

mi:i | ma:i | year:i |ROUND(AVG(languages),1):d| COUNT(*):l
---------------+---------------+---------------+-------------------------+---------------
26436 |74999 |1984 |3.1 |11
31897 |61805 |1985 |3.5 |11
25324 |70011 |1986 |3.0 |15
25945 |73578 |1987 |2.9 |9
25976 |74970 |1988 |3.0 |13
31120 |71165 |1989 |3.1 |12
32568 |65030 |1990 |3.3 |6
27215 |60781 |1991 |4.1 |8
30404 |58715 |1992 |3.0 |3
35742 |67492 |1993 |2.8 |4
45656 |45656 |1995 |3.0 |1
26436 |74999 |1985 |3.1 |11
31897 |61805 |1986 |3.5 |11
25324 |70011 |1987 |3.0 |15
25945 |73578 |1988 |2.9 |9
25976 |74970 |1989 |3.0 |13
31120 |71165 |1990 |3.1 |12
32568 |65030 |1991 |3.3 |6
27215 |60781 |1992 |4.1 |8
30404 |58715 |1993 |3.0 |3
35742 |67492 |1994 |2.8 |4
45656 |45656 |1996 |3.0 |1
;

groupByAndOrderByTruncateWithPositiveParameter
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,13 @@ public abstract class DateTimeHistogramFunction extends DateTimeFunction {
}

/**
* used for aggregration (date histogram)
* used for aggregation (date histogram)
*/
public abstract long interval();
public long fixedInterval() {
return -1;
}

public String calendarInterval() {
return null;
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -5,20 +5,20 @@
*/
package org.elasticsearch.xpack.sql.expression.function.scalar.datetime;

import org.elasticsearch.search.aggregations.bucket.histogram.DateHistogramInterval;
import org.elasticsearch.xpack.sql.expression.Expression;
import org.elasticsearch.xpack.sql.expression.function.scalar.datetime.DateTimeProcessor.DateTimeExtractor;
import org.elasticsearch.xpack.sql.tree.Source;
import org.elasticsearch.xpack.sql.tree.NodeInfo.NodeCtor2;
import org.elasticsearch.xpack.sql.tree.Source;

import java.time.ZoneId;
import java.util.concurrent.TimeUnit;

/**
* Extract the year from a datetime.
*/
public class Year extends DateTimeHistogramFunction {

private static long YEAR_IN_MILLIS = TimeUnit.DAYS.toMillis(1) * 365L;
public static String YEAR_INTERVAL = DateHistogramInterval.YEAR.toString();

public Year(Source source, Expression field, ZoneId zoneId) {
super(source, field, zoneId, DateTimeExtractor.YEAR);
Expand All @@ -45,7 +45,7 @@ public Expression orderBy() {
}

@Override
public long interval() {
return YEAR_IN_MILLIS;
public String calendarInterval() {
return YEAR_INTERVAL;
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -41,9 +41,11 @@
import org.elasticsearch.xpack.sql.expression.function.scalar.ScalarFunction;
import org.elasticsearch.xpack.sql.expression.function.scalar.datetime.DateTimeFunction;
import org.elasticsearch.xpack.sql.expression.function.scalar.datetime.DateTimeHistogramFunction;
import org.elasticsearch.xpack.sql.expression.function.scalar.datetime.Year;
import org.elasticsearch.xpack.sql.expression.function.scalar.geo.GeoShape;
import org.elasticsearch.xpack.sql.expression.function.scalar.geo.StDistance;
import org.elasticsearch.xpack.sql.expression.gen.script.ScriptTemplate;
import org.elasticsearch.xpack.sql.expression.literal.IntervalYearMonth;
import org.elasticsearch.xpack.sql.expression.literal.Intervals;
import org.elasticsearch.xpack.sql.expression.predicate.Range;
import org.elasticsearch.xpack.sql.expression.predicate.fulltext.MatchQueryPredicate;
Expand Down Expand Up @@ -109,6 +111,7 @@
import org.elasticsearch.xpack.sql.util.ReflectionUtils;

import java.time.OffsetTime;
import java.time.Period;
import java.time.ZonedDateTime;
import java.util.Arrays;
import java.util.LinkedHashMap;
Expand Down Expand Up @@ -279,7 +282,11 @@ static GroupingContext groupBy(List<? extends Expression> groupings) {
// dates are handled differently because of date histograms
if (exp instanceof DateTimeHistogramFunction) {
DateTimeHistogramFunction dthf = (DateTimeHistogramFunction) exp;
key = new GroupByDateHistogram(aggId, nameOf(exp), dthf.interval(), dthf.zoneId());
if (dthf.calendarInterval() != null) {
key = new GroupByDateHistogram(aggId, nameOf(exp), dthf.calendarInterval(), dthf.zoneId());
} else {
key = new GroupByDateHistogram(aggId, nameOf(exp), dthf.fixedInterval(), dthf.zoneId());
}
}
// all other scalar functions become a script
else if (exp instanceof ScalarFunction) {
Expand All @@ -294,19 +301,33 @@ else if (exp instanceof GroupingFunction) {

// date histogram
if (h.dataType().isDateBased()) {
long intervalAsMillis = Intervals.inMillis(h.interval());

// When the histogram in SQL is applied on DATE type instead of DATETIME, the interval
// specified is truncated to the multiple of a day. If the interval specified is less
// than 1 day, then the interval used will be `INTERVAL '1' DAY`.
if (h.dataType() == DATE) {
intervalAsMillis = DateUtils.minDayInterval(intervalAsMillis);
}

if (field instanceof FieldAttribute) {
key = new GroupByDateHistogram(aggId, nameOf(field), intervalAsMillis, h.zoneId());
} else if (field instanceof Function) {
key = new GroupByDateHistogram(aggId, ((Function) field).asScript(), intervalAsMillis, h.zoneId());
Object value = h.interval().value();
if (value instanceof IntervalYearMonth
&& ((IntervalYearMonth) value).interval().equals(Period.of(1, 0, 0))) {
String calendarInterval = Year.YEAR_INTERVAL;

// When the histogram is `INTERVAL '1' YEAR`, the interval used in the ES date_histogram will be
// a calendar_interval with value "1y". All other intervals will be fixed_intervals expressed in ms.
if (field instanceof FieldAttribute) {
key = new GroupByDateHistogram(aggId, nameOf(field), calendarInterval, h.zoneId());
} else if (field instanceof Function) {
key = new GroupByDateHistogram(aggId, ((Function) field).asScript(), calendarInterval, h.zoneId());
}
} else {
long intervalAsMillis = Intervals.inMillis(h.interval());

// When the histogram in SQL is applied on DATE type instead of DATETIME, the interval
// specified is truncated to the multiple of a day. If the interval specified is less
// than 1 day, then the interval used will be `INTERVAL '1' DAY`.
if (h.dataType() == DATE) {
intervalAsMillis = DateUtils.minDayInterval(intervalAsMillis);
}

if (field instanceof FieldAttribute) {
key = new GroupByDateHistogram(aggId, nameOf(field), intervalAsMillis, h.zoneId());
} else if (field instanceof Function) {
key = new GroupByDateHistogram(aggId, ((Function) field).asScript(), intervalAsMillis, h.zoneId());
}
}
}
// numeric histogram
Expand Down
Loading