New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SQL: Introduce HISTOGRAM grouping function #36510
Conversation
Pinging @elastic/es-search |
318b716
to
83d76b4
Compare
Introduce Histogram grouping function for bucketing/grouping data based on a given range. Both date and numeric histograms are supported using the appropriate range declaration (numbers vs intervals). SELECT HISTOGRAM(number, 50) AS h FROM index GROUP BY h SELECT HISTOGRAM(date, INTERVAL 1 YEAR) AS h FROM index GROUP BY h In addition add multiply operator for Intervals Add docs for intervals and histogram Fix elastic#36509
83d76b4
to
0c1c6d8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left few cosmetic comments. Still review-in-progress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Left a first batch of comments.
@@ -1,7 +1,99 @@ | |||
[role="xpack"] | |||
[testenv="basic"] | |||
[[sql-functions-datetime]] | |||
=== Date and Time Functions | |||
=== Date/Time and Interval Functions and Operators | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add the beta[]
marker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
A common requirement when dealing with date/time in general is date/time math which resolves around | ||
the notion of ``interval``s, a topic that is worth exploring in the context of {es} and {es-sql}. | ||
|
||
{es} has comprehensive support for <<date-math, date math>> both inside <<date-math-index-names,index names>> and <<mapping-date-format, queries>>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a space after comma , index names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
<2> numeric interval | ||
<3> date/time expression (typically a field) | ||
<4> date/time <<sql-functions-datetime-interval, interval>> | ||
... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need the ...
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left-over.
include-tagged::{sql-specs}/docs.csv-spec[histogramNumeric] | ||
---- | ||
|
||
and date/time fields: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say or
instead of and
since it continues from: either
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
@@ -68,8 +68,8 @@ protected void assertResults(ResultSet expected, ResultSet elastic) throws SQLEx | |||
// | |||
// uncomment this to printout the result set and create new CSV tests | |||
// | |||
//JdbcTestUtils.logLikeCLI(elastic, log); | |||
JdbcAssert.assertResultSets(expected, elastic, log, true); | |||
JdbcTestUtils.logLikeCLI(elastic, log); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
SELECT INTERVAL '163:59.163' MINUTE TO SECOND AS interval; | ||
|
||
interval | ||
--------------- | ||
+0 02:43:59.163 | ||
; | ||
|
||
testDatePlusInterval | ||
intervalPlusInterval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add Local
suffix to all of them since further below there are tests on table columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should remove them from here as they are also tested in the docs csv spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to keep things separate - while there is some duplication the doc lifecycle is different from that of the test. In fact, docs were added to avoid having obscure queries that make sense for testing show up in our documentation.
// | ||
|
||
|
||
histogramNumeric |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove these 2 tests since they are also in the docs csv spec.
@@ -108,7 +108,11 @@ public Analyzer(Configuration configuration, FunctionRegistry functionRegistry, | |||
new ResolveAggsInHaving() | |||
//new ImplicitCasting() | |||
); | |||
return Arrays.asList(substitution, resolution); | |||
Batch finish = new Batch("Finish Analysis", | |||
new PruneSubqueryAliases(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we merge them in one rule?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because they have different concerns - I would expect the second one to change when we add full support for subselects.
30404 |58715 |1993 |3.0 |3.0 |3 | ||
35742 |67492 |1994 |2.8 |2.7 |4 | ||
45656 |45656 |1996 |3.0 |3.0 |1 | ||
25324 |70011 |1986 |3.0 |3.0 |15 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is this -1 year
change come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relative vs absolute time computing. ES allows both absolute offsets (24h) or relative (1d) which takes into account if the date/time is within a day.
Based on the intervals given, ES keeps going back and forth but that's surprising and unexpected - in SQL as the intervals vary, I've used the absolute one so that INTERVAL 1 YEAR
and INTERVAL 12 MONTHS
, etc... gives you the same results.
| `INTERVAL '3 4' DAYS TO HOURS` | 3 days and 4 hours | ||
| `INTERVAL '5 6:12' DAYS TO MINUTES` | 5 days, 6 hours and 12 minutes | ||
| `INTERVAL '3 4:56:01' DAY TO SECOND` | 3 days, 4 hours, 56 minutes and 1 second | ||
| `INTERVAL '2 3:45:01.23456789' DAY TO SECOND` | 2 days, 3 hours, 45 minutes, 1 second and 234567890 milliseconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be nanoseconds, also below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm through with the review, nice work overall!
- Maybe we could add an integ test for interval multiplication on table columns.
- What happens if the histogram is used in the WHERE clause? should we catch this case?
@@ -527,4 +552,4 @@ private static FunctionDefinition def(Class<? extends Function> function, Functi | |||
private interface CastFunctionBuilder<T> { | |||
T build(Location location, Expression expression, DataType dataType); | |||
} | |||
} | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please revert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. git merging.
* operates on a datetime. | ||
*/ | ||
@SuppressWarnings("overloads") // These are ambiguous if you aren't using ctor references but we always do | ||
static <T extends Function> FunctionDefinition def(Class<T> function, DatetimeBinaryFunctionBuilder<T> ctorRef, String... names) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get why do we need a special overload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because no other def class matches - the constructor signature of Histogram is different then everything else - it's the only function with 2 Expression
arguments and a TimeZone
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, thx! I missed something there, sorry.
@@ -415,6 +418,28 @@ public boolean functionExists(String functionName) { | |||
T build(Location location, Expression target, TimeZone tz); | |||
} | |||
|
|||
/** | |||
* Build a {@linkplain FunctionDefinition} for a plur/n-ary function that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean binary
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially I wanted to go for multi/variable arguments but that's not the case. Will update it.
|
||
@Override | ||
public ScriptTemplate asScript() { | ||
throw new SqlIllegalArgumentException("Grouping functions cannot be scripted"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not important for now. I think we could introduce an SqlIllegalOperationException
for such cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That implies an operation that somehow was accepted but it's not implemented, which I would consider a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this is somehow the case here, no? We end up calling this method while we shouldn't. The SqlIllegalArgumentException
marks calling something (e.g.: sql function from the user perspective, or internal method with wrong artuments). Anyway, this is out of this PR's scope.
} | ||
|
||
public void testGroupByDateHistogram() { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing test body ;)
} | ||
|
||
public void testGroupByHistogram() { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here.
@@ -27,7 +27,7 @@ | |||
|
|||
private static final DeprecationLogger deprecationLogger = new DeprecationLogger(LogManager.getLogger(RestSqlClearCursorAction.class)); | |||
|
|||
RestSqlClearCursorAction(Settings settings, RestController controller) { | |||
public RestSqlClearCursorAction(Settings settings, RestController controller) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, this is needed for our debug project only, should we revert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this change shouldn't have occurred in the first place but the PR got it before I got a chance to review it.
@@ -38,7 +38,7 @@ | |||
|
|||
private static final DeprecationLogger deprecationLogger = new DeprecationLogger(LogManager.getLogger(RestSqlQueryAction.class)); | |||
|
|||
RestSqlQueryAction(Settings settings, RestController controller) { | |||
public RestSqlQueryAction(Settings settings, RestController controller) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here.
return TypeResolution.TYPE_RESOLVED; | ||
} | ||
|
||
return new TypeResolution(format("[{}] has arguments with incompatible types [{}] and [{}]", symbol(), l, r)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a test for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done - I added one but forgot about it so I created a dedicated test class (at which point I remembered about the previous test).
if (l instanceof Number && r instanceof Number) { | ||
return Arithmetics.mul((Number) l, (Number) r); | ||
} | ||
l = unwrapJodaTime(l); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add some unit tests for the processor for the intervals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See BinaryArithmeticTests
|
||
histogramDateWithCountAndOrder | ||
schema::h:ts|c:l | ||
SELECT HISTOGRAM(birth_date, INTERVAL 1 YEAR) AS h, COUNT(*) as c FROM test_emp GROUP BY h ORDER BY h DESC; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both tests here seem to return a "continuous" list of years... what happens for an interval where there is no document? For example, if we should have had a 1981 birth_date, would the years between 1964 and 1980 be displayed with count 0 or not displayed at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No bucket is returned (as that would break the notion of null group in SQL).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you mention this in the documentation, please? Users used to date_histogram
(where the "missing" buckets are still returned by ES) will probably expect the same from HISTOGRAM.
+1 00:53:00.0 | ||
; | ||
|
||
datePlusIntervalInline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other tests that don't process tables data have constant
in their name. Should we keep the same trend here? For example constantDatePlusInterval
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this csv, constant
is not used since for intervals that means repeating for every test so it becomes noise. Considering it has its own namespace (dedicated file) I think it's fine as is.
|
||
public class GroupingFunctionAttribute extends FunctionAttribute { | ||
|
||
private final String propertyPath; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a small comment about the role of propertyPath
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed it
} | ||
|
||
@Override | ||
public Expression replaceChildren(List<Expression> newChildren) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the replaceChildren(List<Expression>)
here and not defined as final
in GroupingFunction
enforcing the replacement of a single children? GroupingFunction
would then define a protected abstract GroupingFunction replaceChildren(Expression newExpression);
. Are we expecting more grouping functions with variable list of children in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't know what the future will bring - the pattern above is used for Unary[..]Function
s which a grouping function might be, but we don't know yet.
Adding it now results in 2 methods without any reuse so it's arguable whether it makes sense to do it or not. In general, I prefer to add a pattern when it applies to at least 2 instances, otherwise it's more of a premature recognition.
I've added it nevertheless.
boolean negative = false; | ||
|
||
ParserRuleContext parentCtx = interval.getParent(); | ||
if (parentCtx != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this nested if
block doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of the structure of the grammar it's a way to get the -
symbol from the immediate parent context in the hierarchy that could possibly contain it.
public boolean equals(Object obj) { | ||
if (super.equals(obj)) { | ||
GroupByNumericHistogram other = (GroupByNumericHistogram) obj; | ||
return Objects.equals(interval, other.interval); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why Objects.equals
and not interval == other.interval
?
} | ||
|
||
public void testGroupByDateHistogram() { | ||
LogicalPlan p = plan("SELECT MAX(int) FROM test GROUP BY HISTOGRAM(int, 1000)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a situation where the histogram is being grouped by very small intervals? Like 1 second for a date interval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried a test with 1 second and it worked :). Our data might not be representative but I think it's a side effect of composite aggs which paginates through data instead of creating all the buckets up-front.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Great work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff! LGTM
Introduce Histogram grouping function for bucketing/grouping data based on a given range. Both date and numeric histograms are supported using the appropriate range declaration (numbers vs intervals). SELECT HISTOGRAM(number, 50) AS h FROM index GROUP BY h SELECT HISTOGRAM(date, INTERVAL 1 YEAR) AS h FROM index GROUP BY h In addition add multiply operator for Intervals Add docs for intervals and histogram Fix #36509 (cherry picked from commit 6ee6bb5)
Introduce Histogram grouping function for bucketing/grouping data based
on a given range. Both date and numeric histograms are supported using
the appropriate range declaration (numbers vs intervals).
In addition add multiply operator for Intervals
Add docs for intervals and histogram
Fix #36509
Note: Histograms don't work as arguments to functions nor inside Having yet. That's planned as a separate PR.