Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL: Introduce HISTOGRAM grouping function #36510

Merged
merged 5 commits into from Dec 14, 2018

Conversation

costin
Copy link
Member

@costin costin commented Dec 11, 2018

Introduce Histogram grouping function for bucketing/grouping data based
on a given range. Both date and numeric histograms are supported using
the appropriate range declaration (numbers vs intervals).

SELECT HISTOGRAM(number, 50) AS h FROM index GROUP BY h
SELECT HISTOGRAM(date, INTERVAL 1 YEAR) AS h FROM index GROUP BY h

In addition add multiply operator for Intervals
Add docs for intervals and histogram

Fix #36509

Note: Histograms don't work as arguments to functions nor inside Having yet. That's planned as a separate PR.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

@costin costin force-pushed the introduce-histograms branch 2 times, most recently from 318b716 to 83d76b4 Compare December 11, 2018 23:17
Introduce Histogram grouping function for bucketing/grouping data based
 on a given range. Both date and numeric histograms are supported using
 the appropriate range declaration (numbers vs intervals).

SELECT HISTOGRAM(number, 50) AS h FROM index GROUP BY h
SELECT HISTOGRAM(date, INTERVAL 1 YEAR) AS h FROM index GROUP BY h

In addition add multiply operator for Intervals
Add docs for intervals and histogram

Fix elastic#36509
Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left few cosmetic comments. Still review-in-progress.

docs/reference/sql/concepts.asciidoc Outdated Show resolved Hide resolved
docs/reference/sql/functions/date-time.asciidoc Outdated Show resolved Hide resolved
docs/reference/sql/functions/grouping.asciidoc Outdated Show resolved Hide resolved
docs/reference/sql/functions/grouping.asciidoc Outdated Show resolved Hide resolved
Copy link
Contributor

@matriv matriv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Left a first batch of comments.

@@ -1,7 +1,99 @@
[role="xpack"]
[testenv="basic"]
[[sql-functions-datetime]]
=== Date and Time Functions
=== Date/Time and Interval Functions and Operators

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add the beta[] marker

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

A common requirement when dealing with date/time in general is date/time math which resolves around
the notion of ``interval``s, a topic that is worth exploring in the context of {es} and {es-sql}.

{es} has comprehensive support for <<date-math, date math>> both inside <<date-math-index-names,index names>> and <<mapping-date-format, queries>>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a space after comma , index names

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

<2> numeric interval
<3> date/time expression (typically a field)
<4> date/time <<sql-functions-datetime-interval, interval>>
...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need the ... here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left-over.

include-tagged::{sql-specs}/docs.csv-spec[histogramNumeric]
----

and date/time fields:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say or instead of and since it continues from: either

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -68,8 +68,8 @@ protected void assertResults(ResultSet expected, ResultSet elastic) throws SQLEx
//
// uncomment this to printout the result set and create new CSV tests
//
//JdbcTestUtils.logLikeCLI(elastic, log);
JdbcAssert.assertResultSets(expected, elastic, log, true);
JdbcTestUtils.logLikeCLI(elastic, log);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

SELECT INTERVAL '163:59.163' MINUTE TO SECOND AS interval;

interval
---------------
+0 02:43:59.163
;

testDatePlusInterval
intervalPlusInterval
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add Local suffix to all of them since further below there are tests on table columns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should remove them from here as they are also tested in the docs csv spec.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to keep things separate - while there is some duplication the doc lifecycle is different from that of the test. In fact, docs were added to avoid having obscure queries that make sense for testing show up in our documentation.

//


histogramNumeric
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove these 2 tests since they are also in the docs csv spec.

@@ -108,7 +108,11 @@ public Analyzer(Configuration configuration, FunctionRegistry functionRegistry,
new ResolveAggsInHaving()
//new ImplicitCasting()
);
return Arrays.asList(substitution, resolution);
Batch finish = new Batch("Finish Analysis",
new PruneSubqueryAliases(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we merge them in one rule?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because they have different concerns - I would expect the second one to change when we add full support for subselects.

30404 |58715 |1993 |3.0 |3.0 |3
35742 |67492 |1994 |2.8 |2.7 |4
45656 |45656 |1996 |3.0 |3.0 |1
25324 |70011 |1986 |3.0 |3.0 |15
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this -1 year change come from?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relative vs absolute time computing. ES allows both absolute offsets (24h) or relative (1d) which takes into account if the date/time is within a day.
Based on the intervals given, ES keeps going back and forth but that's surprising and unexpected - in SQL as the intervals vary, I've used the absolute one so that INTERVAL 1 YEAR and INTERVAL 12 MONTHS, etc... gives you the same results.

| `INTERVAL '3 4' DAYS TO HOURS` | 3 days and 4 hours
| `INTERVAL '5 6:12' DAYS TO MINUTES` | 5 days, 6 hours and 12 minutes
| `INTERVAL '3 4:56:01' DAY TO SECOND` | 3 days, 4 hours, 56 minutes and 1 second
| `INTERVAL '2 3:45:01.23456789' DAY TO SECOND` | 2 days, 3 hours, 45 minutes, 1 second and 234567890 milliseconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be nanoseconds, also below.

Copy link
Contributor

@matriv matriv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm through with the review, nice work overall!

  1. Maybe we could add an integ test for interval multiplication on table columns.
  2. What happens if the histogram is used in the WHERE clause? should we catch this case?

@@ -527,4 +552,4 @@ private static FunctionDefinition def(Class<? extends Function> function, Functi
private interface CastFunctionBuilder<T> {
T build(Location location, Expression expression, DataType dataType);
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please revert.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. git merging.

* operates on a datetime.
*/
@SuppressWarnings("overloads") // These are ambiguous if you aren't using ctor references but we always do
static <T extends Function> FunctionDefinition def(Class<T> function, DatetimeBinaryFunctionBuilder<T> ctorRef, String... names) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get why do we need a special overload.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because no other def class matches - the constructor signature of Histogram is different then everything else - it's the only function with 2 Expression arguments and a TimeZone.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, thx! I missed something there, sorry.

@@ -415,6 +418,28 @@ public boolean functionExists(String functionName) {
T build(Location location, Expression target, TimeZone tz);
}

/**
* Build a {@linkplain FunctionDefinition} for a plur/n-ary function that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean binary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I wanted to go for multi/variable arguments but that's not the case. Will update it.


@Override
public ScriptTemplate asScript() {
throw new SqlIllegalArgumentException("Grouping functions cannot be scripted");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not important for now. I think we could introduce an SqlIllegalOperationException for such cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That implies an operation that somehow was accepted but it's not implemented, which I would consider a bug.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is somehow the case here, no? We end up calling this method while we shouldn't. The SqlIllegalArgumentException marks calling something (e.g.: sql function from the user perspective, or internal method with wrong artuments). Anyway, this is out of this PR's scope.

}

public void testGroupByDateHistogram() {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing test body ;)

}

public void testGroupByHistogram() {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here.

@@ -27,7 +27,7 @@

private static final DeprecationLogger deprecationLogger = new DeprecationLogger(LogManager.getLogger(RestSqlClearCursorAction.class));

RestSqlClearCursorAction(Settings settings, RestController controller) {
public RestSqlClearCursorAction(Settings settings, RestController controller) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, this is needed for our debug project only, should we revert?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this change shouldn't have occurred in the first place but the PR got it before I got a chance to review it.

@@ -38,7 +38,7 @@

private static final DeprecationLogger deprecationLogger = new DeprecationLogger(LogManager.getLogger(RestSqlQueryAction.class));

RestSqlQueryAction(Settings settings, RestController controller) {
public RestSqlQueryAction(Settings settings, RestController controller) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here.

return TypeResolution.TYPE_RESOLVED;
}

return new TypeResolution(format("[{}] has arguments with incompatible types [{}] and [{}]", symbol(), l, r));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - I added one but forgot about it so I created a dedicated test class (at which point I remembered about the previous test).

if (l instanceof Number && r instanceof Number) {
return Arithmetics.mul((Number) l, (Number) r);
}
l = unwrapJodaTime(l);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add some unit tests for the processor for the intervals.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See BinaryArithmeticTests


histogramDateWithCountAndOrder
schema::h:ts|c:l
SELECT HISTOGRAM(birth_date, INTERVAL 1 YEAR) AS h, COUNT(*) as c FROM test_emp GROUP BY h ORDER BY h DESC;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both tests here seem to return a "continuous" list of years... what happens for an interval where there is no document? For example, if we should have had a 1981 birth_date, would the years between 1964 and 1980 be displayed with count 0 or not displayed at all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No bucket is returned (as that would break the notion of null group in SQL).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you mention this in the documentation, please? Users used to date_histogram (where the "missing" buckets are still returned by ES) will probably expect the same from HISTOGRAM.

+1 00:53:00.0
;

datePlusIntervalInline
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other tests that don't process tables data have constant in their name. Should we keep the same trend here? For example constantDatePlusInterval.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this csv, constant is not used since for intervals that means repeating for every test so it becomes noise. Considering it has its own namespace (dedicated file) I think it's fine as is.


public class GroupingFunctionAttribute extends FunctionAttribute {

private final String propertyPath;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a small comment about the role of propertyPath?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed it

}

@Override
public Expression replaceChildren(List<Expression> newChildren) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the replaceChildren(List<Expression>) here and not defined as final in GroupingFunction enforcing the replacement of a single children? GroupingFunction would then define a protected abstract GroupingFunction replaceChildren(Expression newExpression);. Are we expecting more grouping functions with variable list of children in the future?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't know what the future will bring - the pattern above is used for Unary[..]Functions which a grouping function might be, but we don't know yet.
Adding it now results in 2 methods without any reuse so it's arguable whether it makes sense to do it or not. In general, I prefer to add a pattern when it applies to at least 2 instances, otherwise it's more of a premature recognition.
I've added it nevertheless.

boolean negative = false;

ParserRuleContext parentCtx = interval.getParent();
if (parentCtx != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this nested if block doing?

Copy link
Contributor

@matriv matriv Dec 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of the structure of the grammar it's a way to get the - symbol from the immediate parent context in the hierarchy that could possibly contain it.

public boolean equals(Object obj) {
if (super.equals(obj)) {
GroupByNumericHistogram other = (GroupByNumericHistogram) obj;
return Objects.equals(interval, other.interval);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Objects.equals and not interval == other.interval?

}

public void testGroupByDateHistogram() {
LogicalPlan p = plan("SELECT MAX(int) FROM test GROUP BY HISTOGRAM(int, 1000)");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a situation where the histogram is being grouped by very small intervals? Like 1 second for a date interval.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried a test with 1 second and it worked :). Our data might not be representative but I think it's a side effect of composite aggs which paginates through data instead of creating all the buckets up-front.

Copy link
Contributor

@matriv matriv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Great work!

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff! LGTM

@costin costin merged commit 6ee6bb5 into elastic:master Dec 14, 2018
@costin costin deleted the introduce-histograms branch December 14, 2018 16:20
costin added a commit that referenced this pull request Dec 14, 2018
Introduce Histogram grouping function for bucketing/grouping data based
 on a given range. Both date and numeric histograms are supported using
 the appropriate range declaration (numbers vs intervals).

SELECT HISTOGRAM(number, 50) AS h FROM index GROUP BY h
SELECT HISTOGRAM(date, INTERVAL 1 YEAR) AS h FROM index GROUP BY h

In addition add multiply operator for Intervals
Add docs for intervals and histogram

Fix #36509

(cherry picked from commit 6ee6bb5)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants