Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More GROUP BY enhancements #2111

Closed
wants to merge 18 commits into from

Conversation

big-andy-coates
Copy link
Contributor

@big-andy-coates big-andy-coates commented Nov 1, 2018

Description

Following on from my recent PR on improving GROUP BY, this PR looks to do a little refactoring to tidy things up and add support for more weird and wonderful GROUP BY clauses. You can now:

Also fixes: #2455

Use non-aggregate functions in the select statement:

e.g. the use of SUBSTRING here:

SELECT SUBSTRING(f1, 0, 1), COUNT(*) FROM TEST GROUP BY f1;

GROUP BY arithmetic or string concat using '+'

SELECT f2+ f1, COUNT(*) AS COUNT FROM TEST GROUP BY f2 + f1;

GROUP BY with constants in the SELECT expression

SELECT f1, 'some constant' as f3, COUNT(f1) FROM TEST GROUP BY f1;

HAVING clause that is an aggregate function:

SELECT f1, COUNT(*) FROM TEST GROUP BY f1 HAVING SUM(f1) > 1;

Multiple having clause

SELECT f1, COUNT(*) FROM TEST GROUP BY f1 HAVING SUM(f1) > 1 AND SUM(f1) < 100;

Not allowed, though possible:

Having clause that is constant

SELECT f2, SUM(f1) FROM TEST GROUP BY f2 HAVING f2='test';

Allowed, though contentious:

GROUP BY Functions where the parameters differ

e.g. Note the different parameters to SUBSTRING below.

SELECT SUBSTRING(source, 0, 1) AS Thing, COUNT(*) AS COUNT FROM TEST GROUP BY SUBSTRING(source, 0, 2);

GROUP BY Functions select directly uses parameters from GROUP BY

SELECT x, y, COUNT()  FROM TEST GROUP BY x + y;
SELECT x, y, COUNT()  FROM TEST GROUP BY CONCAT(x + y);

The argument to allow non-aggregate non-group-by select columns

When we're doing GROUP BY operations, a normal db implementation would fail if you have a column in the select that isn't part of the group by, e.g.

SELECT x, y, COUNT() FROM t GROUP BY x

The db would complain about y. (Or most do anyway). This is because the db is building a single output row, (the aggregate), from a set of source rows.

KSQL, on the other hand, is processing row by row. Each input row updates the aggregate and outputs a new row. So KSQL has access to both the aggregate and the input row. This means we could support copying arbitrary fields from the source row to the output row!

We already copy copy ROWTIME from the source row to the output row. This is basically a updated_at of the aggregate. But another use-case we don't currently support would be wanting to know the updated_by for the aggregate. We could chose to support something like:

SELECT x, COUNT(), user as updated_by FROM source GROUP BY x;

input (x, user) output (x, count, updated_by)
0, bob 0, 1, bob
1, fred 1, 1, fred
0, peter 0, 2, peter
etc.

I'd be interested to hear what people think!

Testing done

Added unit, functional and QueryTransaltionTests.

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

@big-andy-coates big-andy-coates requested a review from a team as a code owner November 1, 2018 21:04
@apurvam
Copy link
Contributor

apurvam commented Nov 2, 2018

Amazing, @big-andy-coates . Are the docs being updated to reflect the greater flexibility. These would be great examples to put in our docs, vs leaving them languishing in PRs.

cc @JimGalasyn can you take that on?

@JimGalasyn
Copy link
Member

@apurvam Absolutely, I've opened a JIRA to track!

@apurvam
Copy link
Contributor

apurvam commented Nov 3, 2018

Thanks @JimGalasyn!

@big-andy-coates
Copy link
Contributor Author

To be honest I didn't update the docs as docs don't currently call out what is not supported by GROUP BY, and most users will just assume GROUP BY supports all the things a normal DB would... which it now does.

There's probably scope to add more advanced examples that use the improved functionality of GROUP BY, and this should undoubtedly be called out in the release notes. Other than that, I don't think the docs need to call out standard SQL functionality. But that's just my opinion.

big-andy-coates and others added 3 commits November 5, 2018 13:03
Conflict on:
  ksql-engine/src/main/java/io/confluent/ksql/planner/plan/AggregateNode.java
@hjafarpour
Copy link
Contributor

@big-andy-coates thanks for the PR. Couple of comments on the semantics before I start the review:

  • GROUP BY arithmetic or string concat using '+' will result in ambiguity. Consider the example query:
    CREATE STREAM OUTPUT AS SELECT f1, f2, COUNT(*) AS COUNT FROM TEST GROUP BY f2 + f1;
    if we have (f1 = 1, f2 = 2) and (f1=2, f2=1) the group by value will be 3 but we will have different rows (1,2) and (2,1)

  • Having clause that is constant: The having clause should only have predicates for aggregates and predicates for non aggregate columns should be in the WHERE clause.

@big-andy-coates
Copy link
Contributor Author

big-andy-coates commented Nov 7, 2018

  • GROUP BY arithmetic or string concat using '+' Good point. I guess String concat is fine, but arithmetic only makes sense only if the SELECT contains, at most, the same arithmetic, e.g. the following should both be valid:
CREATE STREAM OUTPUT AS SELECT COUNT(*) AS COUNT FROM TEST GROUP BY f2 + f1;
CREATE STREAM OUTPUT AS SELECT f1 + f2, COUNT(*) AS COUNT FROM TEST GROUP BY f2 + f1;
  • Having clause that is constant: @rodesai - the 'having clause with constant' was one that you, if I understand it, requested / flagged - what was your reasoning behind that?

@hjafarpour
Copy link
Contributor

I guess String concat is fine,

Even for strings we will have the ambiguity. Consider ('test1', 'test2') and ('tes', 't1test2'). These both will result in the same group by expression, test1test2.

@rodesai
Copy link
Contributor

rodesai commented Nov 7, 2018

Having clause that is constant: @rodesai - the 'having clause with constant' was one that you, if I understand it, requested / flagged - what was your reasoning behind that?

SQL supports HAVING clauses that use non-aggregate columns. I don't see a good reason why we wouldn't support them, even though you might as well just put those conditions in the WHERE clause.

@rodesai
Copy link
Contributor

rodesai commented Nov 7, 2018

Agree that we need to decide the semantics for the case @hjafarpour pointed out. FWIW, different SQL engines seem to treat this case differently. Postgres rejects the statement, and MySql uses an arbitrary row to fill the non-aggregate SELECT columns not present in the group by:
http://sqlfiddle.com/#!9/c0652/4

@big-andy-coates
Copy link
Contributor Author

I've been doing a lot of reviews recently. Will get back to this asap. Thanks for the input. Current plan is to reject such statements, but I'll know more once I start having a look.

@rodesai
Copy link
Contributor

rodesai commented Nov 14, 2018

Here's the issue with tables illustrated with an example:

Suppose we have the following table aggregation:

CREATE TABLE T1 (K INTEGER, X INTEGER, Y INTEGER) WITH (..., KEY='K');
CREATE TABLE T2 SELECT X, Y, COUNT(*) as C FROM T GROUP BY X + Y

Let's look at what happens when KSQL sees the following series of records in T1 (A -> B denotes a record with key A and value B):

10 -> {"K": 10, "X": 1, "Y": 2}
20 -> {"K": 20, "X": 2, "Y": 1}
10 -> {"K": 10, "X": 2, "Y": 2}

Assuming our semantic for select fields is that they take the value of the latest seen record, this should cause the following series of updates to T2:

3 -> {"X": 1, "Y": 2, "C": 1}
3 -> {"X": 2, "Y": 1, "C": 2}
3 -> {"X": 1, "Y": 2, "C": 1} 
4 -> {"X": 2, "Y": 2, "C": 1}

Internally, streams tries to let us do this by emitting the following records prior to the aggregation:

10 -> {"K": 10, "X": 1, "Y": 2}
20 -> {"K": 20, "X": 2, "Y": 1}
10 -> UNDO({"K": 10, "X": 1, "Y": 2})
10 -> {"K": 10, "X": 2, "Y": 2}

Then, when we get the UNDO record, its up to our aggregator to undo the effect of that table update. I can't think of a way to implement this (or even the relaxed semantic that X and Y take the values of some row from the table) without buffering all the previously seen values.

We could relax the semantic to just say that X and Y are some previously seen value for that grouping, but at that point what's the point of even supporting it?

@hjafarpour
Copy link
Contributor

@blueedgenick and @rmoff your feedback on this one is much appreciated! :)

@big-andy-coates
Copy link
Contributor Author

@rodesai,

I'm not sure I see where there is a problem.

When streams emits the

10 -> UNDO({"K": 10, "X": 1, "Y": 2})

I would of thought KSQL will first be repartitioning the record based on the GROUP BY x + y clause, i.e. it becomes:

3 -> UNDO({"K": 10, "X": 1, "Y": 2})

And then the aggregator will undo this record for the COUNT udaf by reducing the count for key 3 from 2, to 1.

Or am I missing something?

@big-andy-coates
Copy link
Contributor Author

big-andy-coates commented Nov 19, 2018

Maybe we could support this using a 'LAST' or 'LATEST' UDAF...? This would make it more SQL like.

@miguno
Copy link
Contributor

miguno commented Nov 19, 2018

cc @mjsax, see Rohan's comment above: #2111 (comment)

This goes back to our KStreams related discussions on aggregation semantics (and properties like associative / commutative / ... for functions).

@mjsax
Copy link
Member

mjsax commented Nov 19, 2018

I agree with @hjafarpour that those statements should be rejected as invalid. For constants in HAVING, I see no reason to not support it. It's a matter of optimization to push it as a filter before the aggregation operator IMHO. It's syntactic sugar only of course, as you can put it into WHERE, too.

The argument for timestamps is also not completely applicable IMHO: semantically, the timestamp of the output records is computed as an aggregation over the input records. We want it to be the maximum over all (atm, this is not implemented in Streams, and KSQL does also not need to reimplement it, but just "copy" the timestamp). But this is just a gap between the implementation and the concept. Conceptually, it'a an aggregation.

The nature of the timestamp aggregation is also important, because it make the computation deterministic. I think KSQL semantics should be build on event-time order but not offset order -- the suggestion to use latest() to pick a random value (for ill formed statements instead of rejecting them) seems to introduce non-determinism base on offset order (note, that in each repartition step, offset order is changes, and thus there is no deterministic offset order for reprocessing).

Btw: MySQL should not be the gold standard here... it takes too many shortcuts -- it's like browsers that render all kind of ill-formatted HTML to let people "successfully" do their homepage -- MySQL seems to be similar and accepts mis-formed SQL statements to be accessible to a broader user base -- we should not follow this example IMHO -- education before shortcuts!.

@rodesai
Copy link
Contributor

rodesai commented Nov 20, 2018

@mjsax I'm not advocating for supporting it. My position is that if we can support it as syntactic sugar for a LAST/LATEST/REDUCE UDAF then I don't see a problem. But I think there's 2 gaps we'd need to bridge before doing so:

  • The issue with undo semantics for LAST/LATEST/REDUCE on tables that I pointed out with a test.

  • The issue with the ordering semantics of the LAST/LATEST/REDUCE aggregator itself, which you brought up. The operation should return the value of the column with the latest timestamp. Currently, the way KSQL aggregations and UDAFs are implemented we can't do this - we don't store the rowtime in the record in the state store, and even if we did we'd need some way to access it from the UDAF, which the current interface doesn't really make easy. I guess we could implement it as a special case in the aggregate node, but we'd still need to get to the rowtime somehow.

@mjsax
Copy link
Member

mjsax commented Nov 20, 2018

Thanks for clarification @rodesai.

I don't think it's a good idea to auto-magically insert a latest UDAF for this case. If a user wants this, she should explicitly write CREATE TABLE AGG AS SELECT LATEST(X), LATEST(Y), COUNT(*) AS C FROM TEST GROUP BY X + Y;

I don't see any reason why LATEST should be special (ie, why do you pick this one as default) compared to other UDAFs and auto magical behavior is usually not a good idea from my experience---or maybe just my personal taste?.

Anyway. How to implement a LATEST function is a different beast. I agree, that you would need to preserve the complete history of all updates (because there is no time window involved) and this make it impossible to implement, as this history is unbounded. For a Stream input, it would be simpler, because there is no retraction and it's sufficient to maintain the value for the record with the largest timestamp seen. The question know is, you we want to allow LATEST for streams but not for table? Or should it not be added for both at all (to provide same functionality for both)?

@big-andy-coates
Copy link
Contributor Author

big-andy-coates commented Nov 20, 2018

Thanks for everyone's input on this!

I think we've kind of agreed on using explicit syntax for this, e.g. something like LATEST(x).

Though I'm not sure 'LATEST' is the right name as I don't think it should attempt to get the value from the latest record, i.e. the record with the highest timestamp. Rather, the functionality I'm thinking about is the ability to copy a field from the source row, i.e. the row that's currently being processed and is updating the aggregate, what ever that row may be.

I guess LATEST is what you'd want to use if implementing some last_updated_by functionality, as you'd want the last person who updated the aggregate. So maybe that was a bad example. Let's leave LATEST and its issues with undo on tables aside.

The idea I'm throwing about is probably more of a FROM_ROW udf, which copies a field from the record updating the aggregate. It promising nothing in terms of time ordering. As such, it can be supported by tables.

Yeah, this would be offset ordered, but if that doesn't cause issues with the use case the user is trying to solve, then is there still a place for this, caveats and all?

I can't help thinking that it may be useful and help solve some use case that wouldn't be possible with KSQL without it. I just can't think of one, which may in itself prove something...

To get this PR moving again, my plan is to only allow matching select clauses, so the following will be valid:

SELECT SUBSTRING(x, 0, 2), COUNT(*) FROM TEST GROUP BY SUBSTRING(x, 0, 2);
SELECT SUBSTRING(x, 0, 2), COUNT(*) FROM TEST GROUP BY x;
SELECT x + y, COUNT(*) FROM TEST GROUP BY x + y;

But the following, unfortunately won't:

-- first column is a subset of the group by clause:
SELECT SUBSTRING(x, 0, 2), COUNT(*) FROM TEST GROUP BY SUBSTRING(x, 0, 10);

@miguno
Copy link
Contributor

miguno commented Nov 20, 2018

@big-andy-coates:

I modified one of your "valid" examples to align with @rodesai's original example at #2111 (comment).

CREATE TABLE T1 (K INTEGER, X INTEGER, Y INTEGER) WITH (..., KEY='K');
CREATE TABLE T2 AS SELECT x + y AS XY, COUNT(*) AS CNT FROM T1 GROUP BY x + y;

Let's look at what happens when KSQL sees the following series of records in T1 (A -> B denotes a record with key A and value B):

10 -> {"K": 10, "X": 1, "Y": 2}
20 -> {"K": 20, "X": 2, "Y": 1}
10 -> {"K": 10, "X": 2, "Y": 2}

I think Andy is suggesting that this would result in the following series of updates to T2:

3 -> {"XY": 3, "CNT": 1}
3 -> {"XY": 3, "CNT": 2}
4 -> {"XY": 4, "CNT": 1}

However, I think we still have the same problem. The second input record for key 10,, that is 10 -> {"K": 10, "X": 2, "Y": 2}, means that the original value for key 10 in the first table T1 was changed:

key old value new value
10 {"K": 10, "X": 1, "Y": 2} {"K": 10, "X": 2, "Y": 2}

If understand correctly, then the value of output key 3 would have to be updated, here: the count CNT would need decrementing from 2 to 1. Hence the actual series of updates to T2 would need to be:

3 -> {"XY": 3, "CNT": 1}
3 -> {"XY": 3, "CNT": 2}
3 -> {"XY": 3, "CNT": 1}    <<< we need this one
4 -> {"XY": 4, "CNT": 1}

And we're back to start with the problem.

@mjsax
Copy link
Member

mjsax commented Nov 20, 2018

I can't help thinking that it may be useful and help solve some use case that wouldn't be possible with KSQL without it. I just can't think of one, which may in itself prove something...

Agreed :)

For what is allowed and what not. I basically agree. However, for the example

SELECT SUBSTRING(x, 0, 2), COUNT(*) FROM TEST GROUP BY SUBSTRING(x, 0, 10);

we could actually be more sophisticated, because we know the semantics of SUBSTRING. The current example could be allowed, because we know that SUBSTRING(x, 0, 2) == SUBSTRING(SUBSTRING(x, 0, 10), 0, 2). This is an advance optimization but we could actually support and allow it. Of course, we can't do this for all expressions, only of the know the semantics and they full-fill the property of "containment" (not sure if this is the best term for it---I guess there should be a mathematical term...).

Does this make sense?

@miguno What you say makes sense, however, there is no problem, because for the example you describe, the know that the value associated with each group does not change, ie, in your example we know that X + Y == 3 (we don't know what the old X or Y was, but we also don't need to know it). Or do I miss something?

@miguno
Copy link
Contributor

miguno commented Nov 20, 2018

@mjsax: I think the root issue is that we are doing a table->table operation.

Perhaps this is the shortest way I can describe the issue I am seeing: With tables, we normally recommend log compaction.

So this original, non-compacted input:

10 -> {"K": 10, "X": 1, "Y": 2}
20 -> {"K": 20, "X": 2, "Y": 1}
10 -> {"K": 10, "X": 2, "Y": 2}

could be log-compacted to:

20 -> {"K": 20, "X": 2, "Y": 1}
10 -> {"K": 10, "X": 2, "Y": 2}

If this happens, then Andy's "valid" variant will cause the table->table operation to produce different results because, in his variant, the idea is that we would not be sending an "correction" record when the second record for key 10 is being processed.

For the non-compacted input, the output would be (note: the output itself is subject to compaction because it's the output type is a table):

3 -> {"XY": 3, "CNT": 1}
3 -> {"XY": 3, "CNT": 2}
4 -> {"XY": 4, "CNT": 1}

For the compacted input, the output would be:

4 -> {"XY": 4, "CNT": 1}
3 -> {"XY": 3, "CNT": 1}

The problem is not that 4 comes before 3 in the second output variant, but that the latest count for 3 is 1 instead of 2.

@mjsax
Copy link
Member

mjsax commented Nov 20, 2018

I agree that we need to produce the same result, and that we need to send a correction/subtraction record for the non-compacted input. However, why do you think this would not happen? Kafka Streams will generate this record correctly from my understanding. Can you elaborate, why Kafka Streams would compute an incorrect subtraction record (or would not emit it at all)?

From my understanding, the output will be:

3 -> {"XY": 3, "CNT": 1}
3 -> {"XY": 3, "CNT": 2}
3 -> {"XY": 3, "CNT": 1} <--- why do you think this record would not be computed?
4 -> {"XY": 4, "CNT": 1}

@big-andy-coates
Copy link
Contributor Author

big-andy-coates commented Nov 20, 2018

we could actually be more sophisticated, because we know the semantics of SUBSTRING. The current example could be allowed, because we know that SUBSTRING(x, 0, 2) == SUBSTRING(SUBSTRING(x, 0, 10), 0, 2). This is an advance optimization but we could actually support and allow it. Of course, we can't do this for all expressions, only of the know the semantics and they full-fill the property of "containment" (not sure if this is the best term for it---I guess there should be a mathematical term...).

Yeah, I've thought along the same lines, but its not a simple undertaking to be able to figure this out. It's not too bad if its the same function, but what about if it starts to go across functions, but still 'contained'?

SELECT MIN(MAX(x, 5), 10), COUNT() FROM TEST GROUP BY Max(x, 5);

Even just handling parameter order in functions where it doesn't matter doesn't come for free:

-- won't work as it stands as the parser doesn't know MIN(5, x) === MIN(x, 5)
SELECT MIN(5, x), COUNT() FROM TEST GROUP BY MIN(x, 5);

-- contrived, but should also be allowed, but won't be.
SELECT MIN(x, CAST(CAST(5, VARCHAR), INT)), COUNT() FROM TEST GROUP BY MIN(x, 5);

That's why I was thinking the FROM_ROW (bad name...) could be useful because it would allow:

SELECT MIN(MAX(FROM_ROW(x), 5), 10), COUNT() FROM TEST GROUP BY Max(x, 5);
SELECT MIN(5, FROM_ROW(x)), COUNT() FROM TEST GROUP BY MIN(x, 5);
SELECT SUBSTRING(FROM_ROW(x), 0, 2), COUNT(*) FROM TEST GROUP BY SUBSTRING(x, 0, 10);

Though, I guess this not preferable to the parser knowing these things are allowed, it's just easier to implement.

@mjsax
Copy link
Member

mjsax commented Nov 20, 2018

@big-andy-coates Totally agreed. It's a very challenging problem... Just wanted to mention it for sake of completeness of the discussion.

@rodesai
Copy link
Contributor

rodesai commented Nov 20, 2018

CREATE TABLE T2 AS SELECT x + y AS XY, COUNT(*) AS CNT FROM T1 GROUP BY x + y;
I don't think there's any reason this won't work. KStreams will emit the correct updates with old and new values.

Yes, agree here. My concern was more about a LATEST UDAF and tables.

@miguno
Copy link
Contributor

miguno commented Nov 21, 2018

I synced offline with @big-andy-coates. This helped to clear up a few things. :-)

First, I see now why Andy's following suggestion would indeed work, unlike what I wrote in my two last comments above (#2111 (comment) and #2111 (comment)):

-- This would work indeed.
CREATE TABLE T2 AS SELECT x + y AS XY, COUNT(*) AS CNT FROM T1 GROUP BY x + y;

To clarify for other readers, I try to summarize again what the problem is that we are talking about.

DDL for the example:

CREATE TABLE T1 (K INTEGER, X INTEGER, Y INTEGER) WITH (..., KEY='K');
CREATE TABLE T2 AS SELECT X, Y, COUNT(*) AS C FROM T1 GROUP BY x + y;

Updates for table T1:

1. 10 -> {"X": 1, "Y": 2}
2. 20 -> {"X": 2, "Y": 1}
3. 20 -> {"X": 3, "Y": 0} -> this results in an UNDO for #2, but the downstream key stays at 3
4. 20 -> {"X": 2, "Y": 2} -> note how this record causes the downstream key to change from 3 to 4;
                             this will also result downstream in an UNDO for record #3 for key 3

This results in the following updates to table T2:

a. 3 -> {"X": 1, "Y": 2, "C": 1}   OK
b. 3 -> {"X": 2, "Y": 1, "C": 2}   OK
c. 3 -> {"X": 3, "Y": 0, "C": 2}   OK
d. 3 -> {"X": 3, "Y": 0, "C": 2} <<< this is the problematic record, caused by the UNDO above
e. 4 -> {"X": 2, "Y": 2, "C": 1}   OK

The problem is what X and Y are in record #d (it's not about C), and this is why e.g. we talked about issues with "LATEST" behavior for tables being problematic. Record #d (and thus the state of T2) should be the same as #a (X=1, Y=2, C=1). But for tables (which have and need UNDOs, unlike streams), as Rohan pointed out above in #2111 (comment), in a realistic scenario this would require us to retain the full, ordered history of past records (in a simple example such as the one above we could fix the example by having access to just the 1-2 previous records, but that doesn't work in general; roughly speaking: the more UNDOs, the more prior history you need). However, the only known information for X and Y we'd have access to in the current KStreams implementation (correct me if I'm wrong) are:

  1. The prior state of T2, which would tell us X=3, Y=0.
  2. The UNDO record, which would tell us X=2, Y=1.

Neither of which are the values for X and Y that -- in my opinion -- one would intuitively expect in this scenario. This intuition is what the "LATEST" discussion above was about, where (in this example after record 4 was processed) the now latest update in T2 for key 3 is determined by record 1, i.e. 10 -> {"X": 1, "Y": 2}, which means it would be #a = 3 -> {"X": 1, "Y": 2, "C": 1}.

@big-andy-coates's suggestion now is to limit what you can put into the SELECT clause when you use an expression in GROUP BY.

CREATE TABLE T2 AS SELECT x + y AS XY, COUNT(*) AS C FROM T1 GROUP BY x + y;

Here, we are effectively ensuring that all fields in the T2 update records are either (1) a constant across all records (e.g. X+Y aka XY == 3 for all records for the downstream key 3) or (2) the result of an aggregation (like COUNT(*) aka C), which we know how to update correctly.

Is that a fair summary?

@big-andy-coates
Copy link
Contributor Author

@rodesai

Yes, agree here. My concern was more about a LATEST UDAF and tables

Yep, and I agree that can't work. Even a FROM_ROW UDAF couldn't work with tables. But we already support a mech for distinguishing between UDAFs that support table->table group bys and those that only support stream->table group bys.

@hjafarpour
Copy link
Contributor

To get this PR moving again, my plan is to only allow matching select clauses, so the following will be valid:

SELECT SUBSTRING(x, 0, 2), COUNT() FROM TEST GROUP BY SUBSTRING(x, 0, 2);
SELECT SUBSTRING(x, 0, 2), COUNT(
) FROM TEST GROUP BY x;
SELECT x + y, COUNT(*) FROM TEST GROUP BY x + y;
But the following, unfortunately won't:

-- first column is a subset of the group by clause:
SELECT SUBSTRING(x, 0, 2), COUNT(*) FROM TEST GROUP BY SUBSTRING(x, 0, 10);

@big-andy-coates I'm +1 for this approach. I see the point from @mjsax about supporting more than exact match but it not only makes implementation more complex, but also it may result in many confusions for users. We would have to explain when non-exact match is permissible and when it is not. Requiring exact match makes it much easier to describe the query behavior and results in much less ambiguity.

@miguno
Copy link
Contributor

miguno commented Nov 29, 2018

I agree with @hjafarpour that we shouldn't get blocked by supporting more than an exact match for the initial implementation.

@miguno
Copy link
Contributor

miguno commented Nov 29, 2018

@blueedgenick : You also had some comments to share if I recall correctly?

#### Conflicting files
`ksql-engine/src/main/java/io/confluent/ksql/QueryEngine.java`
`ksql-engine/src/main/java/io/confluent/ksql/analyzer/QueryAnalyzer.java`
`ksql-engine/src/main/java/io/confluent/ksql/planner/plan/AggregateNode.java`
`ksql-engine/src/main/java/io/confluent/ksql/structured/SchemaKStream.java`
`ksql-engine/src/test/java/io/confluent/ksql/analyzer/AnalyzerTest.java`
`ksql-engine/src/test/java/io/confluent/ksql/analyzer/QueryAnalyzerTest.java`
`ksql-engine/src/test/java/io/confluent/ksql/physical/PhysicalPlanBuilderTest.java`
`ksql-engine/src/test/java/io/confluent/ksql/planner/PlanSourceExtractorVisitorTest.java`
`ksql-engine/src/test/java/io/confluent/ksql/planner/plan/AggregateNodeTest.java`
`ksql-engine/src/test/java/io/confluent/ksql/planner/plan/JoinNodeTest.java`
`ksql-engine/src/test/java/io/confluent/ksql/planner/plan/KsqlBareOutputNodeTest.java`
`ksql-engine/src/test/java/io/confluent/ksql/structured/SchemaKGroupedTableTest.java`
`ksql-engine/src/test/java/io/confluent/ksql/structured/SchemaKStreamTest.java`
`ksql-engine/src/test/java/io/confluent/ksql/structured/SchemaKTableTest.java`
`ksql-engine/src/test/java/io/confluent/ksql/structured/SelectValueMapperTest.java`
`ksql-engine/src/test/java/io/confluent/ksql/structured/SqlPredicateTest.java`
`ksql-engine/src/test/java/io/confluent/ksql/testutils/AnalysisTestUtil.java`
`ksql-engine/src/test/resources/query-validation-tests/group-by.json`
Copy link
Member

@JimGalasyn JimGalasyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, 1337 files changed, browser oom, LGTM!

@JimGalasyn
Copy link
Member

HFS, this PR brought my machine to its knees! ><

 - must have aggregate column
 - disallows UDFs in projection with different params to GROUP BY
 - disallows fields in projection that are params to UDFs in GROUP BY
 - disallows string concat in projection that differs from GROUP BY
 - disallows fields in projection that are involved in string concat in GROUP BY
 - disallows arithmetic in projection that differs from GROUP BY
 - does now allow constant HAVING clause.
@big-andy-coates
Copy link
Contributor Author

@JimGalasyn investigating - PR shouldn't be that large. Something has gone wrong.

@big-andy-coates
Copy link
Contributor Author

Closing in favour of #2472

@big-andy-coates big-andy-coates deleted the group_by_2 branch February 21, 2019 11:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Clarify error message
7 participants