Cannot create aggregate without a GROUP BY #430

rmoff · 2017-11-01T15:53:01Z

KSQL insists on having a GROUP BY column, even if logically it makes sense without.

For example: how many tweets were there in each hourly period?

ksql> select count(*) from twitter window tumbling (size 1 hour);
Aggregate query needs GROUP BY clause.
ksql>

The text was updated successfully, but these errors were encountered:

rmoff · 2017-11-01T16:39:15Z

Workaround:

Step 1: Create a dummy column:

CREATE STREAM TWITTER2 AS SELECT 1 AS FOO,* FROM TWITTER;

Step 2: Group by dummy column:

SELECT FOO,COUNT(*) FROM TWITTER2 WINDOW TUMBLING (SIZE 1 HOUR) GROUP BY FOO;

Optionally, instantiate as a table and view window / aggregate:

CREATE TABLE TWEETS_PER_HOUR AS SELECT FOO,COUNT(*) AS TWEET_COUNT FROM TWITTER2 WINDOW TUMBLING (SIZE 1 HOUR) GROUP BY FOO;
SELECT TIMESTAMPTOSTRING(ROWTIME, 'yyyy-MM-dd HH:mm:ss.SSS') AS WINDOW_START , TWEET_COUNT FROM TWEETS_PER_HOUR;
2017-11-01 15:00:00.000 | 165
2017-11-01 16:00:00.000 | 72
...

miguno · 2018-02-05T19:02:26Z

My suggestion:

Ideally, can we find a way to support the original (non-grouped) query out of the box in some form?
Alternatively, can we improve the error message and e.g. add information on the workaround that @rmoff mentioned (or whatever else we deem appropriate)? I'd doubt that the majority of users would easily and quickly be able to come up with that workaround themselves.

rmoff · 2018-02-05T21:59:48Z

I'd strongly +1 adding workaround info to the error, if this can't be implemented quickly. Seasoned SQL hackers will be used to 'tricks' like this, but most others won't (and will totally expect KSQL to support this, and be disappointed if it doesn't).

dguy · 2018-02-07T10:52:10Z

One of the issue is that we are dealing with partitioned data, so in order to make this work we'd need to take all the data from the partitions and pipe it into a single topic partition to perform the count etc operations. For some operations, i.e, sum, count it could be done in two phases, i.e., on the partitions and then again as a final aggregate on the single topic partition, however, this won't work for all operations, i.e, if you are calculating an average.
That said, you could do this on hive. It would just generate the MapReduce job with a single aggregator at the end (which is essentially the same thing we'd need to do to support it).

miguno · 2018-02-07T16:16:42Z

After talking to @dguy offline, here's my understanding:

There is a workaround as @rmoff showed, but users may not know about it.
For the general case the desired functionality requires, essentially, the input data to be written to a single partition. This works, of course, but is not super efficient. (Good news is that some functions such as SUM and COUNT can be optimized so they are more efficient.)
Both the workaround and a possible OOTB functionality suffer from the same inefficiency described in point 2. The difference between the workaround and OOTB would be that, well, OOTB has a much better UX. But there's no technical hurdle to provide this OOTB (just to mention this clearly).

Given that we are busy with other work right now, I'd suggest:

Short-term: We provide a more descriptive and actionable error message, e.g. possibly telling users about the workaround.
We'll work on supporting this OOTB (no workaround needed any longer) in the future, i.e. once we can move this ticket to the top of our backlog.

miguno · 2018-02-07T16:17:44Z

WDYT @rmoff ?

rmoff · 2018-02-08T09:20:24Z

+1 Sounds like a good approach to me (better error message immediately with workaround, longer term provide the functionality OOTB).

rmoff · 2018-02-22T09:09:26Z

@miguno @hjafarpour If a verbose error message with the workaround is not an option, then I would advocate just giving a URL to this issue, so that people can find the workaround (and track progress on implementing the feature) for themselves.

So instead of:

ksql> select count(*) from twitter window tumbling (size 1 hour);
Aggregate query needs GROUP BY clause.
ksql>

ksql> select count(*) from twitter window tumbling (size 1 hour);
Aggregate query needs GROUP BY clause. See https://github.com/confluentinc/ksql/issues/430.
ksql>

big-andy-coates · 2018-03-01T12:55:55Z

@dguy what's the issue with calculating average? Can not each partition track a count and a sum, which can then be passed to a final aggregate stage?

I'm assuming this would require enhancements to KStreams, but I can't see any reason why it wouldn't work. (except issues with numerical overflows of course).

dguy · 2018-03-01T15:14:38Z

It doesn't need anything on the streams side it is just another object you write to a topic that gets consumed and can be used to aggregate. It just isn't currently supported in KSQL

Jianyi-Ren · 2018-08-01T18:43:52Z

Was wondering if there's any update on this issue after half a year?

janicesn0w · 2019-04-05T07:13:01Z

Will there be any update on this issue? I think it will be really useful to have this soon in KSQL.

a6kme · 2020-03-27T20:53:57Z

Hey @rmoff , just checking in on this. Is the work around still the solution to the problem? In that case, I can just send in a dummy constant value in my original topic and use that for group by in aggregates.

rmoff · 2020-03-28T00:48:58Z

/cc @MichaelDrogalis @derekjn ☝️

MichaelDrogalis · 2020-03-30T23:19:56Z

We don't have this one queued up yet, but hope to soon. Patches always welcome in the meantime. 🙏

tomasfabian · 2021-01-14T07:51:09Z

Hi, I wanted to create a count or sum aggregation without group by and I found this workaround.

Is it possible to receive default initial value at the beginning of each time window?

Now it is not possible to process the absence of events. In case that there was no tweet within an hour (see above @rmoff s example) the subscriber will not be notified about no activity or won't be able to reset the aggregate to 0 in a UI text block.

Example:
1.hour 22 messages, count is equal to 22
2.hour - 0 messages, no activity - aggregation is not generated. Count should generate 0 for integers at the beginning of the time window etc.

In this case the UI would show the previous 22 message, because it is not aware about the start of the current or end of the previous time window.
Another type of processing should raise an alarm in case of no activity, eg. IoT sensor is probably not sending data.

EDIT:
Is it possible to achieve this somehow with the proposed workaround?

Thank you Tomas

outminder · 2021-12-02T20:54:49Z

Hello @rmoff / @mjsax what if I need to select a column but do not want to have it as part of GROUP BY. Example:
Select id, , type, SUM(PLAYBACK) AS TOTAL_PLAYBACK GROUP BY id
I do not want to include 'type in the group by but would like the value selected as it will be used later to merge with a different table, or to take it downstream which needs 'type'. If I include 'type', then it will SUM based on the type of media for example whereas I wanted total SUM. Any suggestions what can be done here?

miguno added the enhancement label Nov 1, 2017

apurvam mentioned this issue Jan 31, 2018

Permit aggregates without group by's #518

Closed

apurvam added the fix-it-week label Oct 16, 2019

mjsax added requires-streams-change sql-syntax streaming-engine Tickets owned by the ksqlDB Streaming Team labels May 16, 2021

agavra assigned nateab Mar 29, 2022

nateab mentioned this issue Apr 6, 2022

feat: allow aggregations without group bys #8986

Merged

2 tasks

nateab closed this as completed in #8986 Apr 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot create aggregate without a GROUP BY #430

Cannot create aggregate without a GROUP BY #430

rmoff commented Nov 1, 2017

rmoff commented Nov 1, 2017 •

edited by miguno

miguno commented Feb 5, 2018

rmoff commented Feb 5, 2018

dguy commented Feb 7, 2018

miguno commented Feb 7, 2018 •

edited

miguno commented Feb 7, 2018

rmoff commented Feb 8, 2018

rmoff commented Feb 22, 2018

big-andy-coates commented Mar 1, 2018

dguy commented Mar 1, 2018

Jianyi-Ren commented Aug 1, 2018

janicesn0w commented Apr 5, 2019

a6kme commented Mar 27, 2020 •

edited

rmoff commented Mar 28, 2020

MichaelDrogalis commented Mar 30, 2020

tomasfabian commented Jan 14, 2021 •

edited

outminder commented Dec 2, 2021

Cannot create aggregate without a GROUP BY #430

Cannot create aggregate without a GROUP BY #430

Comments

rmoff commented Nov 1, 2017

rmoff commented Nov 1, 2017 • edited by miguno

miguno commented Feb 5, 2018

rmoff commented Feb 5, 2018

dguy commented Feb 7, 2018

miguno commented Feb 7, 2018 • edited

miguno commented Feb 7, 2018

rmoff commented Feb 8, 2018

rmoff commented Feb 22, 2018

big-andy-coates commented Mar 1, 2018

dguy commented Mar 1, 2018

Jianyi-Ren commented Aug 1, 2018

janicesn0w commented Apr 5, 2019

a6kme commented Mar 27, 2020 • edited

rmoff commented Mar 28, 2020

MichaelDrogalis commented Mar 30, 2020

tomasfabian commented Jan 14, 2021 • edited

outminder commented Dec 2, 2021

rmoff commented Nov 1, 2017 •

edited by miguno

miguno commented Feb 7, 2018 •

edited

a6kme commented Mar 27, 2020 •

edited

tomasfabian commented Jan 14, 2021 •

edited