Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot create aggregate without a GROUP BY #430

Closed
rmoff opened this issue Nov 1, 2017 · 17 comments · Fixed by #8986
Closed

Cannot create aggregate without a GROUP BY #430

rmoff opened this issue Nov 1, 2017 · 17 comments · Fixed by #8986
Assignees

Comments

@rmoff
Copy link
Contributor

rmoff commented Nov 1, 2017

KSQL insists on having a GROUP BY column, even if logically it makes sense without.

For example: how many tweets were there in each hourly period?

ksql> select count(*) from twitter window tumbling (size 1 hour);
Aggregate query needs GROUP BY clause.
ksql>
@rmoff
Copy link
Contributor Author

rmoff commented Nov 1, 2017

Workaround:

Step 1: Create a dummy column:

CREATE STREAM TWITTER2 AS SELECT 1 AS FOO,* FROM TWITTER;

Step 2: Group by dummy column:

SELECT FOO,COUNT(*) FROM TWITTER2 WINDOW TUMBLING (SIZE 1 HOUR) GROUP BY FOO;

Optionally, instantiate as a table and view window / aggregate:

CREATE TABLE TWEETS_PER_HOUR AS SELECT FOO,COUNT(*) AS TWEET_COUNT FROM TWITTER2 WINDOW TUMBLING (SIZE 1 HOUR) GROUP BY FOO;
SELECT TIMESTAMPTOSTRING(ROWTIME, 'yyyy-MM-dd HH:mm:ss.SSS') AS WINDOW_START , TWEET_COUNT FROM TWEETS_PER_HOUR;
2017-11-01 15:00:00.000 | 165
2017-11-01 16:00:00.000 | 72
...

@miguno
Copy link
Contributor

miguno commented Feb 5, 2018

My suggestion:

  • Ideally, can we find a way to support the original (non-grouped) query out of the box in some form?
  • Alternatively, can we improve the error message and e.g. add information on the workaround that @rmoff mentioned (or whatever else we deem appropriate)? I'd doubt that the majority of users would easily and quickly be able to come up with that workaround themselves.

@rmoff
Copy link
Contributor Author

rmoff commented Feb 5, 2018

I'd strongly +1 adding workaround info to the error, if this can't be implemented quickly. Seasoned SQL hackers will be used to 'tricks' like this, but most others won't (and will totally expect KSQL to support this, and be disappointed if it doesn't).

@dguy
Copy link
Contributor

dguy commented Feb 7, 2018

One of the issue is that we are dealing with partitioned data, so in order to make this work we'd need to take all the data from the partitions and pipe it into a single topic partition to perform the count etc operations. For some operations, i.e, sum, count it could be done in two phases, i.e., on the partitions and then again as a final aggregate on the single topic partition, however, this won't work for all operations, i.e, if you are calculating an average.
That said, you could do this on hive. It would just generate the MapReduce job with a single aggregator at the end (which is essentially the same thing we'd need to do to support it).

@miguno
Copy link
Contributor

miguno commented Feb 7, 2018

After talking to @dguy offline, here's my understanding:

  1. There is a workaround as @rmoff showed, but users may not know about it.
  2. For the general case the desired functionality requires, essentially, the input data to be written to a single partition. This works, of course, but is not super efficient. (Good news is that some functions such as SUM and COUNT can be optimized so they are more efficient.)
  3. Both the workaround and a possible OOTB functionality suffer from the same inefficiency described in point 2. The difference between the workaround and OOTB would be that, well, OOTB has a much better UX. But there's no technical hurdle to provide this OOTB (just to mention this clearly).

Given that we are busy with other work right now, I'd suggest:

  1. Short-term: We provide a more descriptive and actionable error message, e.g. possibly telling users about the workaround.
  2. We'll work on supporting this OOTB (no workaround needed any longer) in the future, i.e. once we can move this ticket to the top of our backlog.

@miguno
Copy link
Contributor

miguno commented Feb 7, 2018

WDYT @rmoff ?

@rmoff
Copy link
Contributor Author

rmoff commented Feb 8, 2018

+1 Sounds like a good approach to me (better error message immediately with workaround, longer term provide the functionality OOTB).

@rmoff
Copy link
Contributor Author

rmoff commented Feb 22, 2018

@miguno @hjafarpour If a verbose error message with the workaround is not an option, then I would advocate just giving a URL to this issue, so that people can find the workaround (and track progress on implementing the feature) for themselves.

So instead of:

ksql> select count(*) from twitter window tumbling (size 1 hour);
Aggregate query needs GROUP BY clause.
ksql>
ksql> select count(*) from twitter window tumbling (size 1 hour);
Aggregate query needs GROUP BY clause. See https://github.com/confluentinc/ksql/issues/430.
ksql>

@big-andy-coates
Copy link
Contributor

@dguy what's the issue with calculating average? Can not each partition track a count and a sum, which can then be passed to a final aggregate stage?

I'm assuming this would require enhancements to KStreams, but I can't see any reason why it wouldn't work. (except issues with numerical overflows of course).

@dguy
Copy link
Contributor

dguy commented Mar 1, 2018

It doesn't need anything on the streams side it is just another object you write to a topic that gets consumed and can be used to aggregate. It just isn't currently supported in KSQL

@Jianyi-Ren
Copy link

Was wondering if there's any update on this issue after half a year?

@janicesn0w
Copy link

Will there be any update on this issue? I think it will be really useful to have this soon in KSQL.

@a6kme
Copy link

a6kme commented Mar 27, 2020

Hey @rmoff , just checking in on this. Is the work around still the solution to the problem? In that case, I can just send in a dummy constant value in my original topic and use that for group by in aggregates.

@rmoff
Copy link
Contributor Author

rmoff commented Mar 28, 2020

/cc @MichaelDrogalis @derekjn ☝️

@MichaelDrogalis
Copy link
Contributor

We don't have this one queued up yet, but hope to soon. Patches always welcome in the meantime. 🙏

@tomasfabian
Copy link
Contributor

tomasfabian commented Jan 14, 2021

Hi, I wanted to create a count or sum aggregation without group by and I found this workaround.

Is it possible to receive default initial value at the beginning of each time window?

Now it is not possible to process the absence of events. In case that there was no tweet within an hour (see above @rmoff s example) the subscriber will not be notified about no activity or won't be able to reset the aggregate to 0 in a UI text block.

Example:
1.hour 22 messages, count is equal to 22
2.hour - 0 messages, no activity - aggregation is not generated. Count should generate 0 for integers at the beginning of the time window etc.

In this case the UI would show the previous 22 message, because it is not aware about the start of the current or end of the previous time window.
Another type of processing should raise an alarm in case of no activity, eg. IoT sensor is probably not sending data.

EDIT:
Is it possible to achieve this somehow with the proposed workaround?

Thank you Tomas

@mjsax mjsax added requires-streams-change sql-syntax streaming-engine Tickets owned by the ksqlDB Streaming Team labels May 16, 2021
@outminder
Copy link

Hello @rmoff / @mjsax what if I need to select a column but do not want to have it as part of GROUP BY. Example:
Select id, , type, SUM(PLAYBACK) AS TOTAL_PLAYBACK GROUP BY id
I do not want to include 'type in the group by but would like the value selected as it will be used later to merge with a different table, or to take it downstream which needs 'type'. If I include 'type', then it will SUM based on the type of media for example whereas I wanted total SUM. Any suggestions what can be done here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.