Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot create aggregate without a GROUP BY #430

Open
rmoff opened this issue Nov 1, 2017 · 12 comments

Comments

@rmoff
Copy link
Contributor

commented Nov 1, 2017

KSQL insists on having a GROUP BY column, even if logically it makes sense without.

For example: how many tweets were there in each hourly period?

ksql> select count(*) from twitter window tumbling (size 1 hour);
Aggregate query needs GROUP BY clause.
ksql>
@rmoff

This comment has been minimized.

Copy link
Contributor Author

commented Nov 1, 2017

Workaround:

Step 1: Create a dummy column:

CREATE STREAM TWITTER2 AS SELECT 1 AS FOO,* FROM TWITTER;

Step 2: Group by dummy column:

SELECT FOO,COUNT(*) FROM TWITTER2 WINDOW TUMBLING (SIZE 1 HOUR) GROUP BY FOO;

Optionally, instantiate as a table and view window / aggregate:

CREATE TABLE TWEETS_PER_HOUR AS SELECT FOO,COUNT(*) AS TWEET_COUNT FROM TWITTER2 WINDOW TUMBLING (SIZE 1 HOUR) GROUP BY FOO;
SELECT TIMESTAMPTOSTRING(ROWTIME, 'yyyy-MM-dd HH:mm:ss.SSS') AS WINDOW_START , TWEET_COUNT FROM TWEETS_PER_HOUR;
2017-11-01 15:00:00.000 | 165
2017-11-01 16:00:00.000 | 72
...
@miguno

This comment has been minimized.

Copy link
Member

commented Feb 5, 2018

My suggestion:

  • Ideally, can we find a way to support the original (non-grouped) query out of the box in some form?
  • Alternatively, can we improve the error message and e.g. add information on the workaround that @rmoff mentioned (or whatever else we deem appropriate)? I'd doubt that the majority of users would easily and quickly be able to come up with that workaround themselves.
@rmoff

This comment has been minimized.

Copy link
Contributor Author

commented Feb 5, 2018

I'd strongly +1 adding workaround info to the error, if this can't be implemented quickly. Seasoned SQL hackers will be used to 'tricks' like this, but most others won't (and will totally expect KSQL to support this, and be disappointed if it doesn't).

@dguy

This comment has been minimized.

Copy link
Contributor

commented Feb 7, 2018

One of the issue is that we are dealing with partitioned data, so in order to make this work we'd need to take all the data from the partitions and pipe it into a single topic partition to perform the count etc operations. For some operations, i.e, sum, count it could be done in two phases, i.e., on the partitions and then again as a final aggregate on the single topic partition, however, this won't work for all operations, i.e, if you are calculating an average.
That said, you could do this on hive. It would just generate the MapReduce job with a single aggregator at the end (which is essentially the same thing we'd need to do to support it).

@miguno

This comment has been minimized.

Copy link
Member

commented Feb 7, 2018

After talking to @dguy offline, here's my understanding:

  1. There is a workaround as @rmoff showed, but users may not know about it.
  2. For the general case the desired functionality requires, essentially, the input data to be written to a single partition. This works, of course, but is not super efficient. (Good news is that some functions such as SUM and COUNT can be optimized so they are more efficient.)
  3. Both the workaround and a possible OOTB functionality suffer from the same inefficiency described in point 2. The difference between the workaround and OOTB would be that, well, OOTB has a much better UX. But there's no technical hurdle to provide this OOTB (just to mention this clearly).

Given that we are busy with other work right now, I'd suggest:

  1. Short-term: We provide a more descriptive and actionable error message, e.g. possibly telling users about the workaround.
  2. We'll work on supporting this OOTB (no workaround needed any longer) in the future, i.e. once we can move this ticket to the top of our backlog.
@miguno

This comment has been minimized.

Copy link
Member

commented Feb 7, 2018

WDYT @rmoff ?

@rmoff

This comment has been minimized.

Copy link
Contributor Author

commented Feb 8, 2018

+1 Sounds like a good approach to me (better error message immediately with workaround, longer term provide the functionality OOTB).

@rmoff

This comment has been minimized.

Copy link
Contributor Author

commented Feb 22, 2018

@miguno @hjafarpour If a verbose error message with the workaround is not an option, then I would advocate just giving a URL to this issue, so that people can find the workaround (and track progress on implementing the feature) for themselves.

So instead of:

ksql> select count(*) from twitter window tumbling (size 1 hour);
Aggregate query needs GROUP BY clause.
ksql>
ksql> select count(*) from twitter window tumbling (size 1 hour);
Aggregate query needs GROUP BY clause. See https://github.com/confluentinc/ksql/issues/430.
ksql>
@big-andy-coates

This comment has been minimized.

Copy link
Member

commented Mar 1, 2018

@dguy what's the issue with calculating average? Can not each partition track a count and a sum, which can then be passed to a final aggregate stage?

I'm assuming this would require enhancements to KStreams, but I can't see any reason why it wouldn't work. (except issues with numerical overflows of course).

@dguy

This comment has been minimized.

Copy link
Contributor

commented Mar 1, 2018

It doesn't need anything on the streams side it is just another object you write to a topic that gets consumed and can be used to aggregate. It just isn't currently supported in KSQL

@Jianyi-Ren

This comment has been minimized.

Copy link

commented Aug 1, 2018

Was wondering if there's any update on this issue after half a year?

@janicesn0w

This comment has been minimized.

Copy link

commented Apr 5, 2019

Will there be any update on this issue? I think it will be really useful to have this soon in KSQL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.