-
Notifications
You must be signed in to change notification settings - Fork 13.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SIP-135] Proposal for aggregate dimensions #28610
Comments
I'm not quite understanding the proposal, to be honest. Can you provide some examples (both good and bad) of the use cases and the queries they'd generate? In most charts you can already "GROUP BY user_id" using the Dimension field on any chart, and still aggregate the metric... you say such an operation is expensive, but I'm not grasping the alternative design. |
If you had a denormalized table with a persons information called PersonSightings with a row like: person_id:1,f_name:evan,l_name:r, address:"343232 W. Palm St Jacksonville,Flordia",:sighting_timestamp:2024-01-01 imagine there was a new row for everytime a camera spotted the person You would want to take advantage of the fact that most of the columsn are functionality depenendt on person_id Currently to get a report on distinct sightings you'd need to add most of the columns as a dimension which generates: SELECT person_id,f_name,l_name, address,sighting_timestamp,count(distinct sighting_timestamp) The main problem is the GROUP BY is very inefficient. I'd want SQL Like SELECT person_id,any(f_name),any(l_name), any(address),any(sighting_timestamp),count(distinct sighting_timestamp) To do this I need to turn those dimensions into metrics and add custom sql. But they should be treated as dimensions not metrics. So I proposed a way to put an aggregate around a dimension so superset can apply ANY or FIRST (whatever the database supports) In my use case I actually have a few dozen of these and using clickhouse which can be exponentially faster with smaller GROUP BYs |
Can you propose a bit more about how you would implement these aggregate functions (e.g. ANY or FIRST), maybe some example queries, and how this would be surfaced in the Dataset modal, the chart builder, the DrillBy/DrillToDetail modals, etc? |
Please make sure you are familiar with the SIP process documented
here. The SIP will be numbered by a committer upon acceptance.
[SIP-135] Proposal for aggregate dimensions
Motivation
Adding columns to a report that are functionality dependent on an already added dimension should come at a low cost.
For example:
'GROUP BY person_id'
If we then want to add columns to the report "first_name,last_name,age" we'd have the choice of adding it to the dimension or as a metric.
Simply adding it to the dimension incurs penalty of putting in the GROUP BY. This is an expensive operation as well as a potential error (what if one of the rows actually had the wrong age and we ended up creating a new person)
Simply adding it as a metric is a bit awkward but is the best fit for my use case now. I have dozens of these columns and perfer to just wrap them in an ANY aggregate function.
Proposed Change
I propose we add a toggle to set a dimension as an "aggregate dimension" Instead of adding this column to the group by the user can just pick the aggregate it would like (min,max,any, first, etc). By default we would not have to duplicate this as a new metric column but could keep the same name. I think this would also work with the newer Drill features in a better way than the metrics.
Rejected Alternatives
Currently adding these strings as a metric in an ANY aggregate works best for me. This awkward and confusing to the end user to see dimensional type data as well as a lot of duplicated column naming "any(first_name) as first_name"
The text was updated successfully, but these errors were encountered: