Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SIP-135] Proposal for aggregate dimensions #28610

Open
michaelritsema opened this issue May 20, 2024 · 4 comments
Open

[SIP-135] Proposal for aggregate dimensions #28610

michaelritsema opened this issue May 20, 2024 · 4 comments
Labels
design:proposal Design proposals need:more-info Requires more information from author sip Superset Improvement Proposal

Comments

@michaelritsema
Copy link

michaelritsema commented May 20, 2024

Please make sure you are familiar with the SIP process documented
here. The SIP will be numbered by a committer upon acceptance.

[SIP-135] Proposal for aggregate dimensions

Motivation

Adding columns to a report that are functionality dependent on an already added dimension should come at a low cost.

For example:

'GROUP BY person_id'

If we then want to add columns to the report "first_name,last_name,age" we'd have the choice of adding it to the dimension or as a metric.

Simply adding it to the dimension incurs penalty of putting in the GROUP BY. This is an expensive operation as well as a potential error (what if one of the rows actually had the wrong age and we ended up creating a new person)

Simply adding it as a metric is a bit awkward but is the best fit for my use case now. I have dozens of these columns and perfer to just wrap them in an ANY aggregate function.

Proposed Change

I propose we add a toggle to set a dimension as an "aggregate dimension" Instead of adding this column to the group by the user can just pick the aggregate it would like (min,max,any, first, etc). By default we would not have to duplicate this as a new metric column but could keep the same name. I think this would also work with the newer Drill features in a better way than the metrics.

Rejected Alternatives

Currently adding these strings as a metric in an ANY aggregate works best for me. This awkward and confusing to the end user to see dimensional type data as well as a lot of duplicated column naming "any(first_name) as first_name"

@michaelritsema michaelritsema added the sip Superset Improvement Proposal label May 20, 2024
@dosubot dosubot bot added the design:proposal Design proposals label May 20, 2024
@rusackas rusackas changed the title [SIP] Proposal for aggregate dimensions [SIP-135] Proposal for aggregate dimensions Jun 4, 2024
@rusackas
Copy link
Member

rusackas commented Jun 4, 2024

I'm not quite understanding the proposal, to be honest. Can you provide some examples (both good and bad) of the use cases and the queries they'd generate? In most charts you can already "GROUP BY user_id" using the Dimension field on any chart, and still aggregate the metric... you say such an operation is expensive, but I'm not grasping the alternative design.

@michaelritsema
Copy link
Author

If you had a denormalized table with a persons information called PersonSightings

with a row like:

person_id:1,f_name:evan,l_name:r, address:"343232 W. Palm St Jacksonville,Flordia",:sighting_timestamp:2024-01-01

imagine there was a new row for everytime a camera spotted the person

You would want to take advantage of the fact that most of the columsn are functionality depenendt on person_id

Currently to get a report on distinct sightings you'd need to add most of the columns as a dimension which generates:

SELECT person_id,f_name,l_name, address,sighting_timestamp,count(distinct sighting_timestamp)
from PersonSightings
GROUP BY person_id,f_name,l_name, address,sighting_timestamp

The main problem is the GROUP BY is very inefficient. I'd want SQL Like

SELECT person_id,any(f_name),any(l_name), any(address),any(sighting_timestamp),count(distinct sighting_timestamp)
from PersonSightings
GROUP BY person_id

To do this I need to turn those dimensions into metrics and add custom sql. But they should be treated as dimensions not metrics.

So I proposed a way to put an aggregate around a dimension so superset can apply ANY or FIRST (whatever the database supports)

In my use case I actually have a few dozen of these and using clickhouse which can be exponentially faster with smaller GROUP BYs

@rusackas
Copy link
Member

@yousoph @geido - please check this out and let us know your thoughts :)

@rusackas
Copy link
Member

rusackas commented Jul 2, 2024

Can you propose a bit more about how you would implement these aggregate functions (e.g. ANY or FIRST), maybe some example queries, and how this would be surfaced in the Dataset modal, the chart builder, the DrillBy/DrillToDetail modals, etc?

@rusackas rusackas added the need:more-info Requires more information from author label Jul 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design:proposal Design proposals need:more-info Requires more information from author sip Superset Improvement Proposal
Projects
Development

No branches or pull requests

2 participants