[SIP-135] Proposal for aggregate dimensions #28610

michaelritsema · 2024-05-20T17:50:21Z

Please make sure you are familiar with the SIP process documented
here. The SIP will be numbered by a committer upon acceptance.

[SIP-135] Proposal for aggregate dimensions

Motivation

Adding columns to a report that are functionality dependent on an already added dimension should come at a low cost.

For example:

'GROUP BY person_id'

If we then want to add columns to the report "first_name,last_name,age" we'd have the choice of adding it to the dimension or as a metric.

Simply adding it to the dimension incurs penalty of putting in the GROUP BY. This is an expensive operation as well as a potential error (what if one of the rows actually had the wrong age and we ended up creating a new person)

Simply adding it as a metric is a bit awkward but is the best fit for my use case now. I have dozens of these columns and perfer to just wrap them in an ANY aggregate function.

Proposed Change

I propose we add a toggle to set a dimension as an "aggregate dimension" Instead of adding this column to the group by the user can just pick the aggregate it would like (min,max,any, first, etc). By default we would not have to duplicate this as a new metric column but could keep the same name. I think this would also work with the newer Drill features in a better way than the metrics.

Rejected Alternatives

Currently adding these strings as a metric in an ANY aggregate works best for me. This awkward and confusing to the end user to see dimensional type data as well as a lot of duplicated column naming "any(first_name) as first_name"

rusackas · 2024-06-04T15:24:36Z

I'm not quite understanding the proposal, to be honest. Can you provide some examples (both good and bad) of the use cases and the queries they'd generate? In most charts you can already "GROUP BY user_id" using the Dimension field on any chart, and still aggregate the metric... you say such an operation is expensive, but I'm not grasping the alternative design.

michaelritsema · 2024-06-04T15:38:02Z

If you had a denormalized table with a persons information called PersonSightings

with a row like:

person_id:1,f_name:evan,l_name:r, address:"343232 W. Palm St Jacksonville,Flordia",:sighting_timestamp:2024-01-01

imagine there was a new row for everytime a camera spotted the person

You would want to take advantage of the fact that most of the columsn are functionality depenendt on person_id

Currently to get a report on distinct sightings you'd need to add most of the columns as a dimension which generates:

SELECT person_id,f_name,l_name, address,sighting_timestamp,count(distinct sighting_timestamp)
from PersonSightings
GROUP BY person_id,f_name,l_name, address,sighting_timestamp

The main problem is the GROUP BY is very inefficient. I'd want SQL Like

SELECT person_id,any(f_name),any(l_name), any(address),any(sighting_timestamp),count(distinct sighting_timestamp)
from PersonSightings
GROUP BY person_id

To do this I need to turn those dimensions into metrics and add custom sql. But they should be treated as dimensions not metrics.

So I proposed a way to put an aggregate around a dimension so superset can apply ANY or FIRST (whatever the database supports)

In my use case I actually have a few dozen of these and using clickhouse which can be exponentially faster with smaller GROUP BYs

rusackas · 2024-06-18T15:26:49Z

@yousoph @geido - please check this out and let us know your thoughts :)

rusackas · 2024-07-02T15:57:08Z

Can you propose a bit more about how you would implement these aggregate functions (e.g. ANY or FIRST), maybe some example queries, and how this would be surfaced in the Dataset modal, the chart builder, the DrillBy/DrillToDetail modals, etc?

michaelritsema added the sip Superset Improvement Proposal label May 20, 2024

dosubot bot added the design:proposal Design proposals label May 20, 2024

rusackas changed the title ~~[SIP] Proposal for aggregate dimensions~~ [SIP-135] Proposal for aggregate dimensions Jun 4, 2024

rusackas added the need:more-info Requires more information from author label Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SIP-135] Proposal for aggregate dimensions #28610

[SIP-135] Proposal for aggregate dimensions #28610

michaelritsema commented May 20, 2024 •

edited by rusackas

Loading

rusackas commented Jun 4, 2024 •

edited

Loading

michaelritsema commented Jun 4, 2024

rusackas commented Jun 18, 2024

rusackas commented Jul 2, 2024

[SIP-135] Proposal for aggregate dimensions #28610

[SIP-135] Proposal for aggregate dimensions #28610

Comments

michaelritsema commented May 20, 2024 • edited by rusackas Loading