Skip to content

Support multi-valued dimensions in the StarTree index #9082

@lksvenoy-r7

Description

@lksvenoy-r7

Currently, the star-tree index can not be used with multi-valued dimensional columns. The problem with supporting multi-valued columns in the split order of the star-tree index is that the behaviour for splitting on multiple values would be require handling all the permutations of the values present in the list, which is not practical.

This ticket is a suggestion to implement support for multi-valued dimensional columns in the split order for the star-tree index, with a simply constraint of only allowing one specific value to be used when splitting for the index.

Suppose we have a table which has the following dimensions:

entityId: String
groupIds: List of strings
status: String

And a query looking like this
SELECT COUNT(*) FROM table WHERE entityId = 'ID' AND groupIds CONTAINS 'group_1' AND status = 'ONLINE'

If there are many rows here, this could be a very intense query to resolve. If the star tree index could pre-aggregate on each value in groupId, we could make this lookup much more performant. This is a problem that we are facing today. A group can contain multiple entities, and an entity can be in multiple groups. This many to many relationship means there is a dynamic constraint on our queries.

Some queries will ignore group (that would be the star node), while some need to access the aggregation on a group basis.

The problem with implementing support for this, is that it wouldn't work for a query defined as this:
SELECT COUNT(*) FROM table WHERE entityId = 'ID' AND groupIds IN ('group_1', 'group_2') AND status = 'ONLINE'

This behaviour would be undefined, as there is no way to create one aggregation per groupId if we also need to handle all the permutations of possible values in the multi-valued column. This is why I propose that we support multi-valued columns for the star-tree index with the constraint that each value would have an aggregation associated with it, but not allowing it to span a selection of multiple values.

For example..

With the following star-tree index:

"starTreeIndexConfigs": [
        {
          "dimensionsSplitOrder": [
            "entityId",
            "groupIds",
            "status"
          ],
          "skipStarNodeCreationForDimensions": [],
          "functionColumnPairs": [
            "COUNT__*"
          ],
          "maxLeafRecords": 10000
        }
      ],

This would be possible

entity_1 -> count(*) = 500
   entity_1 -> group_1 -> count(*) = (200)
       entity_1 -> group_1 -> ONLINE -> count(*) = 100
       entity_1 -> group_1 -> OFFLINE -> count(*) = 100
   entity_1 -> group_2 -> count(*) = (300)
       entity_1 -> group_2 -> ONLINE -> count(*) = 150
       entity_1 -> group_2 -> OFFLINE -> count(*) = 150

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions