Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Implicit statistics from secondary skipping indexes #64210

Open
CurtizJ opened this issue May 21, 2024 · 1 comment
Open

[RFC] Implicit statistics from secondary skipping indexes #64210

CurtizJ opened this issue May 21, 2024 · 1 comment

Comments

@CurtizJ
Copy link
Member

CurtizJ commented May 21, 2024

Now we have two similar concepts: secondary skipping indexes and statistics. They both represent some aggregated data that can be used to optimize queries.

The differences are:

  • Secondary indexes are calculated per each granule and statistics are calculated per each part.
  • Secondary indexes must be read from disk for each query and statistics are stored in memory.
  • Secondary indexes are written into separate files and statistics are written into one file (not now but will be implemented).

The idea is to calculate aggregated skipping indexes for whole part and use them as statistics which later can be used for various optimizations:

  • Filter parts according per-part aggregated index/statistic at the first stage to avoid reading of index from disk and analysis.
  • Push minmax index/statistic as a block-level hint to optimize min/max function or aggregation if range of [min; max] is small.
  • Use set index/statistic to estimate cardinality of column. Can be used for aggregation, joins, parallel replicas, etc.
  • Use hypothesis index to optimize calculation of predicate and filtering. For instances, if we have table with definition:
    (s String, INDEX ind pred(s) TYPE hypothesis), we can optimize queries with WHERE NOT pred(s) by skipping granules where pred(s) == 1. However with statistic of type hypothesis we can push the result of predicate if it is equal to 1 and optimize queries with WHERE pred(s) as well.
  • ... This is incomplete list, there are more possible cases.

Advanced task: we can unify interfaces of statistics and skipping indexes because we can represent any statistic as a skipping index with infinite granularity.

@CurtizJ CurtizJ self-assigned this May 21, 2024
@UnamedRus
Copy link
Contributor

It also can be applied to LowCardinality data type. (Or LowCardinality can be viewed as set index)

#16707

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants