[enhancement](histogram) optimise aggregate function histogram #15317

weizhengte · 2022-12-23T07:56:08Z

Proposed changes

This pr mainly to optimize the histogram(👉🏻 #14910) aggregation function. Including the following:

Support input parameters sample_rate and max_bucket_num
Add UT and regression test
Add documentation
Optimize function implementation logic

Parameter description：

sample_rate：Optional. The proportion of sample data used to generate the histogram. The default is 0.2.
max_bucket_num：Optional. Limit the number of histogram buckets. The default value is 128.

Example：

MySQL [test]> SELECT histogram(c_float) FROM histogram_test;
+-------------------------------------------------------------------------------------------------------------------------------------+
| histogram(`c_float`)                                                                                                                |
+-------------------------------------------------------------------------------------------------------------------------------------+
| {"sample_rate":0.2,"max_bucket_num":128,"bucket_num":3,"buckets":[{"lower":"0.1","upper":"0.1","count":1,"pre_sum":0,"ndv":1},...]} |
+-------------------------------------------------------------------------------------------------------------------------------------+

MySQL [test]> SELECT histogram(c_string, 0.5, 2) FROM histogram_test;
+-------------------------------------------------------------------------------------------------------------------------------------+
| histogram(`c_string`)                                                                                                               |
+-------------------------------------------------------------------------------------------------------------------------------------+
| {"sample_rate":0.5,"max_bucket_num":2,"bucket_num":2,"buckets":[{"lower":"str1","upper":"str7","count":4,"pre_sum":0,"ndv":3},...]} |
+-------------------------------------------------------------------------------------------------------------------------------------+

Query result description：

{
    "sample_rate": 0.2, 
    "max_bucket_num": 128, 
    "bucket_num": 3, 
    "buckets": [
        {
            "lower": "0.1", 
            "upper": "0.2", 
            "count": 2, 
            "pre_sum": 0, 
            "ndv": 2
        }, 
        {
            "lower": "0.8", 
            "upper": "0.9", 
            "count": 2, 
            "pre_sum": 2, 
            "ndv": 2
        }, 
        {
            "lower": "1.0", 
            "upper": "1.0", 
            "count": 2, 
            "pre_sum": 4, 
            "ndv": 1
        }
    ]
}

Field description：

sample_rate：Rate of sampling
max_bucket_num：Limit the maximum number of buckets
bucket_num：The actual number of buckets
buckets：All buckets
- lower：Upper bound of the bucket
- upper：Lower bound of the bucket
- count：The number of elements contained in the bucket
- pre_sum：The total number of elements in the front bucket
- ndv：The number of different values in the bucket

Total number of histogram elements = number of elements in the last bucket(count) + total number of elements in the previous bucket(pre_sum).

Issue Number: close #xxx

Problem summary

Describe your changes.

Checklist(Required)

Does it affect the original behavior:
- Yes
- No
- I don't know
Has unit tests been added:
- Yes
- No
- No Need
Has document been added or modified:
- Yes
- No
- No Need
Does it need to update dependencies:
- Yes
- No
Are there any changes that cannot be rolled back:
- Yes (If Yes, please explain WHY)
- No

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

github-actions

clang-tidy made some suggestions

be/test/vec/aggregate_functions/agg_histogram_test.cpp

hello-stephen · 2022-12-24T08:36:20Z

TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 34.98 seconds
load time: 634 seconds
storage size: 17123234149 Bytes
https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/tmp/20221225141859_clickbench_pr_68479.html

github-actions · 2023-01-06T16:45:44Z

PR approved by at least one committer and no changes requested.

github-actions · 2023-01-06T16:45:46Z

PR approved by anyone and no changes requested.

…statistics (#15490) Histogram statistics are more expensive to collect and we collect and persist them separately. This PR does the following work: 1. Add histogram syntax and add keyword `TABLE` 2. Add the task of collecting histogram statistics 3. Persistent histogram statistics 4. Replace fastjson with gson 5. Add unit tests... Relevant syntax examples： > Refer to some databases such as mysql and add the keyword `TABLE`. ```SQL -- collect column statistics ANALYZE TABLE statistics_test; -- collect histogram statistics ANALYZE TABLE statistics_test UPDATE HISTOGRAM ON col1,col2; ``` base on #15317

github-actions bot added area/vectorization kind/docs Categorizes issue or PR as related to documentation. kind/test labels Dec 23, 2022

github-actions bot reviewed Dec 23, 2022

View reviewed changes

be/test/vec/aggregate_functions/agg_histogram_test.cpp Show resolved Hide resolved

weizhengte mentioned this pull request Dec 23, 2022

[feature-wip](Nereids)(histogram) add aggregate function histogram and collect histogram statistics #14910

Merged

13 tasks

weizhengte marked this pull request as draft December 23, 2022 09:43

weizhengte force-pushed the enh-histogram branch 3 times, most recently from 5399f6b to 5c52be1 Compare December 25, 2022 07:34

weizhengte marked this pull request as ready for review December 25, 2022 07:38

weizhengte force-pushed the enh-histogram branch from 2ae0f44 to 56d3a82 Compare December 25, 2022 07:43

code optimization

31992e5

weizhengte force-pushed the enh-histogram branch from be2d861 to 31992e5 Compare December 25, 2022 12:26

weizhengte added 2 commits December 25, 2022 20:57

bug fix

397a08e

p0 fix

758c19f

weizhengte mentioned this pull request Dec 30, 2022

[enhancement](histogram) add histogram syntax and perstist histogram statistics #15490

Merged

13 tasks

morrySnow approved these changes Jan 6, 2023

View reviewed changes

github-actions bot added the approved Indicates a PR has been approved by one committer. label Jan 6, 2023

github-actions bot added the reviewed label Jan 6, 2023

morrySnow merged commit 76ad599 into apache:master Jan 6, 2023

weizhengte deleted the enh-histogram branch March 8, 2023 16:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[enhancement](histogram) optimise aggregate function histogram #15317

[enhancement](histogram) optimise aggregate function histogram #15317

Uh oh!

weizhengte commented Dec 23, 2022 •

edited

Loading

Uh oh!

github-actions bot left a comment

Uh oh!

Uh oh!

hello-stephen commented Dec 24, 2022 •

edited

Loading

Uh oh!

github-actions bot commented Jan 6, 2023

Uh oh!

github-actions bot commented Jan 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[enhancement](histogram) optimise aggregate function histogram #15317

[enhancement](histogram) optimise aggregate function histogram #15317

Uh oh!

Conversation

weizhengte commented Dec 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Problem summary

Checklist(Required)

Further comments

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hello-stephen commented Dec 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 6, 2023

Uh oh!

github-actions bot commented Jan 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

weizhengte commented Dec 23, 2022 •

edited

Loading

hello-stephen commented Dec 24, 2022 •

edited

Loading