Skip to content

Conversation

@weizhengte
Copy link
Contributor

@weizhengte weizhengte commented Dec 23, 2022

Proposed changes

This pr mainly to optimize the histogram(👉🏻 #14910) aggregation function. Including the following:

  1. Support input parameters sample_rate and max_bucket_num
  2. Add UT and regression test
  3. Add documentation
  4. Optimize function implementation logic

Parameter description:

  • sample_rate:Optional. The proportion of sample data used to generate the histogram. The default is 0.2.
  • max_bucket_num:Optional. Limit the number of histogram buckets. The default value is 128.

Example:

MySQL [test]> SELECT histogram(c_float) FROM histogram_test;
+-------------------------------------------------------------------------------------------------------------------------------------+
| histogram(`c_float`)                                                                                                                |
+-------------------------------------------------------------------------------------------------------------------------------------+
| {"sample_rate":0.2,"max_bucket_num":128,"bucket_num":3,"buckets":[{"lower":"0.1","upper":"0.1","count":1,"pre_sum":0,"ndv":1},...]} |
+-------------------------------------------------------------------------------------------------------------------------------------+

MySQL [test]> SELECT histogram(c_string, 0.5, 2) FROM histogram_test;
+-------------------------------------------------------------------------------------------------------------------------------------+
| histogram(`c_string`)                                                                                                               |
+-------------------------------------------------------------------------------------------------------------------------------------+
| {"sample_rate":0.5,"max_bucket_num":2,"bucket_num":2,"buckets":[{"lower":"str1","upper":"str7","count":4,"pre_sum":0,"ndv":3},...]} |
+-------------------------------------------------------------------------------------------------------------------------------------+

Query result description:

{
    "sample_rate": 0.2, 
    "max_bucket_num": 128, 
    "bucket_num": 3, 
    "buckets": [
        {
            "lower": "0.1", 
            "upper": "0.2", 
            "count": 2, 
            "pre_sum": 0, 
            "ndv": 2
        }, 
        {
            "lower": "0.8", 
            "upper": "0.9", 
            "count": 2, 
            "pre_sum": 2, 
            "ndv": 2
        }, 
        {
            "lower": "1.0", 
            "upper": "1.0", 
            "count": 2, 
            "pre_sum": 4, 
            "ndv": 1
        }
    ]
}

Field description:

  • sample_rate:Rate of sampling
  • max_bucket_num:Limit the maximum number of buckets
  • bucket_num:The actual number of buckets
  • buckets:All buckets
    • lower:Upper bound of the bucket
    • upper:Lower bound of the bucket
    • count:The number of elements contained in the bucket
    • pre_sum:The total number of elements in the front bucket
    • ndv:The number of different values in the bucket

Total number of histogram elements = number of elements in the last bucket(count) + total number of elements in the previous bucket(pre_sum).

Issue Number: close #xxx

Problem summary

Describe your changes.

Checklist(Required)

  1. Does it affect the original behavior:
    • Yes
    • No
    • I don't know
  2. Has unit tests been added:
    • Yes
    • No
    • No Need
  3. Has document been added or modified:
    • Yes
    • No
    • No Need
  4. Does it need to update dependencies:
    • Yes
    • No
  5. Are there any changes that cannot be rolled back:
    • Yes (If Yes, please explain WHY)
    • No

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@github-actions github-actions bot added area/vectorization kind/docs Categorizes issue or PR as related to documentation. kind/test labels Dec 23, 2022
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

@hello-stephen
Copy link
Contributor

hello-stephen commented Dec 24, 2022

TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 34.98 seconds
load time: 634 seconds
storage size: 17123234149 Bytes
https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/tmp/20221225141859_clickbench_pr_68479.html

@weizhengte weizhengte force-pushed the enh-histogram branch 3 times, most recently from 5399f6b to 5c52be1 Compare December 25, 2022 07:34
@weizhengte weizhengte marked this pull request as ready for review December 25, 2022 07:38
@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jan 6, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2023

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2023

PR approved by anyone and no changes requested.

@morrySnow morrySnow merged commit 76ad599 into apache:master Jan 6, 2023
morrySnow pushed a commit that referenced this pull request Jan 6, 2023
…statistics (#15490)

Histogram statistics are more expensive to collect and we collect and persist them separately.

This PR does the following work:
1. Add histogram syntax and add keyword `TABLE`
2. Add the task of collecting histogram statistics
3. Persistent histogram statistics
4. Replace fastjson with gson
5. Add unit tests...

Relevant syntax examples:
> Refer to some databases such as mysql and add the keyword `TABLE`.

```SQL
-- collect column statistics
ANALYZE TABLE statistics_test;

-- collect histogram statistics
ANALYZE TABLE statistics_test UPDATE HISTOGRAM ON col1,col2;
```

base on #15317
@weizhengte weizhengte deleted the enh-histogram branch March 8, 2023 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. area/vectorization kind/docs Categorizes issue or PR as related to documentation. kind/test reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants