Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(query): agg-hashtable-singleton #14524

Merged
merged 17 commits into from Feb 1, 2024
Merged

Conversation

Freejww
Copy link
Collaborator

@Freejww Freejww commented Jan 30, 2024

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

Introduce agg hashtable in the query pipeline and support partition ht.

In this PR, after running ClickBench, we observed improved performance for most queries, the remaining performance is similar to main. We also run ClickBench in DuckDB in the same environment, and the performance was similar.

We can enable this via : set enable_experimental_aggregate_hashtable = 1;

Note this pr only works in singleton deployment currently.

Pros:
better performance in high cardinality group aggregation, up to 2x
better performance in group by string column
better and cleaner codes than the old one

Cons:
It may consume more memory usage than the old one cause it uses using two-part struct layout

continue #13548

  • Fixes #[Link the issue here]

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@github-actions github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Jan 30, 2024
@BohuTANG
Copy link
Member

Better to give a benchmark results rather than a summary :)

@Freejww
Copy link
Collaborator Author

Freejww commented Jan 31, 2024

Performance tests:
image

Benchmark script:

cat > queries.sql << EOF
SELECT SearchEngineID, SearchPhrase, COUNT(*) AS c FROM hits WHERE SearchPhrase <> '' GROUP BY SearchEngineID, SearchPhrase ORDER BY c DESC LIMIT 10;
SELECT count() from (select userid from hits group by userid);
SELECT UserID, SearchPhrase, COUNT(*) FROM hits GROUP BY UserID, SearchPhrase ORDER BY COUNT(*) DESC LIMIT 10;
SELECT UserID, extract(minute FROM EventTime) AS m, SearchPhrase, COUNT(*) FROM hits GROUP BY UserID, m, SearchPhrase ORDER BY COUNT(*) DESC LIMIT 10;
SELECT SearchEngineID, ClientIP, COUNT(*) AS c, SUM(IsRefresh), AVG(ResolutionWidth) FROM hits WHERE SearchPhrase <> '' GROUP BY SearchEngineID, ClientIP ORDER BY c DESC LIMIT 10;
SELECT WatchID, ClientIP, COUNT(*) AS c, SUM(IsRefresh) , AVG(ResolutionWidth) d FROM hits WHERE SearchPhrase <> '' GROUP BY WatchID, ClientIP ORDER BY d desc, WatchID desc LIMIT 10;
SELECT URL, COUNT(*) AS PageViews FROM hits WHERE CounterID = 62 AND EventDate >= '2013-07-01' AND EventDate <= '2013-07-31' AND DontCountHits = 0 AND IsRefresh = 0 AND URL <> '' GROUP BY URL ORDER BY PageViews DESC LIMIT 10;
SELECT TraficSourceID, SearchEngineID, AdvEngineID, CASE WHEN (SearchEngineID = 0 AND AdvEngineID = 0) THEN Referer ELSE '' END AS Src, URL AS Dst, COUNT(*) AS PageViews FROM hits WHERE CounterID = 62 AND EventDate >= '2013-07-01' AND EventDate <= '2013-07-31' AND IsRefresh = 0 GROUP BY TraficSourceID, SearchEngineID, AdvEngineID, Src, Dst ORDER BY PageViews DESC LIMIT 10 OFFSET 1000;
SELECT URL, COUNT(*) AS c FROM hits GROUP BY URL ORDER BY c DESC LIMIT 10;
SELECT 1, URL, COUNT(*) AS c FROM hits GROUP BY 1, URL ORDER BY c DESC LIMIT 10;
SELECT ClientIP, ClientIP - 1, ClientIP - 2, ClientIP - 3, COUNT(*) AS c FROM hits GROUP BY ClientIP, ClientIP - 1, ClientIP - 2, ClientIP - 3 ORDER BY c DESC LIMIT 10;
EOF
cat queries.sql | while read line; do
        res=`echo $line | bendsql --set enable_experimental_aggregate_hashtable=0 --time  --output null`
        ## hot run
        echo $line | bendsql --set enable_experimental_aggregate_hashtable=0 --time  --output null
done
cat queries.sql | while read line; do
        res=`echo $line | bendsql --set enable_experimental_aggregate_hashtable=1 --time  --output null`
        echo $line | bendsql --set enable_experimental_aggregate_hashtable=1 --time  --output null
done

@sundy-li sundy-li added the ci-cloud Build docker image for cloud test label Feb 1, 2024
Copy link
Contributor

github-actions bot commented Feb 1, 2024

Docker Image for PR

  • tag: pr-14524-0973f6e

note: this image tag is only available for internal use,
please check the internal doc for more details.

@sundy-li
Copy link
Member

sundy-li commented Feb 1, 2024

## git clone git@github.com:BohuTANG/wizard.git
export BENDSQL_DSN="xxx"
create warehouse 'bh-agg' warehouse_size='xsmall' with version='pr-14524-0973f6e' cache_size=0;

python3 ./benchsb.py --database tpch_sf100 --runbend --nosuspend
echo "set global enable_experimental_aggregate_hashtable = 1;" | bendsql 
python3 ./benchsb.py --database tpch_sf100 --runbend --nosuspend

xsmall
main: 261s
PR: 230s 
main PR
18.155 16.788
4.165 3.124
10.539 9.524
5.245 4.831
11.002 10.512
8.311 8.424
24.316 11.465
12.889 12.807
20.405 20.207
17.268 11.515
2.365 2.484
9.581 7.778
8.723 9.023
8.344 8.422
8.832 8.723
1.576 1.562
26.506 25.948
13.292 13.382
17.349 14.301
11.376 11.113
18.953 16.886
2.071 2.038

small:
main: 159.216s
PR: 144.014s

main pr
9.837 9.471
3.488 2.779
6.358 5.976
4.247 3.006
6.487 6.123
4.477 4.600
6.844 6.668
8.332 7.746
12.520 11.846
12.244 10.268
1.539 2.250
5.980 5.531
5.774 5.155
5.128 5.048
5.394 5.131
1.147 1.019
16.136 14.884
11.834 7.332
10.734 9.798
7.261 6.766
11.933 11.286
1.522 1.331

This table now includes the total sum for each column at the bottom.

@sundy-li sundy-li added this pull request to the merge queue Feb 1, 2024
Merged via the queue into datafuselabs:main with commit 2d81588 Feb 1, 2024
73 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-cloud Build docker image for cloud test pr-feature this PR introduces a new feature to the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants