[commons] arrow -> arrow2 #1239

sundy-li · 2021-07-30T01:36:53Z

I hereby agree to the terms of the CLA available at: https://datafuse.rs/policies/cla/

Summary

Summary about this PR

Changelog

Improvement

Related Issues

Fixes #1170

Test Plan

Unit Tests

Stateless Tests

databend-bot · 2021-07-30T01:36:56Z

Thanks for the contribution!
I have applied any labels matching special text in your PR Changelog.

Please review the labels and make any necessary changes.

sundy-li · 2021-08-02T02:40:20Z

Baseline benchmarks compare arrow1 vs arrow2 shows great performance improvement.

https://github.com/sundy-li/learn/tree/master/arrow-vs-arrow2/benches

argo bench -- "arrow2-min 2\^13 f32"
arrow2-min 2^13 f32     time:   [1.5618 us 1.5680 us 1.5752 us]

argo bench -- "arrow1-min 2\^13 f32"
arrow1-min 2^13 f32     time:   [3.1624 us 3.1732 us 3.1855 us]

After migrated datafuse into arrow2, the performance shows not much difference. Maybe the bottleneck is still in upstream, such as numbers_tream.

perf-results

Arrow2 makes the code be much more clear and simple, so it's still better to have it.

codecov-commenter · 2021-08-02T04:06:45Z

Codecov Report

Merging #1239 (8dbd277) into master (38c1c70) will increase coverage by 0%.
The diff coverage is 59%.

@@           Coverage Diff            @@
##           master   #1239     +/-   ##
========================================
  Coverage      70%     71%             
========================================
  Files         476     472      -4     
  Lines       27683   29018   +1335     
========================================
+ Hits        19629   20625    +996     
- Misses       8054    8393    +339

Impacted Files	Coverage Δ
common/datablocks/src/kernels/data_block_slice.rs	`100% <ø> (ø)`
common/datavalues/src/arrays/ops/vec_hash.rs	`54% <0%> (ø)`
common/datavalues/src/series/comparison.rs	`96% <ø> (+6%)`	⬆️
common/datavalues/src/series/date_wrap.rs	`0% <ø> (ø)`
common/datavalues/src/series/series_impl.rs	`34% <ø> (+3%)`	⬆️
...usequery/query/src/api/rpc/flight_client_stream.rs	`0% <0%> (ø)`
...sequery/query/src/api/rpc/flight_service_stream.rs	`23% <ø> (ø)`
fusequery/query/src/sql/sql_common.rs	`74% <ø> (+4%)`	⬆️
fusestore/store/src/executor/action_handler.rs	`61% <0%> (+1%)`	⬆️
common/datavalues/src/arrays/ops/fill.rs	`63% <7%> (-19%)`	⬇️
... and 99 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 38c1c70...8dbd277. Read the comment docs.

BohuTANG · 2021-08-02T09:09:17Z

Performance with my AMD Ryzen 7 PRO 4750U(Still faster in most cases, expect for last two case):

Query	arrow1(master)	arrow2
SELECT avg(number) FROM numbers_mt(100000000000)	4.35 s. (22.97 billion rows/s., 183.91 GB/s.)	4.17 s. (23.98 billion rows/s., 191.84 GB/s.)
SELECT sum(number) FROM numbers_mt(100000000000)	4.20 s. (23.79 billion rows/s., 190.50 GB/s.)	3.72 s. (26.82 billion rows/s., 214.59 GB/s.)
SELECT min(number) FROM numbers_mt(100000000000)	4.92 s. (20.31 billion rows/s., 162.64 GB/s.)	4.3 s. (22.90 billion rows/s., 183.17 GB/s.)
SELECT max(number) FROM numbers_mt(100000000000)	4.77 s. (20.95 billion rows/s., 167.78 GB/s.)	4.3 s. (22.90 billion rows/s., 183.17 GB/s.)
SELECT count(number) FROM numbers_mt(100000000000)	2.91 s. (34.33 billion rows/s., 274.90 GB/s.)	2.31 s. (43.21 billion rows/s., 345.66 GB/s.)
SELECT sum(number+number+number) FROM numbers_mt(100000000000)	19.83 s. (5.04 billion rows/s., 40.37 GB/s.)	16.16 s. (6.19 billion rows/s., 49.50 GB/s.)
SELECT sum(number) / count(number) FROM numbers_mt(100000000000)	3.90 s. (25.62 billion rows/s., 205.13 GB/s.)	3.91 s. (25.57 billion rows/s., 204.57 GB/s.)
SELECT sum(number) / count(number), max(number), min(number) FROM numbers_mt(100000000000)	8.28 s. (12.07 billion rows/s., 96.66 GB/s.)	8.21 s. (12.17 billion rows/s., 97.34 GB/s.)
SELECT number FROM numbers_mt(10000000000) ORDER BY number DESC LIMIT 100	4.80 s. (2.08 billion rows/s., 16.67 GB/s.)	5.37 s. (1.86 billion rows/s., 14.88 GB/s.)
SELECT max(number), sum(number) FROM numbers_mt(1000000000) GROUP BY number % 3, number % 4, number % 5	6.31 s. (158.49 million rows/s., 1.27 GB/s.)	7.0 s. (141.51 million rows/s., 1.13 GB/s.)

cc @sundy-li @zhang2014 @jorgecarleitao

sundy-li · 2021-08-02T09:17:10Z

Ok, but the baseline of sort performance is poor, I'm trying to find the bottleneck in arrow2.

jorgecarleitao/arrow2#245

It's strange to see the group by query be slower but the aggregation is faster, this pr did not change the group by behavior.

BohuTANG · 2021-08-02T09:19:30Z

Ok, but the baseline of sort performance is poor, I'm trying to find the bottleneck in arrow2.

jorgecarleitao/arrow2#245

I have seen that, if it solved, arrow2 all is top performance against arrow1

BohuTANG · 2021-08-02T09:22:39Z

For the groupby, may be the first test is fast in that time, i think we can meaused that again when the sort sloved.

BohuTANG

LGTM

databend-bot · 2021-08-02T23:40:34Z

CI Passed
Reviewer Approved
Let's Merge

jorgecarleitao · 2021-08-03T21:22:53Z

I believe that the group by perf can be addressed with the arrow2::compute::hash kernel: hash the arrays one by one and merge the hashes

sundy-li · 2021-08-04T00:37:44Z

arrow2::compute::hash

The current design in datafuse is non-conflict compact hash combination key.

Group by String, the combination key is binary as Vec<u8>
Group by UInt32, the combination key is UInt32
Group by UInt8, UInt8, UInt8, the combination key is UInt32, left last 8bytes zeroed.
Group by UInt8, we can make a small lookup table to tune the performance.

If the key size is defined, we can allocate a large memory in one allocation, the size is Keysize * rows.

[commons] arrow -> arrow2

71543ce

databend-bot added the pr-improvement label Jul 30, 2021

[commons] arrow -> arrow2

9fdffb7

BohuTANG mentioned this pull request Jul 31, 2021

Considering migrate to arrow2 #1170

Closed

2 tasks

sundy-li added 6 commits July 31, 2021 14:12

[datavalues] migrate to arrow2

740412e

[commons] arrow -> arrow2

d02e9da

fusequery to arrow2

fe7d6b4

arrow2 tests fixed

6d0545b

Merge branch 'master' into arrow2

1d2249e

copyright fixed

9c7dac5

update tests

8dbd277

sundy-li requested review from jyizheng and PsiACE August 2, 2021 03:46

sundy-li marked this pull request as ready for review August 2, 2021 03:46

BohuTANG approved these changes Aug 2, 2021

View reviewed changes

databend-bot merged commit 9f5f046 into datafuselabs:master Aug 2, 2021

jorgecarleitao mentioned this pull request Aug 14, 2021

Experimenting with arrow2 apache/datafusion#68

Closed

BohuTANG mentioned this pull request Aug 15, 2021

[improvement] Consider remove common/allocators #1470

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[commons] arrow -> arrow2 #1239

[commons] arrow -> arrow2 #1239

sundy-li commented Jul 30, 2021

databend-bot commented Jul 30, 2021

sundy-li commented Aug 2, 2021 •

edited

codecov-commenter commented Aug 2, 2021

BohuTANG commented Aug 2, 2021 •

edited

sundy-li commented Aug 2, 2021 •

edited

BohuTANG commented Aug 2, 2021 •

edited

BohuTANG commented Aug 2, 2021

BohuTANG left a comment

databend-bot commented Aug 2, 2021

jorgecarleitao commented Aug 3, 2021

sundy-li commented Aug 4, 2021 •

edited

[commons] arrow -> arrow2 #1239

[commons] arrow -> arrow2 #1239

Conversation

sundy-li commented Jul 30, 2021

Summary

Changelog

Related Issues

Test Plan

databend-bot commented Jul 30, 2021

sundy-li commented Aug 2, 2021 • edited

codecov-commenter commented Aug 2, 2021

Codecov Report

BohuTANG commented Aug 2, 2021 • edited

sundy-li commented Aug 2, 2021 • edited

BohuTANG commented Aug 2, 2021 • edited

BohuTANG commented Aug 2, 2021

BohuTANG left a comment

Choose a reason for hiding this comment

databend-bot commented Aug 2, 2021

jorgecarleitao commented Aug 3, 2021

sundy-li commented Aug 4, 2021 • edited

sundy-li commented Aug 2, 2021 •

edited

BohuTANG commented Aug 2, 2021 •

edited

sundy-li commented Aug 2, 2021 •

edited

BohuTANG commented Aug 2, 2021 •

edited

sundy-li commented Aug 4, 2021 •

edited