New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-15336: [Go] Implement 'min_max_neon' with Arm64 GoLang Assembly #12163
Conversation
Tests were passed. Benchmark: Before: BenchmarkPlainEncodingInt32/len_1024-64 314652 3805 ns/op 1076.62 MB/s BenchmarkPlainEncodingInt32/len_2048-64 159996 7442 ns/op 1100.72 MB/s BenchmarkPlainEncodingInt32/len_4096-64 79710 15058 ns/op 1088.09 MB/s BenchmarkPlainEncodingInt32/len_8192-64 39919 30120 ns/op 1087.92 MB/s BenchmarkPlainEncodingInt32/len_16384-64 19790 60623 ns/op 1081.04 MB/s BenchmarkPlainEncodingInt32/len_32768-64 9618 120201 ns/op 1090.44 MB/s BenchmarkPlainEncodingInt32/len_65536-64 4856 241563 ns/op 1085.20 MB/s BenchmarkPlainEncodingInt64/len_1024-64 160287 7474 ns/op 1096.13 MB/s BenchmarkPlainEncodingInt64/len_2048-64 80745 14969 ns/op 1094.49 MB/s BenchmarkPlainEncodingInt64/len_4096-64 39399 30197 ns/op 1085.12 MB/s BenchmarkPlainEncodingInt64/len_8192-64 19797 60557 ns/op 1082.23 MB/s BenchmarkPlainEncodingInt64/len_16384-64 9660 120168 ns/op 1090.74 MB/s BenchmarkPlainEncodingInt64/len_32768-64 4886 241728 ns/op 1084.46 MB/s BenchmarkPlainEncodingInt64/len_65536-64 2412 498532 ns/op 1051.66 MB/s BenchmarkPlainEncodingFloat32/len_1024-64 315712 3804 ns/op 1076.68 MB/s BenchmarkPlainEncodingFloat32/len_2048-64 160172 7502 ns/op 1091.94 MB/s BenchmarkPlainEncodingFloat32/len_4096-64 81097 14858 ns/op 1102.74 MB/s BenchmarkPlainEncodingFloat32/len_8192-64 40135 29861 ns/op 1097.33 MB/s BenchmarkPlainEncodingFloat32/len_16384-64 19816 60470 ns/op 1083.78 MB/s BenchmarkPlainEncodingFloat32/len_32768-64 9736 120119 ns/op 1091.19 MB/s BenchmarkPlainEncodingFloat32/len_65536-64 4778 241758 ns/op 1084.32 MB/s BenchmarkPlainEncodingFloat64/len_1024-64 156734 7608 ns/op 1076.71 MB/s BenchmarkPlainEncodingFloat64/len_2048-64 80022 14986 ns/op 1093.26 MB/s BenchmarkPlainEncodingFloat64/len_4096-64 39451 29765 ns/op 1100.90 MB/s BenchmarkPlainEncodingFloat64/len_8192-64 19820 60614 ns/op 1081.21 MB/s BenchmarkPlainEncodingFloat64/len_16384-64 9933 120136 ns/op 1091.03 MB/s BenchmarkPlainEncodingFloat64/len_32768-64 4904 240777 ns/op 1088.74 MB/s BenchmarkPlainEncodingFloat64/len_65536-64 2382 492836 ns/op 1063.82 MB/s After: BenchmarkPlainEncodingInt32/len_1024-64 1984029 599.6 ns/op 6831.71 MB/s BenchmarkPlainEncodingInt32/len_2048-64 1153614 1053 ns/op 7782.45 MB/s BenchmarkPlainEncodingInt32/len_4096-64 578494 1980 ns/op 8276.18 MB/s BenchmarkPlainEncodingInt32/len_8192-64 259530 4649 ns/op 7048.07 MB/s BenchmarkPlainEncodingInt32/len_16384-64 108944 10561 ns/op 6205.63 MB/s BenchmarkPlainEncodingInt32/len_32768-64 46070 24368 ns/op 5378.83 MB/s BenchmarkPlainEncodingInt32/len_65536-64 22089 50834 ns/op 5156.82 MB/s BenchmarkPlainEncodingInt64/len_1024-64 1136257 1069 ns/op 7666.01 MB/s BenchmarkPlainEncodingInt64/len_2048-64 610842 2001 ns/op 8187.11 MB/s BenchmarkPlainEncodingInt64/len_4096-64 260290 4303 ns/op 7615.33 MB/s BenchmarkPlainEncodingInt64/len_8192-64 111594 10722 ns/op 6112.56 MB/s BenchmarkPlainEncodingInt64/len_16384-64 50252 23624 ns/op 5548.31 MB/s BenchmarkPlainEncodingInt64/len_32768-64 19208 53679 ns/op 4883.51 MB/s BenchmarkPlainEncodingInt64/len_65536-64 9026 128405 ns/op 4083.09 MB/s BenchmarkPlainEncodingFloat32/len_1024-64 2000133 613.3 ns/op 6678.84 MB/s BenchmarkPlainEncodingFloat32/len_2048-64 1039034 1070 ns/op 7657.78 MB/s BenchmarkPlainEncodingFloat32/len_4096-64 573483 2162 ns/op 7577.36 MB/s BenchmarkPlainEncodingFloat32/len_8192-64 258648 4479 ns/op 7316.32 MB/s BenchmarkPlainEncodingFloat32/len_16384-64 111152 10698 ns/op 6126.04 MB/s BenchmarkPlainEncodingFloat32/len_32768-64 48129 22288 ns/op 5880.91 MB/s BenchmarkPlainEncodingFloat32/len_65536-64 20336 58841 ns/op 4455.14 MB/s BenchmarkPlainEncodingFloat64/len_1024-64 1148325 1105 ns/op 7411.95 MB/s BenchmarkPlainEncodingFloat64/len_2048-64 580255 2067 ns/op 7927.00 MB/s BenchmarkPlainEncodingFloat64/len_4096-64 260462 4469 ns/op 7331.86 MB/s BenchmarkPlainEncodingFloat64/len_8192-64 116211 9802 ns/op 6685.97 MB/s BenchmarkPlainEncodingFloat64/len_16384-64 51475 22753 ns/op 5760.62 MB/s BenchmarkPlainEncodingFloat64/len_32768-64 21224 52235 ns/op 5018.54 MB/s BenchmarkPlainEncodingFloat64/len_65536-64 10000 124035 ns/op 4226.93 MB/s Get 4x ~ 6x performance uplift. And no performance difference in other parquet-encoding benchmark cases. Change-Id: Id7e2f5dd2db39f95caac9f232903ae29b889b0c2 Signed-off-by: Yuqi Gu <yuqi.gu@arm.com>
Test in
|
@zeroshade Added min_max_neon for parquet encoding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 Awesome thanks! Looking forward to the next one.
Benchmark runs are scheduled for baseline = 3bc1484 and contender = 2a221e1. 2a221e1 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Tests were passed.
Benchmark:
Before:
BenchmarkPlainEncodingInt32/len_1024-64 314652 3805 ns/op 1076.62 MB/s
BenchmarkPlainEncodingInt32/len_2048-64 159996 7442 ns/op 1100.72 MB/s
BenchmarkPlainEncodingInt32/len_4096-64 79710 15058 ns/op 1088.09 MB/s
BenchmarkPlainEncodingInt32/len_8192-64 39919 30120 ns/op 1087.92 MB/s
BenchmarkPlainEncodingInt32/len_16384-64 19790 60623 ns/op 1081.04 MB/s
BenchmarkPlainEncodingInt32/len_32768-64 9618 120201 ns/op 1090.44 MB/s
BenchmarkPlainEncodingInt32/len_65536-64 4856 241563 ns/op 1085.20 MB/s
BenchmarkPlainEncodingInt64/len_1024-64 160287 7474 ns/op 1096.13 MB/s
BenchmarkPlainEncodingInt64/len_2048-64 80745 14969 ns/op 1094.49 MB/s
BenchmarkPlainEncodingInt64/len_4096-64 39399 30197 ns/op 1085.12 MB/s
BenchmarkPlainEncodingInt64/len_8192-64 19797 60557 ns/op 1082.23 MB/s
BenchmarkPlainEncodingInt64/len_16384-64 9660 120168 ns/op 1090.74 MB/s
BenchmarkPlainEncodingInt64/len_32768-64 4886 241728 ns/op 1084.46 MB/s
BenchmarkPlainEncodingInt64/len_65536-64 2412 498532 ns/op 1051.66 MB/s
BenchmarkPlainEncodingFloat32/len_1024-64 315712 3804 ns/op 1076.68 MB/s
BenchmarkPlainEncodingFloat32/len_2048-64 160172 7502 ns/op 1091.94 MB/s
BenchmarkPlainEncodingFloat32/len_4096-64 81097 14858 ns/op 1102.74 MB/s
BenchmarkPlainEncodingFloat32/len_8192-64 40135 29861 ns/op 1097.33 MB/s
BenchmarkPlainEncodingFloat32/len_16384-64 19816 60470 ns/op 1083.78 MB/s
BenchmarkPlainEncodingFloat32/len_32768-64 9736 120119 ns/op 1091.19 MB/s
BenchmarkPlainEncodingFloat32/len_65536-64 4778 241758 ns/op 1084.32 MB/s
BenchmarkPlainEncodingFloat64/len_1024-64 156734 7608 ns/op 1076.71 MB/s
BenchmarkPlainEncodingFloat64/len_2048-64 80022 14986 ns/op 1093.26 MB/s
BenchmarkPlainEncodingFloat64/len_4096-64 39451 29765 ns/op 1100.90 MB/s
BenchmarkPlainEncodingFloat64/len_8192-64 19820 60614 ns/op 1081.21 MB/s
BenchmarkPlainEncodingFloat64/len_16384-64 9933 120136 ns/op 1091.03 MB/s
BenchmarkPlainEncodingFloat64/len_32768-64 4904 240777 ns/op 1088.74 MB/s
BenchmarkPlainEncodingFloat64/len_65536-64 2382 492836 ns/op 1063.82 MB/s
After:
BenchmarkPlainEncodingInt32/len_1024-64 1984029 599.6 ns/op 6831.71 MB/s
BenchmarkPlainEncodingInt32/len_2048-64 1153614 1053 ns/op 7782.45 MB/s
BenchmarkPlainEncodingInt32/len_4096-64 578494 1980 ns/op 8276.18 MB/s
BenchmarkPlainEncodingInt32/len_8192-64 259530 4649 ns/op 7048.07 MB/s
BenchmarkPlainEncodingInt32/len_16384-64 108944 10561 ns/op 6205.63 MB/s
BenchmarkPlainEncodingInt32/len_32768-64 46070 24368 ns/op 5378.83 MB/s
BenchmarkPlainEncodingInt32/len_65536-64 22089 50834 ns/op 5156.82 MB/s
BenchmarkPlainEncodingInt64/len_1024-64 1136257 1069 ns/op 7666.01 MB/s
BenchmarkPlainEncodingInt64/len_2048-64 610842 2001 ns/op 8187.11 MB/s
BenchmarkPlainEncodingInt64/len_4096-64 260290 4303 ns/op 7615.33 MB/s
BenchmarkPlainEncodingInt64/len_8192-64 111594 10722 ns/op 6112.56 MB/s
BenchmarkPlainEncodingInt64/len_16384-64 50252 23624 ns/op 5548.31 MB/s
BenchmarkPlainEncodingInt64/len_32768-64 19208 53679 ns/op 4883.51 MB/s
BenchmarkPlainEncodingInt64/len_65536-64 9026 128405 ns/op 4083.09 MB/s
BenchmarkPlainEncodingFloat32/len_1024-64 2000133 613.3 ns/op 6678.84 MB/s
BenchmarkPlainEncodingFloat32/len_2048-64 1039034 1070 ns/op 7657.78 MB/s
BenchmarkPlainEncodingFloat32/len_4096-64 573483 2162 ns/op 7577.36 MB/s
BenchmarkPlainEncodingFloat32/len_8192-64 258648 4479 ns/op 7316.32 MB/s
BenchmarkPlainEncodingFloat32/len_16384-64 111152 10698 ns/op 6126.04 MB/s
BenchmarkPlainEncodingFloat32/len_32768-64 48129 22288 ns/op 5880.91 MB/s
BenchmarkPlainEncodingFloat32/len_65536-64 20336 58841 ns/op 4455.14 MB/s
BenchmarkPlainEncodingFloat64/len_1024-64 1148325 1105 ns/op 7411.95 MB/s
BenchmarkPlainEncodingFloat64/len_2048-64 580255 2067 ns/op 7927.00 MB/s
BenchmarkPlainEncodingFloat64/len_4096-64 260462 4469 ns/op 7331.86 MB/s
BenchmarkPlainEncodingFloat64/len_8192-64 116211 9802 ns/op 6685.97 MB/s
BenchmarkPlainEncodingFloat64/len_16384-64 51475 22753 ns/op 5760.62 MB/s
BenchmarkPlainEncodingFloat64/len_32768-64 21224 52235 ns/op 5018.54 MB/s
BenchmarkPlainEncodingFloat64/len_65536-64 10000 124035 ns/op 4226.93 MB/s
Get 4x ~ 6x performance uplift.
And no performance difference in other parquet-encoding benchmark cases.