Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-15336: [Go] Implement 'min_max_neon' with Arm64 GoLang Assembly #12163

Closed
wants to merge 1 commit into from

Conversation

guyuqi
Copy link
Member

@guyuqi guyuqi commented Jan 17, 2022

Tests were passed.
Benchmark:
Before:
BenchmarkPlainEncodingInt32/len_1024-64 314652 3805 ns/op 1076.62 MB/s
BenchmarkPlainEncodingInt32/len_2048-64 159996 7442 ns/op 1100.72 MB/s
BenchmarkPlainEncodingInt32/len_4096-64 79710 15058 ns/op 1088.09 MB/s
BenchmarkPlainEncodingInt32/len_8192-64 39919 30120 ns/op 1087.92 MB/s
BenchmarkPlainEncodingInt32/len_16384-64 19790 60623 ns/op 1081.04 MB/s
BenchmarkPlainEncodingInt32/len_32768-64 9618 120201 ns/op 1090.44 MB/s
BenchmarkPlainEncodingInt32/len_65536-64 4856 241563 ns/op 1085.20 MB/s
BenchmarkPlainEncodingInt64/len_1024-64 160287 7474 ns/op 1096.13 MB/s
BenchmarkPlainEncodingInt64/len_2048-64 80745 14969 ns/op 1094.49 MB/s
BenchmarkPlainEncodingInt64/len_4096-64 39399 30197 ns/op 1085.12 MB/s
BenchmarkPlainEncodingInt64/len_8192-64 19797 60557 ns/op 1082.23 MB/s
BenchmarkPlainEncodingInt64/len_16384-64 9660 120168 ns/op 1090.74 MB/s
BenchmarkPlainEncodingInt64/len_32768-64 4886 241728 ns/op 1084.46 MB/s
BenchmarkPlainEncodingInt64/len_65536-64 2412 498532 ns/op 1051.66 MB/s
BenchmarkPlainEncodingFloat32/len_1024-64 315712 3804 ns/op 1076.68 MB/s
BenchmarkPlainEncodingFloat32/len_2048-64 160172 7502 ns/op 1091.94 MB/s
BenchmarkPlainEncodingFloat32/len_4096-64 81097 14858 ns/op 1102.74 MB/s
BenchmarkPlainEncodingFloat32/len_8192-64 40135 29861 ns/op 1097.33 MB/s
BenchmarkPlainEncodingFloat32/len_16384-64 19816 60470 ns/op 1083.78 MB/s
BenchmarkPlainEncodingFloat32/len_32768-64 9736 120119 ns/op 1091.19 MB/s
BenchmarkPlainEncodingFloat32/len_65536-64 4778 241758 ns/op 1084.32 MB/s
BenchmarkPlainEncodingFloat64/len_1024-64 156734 7608 ns/op 1076.71 MB/s
BenchmarkPlainEncodingFloat64/len_2048-64 80022 14986 ns/op 1093.26 MB/s
BenchmarkPlainEncodingFloat64/len_4096-64 39451 29765 ns/op 1100.90 MB/s
BenchmarkPlainEncodingFloat64/len_8192-64 19820 60614 ns/op 1081.21 MB/s
BenchmarkPlainEncodingFloat64/len_16384-64 9933 120136 ns/op 1091.03 MB/s
BenchmarkPlainEncodingFloat64/len_32768-64 4904 240777 ns/op 1088.74 MB/s
BenchmarkPlainEncodingFloat64/len_65536-64 2382 492836 ns/op 1063.82 MB/s

After:
BenchmarkPlainEncodingInt32/len_1024-64 1984029 599.6 ns/op 6831.71 MB/s
BenchmarkPlainEncodingInt32/len_2048-64 1153614 1053 ns/op 7782.45 MB/s
BenchmarkPlainEncodingInt32/len_4096-64 578494 1980 ns/op 8276.18 MB/s
BenchmarkPlainEncodingInt32/len_8192-64 259530 4649 ns/op 7048.07 MB/s
BenchmarkPlainEncodingInt32/len_16384-64 108944 10561 ns/op 6205.63 MB/s
BenchmarkPlainEncodingInt32/len_32768-64 46070 24368 ns/op 5378.83 MB/s
BenchmarkPlainEncodingInt32/len_65536-64 22089 50834 ns/op 5156.82 MB/s
BenchmarkPlainEncodingInt64/len_1024-64 1136257 1069 ns/op 7666.01 MB/s
BenchmarkPlainEncodingInt64/len_2048-64 610842 2001 ns/op 8187.11 MB/s
BenchmarkPlainEncodingInt64/len_4096-64 260290 4303 ns/op 7615.33 MB/s
BenchmarkPlainEncodingInt64/len_8192-64 111594 10722 ns/op 6112.56 MB/s
BenchmarkPlainEncodingInt64/len_16384-64 50252 23624 ns/op 5548.31 MB/s
BenchmarkPlainEncodingInt64/len_32768-64 19208 53679 ns/op 4883.51 MB/s
BenchmarkPlainEncodingInt64/len_65536-64 9026 128405 ns/op 4083.09 MB/s
BenchmarkPlainEncodingFloat32/len_1024-64 2000133 613.3 ns/op 6678.84 MB/s
BenchmarkPlainEncodingFloat32/len_2048-64 1039034 1070 ns/op 7657.78 MB/s
BenchmarkPlainEncodingFloat32/len_4096-64 573483 2162 ns/op 7577.36 MB/s
BenchmarkPlainEncodingFloat32/len_8192-64 258648 4479 ns/op 7316.32 MB/s
BenchmarkPlainEncodingFloat32/len_16384-64 111152 10698 ns/op 6126.04 MB/s
BenchmarkPlainEncodingFloat32/len_32768-64 48129 22288 ns/op 5880.91 MB/s
BenchmarkPlainEncodingFloat32/len_65536-64 20336 58841 ns/op 4455.14 MB/s
BenchmarkPlainEncodingFloat64/len_1024-64 1148325 1105 ns/op 7411.95 MB/s
BenchmarkPlainEncodingFloat64/len_2048-64 580255 2067 ns/op 7927.00 MB/s
BenchmarkPlainEncodingFloat64/len_4096-64 260462 4469 ns/op 7331.86 MB/s
BenchmarkPlainEncodingFloat64/len_8192-64 116211 9802 ns/op 6685.97 MB/s
BenchmarkPlainEncodingFloat64/len_16384-64 51475 22753 ns/op 5760.62 MB/s
BenchmarkPlainEncodingFloat64/len_32768-64 21224 52235 ns/op 5018.54 MB/s
BenchmarkPlainEncodingFloat64/len_65536-64 10000 124035 ns/op 4226.93 MB/s

Get 4x ~ 6x performance uplift.

And no performance difference in other parquet-encoding benchmark cases.

Tests were passed.
Benchmark:
Before:
BenchmarkPlainEncodingInt32/len_1024-64                   314652              3805 ns/op        1076.62 MB/s
BenchmarkPlainEncodingInt32/len_2048-64                   159996              7442 ns/op        1100.72 MB/s
BenchmarkPlainEncodingInt32/len_4096-64                    79710             15058 ns/op        1088.09 MB/s
BenchmarkPlainEncodingInt32/len_8192-64                    39919             30120 ns/op        1087.92 MB/s
BenchmarkPlainEncodingInt32/len_16384-64                   19790             60623 ns/op        1081.04 MB/s
BenchmarkPlainEncodingInt32/len_32768-64                    9618            120201 ns/op        1090.44 MB/s
BenchmarkPlainEncodingInt32/len_65536-64                    4856            241563 ns/op        1085.20 MB/s
BenchmarkPlainEncodingInt64/len_1024-64                   160287              7474 ns/op        1096.13 MB/s
BenchmarkPlainEncodingInt64/len_2048-64                    80745             14969 ns/op        1094.49 MB/s
BenchmarkPlainEncodingInt64/len_4096-64                    39399             30197 ns/op        1085.12 MB/s
BenchmarkPlainEncodingInt64/len_8192-64                    19797             60557 ns/op        1082.23 MB/s
BenchmarkPlainEncodingInt64/len_16384-64                    9660            120168 ns/op        1090.74 MB/s
BenchmarkPlainEncodingInt64/len_32768-64                    4886            241728 ns/op        1084.46 MB/s
BenchmarkPlainEncodingInt64/len_65536-64                    2412            498532 ns/op        1051.66 MB/s
BenchmarkPlainEncodingFloat32/len_1024-64                 315712              3804 ns/op        1076.68 MB/s
BenchmarkPlainEncodingFloat32/len_2048-64                 160172              7502 ns/op        1091.94 MB/s
BenchmarkPlainEncodingFloat32/len_4096-64                  81097             14858 ns/op        1102.74 MB/s
BenchmarkPlainEncodingFloat32/len_8192-64                  40135             29861 ns/op        1097.33 MB/s
BenchmarkPlainEncodingFloat32/len_16384-64                 19816             60470 ns/op        1083.78 MB/s
BenchmarkPlainEncodingFloat32/len_32768-64                  9736            120119 ns/op        1091.19 MB/s
BenchmarkPlainEncodingFloat32/len_65536-64                  4778            241758 ns/op        1084.32 MB/s
BenchmarkPlainEncodingFloat64/len_1024-64                 156734              7608 ns/op        1076.71 MB/s
BenchmarkPlainEncodingFloat64/len_2048-64                  80022             14986 ns/op        1093.26 MB/s
BenchmarkPlainEncodingFloat64/len_4096-64                  39451             29765 ns/op        1100.90 MB/s
BenchmarkPlainEncodingFloat64/len_8192-64                  19820             60614 ns/op        1081.21 MB/s
BenchmarkPlainEncodingFloat64/len_16384-64                  9933            120136 ns/op        1091.03 MB/s
BenchmarkPlainEncodingFloat64/len_32768-64                  4904            240777 ns/op        1088.74 MB/s
BenchmarkPlainEncodingFloat64/len_65536-64                  2382            492836 ns/op        1063.82 MB/s

After:
BenchmarkPlainEncodingInt32/len_1024-64                  1984029               599.6 ns/op      6831.71 MB/s
BenchmarkPlainEncodingInt32/len_2048-64                  1153614              1053 ns/op        7782.45 MB/s
BenchmarkPlainEncodingInt32/len_4096-64                   578494              1980 ns/op        8276.18 MB/s
BenchmarkPlainEncodingInt32/len_8192-64                   259530              4649 ns/op        7048.07 MB/s
BenchmarkPlainEncodingInt32/len_16384-64                  108944             10561 ns/op        6205.63 MB/s
BenchmarkPlainEncodingInt32/len_32768-64                   46070             24368 ns/op        5378.83 MB/s
BenchmarkPlainEncodingInt32/len_65536-64                   22089             50834 ns/op        5156.82 MB/s
BenchmarkPlainEncodingInt64/len_1024-64                  1136257              1069 ns/op        7666.01 MB/s
BenchmarkPlainEncodingInt64/len_2048-64                   610842              2001 ns/op        8187.11 MB/s
BenchmarkPlainEncodingInt64/len_4096-64                   260290              4303 ns/op        7615.33 MB/s
BenchmarkPlainEncodingInt64/len_8192-64                   111594             10722 ns/op        6112.56 MB/s
BenchmarkPlainEncodingInt64/len_16384-64                   50252             23624 ns/op        5548.31 MB/s
BenchmarkPlainEncodingInt64/len_32768-64                   19208             53679 ns/op        4883.51 MB/s
BenchmarkPlainEncodingInt64/len_65536-64                    9026            128405 ns/op        4083.09 MB/s
BenchmarkPlainEncodingFloat32/len_1024-64                2000133               613.3 ns/op      6678.84 MB/s
BenchmarkPlainEncodingFloat32/len_2048-64                1039034              1070 ns/op        7657.78 MB/s
BenchmarkPlainEncodingFloat32/len_4096-64                 573483              2162 ns/op        7577.36 MB/s
BenchmarkPlainEncodingFloat32/len_8192-64                 258648              4479 ns/op        7316.32 MB/s
BenchmarkPlainEncodingFloat32/len_16384-64                111152             10698 ns/op        6126.04 MB/s
BenchmarkPlainEncodingFloat32/len_32768-64                 48129             22288 ns/op        5880.91 MB/s
BenchmarkPlainEncodingFloat32/len_65536-64                 20336             58841 ns/op        4455.14 MB/s
BenchmarkPlainEncodingFloat64/len_1024-64                1148325              1105 ns/op        7411.95 MB/s
BenchmarkPlainEncodingFloat64/len_2048-64                 580255              2067 ns/op        7927.00 MB/s
BenchmarkPlainEncodingFloat64/len_4096-64                 260462              4469 ns/op        7331.86 MB/s
BenchmarkPlainEncodingFloat64/len_8192-64                 116211              9802 ns/op        6685.97 MB/s
BenchmarkPlainEncodingFloat64/len_16384-64                 51475             22753 ns/op        5760.62 MB/s
BenchmarkPlainEncodingFloat64/len_32768-64                 21224             52235 ns/op        5018.54 MB/s
BenchmarkPlainEncodingFloat64/len_65536-64                 10000            124035 ns/op        4226.93 MB/s

Get 4x ~ 6x performance uplift.

And no performance difference in other parquet-encoding benchmark cases.

Change-Id: Id7e2f5dd2db39f95caac9f232903ae29b889b0c2
Signed-off-by: Yuqi Gu <yuqi.gu@arm.com>
@github-actions
Copy link

@guyuqi
Copy link
Member Author

guyuqi commented Jan 17, 2022

Test in arrow/go/parquet/internal/encoding
ARM_ENABLE_EXT=NEON go test

PASS
ok      github.com/apache/arrow/go/v7/parquet/internal/encoding 9.788s

@guyuqi
Copy link
Member Author

guyuqi commented Jan 17, 2022

@zeroshade Added min_max_neon for parquet encoding.
Get 4x ~ 6x performance uplift in BenchmarkPlainEncodingInt32/Int64/Float32/Float64.

Copy link
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Awesome thanks! Looking forward to the next one.

@zeroshade zeroshade closed this in 2a221e1 Jan 17, 2022
@ursabot
Copy link

ursabot commented Jan 17, 2022

Benchmark runs are scheduled for baseline = 3bc1484 and contender = 2a221e1. 2a221e1 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.52% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.3% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants