Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Remove AVX512 variants of BYTE_STREAM_SPLIT encoding #40095

Closed
pitrou opened this issue Feb 15, 2024 · 7 comments · Fixed by #40127
Closed

[C++][Parquet] Remove AVX512 variants of BYTE_STREAM_SPLIT encoding #40095

pitrou opened this issue Feb 15, 2024 · 7 comments · Fixed by #40127

Comments

@pitrou
Copy link
Member

pitrou commented Feb 15, 2024

Describe the enhancement requested

According to previous observations, it seems the AVX512 accelerations of BYTE_STREAM_SPLIT perform equal or worse then their AVX2 counterparts.

Besides, the SSE2 and AVX2 accelerations, already performing at 5-10 GB/s or more, are amply fast enough.

We could therefore simply remove the AVX512 accelerations.

Component(s)

Benchmarking, C++, Parquet

@pitrou
Copy link
Member Author

pitrou commented Feb 15, 2024

@cyb70289 @wgtmac @mapleFU @felipecrv Opinions on this?

@felipecrv
Copy link
Contributor

According to previous observations, it seems the AVX512 accelerations of BYTE_STREAM_SPLIT perform equal or worse then their AVX2 counterparts.

Isn't that due to the effects of down-clocking when the CPU is executing too many AVX-512 instructions? It might be better in the future or to users that have a way to avoid that down-clocking.

@pitrou
Copy link
Member Author

pitrou commented Feb 16, 2024

I don't think all CPUs have down-clocking, do they? But regardless, if it doesn't provide a significant benefit, it doesn't make much sense to keep those codepaths, IMHO.

@cyb70289
Copy link
Contributor

I remember I normalized benchmark by cpu frequency, avx512 is still worse than avx2 (maybe sse4) on caslake.

@felipecrv
Copy link
Contributor

I don't think all CPUs have down-clocking, do they? But regardless, if it doesn't provide a significant benefit, it doesn't make much sense to keep those codepaths, IMHO.

Alright, then it might be a case of memory bandwidth being the bottleneck and not CPU uops per second. Let's remove it. ✂️

@mapleFU
Copy link
Member

mapleFU commented Feb 17, 2024

I think BYTE_STREAM_SPLIT encoding/decoding might be memory-bound operations. We can remove the AVX512 impl first. If anyone wants to improve or requires AVX512, we can also revert it back...

@wgtmac
Copy link
Member

wgtmac commented Feb 17, 2024

This removal looks reasonable. I have consulted some people at Intel on this but didn't get any useful answer.

pitrou added a commit to pitrou/arrow that referenced this issue Feb 19, 2024
…SPLIT encoding

Two reasons:
* the SSE2 and AVX2 variants are already fast enough (on the order of 10 GB/s)
* the AVX512 variants do not seem faster, and can even be slower, on tested Intel machines
pitrou added a commit that referenced this issue Feb 19, 2024
…encoding (#40127)

Two reasons:
* the SSE2 and AVX2 variants are already fast enough (on the order of 10 GB/s)
* the AVX512 variants do not seem faster, and can even be slower, on tested Intel machines

* Closes: #40095

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou added this to the 16.0.0 milestone Feb 19, 2024
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this issue Feb 28, 2024
…SPLIT encoding (apache#40127)

Two reasons:
* the SSE2 and AVX2 variants are already fast enough (on the order of 10 GB/s)
* the AVX512 variants do not seem faster, and can even be slower, on tested Intel machines

* Closes: apache#40095

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
thisisnic pushed a commit to thisisnic/arrow that referenced this issue Mar 8, 2024
…SPLIT encoding (apache#40127)

Two reasons:
* the SSE2 and AVX2 variants are already fast enough (on the order of 10 GB/s)
* the AVX512 variants do not seem faster, and can even be slower, on tested Intel machines

* Closes: apache#40095

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment