cuda : vsubss4 for older versions of ROCm/clang #2942

Engininja2 · 2023-08-31T19:37:13Z

The Clang that comes with ROCm 5.2 doesn't have the builtin __builtin_elementwise_sub_sat that's used to polyfill __vsubss4 for HIP. This adds fallback code to compile in that case.

JohannesGaessler

This PR looks good to me but please note that I think the implementation could still be made faster. Just tell me when you want to get it merged.

JohannesGaessler · 2023-09-01T07:48:27Z

ggml-cuda.cu

+        if(tmp > std::numeric_limits<int8_t>::max()) tmp = std::numeric_limits<int8_t>::max();
+        if(tmp < std::numeric_limits<int8_t>::min()) tmp = std::numeric_limits<int8_t>::min();


If I understand this code correctly it clips the tmp value to be within the legal limits of int8_t. In that case I think something like the following would be faster:

Suggested change

if(tmp > std::numeric_limits<int8_t>::max()) tmp = std::numeric_limits<int8_t>::max();

if(tmp < std::numeric_limits<int8_t>::min()) tmp = std::numeric_limits<int8_t>::min();

const int smaller_min = tmp < std::numeric_limits<int8_t>::min();

const int bigger_max = tmp > std::numeric_limits<int8_t>::max();

tmp = smaller_min * std::numeric_limits<int8_t>::min() + larger_max * std::numeric_limits<int8_t>::max()

+ !(smaller_min | larger_max) * tmp;

The reason I think it would be faster is that conditional statements are very slow on GPUs. Please note that I did not test this implementation and that you may need to test int vs. int16_t and | vs || for optimal performance.

The idea behind writing it that way is for the compiler to recognize that it's been asked to do saturating subtraction and not emit any conditional instructions at all.
The rocm clangs available on compiler explorer emit @llvm.ssub.sat.v4i8: https://godbolt.org/z/azxe38hMs

Engininja2 · 2023-09-01T20:53:38Z

I checked the assembly for that function compiled for gfx803 on rocm-3.5.1 which is probably the oldest anybody might want to use and it's not using conditional instructions either so I think this is good to merge.

cuda : vsubss4 for older versions of ROCm/clang

75939b4

JohannesGaessler approved these changes Sep 1, 2023

View reviewed changes

JohannesGaessler merged commit f04d002 into ggml-org:master Sep 1, 2023

Engininja2 deleted the vsubss4-hip-compat branch September 6, 2023 21:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cuda : vsubss4 for older versions of ROCm/clang #2942

cuda : vsubss4 for older versions of ROCm/clang #2942

Uh oh!

Engininja2 commented Aug 31, 2023

Uh oh!

JohannesGaessler left a comment

Uh oh!

JohannesGaessler Sep 1, 2023

Uh oh!

Engininja2 Sep 1, 2023

Uh oh!

Engininja2 commented Sep 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if(tmp > std::numeric_limits<int8_t>::max()) tmp = std::numeric_limits<int8_t>::max();
		if(tmp < std::numeric_limits<int8_t>::min()) tmp = std::numeric_limits<int8_t>::min();

-        if(tmp > std::numeric_limits<int8_t>::max()) tmp = std::numeric_limits<int8_t>::max();
-        if(tmp < std::numeric_limits<int8_t>::min()) tmp = std::numeric_limits<int8_t>::min();
+        const int smaller_min = tmp < std::numeric_limits<int8_t>::min();
+        const int bigger_max = tmp > std::numeric_limits<int8_t>::max();
+        tmp = smaller_min * std::numeric_limits<int8_t>::min() + larger_max * std::numeric_limits<int8_t>::max()
+                + !(smaller_min | larger_max) * tmp;

cuda : vsubss4 for older versions of ROCm/clang #2942

cuda : vsubss4 for older versions of ROCm/clang #2942

Uh oh!

Conversation

Engininja2 commented Aug 31, 2023

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Sep 1, 2023

Choose a reason for hiding this comment

Uh oh!

Engininja2 Sep 1, 2023

Choose a reason for hiding this comment

Uh oh!

Engininja2 commented Sep 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants