Skip to content

Conversation

@Engininja2
Copy link
Contributor

The Clang that comes with ROCm 5.2 doesn't have the builtin __builtin_elementwise_sub_sat that's used to polyfill __vsubss4 for HIP. This adds fallback code to compile in that case.

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks good to me but please note that I think the implementation could still be made faster. Just tell me when you want to get it merged.

Comment on lines +101 to +102
if(tmp > std::numeric_limits<int8_t>::max()) tmp = std::numeric_limits<int8_t>::max();
if(tmp < std::numeric_limits<int8_t>::min()) tmp = std::numeric_limits<int8_t>::min();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this code correctly it clips the tmp value to be within the legal limits of int8_t. In that case I think something like the following would be faster:

Suggested change
if(tmp > std::numeric_limits<int8_t>::max()) tmp = std::numeric_limits<int8_t>::max();
if(tmp < std::numeric_limits<int8_t>::min()) tmp = std::numeric_limits<int8_t>::min();
const int smaller_min = tmp < std::numeric_limits<int8_t>::min();
const int bigger_max = tmp > std::numeric_limits<int8_t>::max();
tmp = smaller_min * std::numeric_limits<int8_t>::min() + larger_max * std::numeric_limits<int8_t>::max()
+ !(smaller_min | larger_max) * tmp;

The reason I think it would be faster is that conditional statements are very slow on GPUs. Please note that I did not test this implementation and that you may need to test int vs. int16_t and | vs || for optimal performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea behind writing it that way is for the compiler to recognize that it's been asked to do saturating subtraction and not emit any conditional instructions at all.
The rocm clangs available on compiler explorer emit @llvm.ssub.sat.v4i8: https://godbolt.org/z/azxe38hMs

@Engininja2
Copy link
Contributor Author

I checked the assembly for that function compiled for gfx803 on rocm-3.5.1 which is probably the oldest anybody might want to use and it's not using conditional instructions either so I think this is good to merge.

@JohannesGaessler JohannesGaessler merged commit f04d002 into ggml-org:master Sep 1, 2023
@Engininja2 Engininja2 deleted the vsubss4-hip-compat branch September 6, 2023 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants