New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda performance #2 #266

Closed
damdamce opened this Issue Nov 6, 2014 · 7 comments

Comments

Projects
None yet
3 participants
@damdamce

damdamce commented Nov 6, 2014

Hi,
I experimented a bit more with my tests of glm's CUDA performance. My testing environment is again Linux, Cuda 6.5 and a Geforce GTX 550Ti. GLM's released version 0.9.5.4 was used as the current trunk doesn't compile on CUDA.

The result in short

GLM is still 12% behind CUDA's native types and helper_math.h in certain cases.

About the tests made

In the last bug report (issue #257 ) I used only test1 (watch out, the link contains the revision number). It was quite synthetic and therefore not very good. In my own code glm still turned out to be slower.

So I made sort of a minimal example only for testing glm, based on my code: test 2a. This is the example showing that glm is 12% behind cuda.

One interesting thing is, that when removing the "early exit" from the for loop in glm/cudaKernel, the difference between glm and cuda native is much lower (and the overall performance is better): test 2b. Note that the only difference compared to test 2a are lines 265ff.

Test results

I already found out some possibilities to improve glm's performance (see below), here are the results after the best/fastest changes:

#test 1
CUDA kernel launch with 19532 blocks of 256 threads
time for cuda glm (matrix):         546 milliseconds
time for cuda helper math (matrix): 660 milliseconds
time for cuda glm (dot):            471 milliseconds
time for cuda helper math (dot):    491 milliseconds
time for cuda glm (cross):          246 milliseconds
time for cuda helper math (cross):  246 milliseconds
#test 2a
time for glm:   468 milliseconds
time for cuda:  417 milliseconds
#test 2b
time for glm:   373 milliseconds
time for cuda:  370 milliseconds

I made a file containing all test results.

Code changes

  • One important change is aligning vec4 to 16 bytes, this is the same as in issue #257.
  • I also aligned vec2 to 8 bytes, but didn't test.
  • Aligning vec3 to 16 bytes gives improvements in the synthetic test, but in the test 2 it gives worse performance. The reason is probably different loads and register/memory usage.
  • removing all const references from the base classes' (mat4, vec3, vec4 etc.) methods gave an improvement in some cases (surprisingly test2a didn't change, but test2b did, this might be an error in my testing method though. I checked it several times and couldn't find anyting). It seems like passing by value is faster on the gpu, but in test 2a this is shadowed by some other bottleneck. The usage of const references is controlled by the "#define GLM_REFERENCE const &" in setup.hpp in my version of glm.
  • I experimented with the operator *(mat4, vec4), which changed the performance only in case of using const references.

Conclusion

There is still some issue with glm's cuda performance. I would be happy to continue testing if you give me some ideas on what to test. I just hope that it's not glm's elaborate usage of templates.

Something that could lead us to the solution is the difference from test 2a to 2b. It could mean that loading something (instructions?) from memory is more expensive in glm, and the early exit causes cache misses or something like that. But i'm just speculating.

@Groovounet Groovounet added this to the GLM 0.9.6 milestone Nov 23, 2014

@Groovounet Groovounet self-assigned this Nov 23, 2014

Groovounet pushed a commit that referenced this issue Nov 23, 2014

@damdamce

This comment has been minimized.

Show comment
Hide comment
@damdamce

damdamce Nov 23, 2014

Groovounet wrote:

I created a new extension exposing aligned types. Alignment is definitely not what we always want even if it's extremely useful.

If you want an aligned flavor of a vec4, include and you can use aligned_vec4.

You can also define your own aligned type in a cross platform manner using:
GLM_ALIGNED_TYPEDEF(vec3, my_vec3, 16);

Where my_vec3 is a vec3 aligned to 16 bytes.

Somebody could easily miss that when starting to use glm with cuda. besides that writing glm::aligned_vec4 would be cumbersome and in cuda you would always use the aligned version.

I think something like

#ifndef __NVCC__
typedef tvec4.. vec4
#else
GLM_ALIGNED_TYPEDEF(tvec4.., vec4, 16);
#endif

and the same for vec2 would be better.

thanks for your work anyway : )

Groovounet wrote:

I created a new extension exposing aligned types. Alignment is definitely not what we always want even if it's extremely useful.

If you want an aligned flavor of a vec4, include and you can use aligned_vec4.

You can also define your own aligned type in a cross platform manner using:
GLM_ALIGNED_TYPEDEF(vec3, my_vec3, 16);

Where my_vec3 is a vec3 aligned to 16 bytes.

Somebody could easily miss that when starting to use glm with cuda. besides that writing glm::aligned_vec4 would be cumbersome and in cuda you would always use the aligned version.

I think something like

#ifndef __NVCC__
typedef tvec4.. vec4
#else
GLM_ALIGNED_TYPEDEF(tvec4.., vec4, 16);
#endif

and the same for vec2 would be better.

thanks for your work anyway : )

@Groovounet

This comment has been minimized.

Show comment
Hide comment
@Groovounet

Groovounet Nov 23, 2014

Member

I don't think this is true. Actually if you pack correctly your structures and use aligned memory allocator _aligned_malloc (with GCC, don't know in Cuda) you problably don't even need aligned types.

So anyway, I think both are somewhat useful hence the solution to expose more types. :p

On the contrary, when requiring specify alignment, data won't be as well packed which might consume more bandwitch and individual cache line fetches.

Member

Groovounet commented Nov 23, 2014

I don't think this is true. Actually if you pack correctly your structures and use aligned memory allocator _aligned_malloc (with GCC, don't know in Cuda) you problably don't even need aligned types.

So anyway, I think both are somewhat useful hence the solution to expose more types. :p

On the contrary, when requiring specify alignment, data won't be as well packed which might consume more bandwitch and individual cache line fetches.

@Groovounet Groovounet closed this Nov 23, 2014

@damdamce

This comment has been minimized.

Show comment
Hide comment
@damdamce

damdamce Nov 23, 2014

hehe, this issue is only about cuda. so gcc doesn't matter here imho. neither was issue #257 about gcc.

and i agree that both data types can be useful in gcc, intel or ms compilers, but an unaligned vec4 is not useful in cuda. performance will be always worse as shown in the quite extensive benchmarks. the nvidia engineers certainly know what they are doing and they aligned their native float4 and float2.

if you want to expose the unaligned vec4 in cuda, an unaligned_vec4 datatype would make more sense. but the default should really be an aligned datatype. oh, and the same also goes for ivec4, ivec2 etc..

moreover this issue is not closed by any means :) performance is still behind cuda native types. maybe you mistook it with issue #257? i should have commented there, but i'm unsure whether you see comments on closed issues.

hehe, this issue is only about cuda. so gcc doesn't matter here imho. neither was issue #257 about gcc.

and i agree that both data types can be useful in gcc, intel or ms compilers, but an unaligned vec4 is not useful in cuda. performance will be always worse as shown in the quite extensive benchmarks. the nvidia engineers certainly know what they are doing and they aligned their native float4 and float2.

if you want to expose the unaligned vec4 in cuda, an unaligned_vec4 datatype would make more sense. but the default should really be an aligned datatype. oh, and the same also goes for ivec4, ivec2 etc..

moreover this issue is not closed by any means :) performance is still behind cuda native types. maybe you mistook it with issue #257? i should have commented there, but i'm unsure whether you see comments on closed issues.

@Groovounet

This comment has been minimized.

Show comment
Hide comment
@Groovounet

Groovounet Nov 23, 2014

Member

I strongly disagree with your point of view. Alignment is a compiler independent outside the fact that they require a minimum global alignment.

Alignment is a real topic that should not handle lightly. Sure in your test cases, it always show better performance but you are looking at a single scenario that is SoA based. With AoS cases, you will be very happen to used unaligned vec3 + a float if that case makes sense in your real life case scenario and such scenario happens all the time: Processing vertices in a compute shader for example.

Thanks,
Christophe

Member

Groovounet commented Nov 23, 2014

I strongly disagree with your point of view. Alignment is a compiler independent outside the fact that they require a minimum global alignment.

Alignment is a real topic that should not handle lightly. Sure in your test cases, it always show better performance but you are looking at a single scenario that is SoA based. With AoS cases, you will be very happen to used unaligned vec3 + a float if that case makes sense in your real life case scenario and such scenario happens all the time: Processing vertices in a compute shader for example.

Thanks,
Christophe

@damdamce

This comment has been minimized.

Show comment
Hide comment
@damdamce

damdamce Nov 23, 2014

right now i was talking only about vec4 and vec2. I agree vec3 shouldn't be aligned to 16 bytes :)

moreover i didn't provide only one test case, i provided 3 different ones. one of them taken from real life, 2 of them having AoS (granted, small ones). all of them are clearly faster with aligned vec4.

and granted, it's not about compiler, it's about architecture. unaligned fetches are quite expensive on the gpu. an unaligned vec4 needs two fetches instead of one, multiply it by 1024 threads that can be in on group and you have a great bottleneck. there is a reason that the native float2 and float4 are aligned in cuda :)

with the situation of using unaligned_vec4 there are two issues in my opinion. first, it'll be hard to find. and second, it is a lot to write, code gets longer etc.

with the proposed solution nothing would change for non cuda projects. and cuda projects would benefit from a faster glm..

additionally it hurts a bit that you closed this issue/enhancement without even commenting on the performance issue. I put a lot of work into developing the tests and documenting the result. right now it seems a bit as if you didn't even read it. I mean, I like glm a lot and i would understand if you said that you don't have time or something, but then it shouldn't be closed but instead it should wait for later imo..

right now i was talking only about vec4 and vec2. I agree vec3 shouldn't be aligned to 16 bytes :)

moreover i didn't provide only one test case, i provided 3 different ones. one of them taken from real life, 2 of them having AoS (granted, small ones). all of them are clearly faster with aligned vec4.

and granted, it's not about compiler, it's about architecture. unaligned fetches are quite expensive on the gpu. an unaligned vec4 needs two fetches instead of one, multiply it by 1024 threads that can be in on group and you have a great bottleneck. there is a reason that the native float2 and float4 are aligned in cuda :)

with the situation of using unaligned_vec4 there are two issues in my opinion. first, it'll be hard to find. and second, it is a lot to write, code gets longer etc.

with the proposed solution nothing would change for non cuda projects. and cuda projects would benefit from a faster glm..

additionally it hurts a bit that you closed this issue/enhancement without even commenting on the performance issue. I put a lot of work into developing the tests and documenting the result. right now it seems a bit as if you didn't even read it. I mean, I like glm a lot and i would understand if you said that you don't have time or something, but then it shouldn't be closed but instead it should wait for later imo..

@Groovounet Groovounet added the wontfix label Nov 23, 2014

@Groovounet

This comment has been minimized.

Show comment
Hide comment
@Groovounet

Groovounet Nov 23, 2014

Member

Another example: struct{vec3, vec2, vec3};
I have seen nothing like this in your tests.

I think it's a terrible idea to expose different behaviors on different platforms especially that alignment is not a platform specific issue. Furthermore, systematically aligning vec4 is not a consistent behavior with the rest of the world in that domain, this is not what people would expect so it's not a good idea.

aligned_vec4 resolves the miss aligned case of SoA and I am confident with the current resolution to be effective for all data structure scenarios.

I won't be commenting furthermore, this case is close to me.

Thanks,
Christophe

Member

Groovounet commented Nov 23, 2014

Another example: struct{vec3, vec2, vec3};
I have seen nothing like this in your tests.

I think it's a terrible idea to expose different behaviors on different platforms especially that alignment is not a platform specific issue. Furthermore, systematically aligning vec4 is not a consistent behavior with the rest of the world in that domain, this is not what people would expect so it's not a good idea.

aligned_vec4 resolves the miss aligned case of SoA and I am confident with the current resolution to be effective for all data structure scenarios.

I won't be commenting furthermore, this case is close to me.

Thanks,
Christophe

@fhoenig

This comment has been minimized.

Show comment
Hide comment
@fhoenig

fhoenig Feb 3, 2015

Hey I know you guys closed this but I just looked at some PTX output using GLM and it seems rather verbose. Has anybody done more in-depth tests on CUDA/GLM?

fhoenig commented Feb 3, 2015

Hey I know you guys closed this but I just looked at some PTX output using GLM and it seems rather verbose. Has anybody done more in-depth tests on CUDA/GLM?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment