New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad matrix-vector multiplication performance in Cuda #257

Closed
damdamce opened this Issue Oct 28, 2014 · 15 comments

Comments

Projects
None yet
3 participants
@damdamce

During development I noticed that glm slowed down my project quite a bit. So I made some simple tests with cuda + glm.

I basically do

result[i] = matrix * vectors[i];

several times in a cuda kernel. Once using glm and once cuda's float4 + a very simple custom mat4.

time for cuda glm (matrix): 859 milliseconds
time for cuda helper math (matrix): 225 milliseconds

interestingly dot and cross product are faster with glm in a similar test:

time for cuda glm (dot and cross): 175 milliseconds
time for cuda helper math (dot and cross): 248 milliseconds

I used a GForce GTX 550 Ti, CUDA 6.5 and GLM 0.9.5.4 on linux for the test.

i've put the full test onto bitbucket (https://bitbucket.org/adamce/cuda-glm-performance-test/src/).

  • the test is in src/main.cu
  • the custom mat4 and multiplication operator are in helper_math.h on the very bottom.
@Groovounet

This comment has been minimized.

Show comment
Hide comment
@Groovounet

Groovounet Oct 28, 2014

Member

Very interesting.

It's possible that the strategies in place are better fit for SSE register size allocations than for scalar GPUs. Actually, the implementation is really based SSE registers.

It seems very possible to come with a better strategy for scalar GPUs. I'll investigate when I find the time for this.

Thanks,
Christophe

Member

Groovounet commented Oct 28, 2014

Very interesting.

It's possible that the strategies in place are better fit for SSE register size allocations than for scalar GPUs. Actually, the implementation is really based SSE registers.

It seems very possible to come with a better strategy for scalar GPUs. I'll investigate when I find the time for this.

Thanks,
Christophe

@damdamce

This comment has been minimized.

Show comment
Hide comment
@damdamce

damdamce Oct 28, 2014

Could you give a rough estimate so that I can decide whether to use glm for my project?

Thanks, adam

Could you give a rough estimate so that I can decide whether to use glm for my project?

Thanks, adam

@Groovounet

This comment has been minimized.

Show comment
Hide comment
@Groovounet

Groovounet Oct 28, 2014

Member

Sadly I can't.

Member

Groovounet commented Oct 28, 2014

Sadly I can't.

@damdamce

This comment has been minimized.

Show comment
Hide comment
@damdamce

damdamce Oct 28, 2014

ok : )

I made one additional test by replacing operator*(..) with my own function:

 __device__ __host__ glm::vec4 mul(glm::mat4 m, glm::vec4 v) {
    return glm::vec4(m[0].x*v.x + m[1].x*v.y + m[2].x*v.z + m[3].x*v.w,
                 m[0].y*v.x + m[1].y*v.y + m[2].y*v.z + m[3].y*v.w,
                 m[0].z*v.x + m[1].z*v.y + m[2].z*v.z + m[3].z*v.w,
                 m[0].w*v.x + m[1].w*v.y + m[2].w*v.z + m[3].w*v.w);
 }

the performance didn't change. it's still 4 times slower. therefore I think it might have something to do with how the matrices are stored. in my custom code rowMajor is used, glm uses column major as far as i know. on the other hand, i also tried storing row major and flipping my custom mul with no success.

bottom line: i think it's neither row major storage nor the multiplication function, but something else.

ok : )

I made one additional test by replacing operator*(..) with my own function:

 __device__ __host__ glm::vec4 mul(glm::mat4 m, glm::vec4 v) {
    return glm::vec4(m[0].x*v.x + m[1].x*v.y + m[2].x*v.z + m[3].x*v.w,
                 m[0].y*v.x + m[1].y*v.y + m[2].y*v.z + m[3].y*v.w,
                 m[0].z*v.x + m[1].z*v.y + m[2].z*v.z + m[3].z*v.w,
                 m[0].w*v.x + m[1].w*v.y + m[2].w*v.z + m[3].w*v.w);
 }

the performance didn't change. it's still 4 times slower. therefore I think it might have something to do with how the matrices are stored. in my custom code rowMajor is used, glm uses column major as far as i know. on the other hand, i also tried storing row major and flipping my custom mul with no success.

bottom line: i think it's neither row major storage nor the multiplication function, but something else.

@chbaker0

This comment has been minimized.

Show comment
Hide comment
@chbaker0

chbaker0 Oct 28, 2014

Looking in glm/detail/type_vec4.hpp I saw the storage type used in vec4 is this:

    template <typename T>
    struct simd
    {
        typedef T type[4];
    };

#   if(GLM_ARCH & GLM_ARCH_SSE2)
        template <>
        struct simd<float>
        {
            typedef __m128 type;
        };
#   endif

#   if(GLM_ARCH & GLM_ARCH_AVX)
        template <>
        struct simd<double>
        {
            typedef __m256d type;
        };
#   endif

And the mat4 is just an array of 4 vec4. It's reasonable to believe that CUDA's builtin float4 has specific alignment, while the regular simd struct doesn't. Just speculation of course, but you might want to try applying alignment and see how that affects performance.

Looking in glm/detail/type_vec4.hpp I saw the storage type used in vec4 is this:

    template <typename T>
    struct simd
    {
        typedef T type[4];
    };

#   if(GLM_ARCH & GLM_ARCH_SSE2)
        template <>
        struct simd<float>
        {
            typedef __m128 type;
        };
#   endif

#   if(GLM_ARCH & GLM_ARCH_AVX)
        template <>
        struct simd<double>
        {
            typedef __m256d type;
        };
#   endif

And the mat4 is just an array of 4 vec4. It's reasonable to believe that CUDA's builtin float4 has specific alignment, while the regular simd struct doesn't. Just speculation of course, but you might want to try applying alignment and see how that affects performance.

@damdamce

This comment has been minimized.

Show comment
Hide comment
@damdamce

damdamce Oct 28, 2014

yey : D

those are the new benchmarks 8)

time for cuda glm (matrix): 233 milliseconds
time for cuda helper math (matrix): 225 milliseconds
time for cuda glm (dot): 187 milliseconds
time for cuda helper math (dot): 307 milliseconds
time for cuda glm (cross): 45 milliseconds
time for cuda helper math (cross): 163 milliseconds

the matrix multiplication is not totally there yet, but very close. the other two are even faster than before.

this is a diff to version 0.9.5.4:

diff -r 653704ab2abc src/glm/detail/type_vec4.hpp
--- a/src/glm/detail/type_vec4.hpp  Tue Oct 28 19:35:36 2014 +0100
+++ b/src/glm/detail/type_vec4.hpp  Tue Oct 28 23:24:52 2014 +0100
@@ -29,6 +29,11 @@
#ifndef glm_core_type_gentype4
#define glm_core_type_gentype4

+#if !defined(__CUDA_LIBDEVICE__)
+#include "builtin_types.h"
+#endif /* !__CUDA_LIBDEVICE__ */
+#include "host_defines.h"
+
//#include "../fwd.hpp"
#include "setup.hpp"
#include "type_vec.hpp"
@@ -45,7 +50,7 @@
namespace detail
{
    template <typename T, precision P>
-   struct tvec4
+    struct __builtin_align__(16) tvec4
    {
        //////////////////////////////////////
        // Implementation detail

but i'm afraid it can't go into master like that. anyway, it fixes glm for me now and i can use it until the change goes in. thanks, i'm happy.

yey : D

those are the new benchmarks 8)

time for cuda glm (matrix): 233 milliseconds
time for cuda helper math (matrix): 225 milliseconds
time for cuda glm (dot): 187 milliseconds
time for cuda helper math (dot): 307 milliseconds
time for cuda glm (cross): 45 milliseconds
time for cuda helper math (cross): 163 milliseconds

the matrix multiplication is not totally there yet, but very close. the other two are even faster than before.

this is a diff to version 0.9.5.4:

diff -r 653704ab2abc src/glm/detail/type_vec4.hpp
--- a/src/glm/detail/type_vec4.hpp  Tue Oct 28 19:35:36 2014 +0100
+++ b/src/glm/detail/type_vec4.hpp  Tue Oct 28 23:24:52 2014 +0100
@@ -29,6 +29,11 @@
#ifndef glm_core_type_gentype4
#define glm_core_type_gentype4

+#if !defined(__CUDA_LIBDEVICE__)
+#include "builtin_types.h"
+#endif /* !__CUDA_LIBDEVICE__ */
+#include "host_defines.h"
+
//#include "../fwd.hpp"
#include "setup.hpp"
#include "type_vec.hpp"
@@ -45,7 +50,7 @@
namespace detail
{
    template <typename T, precision P>
-   struct tvec4
+    struct __builtin_align__(16) tvec4
    {
        //////////////////////////////////////
        // Implementation detail

but i'm afraid it can't go into master like that. anyway, it fixes glm for me now and i can use it until the change goes in. thanks, i'm happy.

@Groovounet

This comment has been minimized.

Show comment
Hide comment
@Groovounet

Groovounet Oct 28, 2014

Member

Interesting results.

fvec4SIMD does have these alignments code in place and yes it's pretty important for CPU SSE code paths.

Do you actually need:
+#if !defined(CUDA_LIBDEVICE)
+#include "builtin_types.h"
+#endif /* !CUDA_LIBDEVICE */
+#include "host_defines.h"
?

Member

Groovounet commented Oct 28, 2014

Interesting results.

fvec4SIMD does have these alignments code in place and yes it's pretty important for CPU SSE code paths.

Do you actually need:
+#if !defined(CUDA_LIBDEVICE)
+#include "builtin_types.h"
+#endif /* !CUDA_LIBDEVICE */
+#include "host_defines.h"
?

@damdamce

This comment has been minimized.

Show comment
Hide comment
@damdamce

damdamce Oct 28, 2014

hm, it seems that i don't need them. i put them originally because qtcreator showed me an error.

but at the moment i'm not sure of anything because after messing around with glm's mat4*vec4 operator suddenly cuda's helper math matrix operations take 333 milliseconds (100 more). So i'll reboot and recheck. (edit: after reboot it's back to 225)

is it possible to use some define or something to enable it, so that i can use vanilla glm?

one more info:
I don't know if it's significant, but using

 return typename tmat4x4<T, P>::col_type(
        m[0][0] * v[0] + m[1][0] * v[1] + m[2][0] * v[2] + m[3][0] * v[3],
        m[0][1] * v[0] + m[1][1] * v[1] + m[2][1] * v[2] + m[3][1] * v[3],
        m[0][2] * v[0] + m[1][2] * v[1] + m[2][2] * v[2] + m[3][2] * v[3],
        m[0][3] * v[0] + m[1][3] * v[1] + m[2][3] * v[2] + m[3][3] * v[3]);

instead of the code with Mov0(v[0]); gives an improvement from 233 to 230 msec.

hm, it seems that i don't need them. i put them originally because qtcreator showed me an error.

but at the moment i'm not sure of anything because after messing around with glm's mat4*vec4 operator suddenly cuda's helper math matrix operations take 333 milliseconds (100 more). So i'll reboot and recheck. (edit: after reboot it's back to 225)

is it possible to use some define or something to enable it, so that i can use vanilla glm?

one more info:
I don't know if it's significant, but using

 return typename tmat4x4<T, P>::col_type(
        m[0][0] * v[0] + m[1][0] * v[1] + m[2][0] * v[2] + m[3][0] * v[3],
        m[0][1] * v[0] + m[1][1] * v[1] + m[2][1] * v[2] + m[3][1] * v[3],
        m[0][2] * v[0] + m[1][2] * v[1] + m[2][2] * v[2] + m[3][2] * v[3],
        m[0][3] * v[0] + m[1][3] * v[1] + m[2][3] * v[2] + m[3][3] * v[3]);

instead of the code with Mov0(v[0]); gives an improvement from 233 to 230 msec.

@Groovounet

This comment has been minimized.

Show comment
Hide comment
@Groovounet

Groovounet Oct 28, 2014

Member

Use GLM_ALIGN(16) instead of builtin_align(16) and you might be good to go...
Erm, actually I probably need to update setup.hpp that defines GLM_ALIGN to support CUDA...

But that's probably all that is needed here.

Member

Groovounet commented Oct 28, 2014

Use GLM_ALIGN(16) instead of builtin_align(16) and you might be good to go...
Erm, actually I probably need to update setup.hpp that defines GLM_ALIGN to support CUDA...

But that's probably all that is needed here.

@damdamce

This comment has been minimized.

Show comment
Hide comment
@damdamce

damdamce Oct 28, 2014

diff -r b492363e33cf src/glm/detail/setup.hpp
--- a/src/glm/detail/setup.hpp  Tue Oct 28 23:28:14 2014 +0100
+++ b/src/glm/detail/setup.hpp  Wed Oct 29 00:23:18 2014 +0100
@@ -755,7 +755,7 @@
#  define GLM_ALIGNED_STRUCT(x) __declspec(align(x)) struct
#  define GLM_RESTRICT
#  define GLM_RESTRICT_VAR __restrict
-#elif(GLM_COMPILER & (GLM_COMPILER_GCC | GLM_COMPILER_CLANG))
+#elif(GLM_COMPILER & (GLM_COMPILER_GCC | GLM_COMPILER_CLANG | GLM_COMPILER_CUDA))
#  define GLM_DEPRECATED __attribute__((__deprecated__))
#  define GLM_ALIGN(x) __attribute__((aligned(x)))
#  define GLM_ALIGNED_STRUCT(x) struct __attribute__((aligned(x)))
diff -r b492363e33cf src/glm/detail/type_vec4.hpp
--- a/src/glm/detail/type_vec4.hpp  Tue Oct 28 23:28:14 2014 +0100
+++ b/src/glm/detail/type_vec4.hpp  Wed Oct 29 00:23:18 2014 +0100
@@ -50,7 +45,7 @@
namespace detail
{
    template <typename T, precision P>
-    struct tvec4
+    GLM_ALIGNED_STRUCT(16) tvec4
    {
        //////////////////////////////////////
        // Implementation detail

this works for me.

diff -r b492363e33cf src/glm/detail/setup.hpp
--- a/src/glm/detail/setup.hpp  Tue Oct 28 23:28:14 2014 +0100
+++ b/src/glm/detail/setup.hpp  Wed Oct 29 00:23:18 2014 +0100
@@ -755,7 +755,7 @@
#  define GLM_ALIGNED_STRUCT(x) __declspec(align(x)) struct
#  define GLM_RESTRICT
#  define GLM_RESTRICT_VAR __restrict
-#elif(GLM_COMPILER & (GLM_COMPILER_GCC | GLM_COMPILER_CLANG))
+#elif(GLM_COMPILER & (GLM_COMPILER_GCC | GLM_COMPILER_CLANG | GLM_COMPILER_CUDA))
#  define GLM_DEPRECATED __attribute__((__deprecated__))
#  define GLM_ALIGN(x) __attribute__((aligned(x)))
#  define GLM_ALIGNED_STRUCT(x) struct __attribute__((aligned(x)))
diff -r b492363e33cf src/glm/detail/type_vec4.hpp
--- a/src/glm/detail/type_vec4.hpp  Tue Oct 28 23:28:14 2014 +0100
+++ b/src/glm/detail/type_vec4.hpp  Wed Oct 29 00:23:18 2014 +0100
@@ -50,7 +45,7 @@
namespace detail
{
    template <typename T, precision P>
-    struct tvec4
+    GLM_ALIGNED_STRUCT(16) tvec4
    {
        //////////////////////////////////////
        // Implementation detail

this works for me.

@Groovounet Groovounet added the bug label Oct 28, 2014

@Groovounet Groovounet added this to the GLM 0.9.6 milestone Oct 28, 2014

@Groovounet Groovounet self-assigned this Oct 28, 2014

@Groovounet

This comment has been minimized.

Show comment
Hide comment
@Groovounet

Groovounet Oct 28, 2014

Member

This changes made it to master for GLM 0.9.6 release.

Thanks for contributing!
Christophe

Member

Groovounet commented Oct 28, 2014

This changes made it to master for GLM 0.9.6 release.

Thanks for contributing!
Christophe

@Groovounet Groovounet closed this Oct 28, 2014

@damdamce

This comment has been minimized.

Show comment
Hide comment
@damdamce

damdamce Oct 29, 2014

do you think it would be also possible to add the alignment rules for vec2 and 3?

it would be align(8) for vec2 and align(16) for vec3.
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses
edit: cross that for vec3. it's not aligned in cuda.

i did some tests with the cross kernel:
float4
time for cuda glm (cross): 44 milliseconds
time for cuda helper math (cross): 44 milliseconds

float3
time for cuda glm (cross): 87 milliseconds (this includes align(16), before it was also at 98 msec)
time for cuda helper math (cross): 98 milliseconds

i don't understand why vec3 aligned to 16 bytes is slower than vec4. it might be the testing method. but the thing is that the cuda docs recommend align 16, so it should be better..

do you think it would be also possible to add the alignment rules for vec2 and 3?

it would be align(8) for vec2 and align(16) for vec3.
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses
edit: cross that for vec3. it's not aligned in cuda.

i did some tests with the cross kernel:
float4
time for cuda glm (cross): 44 milliseconds
time for cuda helper math (cross): 44 milliseconds

float3
time for cuda glm (cross): 87 milliseconds (this includes align(16), before it was also at 98 msec)
time for cuda helper math (cross): 98 milliseconds

i don't understand why vec3 aligned to 16 bytes is slower than vec4. it might be the testing method. but the thing is that the cuda docs recommend align 16, so it should be better..

@Groovounet

This comment has been minimized.

Show comment
Hide comment
@Groovounet

Groovounet Nov 12, 2014

Member

The fix for this bug turns out to be not valid as it would force all vec4 types to be aligned to 16 bytes including i8vec4 for example.

The fix has been reverted for the moment.

Member

Groovounet commented Nov 12, 2014

The fix for this bug turns out to be not valid as it would force all vec4 types to be aligned to 16 bytes including i8vec4 for example.

The fix has been reverted for the moment.

Groovounet pushed a commit that referenced this issue Nov 23, 2014

@Groovounet

This comment has been minimized.

Show comment
Hide comment
@Groovounet

Groovounet Nov 23, 2014

Member

I created a new extension exposing aligned types. Alignment is definitely not what we always want even if it's extremely useful.

If you want an aligned flavor of a vec4, include <glm/gtc/type_aligned.hpp> and you can use aligned_vec4.

You can also define your own aligned type in a cross platform manner using:
GLM_ALIGNED_TYPEDEF(vec3, my_vec3, 16);

Where my_vec3 is a vec3 aligned to 16 bytes.

Thanks for contributing,
Christophe

Member

Groovounet commented Nov 23, 2014

I created a new extension exposing aligned types. Alignment is definitely not what we always want even if it's extremely useful.

If you want an aligned flavor of a vec4, include <glm/gtc/type_aligned.hpp> and you can use aligned_vec4.

You can also define your own aligned type in a cross platform manner using:
GLM_ALIGNED_TYPEDEF(vec3, my_vec3, 16);

Where my_vec3 is a vec3 aligned to 16 bytes.

Thanks for contributing,
Christophe

@Groovounet Groovounet closed this Nov 23, 2014

@Groovounet Groovounet added wontfix enhancement and removed bug labels Nov 23, 2014

@Groovounet

This comment has been minimized.

Show comment
Hide comment
@Groovounet

Groovounet Nov 23, 2014

Member

Hi,

What's left to fix here?
Thanks,
Christophe

Member

Groovounet commented Nov 23, 2014

Hi,

What's left to fix here?
Thanks,
Christophe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment