Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenCL Error -11 in line163 of clBLAS/src/library/blas/xgemm.cc #205

Closed
akshayc11 opened this issue Dec 22, 2015 · 9 comments
Closed

OpenCL Error -11 in line163 of clBLAS/src/library/blas/xgemm.cc #205

akshayc11 opened this issue Dec 22, 2015 · 9 comments

Comments

@akshayc11
Copy link

Similar to
#172

I have been facing this issue on an Ubuntu 14.04 machine with NVIDIA Geforce Titan with cuda-7.5

This code used to work for me before the AutoGemm overhaul, but the latest iteration does not. It happens during the first call to the function itself.

Please let me know if you need any additional information.

I have been unsuccessful in compiling the test cases, so I cannot verify that the error is reproducible.

Thanks

Akshay

@pavanky
Copy link
Contributor

pavanky commented Dec 22, 2015

@akshayc11 Can you check if this happens in the develop branch ?

@akshayc11
Copy link
Author

This does not happen with the develop branch.

Thanks

Akshay

@akshayc11
Copy link
Author

There is an additional issue when I move to develop. Some of the gemm calls return matrixes with nans, which do not appear if I revert back to the commit version that I had used before:

commit 9731ea2
Merge: a6b3f9d 3f032e7
Author: David Tanner david.tanner@amd.com
Date: Wed Jul 1 15:00:31 2015 -0500

@pavanky
Copy link
Contributor

pavanky commented Dec 22, 2015

@akshayc11 Is this for complex gemm ? If so check out my PR. It has a fix for that.

@pavanky
Copy link
Contributor

pavanky commented Dec 22, 2015

This is the PR I am referring to: #202

@akshayc11
Copy link
Author

Its for gemm with real numbers only.. I am pretty sure the code I am running never deals with complex numbers.

@pavanky
Copy link
Contributor

pavanky commented Dec 22, 2015

@akshayc11 Any chance you can link a stand alone snippet that reproduces the problem ? I am investigating other issues with clBLAS that we face when building with our library. I'll look into fixing this alongside the other issues.

@akshayc11
Copy link
Author

@pavanky Sorry for the delayed response. Unfortunately, I do not have a stand-alone snippet at this point. The code-base where I use this is quite convoluted and has multiple nested function calls before reaching the gemm call. For now, I have reverted back to a version of master that did work before.

The following is with the clBLAS library using the develop branch

When I try to run the sample code, I get the following:

$ gcc -I/usr/local/cuda/include -I/data-local/akshayc/Workspace/Software/asr-lge-embedded/tools/clBLAS-dynamic/build-linux-dynamic/package/include example_sgemm.c -o gemm -L/data-local/akshayc/Workspace/Software/asr-lge-embedded/tools/clBLAS-dynamic/build-linux-dynamic/package/lib64 -lclBLAS -L/usr/local/cuda/lib64 -lOpenCL 

$ export LD_LIBRARY_PATH=/usr/local/cuda/lib64:`pwd`/package/lib64:$LD_LIBRARY_PATH

$ ./gemm 
Segmentation fault (core dumped)

@akshayc11
Copy link
Author

On running the command with valgrind, I get:

$ valgrind ./gemm
==13673== Memcheck, a memory error detector
==13673== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==13673== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==13673== Command: ./gemm
==13673== 
==13673== Invalid read of size 4
==13673==    at 0x67B79A9: ??? (in /usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0.0)
==13673==    by 0x67B7F28: ??? (in /usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0.0)
==13673==    by 0x4010139: call_init.part.0 (dl-init.c:78)
==13673==    by 0x4010222: call_init (dl-init.c:36)
==13673==    by 0x4010222: _dl_init (dl-init.c:126)
==13673==    by 0x4001309: ??? (in /lib/x86_64-linux-gnu/ld-2.19.so)
==13673==  Address 0x77ae404 is 20 bytes inside a block of size 23 alloc'd
==13673==    at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==13673==    by 0x67B796A: ??? (in /usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0.0)
==13673==    by 0x67B7F28: ??? (in /usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0.0)
==13673==    by 0x4010139: call_init.part.0 (dl-init.c:78)
==13673==    by 0x4010222: call_init (dl-init.c:36)
==13673==    by 0x4010222: _dl_init (dl-init.c:126)
==13673==    by 0x4001309: ??? (in /lib/x86_64-linux-gnu/ld-2.19.so)
==13673== 
==13673== Warning: noted but unhandled ioctl 0x30000001 with no size/direction hints.
==13673==    This could cause spurious value errors to appear.
==13673==    See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
==13673== Warning: set address range perms: large range [0x200000000, 0x700000000) (noaccess)
==13673== Warning: set address range perms: large range [0x900000000, 0xc00000000) (noaccess)
==13673== Warning: set address range perms: large range [0xc00000000, 0xf00000000) (noaccess)
==13673== Use of uninitialised value of size 8
==13673==    at 0x400FDF2: _dl_signal_error (dl-error.c:94)
==13673==    by 0x400FF7D: _dl_signal_cerror (dl-error.c:155)
==13673==    by 0x400B267: _dl_lookup_symbol_x (dl-lookup.c:779)
==13673==    by 0x400F556: _dl_fixup (dl-runtime.c:111)
==13673==    by 0x4016514: _dl_runtime_resolve (dl-trampoline.S:45)
==13673==    by 0x50D2B82: rwlockInit (rwlock.c:110)
==13673==    by 0x5104D65: clblasFunctorCache<clblasSscalFunctorGeneric, _clblasXscalFunctorGenericData, std::less<_clblasXscalFunctorGenericData> >::clblasFunctorCache() (functor.h:280)
==13673==    by 0x5104B2B: __static_initialization_and_destruction_0(int, int) (functor_xscal_generic.cc:194)
==13673==    by 0x5104C4B: _GLOBAL__sub_I_functor_xscal_generic.cc (functor_xscal_generic.cc:439)
==13673==    by 0x4010139: call_init.part.0 (dl-init.c:78)
==13673==    by 0x4010222: call_init (dl-init.c:36)
==13673==    by 0x4010222: _dl_init (dl-init.c:126)
==13673==    by 0x4001309: ??? (in /lib/x86_64-linux-gnu/ld-2.19.so)
==13673== 
==13673== 
==13673== Process terminating with default action of signal 11 (SIGSEGV)
==13673==  Access not within mapped region at address 0xB
==13673==    at 0x400FDF2: _dl_signal_error (dl-error.c:94)
==13673==    by 0x400FF7D: _dl_signal_cerror (dl-error.c:155)
==13673==    by 0x400B267: _dl_lookup_symbol_x (dl-lookup.c:779)
==13673==    by 0x400F556: _dl_fixup (dl-runtime.c:111)
==13673==    by 0x4016514: _dl_runtime_resolve (dl-trampoline.S:45)
==13673==    by 0x50D2B82: rwlockInit (rwlock.c:110)
==13673==    by 0x5104D65: clblasFunctorCache<clblasSscalFunctorGeneric, _clblasXscalFunctorGenericData, std::less<_clblasXscalFunctorGenericData> >::clblasFunctorCache() (functor.h:280)
==13673==    by 0x5104B2B: __static_initialization_and_destruction_0(int, int) (functor_xscal_generic.cc:194)
==13673==    by 0x5104C4B: _GLOBAL__sub_I_functor_xscal_generic.cc (functor_xscal_generic.cc:439)
==13673==    by 0x4010139: call_init.part.0 (dl-init.c:78)
==13673==    by 0x4010222: call_init (dl-init.c:36)
==13673==    by 0x4010222: _dl_init (dl-init.c:126)
==13673==    by 0x4001309: ??? (in /lib/x86_64-linux-gnu/ld-2.19.so)
==13673==  If you believe this happened as a result of a stack
==13673==  overflow in your program's main thread (unlikely but
==13673==  possible), you can try to increase the size of the
==13673==  main thread stack using the --main-stacksize= flag.
==13673==  The main thread stack size used in this run was 8388608.
==13673== 
==13673== HEAP SUMMARY:
==13673==     in use at exit: 209,565 bytes in 128 blocks
==13673==   total heap usage: 234 allocs, 106 frees, 273,577 bytes allocated
==13673== 
==13673== LEAK SUMMARY:
==13673==    definitely lost: 32,816 bytes in 1 blocks
==13673==    indirectly lost: 0 bytes in 0 blocks
==13673==      possibly lost: 2,312 bytes in 17 blocks
==13673==    still reachable: 174,437 bytes in 110 blocks
==13673==         suppressed: 0 bytes in 0 blocks
==13673== Rerun with --leak-check=full to see details of leaked memory
==13673== 
==13673== For counts of detected and suppressed errors, rerun with: -v
==13673== Use --track-origins=yes to see where uninitialised values come from
==13673== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 1 from 1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants