Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda-related tests failed when make test #922

Closed
QuantumY-CHN opened this issue Nov 18, 2022 · 7 comments
Closed

cuda-related tests failed when make test #922

QuantumY-CHN opened this issue Nov 18, 2022 · 7 comments
Assignees
Labels
NVIDIA CUDA Nvidia GPU related issues

Comments

@QuantumY-CHN
Copy link

QuantumY-CHN commented Nov 18, 2022

Thank you for your fantastic work.

I'm on Ubuntu 18.04, cuda-10.0 but got some cuda-related tests that failed while 'make test'. Could you please give some advice on that? Thanks in advance.

96% tests passed, 8 tests failed out of 183

Total Test time (real) = 244.35 sec

The following tests FAILED:
	  1 - cuda_memcheck_dense_qr_test (Failed)
	  2 - cuda_memcheck_dense_cholesky_test (Failed)
	 26 - cuda_dense_cholesky_test (Failed)
	 27 - cuda_dense_qr_test (Failed)
	102 - ba_denseschur_cuda_auto_test (Subprocess aborted)
	123 - ba_denseschur_cuda_auto_threads_test (Subprocess aborted)
	144 - ba_denseschur_cuda_user_test (Subprocess aborted)
	165 - ba_denseschur_cuda_user_threads_test (Subprocess aborted)
Errors while running CTest
Output from these tests are in: /home/ubuntu/catkin_wss/ceres-bin/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.
Makefile:70: recipe for target 'test' failed
make: *** [test] Error 8

Here's the part of the log:
1/183 Testing: cuda_memcheck_dense_qr_test
1/183 Test: cuda_memcheck_dense_qr_test
Command: "/usr/local/cuda-10.0/bin/cuda-memcheck" "--leak-check" "full" "/home/ubuntu/catkin_wss/ceres-bin/bin/cuda_dense_qr_test"
Directory: /home/ubuntu/catkin_wss/ceres-bin/internal/ceres
"cuda_memcheck_dense_qr_test" start time: Nov 18 07:37 UTC
Output:
----------------------------------------------------------
Running main() from gmock_main.cc
[==========] Running 5 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 5 tests from CUDADenseQR
[ RUN      ] CUDADenseQR.InvalidOptionOnCreate
[       OK ] CUDADenseQR.InvalidOptionOnCreate (0 ms)
[ RUN      ] CUDADenseQR.QR4x4Matrix
========= CUDA-MEMCHECK
========= Program hit cudaErrorInvalidDeviceFunction (error 8) due to "invalid device function" on CUDA API call to cudaFuncSetAttribute. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x4545f6]
=========     Host Frame:/usr/local/cuda-10.0/lib64/libcublas.so.10.0 [0x79a03c]
=========     Host Frame:/usr/local/cuda-10.0/lib64/libcublas.so.10.0 [0x72c2ab]
=========     Host Frame:/usr/local/cuda-10.0/lib64/libcublas.so.10.0 [0x72c610]
=========     Host Frame:/usr/local/cuda-10.0/lib64/libcublas.so.10.0 (cublasCreate_v2 + 0x1ce7) [0x14b337]
=========     Host Frame:/home/ubuntu/catkin_wss/ceres-bin/bin/cuda_dense_qr_test [0x53762]
=========     Host Frame:/home/ubuntu/catkin_wss/ceres-bin/bin/cuda_dense_qr_test [0x54abb]
=========     Host Frame:/home/ubuntu/catkin_wss/ceres-bin/bin/cuda_dense_qr_test [0x110de]
=========     Host Frame:/home/ubuntu/catkin_wss/ceres-bin/bin/cuda_dense_qr_test [0x5290a]
=========     Host Frame:/home/ubuntu/catkin_wss/ceres-bin/bin/cuda_dense_qr_test [0x3d7f3]
=========     Host Frame:/home/ubuntu/catkin_wss/ceres-bin/bin/cuda_dense_qr_test [0x3d9b8]
=========     Host Frame:/home/ubuntu/catkin_wss/ceres-bin/bin/cuda_dense_qr_test [0x3dc31]
=========     Host Frame:/home/ubuntu/catkin_wss/ceres-bin/bin/cuda_dense_qr_test [0x4489c]
=========     Host Frame:/home/ubuntu/catkin_wss/ceres-bin/bin/cuda_dense_qr_test [0x44be1]
=========     Host Frame:/home/ubuntu/catkin_wss/ceres-bin/bin/cuda_dense_qr_test [0xffea]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21c87]
=========     Host Frame:/home/ubuntu/catkin_wss/ceres-bin/bin/cuda_dense_qr_test [0x1011a]
=========
========= Program hit cudaErrorInvalidDeviceFunction (error 8) due to "invalid device function" on CUDA API call to cudaGetLastError. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x4545f6]
=========     Host Frame:/usr/local/cuda-10.0/lib64/libcublas.so.10.0 [0x79deb3]
=========     Host Frame:/usr/local/cuda-10.0/lib64/libcublas.so.10.0 [0x72c2b8]
=========     Host Frame:/usr/local/cuda-10.0/lib64/libcublas.so.10.0 [0x72c610]
=========     Host Frame:/usr/local/cuda-10.0/lib64/libcublas.so.10.0 (cublasCreate_v2 + 0x1ce7) [0x14b337]
=========     Host Frame:/home/ubuntu/catkin_wss/ceres-bin/bin/cuda_dense_qr_test [0x53762]
=========     Host Frame:/home/ubuntu/catkin_wss/ceres-bin/bin/cuda_dense_qr_test [0x54abb]
=========     Host Frame:/home/ubuntu/catkin_wss/ceres-bin/bin/cuda_dense_qr_test [0x110de]
=========     Host Frame:/home/ubuntu/catkin_wss/ceres-bin/bin/cuda_dense_qr_test [0x5290a]
=========     Host Frame:/home/ubuntu/catkin_wss/ceres-bin/bin/cuda_dense_qr_test [0x3d7f3]
=========     Host Frame:/home/ubuntu/catkin_wss/ceres-bin/bin/cuda_dense_qr_test [0x3d9b8]
=========     Host Frame:/home/ubuntu/catkin_wss/ceres-bin/bin/cuda_dense_qr_test [0x3dc31]
=========     Host Frame:/home/ubuntu/catkin_wss/ceres-bin/bin/cuda_dense_qr_test [0x4489c]
=========     Host Frame:/home/ubuntu/catkin_wss/ceres-bin/bin/cuda_dense_qr_test [0x44be1]
=========     Host Frame:/home/ubuntu/catkin_wss/ceres-bin/bin/cuda_dense_qr_test [0xffea]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21c87]
=========     Host Frame:/home/ubuntu/catkin_wss/ceres-bin/bin/cuda_dense_qr_test [0x1011a]
=========
========= Program hit cudaErrorInvalidDeviceFunction (error 8) due to "invalid device function" on CUDA API call to cudaFuncSetAttribute. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x4545f6]
=========     Host Frame:/usr/local/cuda-10.0/lib64/libcublas.so.10.0 [0x79a03c]
=========     Host Frame:/usr/local/cuda-10.0/lib64/libcublas.so.10.0 [0x72c2ab]
=========     Host Frame:/usr/local/cuda-10.0/lib64/libcublas.so.10.0 [0x72c610]
=========     Host Frame:/usr/local/cuda-10.0/lib64/libcublas.so.10.0 (cublasCreate_v2 + 0x1ce7) [0x14b337]
=========     Host Frame:/home/ubuntu/catkin_wss/ceres-bin/bin/cuda_dense_qr_test [0x53762]
=========     Host Frame:/home/ubuntu/catkin_wss/ceres-bin/bin/cuda_dense_qr_test [0x54abb]
=========     Host Frame:/home/ubuntu/catkin_wss/ceres-bin/bin/cuda_dense_qr_test [0x110de]
=========     Host Frame:/home/ubuntu/catkin_wss/ceres-bin/bin/cuda_dense_qr_test [0x5290a]
=========     Host Frame:/home/ubuntu/catkin_wss/ceres-bin/bin/cuda_dense_qr_test [0x3d7f3]
=========     Host Frame:/home/ubuntu/catkin_wss/ceres-bin/bin/cuda_dense_qr_test [0x3d9b8]
=========     Host Frame:/home/ubuntu/catkin_wss/ceres-bin/bin/cuda_dense_qr_test [0x3dc31]
=========     Host Frame:/home/ubuntu/catkin_wss/ceres-bin/bin/cuda_dense_qr_test [0x4489c]
=========     Host Frame:/home/ubuntu/catkin_wss/ceres-bin/bin/cuda_dense_qr_test [0x44be1]
=========     Host Frame:/home/ubuntu/catkin_wss/ceres-bin/bin/cuda_dense_qr_test [0xffea]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21c87]
=========     Host Frame:/home/ubuntu/catkin_wss/ceres-bin/bin/cuda_dense_qr_test [0x1011a]
=========
========= Program hit cudaErrorInvalidDeviceFunction (error 8) due to "invalid device function" on CUDA API call to cudaGetLastError. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x4545f6]
=========     Host Frame:/usr/local/cuda-10.0/lib64/libcublas.so.10.0 [0x79deb3]
=========     Host Frame:/usr/local/cuda-10.0/lib64/libcublas.so.10.0 [0x72c2b8]
=========     Host Frame:/usr/local/cuda-10.0/lib64/libcublas.so.10.0 [0x72c610]
=========     Host Frame:/usr/local/cuda-10.0/lib64/libcublas.so.10.0 (cublasCreate_v2 + 0x1ce7)

123/183 Testing: ba_denseschur_cuda_auto_threads_test
123/183 Test: ba_denseschur_cuda_auto_threads_test
Command: "/home/ubuntu/catkin_wss/ceres-bin/bin/ba_denseschur_cuda_auto_threads_test" "--test_srcdir" "/home/ubuntu/catkin_wss/ceres-solver-2.1.0/data"
Directory: /home/ubuntu/catkin_wss/ceres-bin/internal/ceres/generated_bundle_adjustment_tests
"ba_denseschur_cuda_auto_threads_test" start time: Nov 18 07:40 UTC
Output:
----------------------------------------------------------
Running main() from gmock_main.cc
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from BundleAdjustmentTest
[ RUN      ] BundleAdjustmentTest.DenseSchur_Cuda_AutomaticOrdering_Threads
E20221118 07:40:32.381206 16256 trust_region_minimizer.cc:93] Terminating: Linear solver failed due to unrecoverable non-numeric causes. Please see the error log for clues. 
F20221118 07:40:32.390861 16256 test_util.h:121] Check failed: summary.termination_type != ceres::FAILURE (2 vs. 2) 
*** Check failure stack trace: ***
    @     0x7f52f03716c6  google::LogMessage::Fail()
    @     0x7f52f0371612  google::LogMessage::SendToLog()
    @     0x7f52f0370e3d  google::LogMessage::Flush()
    @     0x7f52f037482a  google::LogMessageFatal::~LogMessageFatal()
    @     0x557800e1c29a  ceres::internal::BundleAdjustmentTest_DenseSchur_Cuda_AutomaticOrdering_Threads_Test::TestBody()
    @     0x557800e5b3ca  testing::internal::HandleExceptionsInMethodIfSupported<>()
    @     0x557800e46323  testing::Test::Run()
    @     0x557800e464e8  testing::TestInfo::Run()
    @     0x557800e46761  testing::TestSuite::Run()
    @     0x557800e4d3cc  testing::internal::UnitTestImpl::RunAllTests()
    @     0x557800e4d711  testing::UnitTest::Run()
    @     0x557800e1b63a  main
    @     0x7f52e1198c87  __libc_start_main
    @     0x557800e1b7fa  _start
    @              (nil)  (unknown)
<end of output>
Test time =   1.53 sec
----------------------------------------------------------
Test Failed.
"ba_denseschur_cuda_auto_threads_test" end time: Nov 18 07:40 UTC
"ba_denseschur_cuda_auto_threads_test" time elapsed: 00:00:01
----------------------------------------------------------

144/183 Testing: ba_denseschur_cuda_user_test
144/183 Test: ba_denseschur_cuda_user_test
Command: "/home/ubuntu/catkin_wss/ceres-bin/bin/ba_denseschur_cuda_user_test" "--test_srcdir" "/home/ubuntu/catkin_wss/ceres-solver-2.1.0/data"
Directory: /home/ubuntu/catkin_wss/ceres-bin/internal/ceres/generated_bundle_adjustment_tests
"ba_denseschur_cuda_user_test" start time: Nov 18 07:41 UTC
Output:
----------------------------------------------------------
Running main() from gmock_main.cc
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from BundleAdjustmentTest
[ RUN      ] BundleAdjustmentTest.DenseSchur_Cuda_UserOrdering
E20221118 07:41:03.800753 16337 trust_region_minimizer.cc:93] Terminating: Linear solver failed due to unrecoverable non-numeric causes. Please see the error log for clues. 
F20221118 07:41:03.807525 16337 test_util.h:121] Check failed: summary.termination_type != ceres::FAILURE (2 vs. 2) 
*** Check failure stack trace: ***
    @     0x7f98ce0c66c6  google::LogMessage::Fail()
    @     0x7f98ce0c6612  google::LogMessage::SendToLog()
    @     0x7f98ce0c5e3d  google::LogMessage::Flush()
    @     0x7f98ce0c982a  google::LogMessageFatal::~LogMessageFatal()
    @     0x5654f5c1c1fa  ceres::internal::BundleAdjustmentTest_DenseSchur_Cuda_UserOrdering_Test::TestBody()
    @     0x5654f5c5b28a  testing::internal::HandleExceptionsInMethodIfSupported<>()
    @     0x5654f5c461e3  testing::Test::Run()
    @     0x5654f5c463a8  testing::TestInfo::Run()
    @     0x5654f5c46621  testing::TestSuite::Run()
    @     0x5654f5c4d28c  testing::internal::UnitTestImpl::RunAllTests()
    @     0x5654f5c4d5d1  testing::UnitTest::Run()
    @     0x5654f5c1b62a  main
    @     0x7f98beeedc87  __libc_start_main
    @     0x5654f5c1b7ea  _start
    @              (nil)  (unknown)
<end of output>
Test time =   1.50 sec
----------------------------------------------------------
Test Failed.
"ba_denseschur_cuda_user_test" end time: Nov 18 07:41 UTC
"ba_denseschur_cuda_user_test" time elapsed: 00:00:01
----------------------------------------------------------

165/183 Testing: ba_denseschur_cuda_user_threads_test
165/183 Test: ba_denseschur_cuda_user_threads_test
Command: "/home/ubuntu/catkin_wss/ceres-bin/bin/ba_denseschur_cuda_user_threads_test" "--test_srcdir" "/home/ubuntu/catkin_wss/ceres-solver-2.1.0/data"
Directory: /home/ubuntu/catkin_wss/ceres-bin/internal/ceres/generated_bundle_adjustment_tests
"ba_denseschur_cuda_user_threads_test" start time: Nov 18 07:41 UTC
Output:
----------------------------------------------------------
Running main() from gmock_main.cc
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from BundleAdjustmentTest
[ RUN      ] BundleAdjustmentTest.DenseSchur_Cuda_UserOrdering_Threads
E20221118 07:41:36.848340 16378 trust_region_minimizer.cc:93] Terminating: Linear solver failed due to unrecoverable non-numeric causes. Please see the error log for clues. 
F20221118 07:41:36.855412 16378 test_util.h:121] Check failed: summary.termination_type != ceres::FAILURE (2 vs. 2) 
*** Check failure stack trace: ***
    @     0x7f370c32e6c6  google::LogMessage::Fail()
    @     0x7f370c32e612  google::LogMessage::SendToLog()
    @     0x7f370c32de3d  google::LogMessage::Flush()
    @     0x7f370c33182a  google::LogMessageFatal::~LogMessageFatal()
    @     0x556ca201c20a  ceres::internal::BundleAdjustmentTest_DenseSchur_Cuda_UserOrdering_Threads_Test::TestBody()
    @     0x556ca205b29a  testing::internal::HandleExceptionsInMethodIfSupported<>()
    @     0x556ca20461f3  testing::Test::Run()
    @     0x556ca20463b8  testing::TestInfo::Run()
    @     0x556ca2046631  testing::TestSuite::Run()
    @     0x556ca204d29c  testing::internal::UnitTestImpl::RunAllTests()
    @     0x556ca204d5e1  testing::UnitTest::Run()
    @     0x556ca201b63a  main
    @     0x7f36fd155c87  __libc_start_main
    @     0x556ca201b7fa  _start
    @              (nil)  (unknown)
<end of output>
Test time =   1.48 sec
----------------------------------------------------------
Test Failed.
"ba_denseschur_cuda_user_threads_test" end time: Nov 18 07:41 UTC
"ba_denseschur_cuda_user_threads_test" time elapsed: 00:00:01
@sandwichmaker
Copy link
Contributor

Cc:@joydeep-b

@joydeep-b
Copy link
Contributor

Uh oh that looks like an unsupported graphics card. Can you run any of the cuda tests either the verbose flag --v 3 and --alsologtostderr and share the result? I'd like to see the cuda info string.

1 similar comment
@joydeep-b
Copy link
Contributor

Uh oh that looks like an unsupported graphics card. Can you run any of the cuda tests either the verbose flag --v 3 and --alsologtostderr and share the result? I'd like to see the cuda info string.

@QuantumY-CHN
Copy link
Author

QuantumY-CHN commented Nov 19, 2022

@joydeep-b @sandwichmaker

Thank you for your reply. I'm on Nvidia RTX3060
Here's the output

ubuntu@dbfb2a9cb529:~/catkin_wss/ceres-bin/bin$ ./cuda_dense_qr_test --v 3
Running main() from gmock_main.cc
[==========] Running 5 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 5 tests from CUDADenseQR
[ RUN      ] CUDADenseQR.InvalidOptionOnCreate
[       OK ] CUDADenseQR.InvalidOptionOnCreate (0 ms)
[ RUN      ] CUDADenseQR.QR4x4Matrix
[       OK ] CUDADenseQR.QR4x4Matrix (447 ms)
[ RUN      ] CUDADenseQR.QR4x2Matrix
[       OK ] CUDADenseQR.QR4x2Matrix (1 ms)
[ RUN      ] CUDADenseQR.MustFactorizeBeforeSolve
[       OK ] CUDADenseQR.MustFactorizeBeforeSolve (0 ms)
[ RUN      ] CUDADenseQR.Randomized1600x100Tests
/home/ubuntu/catkin_wss/ceres-solver-2.1.0/internal/ceres/cuda_dense_qr_test.cc:162: Failure
The difference between (x_computed - x_expected).norm() / x_expected.norm() and 0.0 is 1.0024927494899278, which exceeds std::numeric_limits<double>::epsilon() * 400, where
(x_computed - x_expected).norm() / x_expected.norm() evaluates to 1.0024927494899278,
0.0 evaluates to 0, and
std::numeric_limits<double>::epsilon() * 400 evaluates to 8.8817841970012523e-14.
[  FAILED  ] CUDADenseQR.Randomized1600x100Tests (5 ms)
[----------] 5 tests from CUDADenseQR (454 ms total)

[----------] Global test environment tear-down
[==========] 5 tests from 1 test suite ran. (454 ms total)
[  PASSED  ] 4 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] CUDADenseQR.Randomized1600x100Tests

 1 FAILED TEST
ubuntu@dbfb2a9cb529:~/catkin_wss/ceres-bin/bin$ ./cuda_dense_qr_test --logtostderr
Running main() from gmock_main.cc
[==========] Running 5 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 5 tests from CUDADenseQR
[ RUN      ] CUDADenseQR.InvalidOptionOnCreate
[       OK ] CUDADenseQR.InvalidOptionOnCreate (0 ms)
[ RUN      ] CUDADenseQR.QR4x4Matrix
[       OK ] CUDADenseQR.QR4x4Matrix (445 ms)
[ RUN      ] CUDADenseQR.QR4x2Matrix
[       OK ] CUDADenseQR.QR4x2Matrix (1 ms)
[ RUN      ] CUDADenseQR.MustFactorizeBeforeSolve
[       OK ] CUDADenseQR.MustFactorizeBeforeSolve (0 ms)
[ RUN      ] CUDADenseQR.Randomized1600x100Tests
/home/ubuntu/catkin_wss/ceres-solver-2.1.0/internal/ceres/cuda_dense_qr_test.cc:162: Failure
The difference between (x_computed - x_expected).norm() / x_expected.norm() and 0.0 is 0.9986744452855022, which exceeds std::numeric_limits<double>::epsilon() * 400, where
(x_computed - x_expected).norm() / x_expected.norm() evaluates to 0.9986744452855022,
0.0 evaluates to 0, and
std::numeric_limits<double>::epsilon() * 400 evaluates to 8.8817841970012523e-14.
[  FAILED  ] CUDADenseQR.Randomized1600x100Tests (5 ms)
[----------] 5 tests from CUDADenseQR (453 ms total)

[----------] Global test environment tear-down
[==========] 5 tests from 1 test suite ran. (453 ms total)
[  PASSED  ] 4 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] CUDADenseQR.Randomized1600x100Tests

 1 FAILED TEST

@Cristian-wp
Copy link

Hi, I was facing your problem on NVIDEA Jetson Nano. This was my solution: #909 (comment)

@sandwichmaker sandwichmaker added the NVIDIA CUDA Nvidia GPU related issues label Nov 25, 2022
@joydeep-b
Copy link
Contributor

@QuantumY-CHN The first and second errors you posted are different.
The first error you posted hints at an unsupported device:

cudaErrorInvalidDeviceFunction (error 8) due to "invalid device function" on CUDA API call to cudaFuncSetAttribute

The second is one of numerical precision, which is being discussed on the #909 thread.

Did something change between the two? Driver upgrade / SDK upgrade?

Unfortunately I am unable to reproduce this error - I tested out on an RTX 3060 with Ubuntu 22.04 and CUDA 11.4 without errors.

Could you share some more details about your system? Please run the following and paste the full output here:

uname -a && nvidia-smi && nvcc --version && cat /etc/lsb-release

@sandwichmaker
Copy link
Contributor

I am going to close this due to lack of updates. please reopen as needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NVIDIA CUDA Nvidia GPU related issues
Projects
None yet
Development

No branches or pull requests

4 participants