Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfaults in rabit tests #10321

Closed
hcho3 opened this issue May 23, 2024 · 11 comments
Closed

Segfaults in rabit tests #10321

hcho3 opened this issue May 23, 2024 · 11 comments
Labels

Comments

@hcho3
Copy link
Collaborator

hcho3 commented May 23, 2024

https://github.com/dmlc/xgboost/actions/runs/9211613634/job/25341388037?pr=10320
https://github.com/dmlc/xgboost/actions/runs/9211613620/job/25341830117?pr=10320

Rabit tests segfault on the macos-13 platform. The failure occurs consistently when I restart the tests.

@hcho3 hcho3 added the Blocking label May 23, 2024
@trivialfis
Copy link
Member

I will debug it next week.

@hcho3
Copy link
Collaborator Author

hcho3 commented May 23, 2024

I can't reproduce the segfault on my Mac mini (M1). I wonder if the issue is only present on Intel Mac?

@hcho3
Copy link
Collaborator Author

hcho3 commented May 23, 2024

Might be related to #10312 ?

@trivialfis
Copy link
Member

I don't think this is related to NCCL. I can reproduce that with repeated runs, it's a deadlock inside CTK called by NCCL.

@trivialfis
Copy link
Member

I wonder if the issue is only present on Intel Mac?

That's what we are currently using on the master branch right?

@hcho3
Copy link
Collaborator Author

hcho3 commented May 23, 2024

That's what we are currently using on the master branch right?

Yes

@trivialfis
Copy link
Member

Looking at the x86 instance on aws, it seems macos 13 is not available? https://aws.amazon.com/ec2/instance-types/mac/

@hcho3
Copy link
Collaborator Author

hcho3 commented May 23, 2024

Same error with macos-12: https://github.com/dmlc/xgboost/actions/runs/9212929759/job/25345674003?pr=10320

@hcho3
Copy link
Collaborator Author

hcho3 commented May 28, 2024

Just ran the gtest with Thread Sanitizer enabled on MacOS 12:

ec2-user@ip-172-31-25-190 build % ./testxgboost --gtest_filter=AllgatherTest.VBasic
testxgboost(6279,0x115ffe600) malloc: nano zone abandoned due to inability to preallocate reserved vm space.
Note: Google Test filter = AllgatherTest.VBasic
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from AllgatherTest
[ RUN      ] AllgatherTest.VBasic
[04:55:31] INFO: /Users/ec2-user/xgboost/tests/cpp/collective/test_worker.h:119: Using 7 workers for test.
[04:55:31] Task t:3 got rank 3
[04:55:31] Task t:2 got rank 2
[04:55:31] Task t:5 got rank 5
[04:55:31] Task t:1 got rank 1
[04:55:31] Task t:0 got rank 0
[04:55:31] Task t:6 got rank 6
==================
WARNING: ThreadSanitizer: data race (pid=6279)
  Read of size 8 at 0x7ff85dcc3e30 by thread T11:
    #0 std::__1::ostreambuf_iterator<char, std::__1::char_traits<char> > std::__1::__pad_and_output<char, std::__1::char_traits<char> >(std::__1::ostreambuf_iterator<char, std::__1::char_traits<char> >, char const*, char const*, char const*, std::__1::ios_base&, char) <null>:2 (testxgboost:x86_64+0x100003439)
    #1 std::__1::basic_ostream<char, std::__1::char_traits<char> >& std::__1::__put_character_sequence<char, std::__1::char_traits<char> >(std::__1::basic_ostream<char, std::__1::char_traits<char> >&, char const*, unsigned long) <null>:2 (testxgboost:x86_64+0x10000330c)
    #2 xgboost::LogCallbackRegistry::LogCallbackRegistry()::'lambda'(char const*)::__invoke(char const*) <null>:2 (testxgboost:x86_64+0x1003e4ba0)
    #3 xgboost::ConsoleLogger::~ConsoleLogger() <null>:2 (testxgboost:x86_64+0x1003e4ded)
    #4 xgboost::ConsoleLogger::~ConsoleLogger() <null>:2 (testxgboost:x86_64+0x1003e4f59)
    #5 xgboost::collective::RabitComm::Bootstrap(std::__1::chrono::duration<long long, std::__1::ratio<1l, 1l> >, int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >) <null>:2 (testxgboost:x86_64+0x10018a42b)
    #6 xgboost::collective::RabitComm::RabitComm(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1l> >, int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, xgboost::StringView) <null>:2 (testxgboost:x86_64+0x1001889d6)
    #7 xgboost::collective::RabitComm::RabitComm(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1l> >, int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, xgboost::StringView) <null>:2 (testxgboost:x86_64+0x10018b77c)
    #8 xgboost::collective::WorkerForTest::WorkerForTest(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, int, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1l> >, int, int) <null>:2 (testxgboost:x86_64+0x100646c6a)
    #9 void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void xgboost::collective::TestDistributed<xgboost::collective::AllgatherTest_VBasic_Test::TestBody()::$_1>(int, xgboost::collective::AllgatherTest_VBasic_Test::TestBody()::$_1)::'lambda'()> >(void*) <null>:2 (testxgboost:x86_64+0x1006483cf)
    
Previous write of size 8 at 0x7ff85dcc3e30 by thread T8:
    #0 std::__1::ostreambuf_iterator<char, std::__1::char_traits<char> > std::__1::__pad_and_output<char, std::__1::char_traits<char> >(std::__1::ostreambuf_iterator<char, std::__1::char_traits<char> >, char const*, char const*, char const*, std::__1::ios_base&, char) <null>:2 (testxgboost:x86_64+0x100003553)
    #1 std::__1::basic_ostream<char, std::__1::char_traits<char> >& std::__1::__put_character_sequence<char, std::__1::char_traits<char> >(std::__1::basic_ostream<char, std::__1::char_traits<char> >&, char const*, unsigned long) <null>:2 (testxgboost:x86_64+0x10000330c)
    #2 xgboost::LogCallbackRegistry::LogCallbackRegistry()::'lambda'(char const*)::__invoke(char const*) <null>:2 (testxgboost:x86_64+0x1003e4ba0)
    #3 xgboost::ConsoleLogger::~ConsoleLogger() <null>:2 (testxgboost:x86_64+0x1003e4ded)
    #4 xgboost::ConsoleLogger::~ConsoleLogger() <null>:2 (testxgboost:x86_64+0x1003e4f59)
    #5 xgboost::collective::RabitComm::Bootstrap(std::__1::chrono::duration<long long, std::__1::ratio<1l, 1l> >, int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >) <null>:2 (testxgboost:x86_64+0x10018a42b)
    #6 xgboost::collective::RabitComm::RabitComm(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1l> >, int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, xgboost::StringView) <null>:2 (testxgboost:x86_64+0x1001889d6)
    #7 xgboost::collective::RabitComm::RabitComm(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1l> >, int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, xgboost::StringView) <null>:2 (testxgboost:x86_64+0x10018b77c)
    #8 xgboost::collective::WorkerForTest::WorkerForTest(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, int, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1l> >, int, int) <null>:2 (testxgboost:x86_64+0x100646c6a)
    #9 void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void xgboost::collective::TestDistributed<xgboost::collective::AllgatherTest_VBasic_Test::TestBody()::$_1>(int, xgboost::collective::AllgatherTest_VBasic_Test::TestBody()::$_1)::'lambda'()> >(void*) <null>:2 (testxgboost:x86_64+0x1006483cf)

  Location is global 'std::__1::cerr' at 0x7ff85dcc3e10 (libc++.1.dylib+0x417d1e30)

  Thread T11 (tid=40370, running) created by main thread at:
    #0 pthread_create <null>:3 (libclang_rt.tsan_osx_dynamic.dylib:x86_64h+0x2dd7f)
    #1 void std::__1::allocator_traits<std::__1::allocator<std::__1::thread> >::construct<std::__1::thread, void xgboost::collective::TestDistributed<xgboost::collective::AllgatherTest_VBasic_Test::TestBody()::$_1>(int, xgboost::collective::AllgatherTest_VBasic_Test::TestBody()::$_1)::'lambda'(), void>(std::__1::allocator<std::__1::thread>&, xgboost::collective::AllgatherTest_VBasic_Test::TestBody()::$_1*, void xgboost::collective::TestDistributed<xgboost::collective::AllgatherTest_VBasic_Test::TestBody()::$_1>(int, xgboost::collective::AllgatherTest_VBasic_Test::TestBody()::$_1)::'lambda'()&&) <null>:2 (testxgboost:x86_64+0x10064825b)
    #2 xgboost::collective::AllgatherTest_VBasic_Test::TestBody() <null>:2 (testxgboost:x86_64+0x1006420b7)
    #3 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) <null>:2 (testxgboost:x86_64+0x1009c823f)
    #4 testing::Test::Run() <null>:2 (testxgboost:x86_64+0x1009c8117)
    #5 testing::TestInfo::Run() <null>:2 (testxgboost:x86_64+0x1009ca6c8)
    #6 testing::TestSuite::Run() <null>:2 (testxgboost:x86_64+0x1009cb8f6)
    #7 testing::internal::UnitTestImpl::RunAllTests() <null>:2 (testxgboost:x86_64+0x1009e22ee)
    #8 bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) <null>:2 (testxgboost:x86_64+0x1009e124f)
    #9 testing::UnitTest::Run() <null>:2 (testxgboost:x86_64+0x1009e1184)
    #10 main <null>:2 (testxgboost:x86_64+0x1008e72d6)

  Thread T8 (tid=40368, running) created by main thread at:
    #0 pthread_create <null>:3 (libclang_rt.tsan_osx_dynamic.dylib:x86_64h+0x2dd7f)
    #1 void std::__1::allocator_traits<std::__1::allocator<std::__1::thread> >::construct<std::__1::thread, void xgboost::collective::TestDistributed<xgboost::collective::AllgatherTest_VBasic_Test::TestBody()::$_1>(int, xgboost::collective::AllgatherTest_VBasic_Test::TestBody()::$_1)::'lambda'(), void>(std::__1::allocator<std::__1::thread>&, xgboost::collective::AllgatherTest_VBasic_Test::TestBody()::$_1*, void xgboost::collective::TestDistributed<xgboost::collective::AllgatherTest_VBasic_Test::TestBody()::$_1>(int, xgboost::collective::AllgatherTest_VBasic_Test::TestBody()::$_1)::'lambda'()&&) <null>:2 (testxgboost:x86_64+0x10064825b)
    #2 xgboost::collective::AllgatherTest_VBasic_Test::TestBody() <null>:2 (testxgboost:x86_64+0x100642002)
    #3 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) <null>:2 (testxgboost:x86_64+0x1009c823f)
    #4 testing::Test::Run() <null>:2 (testxgboost:x86_64+0x1009c8117)
    #5 testing::TestInfo::Run() <null>:2 (testxgboost:x86_64+0x1009ca6c8)
    #6 testing::TestSuite::Run() <null>:2 (testxgboost:x86_64+0x1009cb8f6)
    #7 testing::internal::UnitTestImpl::RunAllTests() <null>:2 (testxgboost:x86_64+0x1009e22ee)
    #8 bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) <null>:2 (testxgboost:x86_64+0x1009e124f)
    #9 testing::UnitTest::Run() <null>:2 (testxgboost:x86_64+0x1009e1184)
    #10 main <null>:2 (testxgboost:x86_64+0x1008e72d6)

SUMMARY: ThreadSanitizer: data race (testxgboost:x86_64+0x100003439) in std::__1::ostreambuf_iterator<char, std::__1::char_traits<char> > std::__1::__pad_and_output<char, std::__1::char_traits<char> >(std::__1::ostreambuf_iterator<char, std::__1::char_traits<char> >, char const*, char const*, char const*, std::__1::ios_base&, char)+0x49

@hcho3
Copy link
Collaborator Author

hcho3 commented May 28, 2024

Never mind, the error from ThreadSanitizer appears to be a false positive

@hcho3
Copy link
Collaborator Author

hcho3 commented May 28, 2024

Fixed in #10320

@hcho3 hcho3 closed this as completed May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants