Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Sporadic crashes in arrow-dataset-scanner-test #20442

Closed
asfimport opened this issue Oct 4, 2022 · 14 comments
Closed

[C++] Sporadic crashes in arrow-dataset-scanner-test #20442

asfimport opened this issue Oct 4, 2022 · 14 comments

Comments

@asfimport
Copy link

Seen e.g. on this PR:
https://github.com/apache/arrow/actions/runs/3181284056/jobs/5185859816#step:11:255

[----------] 8 tests from BasicNewScannerTests/TestScannerBase
[ RUN      ] BasicNewScannerTests/TestScannerBase.ScanOrdered/0fast1f128b
[       OK ] BasicNewScannerTests/TestScannerBase.ScanOrdered/0fast1f128b (33 ms)
[ RUN      ] BasicNewScannerTests/TestScannerBase.ScanOrdered/1fast16f128b
[       OK ] BasicNewScannerTests/TestScannerBase.ScanOrdered/1fast16f128b (582 ms)
[ RUN      ] BasicNewScannerTests/TestScannerBase.ScanOrdered/2slow1f128b
[       OK ] BasicNewScannerTests/TestScannerBase.ScanOrdered/2slow1f128b (65 ms)
[ RUN      ] BasicNewScannerTests/TestScannerBase.ScanOrdered/3slow16f128b
terminate called after throwing an instance of 'std::system_error'
  what():  Resource temporarily unavailable
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively

Reporter: Antoine Pitrou / @pitrou
Assignee: Antoine Pitrou / @pitrou

PRs and other links:

Note: This issue was originally created as ARROW-17927. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
This also fails on 32-bit Linux builds, which should hopefully make it easy to reproduce:
https://github.com/ursacomputing/crossbow/actions/runs/3178547651/jobs/5180154274#step:5:2694

@asfimport
Copy link
Author

Percy Camilo Triveño Aucahuasi / @aucahuasi:
I built the latest arrow version on ubuntu 18 for 32 bits architectures (I tried with i386 and armv7l), but could not reproduce the issue.

I used docker buildx on a Macbook Pro M1.

These were the cmake args:

cmake -DCMAKE_BUILD_TYPE=Debug -DCMAKE_INSTALL_PREFIX=/someprefix -DCMAKE_INSTALL_LIBDIR=lib -DARROW_CUDA=OFF -DARROW_NO_DEPRECATED_API=OFF -DARROW_BUILD_STATIC=OFF -DARROW_INSTALL_NAME_RPATH=ON -DARROW_USE_LD_GOLD=ON -DARROW_USE_CCACHE=ON -DARROW_USE_ASAN=OFF -DARROW_USE_TSAN=OFF -DARROW_USE_UBSAN=OFF -DARROW_BUILD_EXAMPLES=ON -DARROW_BUILD_TESTS=ON -DARROW_TEST_MEMCHECK=OFF -DARROW_ENABLE_TIMING_TESTS=ON -DARROW_BUILD_INTEGRATION=ON -DARROW_BUILD_BENCHMARKS=OFF -DARROW_BUILD_BENCHMARKS_REFERENCE=OFF -DARROW_BUILD_OPENMP_BENCHMARKS=OFF -DARROW_BUILD_DETAILED_BENCHMARKS=OFF -DARROW_TEST_LINKAGE=shared -DARROW_FUZZING=OFF -DARROW_LARGE_MEMORY_TESTS=OFF -DARROW_PARQUET=ON -DPARQUET_REQUIRE_ENCRYPTION=ON -DPARQUET_BUILD_EXAMPLES=ON -DPARQUET_BUILD_EXECUTABLES=OFF -DARROW_ORC=OFF -DARROW_PYTHON=ON -DARROW_COMPUTE=ON -DARROW_CSV=ON -DARROW_DATASET=ON -DARROW_EXTRA_ERROR_CONTEXT=ON -DARROW_FILESYSTEM=ON -DARROW_JEMALLOC=OFF -DARROW_JEMALLOC_USE_SHARED=ON -DARROW_JSON=ON -DARROW_S3=OFF -DARROW_HDFS=OFF -DARROW_GCS=OFF -DARROW_FLIGHT=OFF -DARROW_GANDIVA=OFF -DARROW_PLASMA=OFF -DARROW_SUBSTRAIT=OFF -DARROW_BUILD_UTILITIES=OFF -DARROW_EXTRA_ERROR_CONTEXT=ON -DARROW_USE_CCACHE=ON -DCMAKE_UNITY_BUILD=OFF -DARROW_USE_PRECOMPILED_HEADERS=OFF -DBUILD_WARNING_LEVEL=PRODUCTION -DARROW_WITH_OPENTELEMETRY=OFF -DARROW_WITH_BZ2=ON -DARROW_WITH_ZLIB=ON -DARROW_WITH_ZSTD=ON -DARROW_WITH_LZ4=ON -DARROW_WITH_THRIFT=OFF -DARROW_WITH_SNAPPY=ON -DARROW_WITH_BROTLI=ON -DGTest_SOURCE=BUNDLED ..

Both cases printed at the end:

[----------] Global test environment tear-down
[==========] 123 tests from 9 test suites ran. (196893 ms total)
[  PASSED  ] 123 tests.

Tomorrow I'll try to explore this issue on a windows machine.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
I was able to easily reproduce it using Docker.

First run bash inside the container:

ARCH=i386 DEBIAN=11 archery docker run -e ARROW_S3=OFF -e ARROW_GANDIVA=OFF debian-cpp bash

Then under bash:

# set +e
# /arrow/ci/scripts/cpp_build.sh /arrow /build
# taskset -c 0,2 /build/cpp/debug/arrow-dataset-scanner-test

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
I'll note that more than 400 threads are started in that test. That doesn't sound right at all.
https://gist.github.com/pitrou/6c43e11724179f39d531bfaeb836524d

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
The problem seems to be that SleepAsync and SleepABitAsync detach the underlying thread, meaning timer threads can accumulate depending on scheduling specifics.

@asfimport
Copy link
Author

Raúl Cumplido / @raulcd:
This has also failed on a couple of nightlies:

@asfimport
Copy link
Author

Weston Pace / @westonpace:

I'll note that more than 400 threads are started in that test. That doesn't sound right at all.
https://gist.github.com/pitrou/6c43e11724179f39d531bfaeb836524d

That is more or less intentional. The goal of the test is to have every batch delivered by its own thread for maximum chaos. The test itself has 16 fragments, each with 128 batches so I think 2048 threads will be created during the lifetime of the test.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Right, but I meant 400 threads running simultaneously (even when pinning on 2 CPU cores).

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Issue resolved by pull request 14339
#14339

@asfimport
Copy link
Author

Percy Camilo Triveño Aucahuasi / @aucahuasi:
Just in case there are similar issues in the future, I'll copy here some additional information about this issue.

Yesterday I managed to reproduce this issue using Windows (msys2-mingw32) and I got this stacktrace:

1   ??                                                                                                                               0xfeeefeee 
2   std::__uniq_ptr_impl<void, void ( *)(void *)>::reset                                                           unique_ptr.h  203 0x15b7e36  
3   std::__uniq_ptr_impl<void, void ( *)(void *)>::operator=                                                       unique_ptr.h  183 0x15b7f1f  
4   std::__uniq_ptr_data<void, void ( *)(void *), true, true>::operator=                                           unique_ptr.h  235 0x15acebd  
5   std::unique_ptr<void, void ( *)(void *)>::operator=                                                            unique_ptr.h  406 0x14d884d  
6   arrow::Future<arrow::internal::Empty>::SetResult                                                               future.h      650 0x113578a  
7   arrow::Future<arrow::internal::Empty>::DoMarkFinished                                                          future.h      655 0x1134713  
8   arrow::Future<arrow::internal::Empty>::MarkFinished<arrow::internal::Empty, void>                              future.h      409 0x1134699  
9   operator()                                                                                                     gtest_util.cc 783 0x332fb6   
10  std::__invoke_impl<void, arrow::SleepABitAsync()::<lambda()>>(std::__invoke_other, struct {...} &&)            invoke.h      61  0x338f42   
11  std::__invoke<arrow::SleepABitAsync()::<lambda()>>(struct {...} &&)                                            invoke.h      96  0x338eca   
12  std::thread::_Invoker<std::tuple<arrow::SleepABitAsync()::<lambda()>>>::_M_invoke<0>(std::_Index_tuple<0>)     std_thread.h  252 0x338e5a   
13  std::thread::_Invoker<std::tuple<arrow::SleepABitAsync()::<lambda()>>>::operator()(void)                       std_thread.h  259 0x338e14   
14  std::thread::_State_impl<std::thread::_Invoker<std::tuple<arrow::SleepABitAsync()::<lambda()>>>>::_M_run(void) std_thread.h  210 0x338de0   
15  ??                                                                                                                               0x6b6bfee0 
16  ??                                                                                                                               0x6b8d4e7d 
17  msvcrt!_beginthreadex                                                                                                            0x75716cff 
18  msvcrt!_endthreadex                                                                                                              0x75716dc1 
19  KERNEL32!BaseThreadInitThunk                                                                                                     0x753dfa29 
20  ntdll!RtlGetAppContainerNamedObjectPath                                                                                          0x77307b5e 
... <More>                                                                                                                                      

It seems there was a segmentation fault with the status that we were trying to use for marking the future as finished; sometimes I got this message as well:

Running main() from C:/M/mingw-w64-gtest/src/googletest-release-1.12.1/googletest/src/gtest_main.cc
Note: Google Test filter = BasicNewScannerTests/TestScannerBase.ScanOrdered/3slow16f128b*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from BasicNewScannerTests/TestScannerBase
[ RUN      ] BasicNewScannerTests/TestScannerBase.ScanOrdered/3slow16f128b
C:/arrow/cpp/src/arrow/util/future.ccC:/arrow/cpp/src/arrow/util/future.cc:125:  Check failed: !IsFutureFinished(state_) :125:  Check failed: !IsFutureFinished(state_) Future already marked finished
FPress <RETURN> to close this window...

Or this other one too:

Running main() from C:/M/mingw-w64-gtest/src/googletest-release-1.12.1/googletest/src/gtest_main.cc
Note: Google Test filter = BasicNewScannerTests/TestScannerBase.ScanOrdered/3slow16f128b*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from BasicNewScannerTests/TestScannerBase
[ RUN      ] BasicNewScannerTests/TestScannerBase.ScanOrdered/3slow16f128b
C:/arrow/cpp/src/arrow/dataset/scanner_test.cc:538: Failure
Value of: _fut138.Wait(::arrow::kDefaultAssertFinishesWaitSeconds)
  Actual: false
Expected: true
C:/arrow/cpp/src/arrow/dataset/scanner_test.cc:538: Failure
Failed
Future did not finish in a timely fashion

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
@aucahuasi This seems to be different issues, can you check if you can still reproduce with git master?

@asfimport
Copy link
Author

Percy Camilo Triveño Aucahuasi / @aucahuasi:
Definitely, I'll try with the recent changes and will send the outputs here, btw I'm using these cmake args in this windows machine:

cmake -G "MSYS Makefiles" -DARROW_BUILD_TESTS=ON -DARROW_BUILD_SHARED=OFF -DARROW_BUILD_STATIC=ON -DARROW_DATASET=ON -DARROW_COMPUTE=ON -DCMAKE_BUILD_TYPE="Debug" -DBUILD_WARNING_LEVEL="PRODUCTION" -DCMAKE_CXX_STANDARD=17 ..

@asfimport
Copy link
Author

Percy Camilo Triveño Aucahuasi / @aucahuasi:
Using the recent changes (including the PR that solved this issue) I cannot reproduce the issue anymore; this is the output I'm getting:

Running main() from C:/M/mingw-w64-gtest/src/googletest-release-1.12.1/googletest/src/gtest_main.cc
Note: Google Test filter = BasicNewScannerTests/TestScannerBase.ScanOrdered/3slow16f128b*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from BasicNewScannerTests/TestScannerBase
[ RUN      ] BasicNewScannerTests/TestScannerBase.ScanOrdered/3slow16f128b
[       OK ] BasicNewScannerTests/TestScannerBase.ScanOrdered/3slow16f128b (3085 ms)
[----------] 1 test from BasicNewScannerTests/TestScannerBase (3096 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (3112 ms total)
[  PASSED  ] 1 test.
Press <RETURN> to close this window...

Also, when I run all the tests (without --gtest_filter="BasicNewScannerTests/TestScannerBase.ScanOrdered/3slow16f128b*" ) I cannot reproduce the issue either.

So it seems that the stacktrace and outputs I sent in the previous comment were related to the original issue in this ticket.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
I see, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants