Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Python] Setting rows per group cause segmentation fault writing parquet with write_dataset #34539

Closed
isvoboda opened this issue Mar 12, 2023 · 13 comments · Fixed by #35075
Assignees
Labels
Milestone

Comments

@isvoboda
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

Writing parquet file crashes with segmentation fault with write_dataset and setting min_rows_per_file and max_rows_per_file.

It seems the crash depends on particular number for example setting 10_000 for both parameters works while 60_000 crashes.

pyarrow==11.0

See an example and dedicated docker environment

segfault.py

import pyarrow as pa
from pyarrow import dataset as ds
from pyarrow import fs

TOTAL = 2**20

schema = pa.schema(
    [
        pa.field("id", pa.uint64()),
    ]
)

data = (pa.RecordBatch.from_pylist([{"id": i}], schema=schema) for i in range(TOTAL))

file_system, path = fs.FileSystem.from_uri("file:///tmp/data")


ds.write_dataset(
    data=data,
    base_dir=path,
    format="parquet",
    schema=schema,
    filesystem=file_system,
    min_rows_per_group=60_000,
    max_rows_per_group=60_000,
)

Dockerfile

FROM python:3.10.10-slim-buster
RUN pip install pyarrow==11.0
COPY segfault.py /tmp/segfault.py

Run the example

  • docker build . -t pyarrow-segfault
  • docker run --rm -it pyarrow-segfault bash
  • python /tmp/segfault.py

Component(s)

Python

@AlenkaF AlenkaF changed the title Setting rows per group cause segmentation fault writing parquet with write_dataset [Python] Setting rows per group cause segmentation fault writing parquet with write_dataset Mar 13, 2023
@isvoboda
Copy link
Author

isvoboda commented Mar 14, 2023

It seems use_threads=False prevents the segfault.

@westonpace westonpace added the Priority: Blocker Marks a blocker for the release label Mar 14, 2023
@AlenkaF
Copy link
Member

AlenkaF commented Mar 15, 2023

Thank you for reporting @isvoboda!

I am able to reproduce with the docker example you added (awesome, thanks!).
In my case use_threads=False didn't prevent the segfault.

I wasn't able to look deeper into the issue but it does look like a bug in datasets.

@raulcd raulcd added this to the 12.0.0 milestone Mar 15, 2023
@jorisvandenbossche
Copy link
Member

FWIW, I can't reproduce this on Linux (ubuntu), not in my local development environment, and also not using docker with the script and steps described above.

Using docker, reading the created file after running the script looks OK:

root@e21abdd839a3:/# python
Python 3.10.10 (main, Mar 14 2023, 02:53:09) [GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow.parquet as pq
>>> pq.read_metadata("/tmp/data/part-0.parquet")
<pyarrow._parquet.FileMetaData object at 0x7fe3c1f489a0>
  created_by: parquet-cpp-arrow version 11.0.0
  num_columns: 1
  num_rows: 1048576
  num_row_groups: 18
  format_version: 1.0
  serialized_size: 2168

@isvoboda what platform are you using?

@isvoboda
Copy link
Author

isvoboda commented Mar 17, 2023

@jorisvandenbossche I have tried several environments

  • wsl with ubuntu 22.04.2 LTS with python 3.10.6 - can not reproduce
  • docker running on wsl ubuntu 22.04.2 LTS - can not reproduce
  • debian bullseye with python 3.10.10 from pyenv - can reproduce
  • docker running on debian bullseye - can reproduce
  • up-to date arch linux with system python 3.10.10 - can reproduce
  • docker running on arch linux - can reproduce

pyarrow==10.0 does not crash in any of the above listed environments.
I can share coredump file if it helps?

@AlenkaF
Copy link
Member

AlenkaF commented Mar 20, 2023

I can add to the list of environments: macOS Monterey 12.6.3 with Python 3.10.10 (with Docker or without).

In my case I get segfault no matter of how I define min/max_rows_per_group. With min_rows_per_group=10_000 and max_rows_per_group=10_000 lldb backtrace is:

Process 65402 stopped
* thread #12, stop reason = EXC_BAD_ACCESS (code=2, address=0x1702efee8)
    frame #0: 0x000000011805a378 libarrow_dataset.1200.0.0.dylib`arrow::dataset::internal::DatasetWriter::DatasetWriterImpl::DoWriteRecordBatch(this=0x00000001702f0110, batch=<unavailable>, directory="", prefix="") at dataset_writer.cc:578
   575    }
   576 
   577    Future<> DoWriteRecordBatch(std::shared_ptr<RecordBatch> batch,
-> 578                                const std::string& directory, const std::string& prefix) {
   579      ARROW_ASSIGN_OR_RAISE(
   580          auto dir_queue_itr,
   581          ::arrow::internal::GetOrInsertGenerated(
Target 0: (Python) stopped.

Without min_rows_per_group or max_rows_per_group set the backtrace is very long:

Process 75137 stopped
* thread #25, stop reason = EXC_BAD_ACCESS (code=2, address=0x170a0bfe0)
    frame #0: 0x00000001998704fc libsystem_malloc.dylib`tiny_malloc_should_clear + 8
libsystem_malloc.dylib`tiny_malloc_should_clear:
->  0x1998704fc <+8>:  stp    x28, x27, [sp, #0x60]
    0x199870500 <+12>: stp    x26, x25, [sp, #0x70]
    0x199870504 <+16>: stp    x24, x23, [sp, #0x80]
    0x199870508 <+20>: stp    x22, x21, [sp, #0x90]
Target 0: (Python) stopped.
(lldb) bt 5
* thread #25, stop reason = EXC_BAD_ACCESS (code=2, address=0x170a0bfe0)
  * frame #0: 0x00000001998704fc libsystem_malloc.dylib`tiny_malloc_should_clear + 8
    frame #1: 0x000000019986f3a0 libsystem_malloc.dylib`szone_malloc_should_clear + 92
    frame #2: 0x000000019988b748 libsystem_malloc.dylib`_malloc_zone_malloc + 156
    frame #3: 0x0000000199a1c8b0 libc++abi.dylib`operator new(unsigned long) + 32
    frame #4: 0x000000010272a1d4 lib.cpython-310-darwin.so`void* std::__1::__libcpp_operator_new<unsigned long>(__args=16) at new:235:10
(lldb) bt 15
* thread #25, stop reason = EXC_BAD_ACCESS (code=2, address=0x170a0bfe0)
  * frame #0: 0x00000001998704fc libsystem_malloc.dylib`tiny_malloc_should_clear + 8
    frame #1: 0x000000019986f3a0 libsystem_malloc.dylib`szone_malloc_should_clear + 92
    frame #2: 0x000000019988b748 libsystem_malloc.dylib`_malloc_zone_malloc + 156
    frame #3: 0x0000000199a1c8b0 libc++abi.dylib`operator new(unsigned long) + 32
    frame #4: 0x000000010272a1d4 lib.cpython-310-darwin.so`void* std::__1::__libcpp_operator_new<unsigned long>(__args=16) at new:235:10
    frame #5: 0x000000010272a130 lib.cpython-310-darwin.so`std::__1::__libcpp_allocate(__size=16, __align=8) at new:261:10
    frame #6: 0x0000000102833a00 lib.cpython-310-darwin.so`std::__1::allocator<std::__1::shared_ptr<arrow::Array> >::allocate(this=0x0000000170a0c278, __n=1) at allocator.h:108:38
    frame #7: 0x0000000102833860 lib.cpython-310-darwin.so`std::__1::allocator_traits<std::__1::allocator<std::__1::shared_ptr<arrow::Array> > >::allocate(__a=0x0000000170a0c278, __n=1) at allocator_traits.h:262:20
    frame #8: 0x0000000102833378 lib.cpython-310-darwin.so`std::__1::vector<std::__1::shared_ptr<arrow::Array>, std::__1::allocator<std::__1::shared_ptr<arrow::Array> > >::__vallocate(this=0x0000000170a0c268 size=0, __n=1) at vector:1015:37
    frame #9: 0x0000000126e90fa0 libarrow.1200.0.0.dylib`std::__1::vector<std::__1::shared_ptr<arrow::Array>, std::__1::allocator<std::__1::shared_ptr<arrow::Array> > >::vector(this=0x0000000170a0c268 size=0, __x=size=1) at vector:1280:9
    frame #10: 0x0000000126e90f30 libarrow.1200.0.0.dylib`std::__1::vector<std::__1::shared_ptr<arrow::Array>, std::__1::allocator<std::__1::shared_ptr<arrow::Array> > >::vector(this=0x0000000170a0c268 size=0, __x=size=1) at vector:1273:1
    frame #11: 0x00000001271b3b14 libarrow.1200.0.0.dylib`std::__1::__shared_ptr_emplace<arrow::ChunkedArray, std::__1::allocator<arrow::ChunkedArray> >::__shared_ptr_emplace<std::__1::vector<std::__1::shared_ptr<arrow::Array>, std::__1::allocator<std::__1::shared_ptr<arrow::Array> > >&, std::__1::shared_ptr<arrow::DataType> const&>(this=0x000000014b883640, __a=allocator<arrow::ChunkedArray> @ 0x0000000170a0c2af, __args=size=1, __args=std::__1::shared_ptr<arrow::DataType>::element_type @ 0x0000000103516c48 strong=5977 weak=2) at shared_ptr.h:298:41
    frame #12: 0x00000001271b3a80 libarrow.1200.0.0.dylib`std::__1::__shared_ptr_emplace<arrow::ChunkedArray, std::__1::allocator<arrow::ChunkedArray> >::__shared_ptr_emplace<std::__1::vector<std::__1::shared_ptr<arrow::Array>, std::__1::allocator<std::__1::shared_ptr<arrow::Array> > >&, std::__1::shared_ptr<arrow::DataType> const&>(this=0x000000014b883640, __a=allocator<arrow::ChunkedArray> @ 0x0000000170a0c2ef, __args=size=1, __args=std::__1::shared_ptr<arrow::DataType>::element_type @ 0x0000000103516c48 strong=5977 weak=2) at shared_ptr.h:292:5
    frame #13: 0x00000001271b39c0 libarrow.1200.0.0.dylib`std::__1::shared_ptr<arrow::ChunkedArray> std::__1::allocate_shared<arrow::ChunkedArray, std::__1::allocator<arrow::ChunkedArray>, std::__1::vector<std::__1::shared_ptr<arrow::Array>, std::__1::allocator<std::__1::shared_ptr<arrow::Array> > >&, std::__1::shared_ptr<arrow::DataType> const&, void>(__a=0x0000000170a0c3c7, __args=size=1, __args=std::__1::shared_ptr<arrow::DataType>::element_type @ 0x0000000103516c48 strong=5977 weak=2) at shared_ptr.h:1106:55
    frame #14: 0x00000001271a0920 libarrow.1200.0.0.dylib`std::__1::shared_ptr<arrow::ChunkedArray> std::__1::make_shared<arrow::ChunkedArray, std::__1::vector<std::__1::shared_ptr<arrow::Array>, std::__1::allocator<std::__1::shared_ptr<arrow::Array> > >&, std::__1::shared_ptr<arrow::DataType> const&, void>(__args=size=1, __args=std::__1::shared_ptr<arrow::DataType>::element_type @ 0x0000000103516c48 strong=5977 weak=2) at shared_ptr.h:1115:12

And I had to change TOTAL = 2**10 to not get a segfault:

<pyarrow._parquet.FileMetaData object at 0x12b79c270>
  created_by: parquet-cpp-arrow version 12.0.0-SNAPSHOT
  num_columns: 1
  num_rows: 1024
  num_row_groups: 1024
  format_version: 1.0
  serialized_size: 97276
Process 75050 exited with status = 0 (0x00000000) 

@AlenkaF AlenkaF changed the title [Python] Setting rows per group cause segmentation fault writing parquet with write_dataset [C++][Python] Setting rows per group cause segmentation fault writing parquet with write_dataset Mar 21, 2023
@AlenkaF
Copy link
Member

AlenkaF commented Mar 21, 2023

Just to note, I have tried running the script locally with jemalloc (instead of system) allocator, but it didn't make a difference.

@snizovtsev
Copy link

snizovtsev commented Mar 21, 2023

I've faced a similar issue with C++ / Acero dataset writer. Regardless of options, pipeline gets crashed when dataset "write" node is used.

I found that crash cause is stack overflow. Depending on ulimit -s settings GDB shows up to ~698997 stack frames full of repeatable future continuations:

backtrace
Thread 5 "builder" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffbabfff6c0 (LWP 955318)]
0x00005555556b18dd in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<arrow::NumericArray<arrow::UInt64Type>, std::allocator<arrow::NumericArray<arrow::UInt64Type> >, std::shared_ptr<arrow::ArrayData> const&> (this=0x0, __p=<error reading variable: Cannot access memory at address 0x0>, __a=...)
    at /usr/include/c++/11/bits/shared_ptr_base.h:643
643     /usr/include/c++/11/bits/shared_ptr_base.h: No such file or directory.
(gdb) up
#1  0x00005555556ae56c in std::__shared_ptr<arrow::NumericArray<arrow::UInt64Type>, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<arrow::NumericArray<arrow::UInt64Type> >, std::shared_ptr<arrow::ArrayData> const&> (this=0x7ffba8000170, __tag=...) at /usr/include/c++/11/bits/shared_ptr_base.h:1342
1342    in /usr/include/c++/11/bits/shared_ptr_base.h
(gdb)
#2  0x00005555556ab913 in std::shared_ptr<arrow::NumericArray<arrow::UInt64Type> >::shared_ptr<std::allocator<arrow::NumericArray<arrow::UInt64Type> >, std::shared_ptr<arrow::ArrayData> const&> (this=0x7ffba8000170, __tag=...) at /usr/include/c++/11/bits/shared_ptr.h:409
409     /usr/include/c++/11/bits/shared_ptr.h: No such file or directory.
(gdb)
#3  0x00005555556a8e09 in std::allocate_shared<arrow::NumericArray<arrow::UInt64Type>, std::allocator<arrow::NumericArray<arrow::UInt64Type> >, std::shared_ptr<arrow::ArrayData> const&> (__a=...) at /usr/include/c++/11/bits/shared_ptr.h:863
863     in /usr/include/c++/11/bits/shared_ptr.h
(gdb)
#4  0x00005555556a5a66 in std::make_shared<arrow::NumericArray<arrow::UInt64Type>, std::shared_ptr<arrow::ArrayData> const&> ()
    at /usr/include/c++/11/bits/shared_ptr.h:879
879     in /usr/include/c++/11/bits/shared_ptr.h
(gdb)
#5  0x000055555569cea4 in arrow::(anonymous namespace)::ArrayDataWrapper::Visit<arrow::UInt64Type> (this=0x7ffba80001f0)
    at external/com_github_apache_arrow/cpp/src/arrow/array/util.cc:65
65          *out_ = std::make_shared<ArrayType>(data_);
(gdb)
#6  0x0000555555699616 in arrow::VisitTypeInline<arrow::(anonymous namespace)::ArrayDataWrapper> (type=..., visitor=0x7ffba80001f0)
    at external/com_github_apache_arrow/cpp/src/arrow/visit_type_inline.h:54
54          ARROW_GENERATE_FOR_ALL_TYPES(TYPE_VISIT_INLINE);
(gdb)
#7  0x00005555556937c7 in arrow::MakeArray (data=std::shared_ptr<arrow::ArrayData> (use count 1, weak count 0) = {...})
    at external/com_github_apache_arrow/cpp/src/arrow/array/util.cc:310
310       DCHECK_OK(VisitTypeInline(*data->type, &wrapper_visitor));
(gdb)
#8  0x0000555555cbb4bf in arrow::SimpleRecordBatch::column (this=0x7ffb89829d40, i=0) at external/com_github_apache_arrow/cpp/src/arrow/record_batch.cc:82
82            result = MakeArray(columns_[i]);
(gdb)
#9  0x0000555555bfde42 in arrow::Table::FromRecordBatches (schema=std::shared_ptr<arrow::Schema> (use count 17, weak count 0) = {...},
    batches=std::vector of length 1, capacity 1 = {...}) at external/com_github_apache_arrow/cpp/src/arrow/table.cc:304
304           column_arrays[j] = batches[j]->column(i);
(gdb)
#10 0x0000555555bfe188 in arrow::Table::FromRecordBatches (batches=std::vector of length 1, capacity 1 = {...})
    at external/com_github_apache_arrow/cpp/src/arrow/table.cc:318
318       return FromRecordBatches(batches[0]->schema(), batches);
(gdb)
#11 0x0000555555f7a9a5 in arrow::dataset::internal::(anonymous namespace)::DatasetWriterFileQueue::PopStagedBatch (this=0x7ffba003b3b0)
    at external/com_github_apache_arrow/cpp/src/arrow/dataset/dataset_writer.cc:179
179         ARROW_ASSIGN_OR_RAISE(std::shared_ptr<Table> table,
(gdb)
#12 0x0000555555f7acc3 in arrow::dataset::internal::(anonymous namespace)::DatasetWriterFileQueue::PopAndDeliverStagedBatch (this=0x7ffba003b3b0)
    at external/com_github_apache_arrow/cpp/src/arrow/dataset/dataset_writer.cc:193
193         ARROW_ASSIGN_OR_RAISE(std::shared_ptr<RecordBatch> next_batch, PopStagedBatch());
(gdb)
#13 0x0000555555f84208 in arrow::dataset::internal::(anonymous namespace)::DatasetWriterFileQueue::Push (this=0x7ffba003b3b0,
    batch=std::shared_ptr<arrow::RecordBatch> (empty) = {...}) at external/com_github_apache_arrow/cpp/src/arrow/dataset/dataset_writer.cc:208
208           ARROW_ASSIGN_OR_RAISE(int64_t rows_popped, PopAndDeliverStagedBatch());
(gdb)
#14 0x0000555555f84824 in arrow::dataset::internal::(anonymous namespace)::DatasetWriterDirectoryQueue::StartWrite (this=0x7ffba003af00,
    batch=std::shared_ptr<arrow::RecordBatch> (use count 3, weak count 0) = {...})
    at external/com_github_apache_arrow/cpp/src/arrow/dataset/dataset_writer.cc:307
307         return latest_open_file_->Push(batch);
(gdb)
#15 0x0000555555f8678c in arrow::dataset::internal::DatasetWriter::DatasetWriterImpl::DoWriteRecordBatch (this=0x555557732190,
    batch=std::shared_ptr<arrow::RecordBatch> (use count 3, weak count 0) = {...}, directory="parts/", prefix="8_")
    at external/com_github_apache_arrow/cpp/src/arrow/dataset/dataset_writer.cc:611
611           RETURN_NOT_OK(dir_queue->StartWrite(next_chunk));
(gdb)
#16 0x0000555555f85869 in arrow::dataset::internal::DatasetWriter::DatasetWriterImpl::WriteAndCheckBackpressure (this=0x555557732190,
    batch=std::shared_ptr<arrow::RecordBatch> (empty) = {...}, directory="", prefix="8_")
    at external/com_github_apache_arrow/cpp/src/arrow/dataset/dataset_writer.cc:526
526           return DoWriteRecordBatch(std::move(batch), write_options_.base_dir, prefix);
(gdb)
#17 0x0000555555f85967 in arrow::dataset::internal::DatasetWriter::DatasetWriterImpl::WriteRecordBatch(std::shared_ptr<arrow::RecordBatch>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::{lambda()#1}::operator()() (__closure=0x7ffb89829b50) at external/com_github_apache_arrow/cpp/src/arrow/dataset/dataset_writer.cc:535
535                   WriteAndCheckBackpressure(std::move(batch), directory, prefix);
(gdb) up
#18 0x0000555555f8e1b7 in arrow::util::AsyncTaskScheduler::SimpleTask<arrow::dataset::internal::DatasetWriter::DatasetWriterImpl::WriteRecordBatch(std::shared_ptr<arrow::RecordBatch>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::{lambda()#1}>::operator()() (this=0x7ffb89829b40)
    at external/com_github_apache_arrow/cpp/src/arrow/util/async_util.h:153
153         Result<Future<>> operator()() override { return callable(); }
(gdb)
#19 0x0000555555b3027f in WrapperTask::operator() (this=0x7ffb8982f270) at external/com_github_apache_arrow/cpp/src/arrow/util/async_util.cc:411
411             ARROW_ASSIGN_OR_RAISE(Future<> inner_fut, (*target)());
(gdb)
#20 0x0000555555b2f8ea in operator() (__closure=0x7ffb4e58e780) at external/com_github_apache_arrow/cpp/src/arrow/util/async_util.cc:341
341               ARROW_ASSIGN_OR_RAISE(Future<> inner_fut, (*inner_task)());
(gdb)
#21 0x0000555555b37281 in arrow::util::AsyncTaskScheduler::SimpleTask<arrow::util::(anonymous namespace)::ThrottledAsyncTaskSchedulerImpl::SubmitTask(std::unique_ptr<arrow::util::AsyncTaskScheduler::Task>, int)::<lambda()> >::operator()(void) (this=0x7ffb4e58e770)
    at external/com_github_apache_arrow/cpp/src/arrow/util/async_util.h:153
153         Result<Future<>> operator()() override { return callable(); }
(gdb)
#22 0x0000555555b2ecde in arrow::util::(anonymous namespace)::AsyncTaskSchedulerImpl::DoSubmitTask (this=0x555557730310,
    task=std::unique_ptr<arrow::util::AsyncTaskScheduler::Task> = {...}) at external/com_github_apache_arrow/cpp/src/arrow/util/async_util.cc:195
195         Result<Future<>> submit_result = (*task)();
(gdb)
#23 0x0000555555b2f15e in arrow::util::(anonymous namespace)::AsyncTaskSchedulerImpl::SubmitTaskUnlocked (this=0x555557730310,
    task=std::unique_ptr<arrow::util::AsyncTaskScheduler::Task> = {...}, lk=...) at external/com_github_apache_arrow/cpp/src/arrow/util/async_util.cc:261
261         return DoSubmitTask(std::move(task));
(gdb)
#24 0x0000555555b2ea24 in arrow::util::(anonymous namespace)::AsyncTaskSchedulerImpl::AddTask (this=0x555557730310,
    task=std::unique_ptr<arrow::util::AsyncTaskScheduler::Task> = {...}) at external/com_github_apache_arrow/cpp/src/arrow/util/async_util.cc:171
171         SubmitTaskUnlocked(std::move(task), std::move(lk));
(gdb) up 80000
#80024 0x0000555555b2ea24 in arrow::util::(anonymous namespace)::AsyncTaskSchedulerImpl::AddTask (this=0x555557730310,
    task=std::unique_ptr<arrow::util::AsyncTaskScheduler::Task> = {...}) at external/com_github_apache_arrow/cpp/src/arrow/util/async_util.cc:171
171         SubmitTaskUnlocked(std::move(task), std::move(lk));
(gdb) up
#80025 0x0000555555b311bc in arrow::util::AsyncTaskScheduler::AddSimpleTask<arrow::util::(anonymous namespace)::ThrottledAsyncTaskSchedulerImpl::SubmitTask(std::unique_ptr<arrow::util::AsyncTaskScheduler::Task>, int)::<lambda()> >(struct {...}, std::string_view) (this=0x555557730310, callable=...,
    name="DatasetWriter::WriteAndCheckBackpressure") at external/com_github_apache_arrow/cpp/src/arrow/util/async_util.h:171
171         return AddTask(std::make_unique<SimpleTask<Callable>>(std::move(callable), name));
(gdb)
#80026 0x0000555555b2fb40 in arrow::util::(anonymous namespace)::ThrottledAsyncTaskSchedulerImpl::SubmitTask (this=0x5555577380f0,
    task=std::unique_ptr<arrow::util::AsyncTaskScheduler::Task> = {...}, latched_cost=1)
    at external/com_github_apache_arrow/cpp/src/arrow/util/async_util.cc:338
338         return target_->AddSimpleTask(
(gdb)
#80027 0x0000555555b2fdcb in arrow::util::(anonymous namespace)::ThrottledAsyncTaskSchedulerImpl::ContinueTasks (this=0x5555577380f0)
    at external/com_github_apache_arrow/cpp/src/arrow/util/async_util.cc:374
374             if (!SubmitTask(std::move(next_task), next_cost)) {
(gdb) up 160000
#240027 0x0000555555b2fdcb in arrow::util::(anonymous namespace)::ThrottledAsyncTaskSchedulerImpl::ContinueTasks (this=0x5555577380f0)
    at external/com_github_apache_arrow/cpp/src/arrow/util/async_util.cc:374
374             if (!SubmitTask(std::move(next_task), next_cost)) {
(gdb) up 160000
#400027 0x0000555555b2fdcb in arrow::util::(anonymous namespace)::ThrottledAsyncTaskSchedulerImpl::ContinueTasks (this=0x5555577380f0)
    at external/com_github_apache_arrow/cpp/src/arrow/util/async_util.cc:374
374             if (!SubmitTask(std::move(next_task), next_cost)) {
(gdb) up 160000
#560027 0x0000555555b2fdcb in arrow::util::(anonymous namespace)::ThrottledAsyncTaskSchedulerImpl::ContinueTasks (this=0x5555577380f0)
    at external/com_github_apache_arrow/cpp/src/arrow/util/async_util.cc:374
374             if (!SubmitTask(std::move(next_task), next_cost)) {
(gdb) up 160000
#698997 0x00007ffff7b20d90 in ?? () from /usr/lib/libc.so.6
(gdb) down
#698996 0x00007ffff7a9ebb5 in ?? () from /usr/lib/libc.so.6
(gdb)
#698995 0x00007ffff7cd72c3 in std::execute_native_thread_routine (__p=0x5555577328a0) at /usr/src/debug/gcc/gcc/libstdc++-v3/src/c++11/thread.cc:82
82      /usr/src/debug/gcc/gcc/libstdc++-v3/src/c++11/thread.cc: No such file or directory.
(gdb)
#698994 0x0000555555934c1e in std::thread::_State_impl<std::thread::_Invoker<std::tuple<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::<lambda()> > > >::_M_run(void) (this=0x5555577328a0) at /usr/include/c++/11/bits/std_thread.h:211
211     /usr/include/c++/11/bits/std_thread.h: No such file or directory.
(gdb)
#698993 0x0000555555934c3a in std::thread::_Invoker<std::tuple<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::<lambda()> > >::operator()(void) (
    this=0x5555577328a8) at /usr/include/c++/11/bits/std_thread.h:260
260     in /usr/include/c++/11/bits/std_thread.h


(gdb)
#698992 0x0000555555934c66 in std::thread::_Invoker<std::tuple<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::<lambda()> > >::_M_invoke<0>(std::_Index_tuple<0>) (this=0x5555577328a8) at /usr/include/c++/11/bits/std_thread.h:253
253     in /usr/include/c++/11/bits/std_thread.h

(gdb)
#698991 0x0000555555934cb9 in std::__invoke<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::<lambda()> >(struct {...} &&) (__fn=...)
    at /usr/include/c++/11/bits/invoke.h:96
96      /usr/include/c++/11/bits/invoke.h: No such file or directory.
(gdb)
#698990 0x0000555555934cf6 in std::__invoke_impl<void, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::<lambda()> >(std::__invoke_other, struct {...} &&) (__f=...) at /usr/include/c++/11/bits/invoke.h:61
61      in /usr/include/c++/11/bits/invoke.h
(gdb)
#698989 0x000055555593110d in operator() (__closure=0x5555577328a8) at external/com_github_apache_arrow/cpp/src/arrow/util/thread_pool.cc:430
430           WorkerLoop(state, it);
(gdb)
#698988 0x00005555559303ab in arrow::internal::WorkerLoop (state=std::shared_ptr<arrow::internal::ThreadPool::State> (use count 33, weak count 1) = {...},
    it=...) at external/com_github_apache_arrow/cpp/src/arrow/util/thread_pool.cc:269
269               std::move(task.callable)();
(gdb)
#698987 0x0000555555935e32 in arrow::internal::FnOnce<void ()>::operator()() && (this=0x7ffbabffeda0)
    at external/com_github_apache_arrow/cpp/src/arrow/util/functional.h:140
140         return bye->invoke(std::forward<A&&>(a)...);
(gdb)
#698986 0x00005555558be170 in arrow::internal::FnOnce<void ()>::FnImpl<std::_Bind<arrow::detail::ContinueFuture (arrow::Future<arrow::internal::Empty>, std::function<arrow::Status ()>)> >::invoke() (this=0x7ffbc4002930) at external/com_github_apache_arrow/cpp/src/arrow/util/functional.h:152
152         R invoke(A&&... a) override { return std::move(fn_)(std::forward<A&&>(a)...); }
(gdb)
#698985 0x00005555558be19c in std::_Bind<arrow::detail::ContinueFuture (arrow::Future<arrow::internal::Empty>, std::function<arrow::Status ()>)>::operator()<, void>() (this=0x7ffbc4002938) at /usr/include/c++/11/functional:503
503     /usr/include/c++/11/functional: No such file or directory.
(gdb)
#698984 0x00005555558be223 in std::_Bind<arrow::detail::ContinueFuture (arrow::Future<arrow::internal::Empty>, std::function<arrow::Status ()>)>::__call<void, , 0ul, 1ul>(std::tuple<>&&, std::_Index_tuple<0ul, 1ul>) (this=0x7ffbc4002938, __args=...) at /usr/include/c++/11/functional:420
420     in /usr/include/c++/11/functional
(gdb)
#698983 0x00005555558be305 in std::__invoke<arrow::detail::ContinueFuture&, arrow::Future<arrow::internal::Empty>&, std::function<arrow::Status ()>&>(arrow::detail::ContinueFuture&, arrow::Future<arrow::internal::Empty>&, std::function<arrow::Status ()>&) (__fn=...) at /usr/include/c++/11/bits/invoke.h:96
96      /usr/include/c++/11/bits/invoke.h: No such file or directory.
(gdb)
#698982 0x00005555558be3d8 in std::__invoke_impl<void, arrow::detail::ContinueFuture&, arrow::Future<arrow::internal::Empty>&, std::function<arrow::Status ()>&>(std::__invoke_other, arrow::detail::ContinueFuture&, arrow::Future<arrow::internal::Empty>&, std::function<arrow::Status ()>&) (__f=...)
    at /usr/include/c++/11/bits/invoke.h:61
61      in /usr/include/c++/11/bits/invoke.h
(gdb)
#698981 0x00005555558be485 in arrow::detail::ContinueFuture::operator()<std::function<arrow::Status ()>&, , arrow::Status, arrow::Future<arrow::internal::Empty> >(arrow::Future<arrow::internal::Empty>, std::function<arrow::Status ()>&) const (this=0x7ffbc4002938, next=..., f=...)
    at external/com_github_apache_arrow/cpp/src/arrow/util/future.h:150
150         next.MarkFinished(std::forward<ContinueFunc>(f)(std::forward<Args>(a)...));
(gdb)
#698980 0x00005555558be52f in std::function<arrow::Status ()>::operator()() const (this=0x7ffbc4002940) at /usr/include/c++/11/bits/std_function.h:590
590     /usr/include/c++/11/bits/std_function.h: No such file or directory.
(gdb)
#698979 0x00005555558cf8d3 in std::_Function_handler<arrow::Status(), arrow::compute::(anonymous namespace)::SourceNode::SliceAndDeliverMorsel(const arrow::compute::ExecBatch&)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/11/bits/std_function.h:291
291     /usr/include/c++/11/bits/std_function.h: No such file or directory.
(gdb)
#698978 0x00005555558d12f9 in std::__invoke_r<arrow::Status, arrow::compute::(anonymous namespace)::SourceNode::SliceAndDeliverMorsel(const arrow::compute::ExecBatch&)::<lambda()>&>(struct {...} &) (__fn=...) at /usr/include/c++/11/bits/invoke.h:116
116     /usr/include/c++/11/bits/invoke.h: No such file or directory.
(gdb)
#698977 0x00005555558d2674 in std::__invoke_impl<arrow::Status, arrow::compute::(anonymous namespace)::SourceNode::SliceAndDeliverMorsel(const arrow::compute::ExecBatch&)::<lambda()>&>(std::__invoke_other, struct {...} &) (__f=...) at /usr/include/c++/11/bits/invoke.h:61
61      in /usr/include/c++/11/bits/invoke.h
(gdb)
#698976 0x00005555558ca227 in operator() (__closure=0x7ffbc4002890) at external/com_github_apache_arrow/cpp/src/arrow/compute/exec/source_node.cc:100
100                 ARROW_RETURN_NOT_OK(output_->InputReceived(this, std::move(batch)));
(gdb)
#698975 0x00005555558b737d in arrow::compute::MapNode::InputReceived (this=0x555557731290, input=0x5555577309d0, batch=...)
    at external/com_github_apache_arrow/cpp/src/arrow/compute/exec/map_node.cc:72
72        ARROW_RETURN_NOT_OK(output_->InputReceived(this, std::move(output_batch)));
(gdb)
#698974 0x00005555558b737d in arrow::compute::MapNode::InputReceived (this=0x555557731a40, input=0x555557731290, batch=...)
    at external/com_github_apache_arrow/cpp/src/arrow/compute/exec/map_node.cc:72
72        ARROW_RETURN_NOT_OK(output_->InputReceived(this, std::move(output_batch)));
(gdb)
#698973 0x00005555558c0334 in arrow::compute::(anonymous namespace)::ConsumingSinkNode::InputReceived (this=0x555557722520, input=0x555557731a40, batch=...)
    at external/com_github_apache_arrow/cpp/src/arrow/compute/exec/sink_node.cc:343
343         ARROW_RETURN_NOT_OK(consumer_->Consume(std::move(batch)));
(gdb)
#698972 0x0000555555f5bde6 in arrow::dataset::(anonymous namespace)::DatasetWritingSinkNodeConsumer::Consume (this=0x555557731b90, batch=...)
    at external/com_github_apache_arrow/cpp/src/arrow/dataset/file_base.cc:415
415         return WriteNextBatch(std::move(record_batch), batch.guarantee);
(gdb)
#698971 0x0000555555f5c0b7 in arrow::dataset::(anonymous namespace)::DatasetWritingSinkNodeConsumer::WriteNextBatch (this=0x555557731b90,
    batch=std::shared_ptr<arrow::RecordBatch> (use count 1, weak count 0) = {...}, guarantee=...)
    at external/com_github_apache_arrow/cpp/src/arrow/dataset/file_base.cc:434
434                           });
(gdb)
#698970 0x0000555555f5b5b6 in arrow::dataset::(anonymous namespace)::WriteBatch(std::shared_ptr<arrow::RecordBatch>, arrow::compute::Expression, arrow::dataset::FileSystemDatasetWriteOptions, std::function<arrow::Status(std::shared_ptr<arrow::RecordBatch>, const arrow::dataset::PartitionPathFormat&)>) (
    batch=std::shared_ptr<arrow::RecordBatch> (empty) = {...}, guarantee=..., write_options=..., write=...)
    at external/com_github_apache_arrow/cpp/src/arrow/dataset/file_base.cc:383
383         RETURN_NOT_OK(write(next_batch, destination));
(gdb)
#698969 0x0000555555f666a6 in std::function<arrow::Status (std::shared_ptr<arrow::RecordBatch>, arrow::dataset::PartitionPathFormat const&)>::operator()(std::shared_ptr<arrow::RecordBatch>, arrow::dataset::PartitionPathFormat const&) const (this=0x7ffbabffe570,
    __args#0=std::shared_ptr<arrow::RecordBatch> (empty) = {...}, __args#1=...) at /usr/include/c++/11/bits/std_function.h:590
590     /usr/include/c++/11/bits/std_function.h: No such file or directory.
(gdb)
#698968 0x0000555555f5f4c9 in std::_Function_handler<arrow::Status(std::shared_ptr<arrow::RecordBatch>, const arrow::dataset::PartitionPathFormat&), arrow::dataset::(anonymous namespace)::DatasetWritingSinkNodeConsumer::WriteNextBatch(std::shared_ptr<arrow::RecordBatch>, arrow::compute::Expression)::<lambda(std::shared_ptr<arrow::RecordBatch>, const arrow::dataset::PartitionPathFormat&)> >::_M_invoke(const std::_Any_data &, std::shared_ptr<arrow::RecordBatch> &&, const arrow::dataset::PartitionPathFormat &) (__functor=..., __args#0=..., __args#1=...) at /usr/include/c++/11/bits/std_function.h:291
291     in /usr/include/c++/11/bits/std_function.h
(gdb)
#698967 0x0000555555f603ee in std::__invoke_r<arrow::Status, arrow::dataset::(anonymous namespace)::DatasetWritingSinkNodeConsumer::WriteNextBatch(std::shared_ptr<arrow::RecordBatch>, arrow::compute::Expression)::<lambda(std::shared_ptr<arrow::RecordBatch>, const arrow::dataset::PartitionPathFormat&)>&, std::shared_ptr<arrow::RecordBatch>, const arrow::dataset::PartitionPathFormat&>(struct {...} &) (__fn=...) at /usr/include/c++/11/bits/invoke.h:116
116     /usr/include/c++/11/bits/invoke.h: No such file or directory.
(gdb)
#698966 0x0000555555f6120f in std::__invoke_impl<arrow::Status, arrow::dataset::(anonymous namespace)::DatasetWritingSinkNodeConsumer::WriteNextBatch(std::shared_ptr<arrow::RecordBatch>, arrow::compute::Expression)::<lambda(std::shared_ptr<arrow::RecordBatch>, const arrow::dataset::PartitionPathFormat&)>&, std::shared_ptr<arrow::RecordBatch>, const arrow::dataset::PartitionPathFormat&>(std::__invoke_other, struct {...} &) (__f=...)
    at /usr/include/c++/11/bits/invoke.h:61
61      in /usr/include/c++/11/bits/invoke.h

When I set OMP_NUM_THREADS 2 or 1 crash disappears.

Tested versions:
apache-arrow-11.0.0 (system alloc)
apache-arrow-12.0.0.dev (system alloc)

@westonpace
Copy link
Member

@snizovtsev

This stack trace is very useful. I have a few hunches and this is probably enough to go on. I will try and get some time to look at this further. I'll need to find some time when I can dig in for a few hours.

I think the scheduler is expecting a "task" to actually correspond to a new thread task getting launched and there are a few places where we don't do that in the dataset writer.

If you are able to reproduce this again and maybe attach the first 300-400 or so frames so I can see the actual loop a bit more clearly that would be helpful.

@assignUser
Copy link
Member

@westonpace As it has been a while and the release is approaching do you think we can get a fix ready and is this a blocker in your eyes?

@westonpace
Copy link
Member

Yes. I'll look at this now.

@snizovtsev
Copy link

@westonpace here is full backtrace under OMP_NUM_THREADS=2 and 128 output partitions. Same code under 32 partitions doesn't overflow stack.
arrow-bt.txt

@westonpace
Copy link
Member

@snizovtsev thanks! This confirms what I discovered in #35075. I believe that fix should work for your situation.

@snizovtsev
Copy link

@westonpace thanks! I've tested your PR on my workload and stack overflows had gone.

However Acero engine still tends to starve and leak memory in some cases. In latter case I can see that at some random point write speed degrades to almost 0 and the engine starts consuming memory until exhaustion. It has higher chances to finish successfully when number of output partitions is small like and reader is slow (after dropping system page cache).

westonpace added a commit that referenced this issue Apr 13, 2023
…taset writer (#35075)

### Rationale for this change

Fixes a bug in the throttled scheduler.

### What changes are included in this PR?

The throttled scheduler will no longer recurse in the ContinueTasks loop if the continued task was immediately finished.

### Are these changes tested?

Yes, I added a new stress test that exposed the stack overflow very reliably on a standard Linux system.

### Are there any user-facing changes?

No.
* Closes: #34539

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
@westonpace westonpace modified the milestones: 12.0.0, 13.0.0 Apr 13, 2023
raulcd pushed a commit that referenced this issue Apr 17, 2023
…taset writer (#35075)

### Rationale for this change

Fixes a bug in the throttled scheduler.

### What changes are included in this PR?

The throttled scheduler will no longer recurse in the ContinueTasks loop if the continued task was immediately finished.

### Are these changes tested?

Yes, I added a new stress test that exposed the stack overflow very reliably on a standard Linux system.

### Are there any user-facing changes?

No.
* Closes: #34539

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
liujiacheng777 pushed a commit to LoongArch-Python/arrow that referenced this issue May 11, 2023
… in dataset writer (apache#35075)

### Rationale for this change

Fixes a bug in the throttled scheduler.

### What changes are included in this PR?

The throttled scheduler will no longer recurse in the ContinueTasks loop if the continued task was immediately finished.

### Are these changes tested?

Yes, I added a new stress test that exposed the stack overflow very reliably on a standard Linux system.

### Are there any user-facing changes?

No.
* Closes: apache#34539

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this issue May 15, 2023
… in dataset writer (apache#35075)

### Rationale for this change

Fixes a bug in the throttled scheduler.

### What changes are included in this PR?

The throttled scheduler will no longer recurse in the ContinueTasks loop if the continued task was immediately finished.

### Are these changes tested?

Yes, I added a new stress test that exposed the stack overflow very reliably on a standard Linux system.

### Are there any user-facing changes?

No.
* Closes: apache#34539

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
rtpsw pushed a commit to rtpsw/arrow that referenced this issue May 16, 2023
… in dataset writer (apache#35075)

### Rationale for this change

Fixes a bug in the throttled scheduler.

### What changes are included in this PR?

The throttled scheduler will no longer recurse in the ContinueTasks loop if the continued task was immediately finished.

### Are these changes tested?

Yes, I added a new stress test that exposed the stack overflow very reliably on a standard Linux system.

### Are there any user-facing changes?

No.
* Closes: apache#34539

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants