[C++][Acero] Random hangs when joining tables with ExecutePlan #39582

stenlarsson · 2024-01-12T12:32:28Z

Describe the bug, including details regarding any error messages, version, and platform.

We have problems for a long time with a specific batch job that combines data from different sources. There is something in the data causing the issue, but I haven't been able to figure exactly what. I have created a test case where I tried my best to minimise and anonymise the data: https://github.com/stenlarsson/arrow-test

Sometimes it hangs after a random number of iterations:

$ ruby hang.rb
0
1
2

Sometimes it crashes:

$ ruby hang.rb
0
SEGV received in BUS handler
[1]    74331 abort      ruby hang.rb

I'm running macOS / Ruby 3.2.2 / Arrow 14.0.2 on my computer, but have also reproduced the error with Linux / Ruby 3.0.6 / Arrow 11.0.0. It doesn't seem to happen with Arrow 10.0.1.

Component(s)

Ruby

The text was updated successfully, but these errors were encountered:

stenlarsson · 2024-01-24T08:18:37Z

I initially thought this was a Ruby problem, but now I managed to reproduce the problem with Python 3.11.6 / Arrow 15.0.0 as well. It doesn't crash when running it on macOS, but maybe I'm just lucky. It crashes randomly when running Ubuntu inside a Lima VM:

$ ~/venv/bin/python hang.py
0
1
2
3
4
5
Segmentation fault (core dumped)

I pushed hang.py to https://github.com/stenlarsson/arrow-test.

stenlarsson · 2024-01-24T09:14:53Z

I also tried to compile a debug version of Arrow. Not sure if I built it correctly, but when running it the following assertion fails:

$ ruby hang.rb
0
/home/stenlarsson.linux/arrow/cpp/src/arrow/compute/util.cc:38:  Check failed: top_ <= buffer_size_
/home/stenlarsson.linux/arrow/cpp/src/arrow/compute/util.cc:38:  Check failed: top_ <= buffer_size_
/home/stenlarsson.linux/arrow/cpp/src/arrow/compute/util.cc:38:  Check failed: top_ <= buffer_size_
/home/stenlarsson.linux/arrow/cpp/src/arrow/compute/util.cc:38:  Check failed: top_ <= buffer_size_
/usr/local/lib/libarrow.so.1600(_ZN5arrow4util7CerrLog14PrintBackTraceEv+0x34)[0xffff9d325794]
/usr/local/lib/libarrow.so.1600(_ZN5arrow4util7CerrLogD1Ev+0x60)[0xffff9d325704]
/usr/local/lib/libarrow.so.1600(_ZN5arrow4util7CerrLogD0Ev+0x14)[0xffff9d325728]
/usr/local/lib/libarrow.so.1600(_ZN5arrow4util8ArrowLogD1Ev+0x50)[0xffff9d325578]
/usr/local/lib/libarrow.so.1600(_ZN5arrow4util15TempVectorStack5allocEjPPhPi+0xf4)[0xffff9d91e060]
/usr/local/lib/libarrow.so.1600(_ZN5arrow4util16TempVectorHolderIjEC1EPNS0_15TempVectorStackEj+0x58)[0xffff9d6c54b4]
/usr/local/lib/libarrow.so.1600(_ZN5arrow7compute9Hashing3215HashMultiColumnERKSt6vectorINS0_14KeyColumnArrayESaIS3_EEPNS0_12LightContextEPj+0xa4)[0xffff9d6c0ae0]
/usr/local/lib/libarrow.so.1600(_ZN5arrow7compute9Hashing329HashBatchERKNS0_9ExecBatchEPjRSt6vectorINS0_14KeyColumnArrayESaIS7_EElPNS_4util15TempVectorStackEll+0x130)[0xffff9d6c12cc]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero18SwissTableWithKeys4HashEPNS1_5InputEPjl+0x110)[0xffffaf681410]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero18JoinProbeProcessor11OnNextBatchElRKNS_7compute9ExecBatchEPNS_4util15TempVectorStackEPSt6vectorINS2_14KeyColumnArrayESaISA_EE+0x228)[0xffffaf687080]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero9SwissJoin16ProbeSingleBatchEmNS_7compute9ExecBatchE+0x300)[0xffffaf68f2a8]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero12HashJoinNode16OnProbeSideBatchEmNS_7compute9ExecBatchE+0x25c)[0xffffaf601800]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero12HashJoinNode13InputReceivedEPNS0_8ExecNodeENS_7compute9ExecBatchE+0x1b0)[0xffffaf6022c8]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero12HashJoinNode19OutputBatchCallbackENS_7compute9ExecBatchE+0x80)[0xffffaf603820]
/usr/local/lib/libarrow_acero.so.1600(_ZZN5arrow5acero12HashJoinNode4InitEvENKUllNS_7compute9ExecBatchEE4_clElS3_+0x5c)[0xffffaf602bd0]
/usr/local/lib/libarrow_acero.so.1600(_ZSt13__invoke_implIN5arrow6StatusERZNS0_5acero12HashJoinNode4InitEvEUllNS0_7compute9ExecBatchEE_JlS5_EET_St14__invoke_otherOT0_DpOT1_+0x80)[0xffffaf619114]
/usr/local/lib/libarrow_acero.so.1600(_ZSt10__invoke_rIN5arrow6StatusERZNS0_5acero12HashJoinNode4InitEvEUllNS0_7compute9ExecBatchEE_JlS5_EENSt9enable_ifIX16is_invocable_r_vIT_T0_DpT1_EES9_E4typeEOSA_DpOSB_+0x70)[0xffffaf61481c]
/usr/local/lib/libarrow_acero.so.1600(_ZNSt17_Function_handlerIFN5arrow6StatusElNS0_7compute9ExecBatchEEZNS0_5acero12HashJoinNode4InitEvEUllS3_E_E9_M_invokeERKSt9_Any_dataOlOS3_+0x6c)[0xffffaf60e9c0]
/usr/local/lib/libarrow_acero.so.1600(_ZNKSt8functionIFN5arrow6StatusElNS0_7compute9ExecBatchEEEclElS3_+0xa8)[0xffffaf5cf6f0]
/usr/local/lib/libarrow_acero.so.1600(+0x506ddc)[0xffffaf686ddc]
/usr/local/lib/libarrow_acero.so.1600(+0x50bfe0)[0xffffaf68bfe0]
/usr/local/lib/libarrow_acero.so.1600(+0x50b0c8)[0xffffaf68b0c8]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero18JoinProbeProcessor11OnNextBatchElRKNS_7compute9ExecBatchEPNS_4util15TempVectorStackEPSt6vectorINS2_14KeyColumnArrayESaISA_EE+0x8e8)[0xffffaf687740]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero9SwissJoin16ProbeSingleBatchEmNS_7compute9ExecBatchE+0x300)[0xffffaf68f2a8]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero12HashJoinNode16OnProbeSideBatchEmNS_7compute9ExecBatchE+0x25c)[0xffffaf601800]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero12HashJoinNode13InputReceivedEPNS0_8ExecNodeENS_7compute9ExecBatchE+0x1b0)[0xffffaf6022c8]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero12HashJoinNode19OutputBatchCallbackENS_7compute9ExecBatchE+0x80)[0xffffaf603820]
/usr/local/lib/libarrow_acero.so.1600(_ZZN5arrow5acero12HashJoinNode4InitEvENKUllNS_7compute9ExecBatchEE4_clElS3_+0x5c)[0xffffaf602bd0]
/usr/local/lib/libarrow_acero.so.1600(_ZSt13__invoke_implIN5arrow6StatusERZNS0_5acero12HashJoinNode4InitEvEUllNS0_7compute9ExecBatchEE_JlS5_EET_St14__invoke_otherOT0_DpOT1_+0x80)[0xffffaf619114]
/usr/local/lib/libarrow_acero.so.1600(_ZSt10__invoke_rIN5arrow6StatusERZNS0_5acero12HashJoinNode4InitEvEUllNS0_7compute9ExecBatchEE_JlS5_EENSt9enable_ifIX16is_invocable_r_vIT_T0_DpT1_EES9_E4typeEOSA_DpOSB_+0x70)[0xffffaf61481c]
/usr/local/lib/libarrow_acero.so.1600(_ZNSt17_Function_handlerIFN5arrow6StatusElNS0_7compute9ExecBatchEEZNS0_5acero12HashJoinNode4InitEvEUllS3_E_E9_M_invokeERKSt9_Any_dataOlOS3_+0x6c)[0xffffaf60e9c0]
/usr/local/lib/libarrow_acero.so.1600(_ZNKSt8functionIFN5arrow6StatusElNS0_7compute9ExecBatchEEEclElS3_+0xa8)[0xffffaf5cf6f0]
/usr/local/lib/libarrow_acero.so.1600(+0x506ddc)[0xffffaf686ddc]
/usr/local/lib/libarrow_acero.so.1600(+0x50bfe0)[0xffffaf68bfe0]
/usr/local/lib/libarrow_acero.so.1600(+0x50b0c8)[0xffffaf68b0c8]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero18JoinProbeProcessor11OnNextBatchElRKNS_7compute9ExecBatchEPNS_4util15TempVectorStackEPSt6vectorINS2_14KeyColumnArrayESaISA_EE+0x8e8)[0xffffaf687740]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero9SwissJoin16ProbeSingleBatchEmNS_7compute9ExecBatchE+0x300)[0xffffaf68f2a8]
/usr/local/lib/libarrow_acero.so.1600(_ZZN5arrow5acero12HashJoinNode4InitEvENKUlmlE_clEml+0x94)[0xffffaf602d54]
/usr/local/lib/libarrow_acero.so.1600(_ZSt13__invoke_implIN5arrow6StatusERZNS0_5acero12HashJoinNode4InitEvEUlmlE6_JmlEET_St14__invoke_otherOT0_DpOT1_+0x74)[0xffffaf6193dc]
/usr/local/lib/libarrow_acero.so.1600(_ZSt10__invoke_rIN5arrow6StatusERZNS0_5acero12HashJoinNode4InitEvEUlmlE6_JmlEENSt9enable_ifIX16is_invocable_r_vIT_T0_DpT1_EES7_E4typeEOS8_DpOS9_+0x70)[0xffffaf614c00]
/usr/local/lib/libarrow_acero.so.1600(_ZNSt17_Function_handlerIFN5arrow6StatusEmlEZNS0_5acero12HashJoinNode4InitEvEUlmlE6_E9_M_invokeERKSt9_Any_dataOmOl+0x6c)[0xffffaf60ecc4]
/usr/local/lib/libarrow_acero.so.1600(_ZNKSt8functionIFN5arrow6StatusEmlEEclEml+0xa8)[0xffffaf6a5068]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero17TaskSchedulerImpl11ExecuteTaskEmilPb+0x84)[0xffffaf6a264c]
/usr/local/lib/libarrow_acero.so.1600(+0x523220)[0xffffaf6a3220]
/usr/local/lib/libarrow_acero.so.1600(+0x52421c)[0xffffaf6a421c]
/usr/local/lib/libarrow_acero.so.1600(+0x52407c)[0xffffaf6a407c]
/usr/local/lib/libarrow_acero.so.1600(+0x523ee4)[0xffffaf6a3ee4]
/usr/local/lib/libarrow_acero.so.1600(_ZNKSt8functionIFN5arrow6StatusEmEEclEm+0x94)[0xffffaf5d033c]
/usr/local/lib/libarrow_acero.so.1600(+0x4b8a14)[0xffffaf638a14]
/usr/local/lib/libarrow_acero.so.1600(+0x4ba570)[0xffffaf63a570]
/usr/local/lib/libarrow_acero.so.1600(+0x4b9fb8)[0xffffaf639fb8]
/usr/local/lib/libarrow_acero.so.1600(+0x4b976c)[0xffffaf63976c]
/usr/local/lib/libarrow_acero.so.1600(_ZNKSt8functionIFN5arrow6StatusEvEEclEv+0x7c)[0xffffaf4d3468]
/usr/local/lib/libarrow_acero.so.1600(_ZNK5arrow6detail14ContinueFutureclIRSt8functionIFNS_6StatusEvEEJES4_NS_6FutureINS_8internal5EmptyEEEEENSt9enable_ifIXaaaantsrSt7is_voidIT1_E5valuentsrNS0_9is_futureISE_EE5valueoontsrT2_8is_emptysrSt7is_sameISE_S4_E5valueEvE4typeESI_OT_DpOT0_+0x4c)[0xffffaf63e0f0]
/usr/local/lib/libarrow_acero.so.1600(_ZSt13__invoke_implIvRN5arrow6detail14ContinueFutureEJRNS0_6FutureINS0_8internal5EmptyEEERSt8functionIFNS0_6StatusEvEEEET_St14__invoke_otherOT0_DpOT1_+0x74)[0xffffaf63dff0]
/usr/local/lib/libarrow_acero.so.1600(_ZSt8__invokeIRN5arrow6detail14ContinueFutureEJRNS0_6FutureINS0_8internal5EmptyEEERSt8functionIFNS0_6StatusEvEEEENSt15__invoke_resultIT_JDpT0_EE4typeEOSF_DpOSG_+0x50)[0xffffaf63df08]
/usr/local/lib/libarrow_acero.so.1600(_ZNSt5_BindIFN5arrow6detail14ContinueFutureENS0_6FutureINS0_8internal5EmptyEEESt8functionIFNS0_6StatusEvEEEE6__callIvJEJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE+0x80)[0xffffaf63de04]
/usr/local/lib/libarrow_acero.so.1600(_ZNSt5_BindIFN5arrow6detail14ContinueFutureENS0_6FutureINS0_8internal5EmptyEEESt8functionIFNS0_6StatusEvEEEEclIJEvEET0_DpOT_+0x40)[0xffffaf63dd48]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow8internal6FnOnceIFvvEE6FnImplISt5_BindIFNS_6detail14ContinueFutureENS_6FutureINS0_5EmptyEEESt8functionIFNS_6StatusEvEEEEE6invokeEv+0x1c)[0xffffaf63dcfc]
/usr/local/lib/libarrow.so.1600(_ZNO5arrow8internal6FnOnceIFvvEEclEv+0x54)[0xffff9d352db8]
/usr/local/lib/libarrow.so.1600(+0x1aabcb8)[0xffff9d34bcb8]
/usr/local/lib/libarrow.so.1600(+0x1aacd94)[0xffff9d34cd94]
/usr/local/lib/libarrow.so.1600(+0x1ab1404)[0xffff9d351404]
/usr/local/lib/libarrow.so.1600(+0x1ab13bc)[0xffff9d3513bc]
/usr/local/lib/libarrow.so.1600(+0x1ab1358)[0xffff9d351358]
/usr/local/lib/libarrow.so.1600(+0x1ab132c)[0xffff9d35132c]
/usr/local/lib/libarrow.so.1600(+0x1ab130c)[0xffff9d35130c]
/lib/aarch64-linux-gnu/libstdc++.so.6(+0xdb1cc)[0xffffaefeb1cc]
/usr/local/lib/libarrow.so.1600(_ZN5arrow4util7CerrLog14PrintBackTraceEv+0x34)[0xffff9d325794]
/usr/local/lib/libarrow.so.1600(_ZN5arrow4util7CerrLogD1Ev+0x60)[0xffff9d325704]
/usr/local/lib/libarrow.so.1600(_ZN5arrow4util7CerrLogD0Ev+0x14)[0xffff9d325728]
/usr/local/lib/libarrow.so.1600(_ZN5arrow4util8ArrowLogD1Ev+0x50)[0xffff9d325578]
/usr/local/lib/libarrow.so.1600(_ZN5arrow4util15TempVectorStack5allocEjPPhPi+0xf4)[0xffff9d91e060]
/usr/local/lib/libarrow.so.1600(_ZN5arrow4util16TempVectorHolderIjEC1EPNS0_15TempVectorStackEj+0x58)[0xffff9d6c54b4]
/usr/local/lib/libarrow.so.1600(_ZN5arrow7compute9Hashing3215HashMultiColumnERKSt6vectorINS0_14KeyColumnArrayESaIS3_EEPNS0_12LightContextEPj+0xa4)[0xffff9d6c0ae0]
/usr/local/lib/libarrow.so.1600(_ZN5arrow7compute9Hashing329HashBatchERKNS0_9ExecBatchEPjRSt6vectorINS0_14KeyColumnArrayESaIS7_EElPNS_4util15TempVectorStackEll+0x130)[0xffff9d6c12cc]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero18SwissTableWithKeys4HashEPNS1_5InputEPjl+0x110)[0xffffaf681410]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero18JoinProbeProcessor11OnNextBatchElRKNS_7compute9ExecBatchEPNS_4util15TempVectorStackEPSt6vectorINS2_14KeyColumnArrayESaISA_EE+0x228)[0xffffaf687080]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero9SwissJoin16ProbeSingleBatchEmNS_7compute9ExecBatchE+0x300)[0xffffaf68f2a8]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero12HashJoinNode16OnProbeSideBatchEmNS_7compute9ExecBatchE+0x25c)[0xffffaf601800]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero12HashJoinNode13InputReceivedEPNS0_8ExecNodeENS_7compute9ExecBatchE+0x1b0)[0xffffaf6022c8]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero12HashJoinNode19OutputBatchCallbackENS_7compute9ExecBatchE+0x80)[0xffffaf603820]
/usr/local/lib/libarrow_acero.so.1600(_ZZN5arrow5acero12HashJoinNode4InitEvENKUllNS_7compute9ExecBatchEE4_clElS3_+0x5c)[0xffffaf602bd0]
/usr/local/lib/libarrow_acero.so.1600(_ZSt13__invoke_implIN5arrow6StatusERZNS0_5acero12HashJoinNode4InitEvEUllNS0_7compute9ExecBatchEE_JlS5_EET_St14__invoke_otherOT0_DpOT1_+0x80)[0xffffaf619114]
/usr/local/lib/libarrow_acero.so.1600(_ZSt10__invoke_rIN5arrow6StatusERZNS0_5acero12HashJoinNode4InitEvEUllNS0_7compute9ExecBatchEE_JlS5_EENSt9enable_ifIX16is_invocable_r_vIT_T0_DpT1_EES9_E4typeEOSA_DpOSB_+0x70)[0xffffaf61481c]
/usr/local/lib/libarrow_acero.so.1600(_ZNSt17_Function_handlerIFN5arrow6StatusElNS0_7compute9ExecBatchEEZNS0_5acero12HashJoinNode4InitEvEUllS3_E_E9_M_invokeERKSt9_Any_dataOlOS3_+0x6c)[0xffffaf60e9c0]
/usr/local/lib/libarrow_acero.so.1600(_ZNKSt8functionIFN5arrow6StatusElNS0_7compute9ExecBatchEEEclElS3_+0xa8)[0xffffaf5cf6f0]
/usr/local/lib/libarrow_acero.so.1600(+0x506ddc)[0xffffaf686ddc]
/usr/local/lib/libarrow_acero.so.1600(+0x50bfe0)[0xffffaf68bfe0]
/usr/local/lib/libarrow_acero.so.1600(+0x50b0c8)[0xffffaf68b0c8]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero18JoinProbeProcessor11OnNextBatchElRKNS_7compute9ExecBatchEPNS_4util15TempVectorStackEPSt6vectorINS2_14KeyColumnArrayESaISA_EE+0x8e8)[0xffffaf687740]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero9SwissJoin16ProbeSingleBatchEmNS_7compute9ExecBatchE+0x300)[0xffffaf68f2a8]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero12HashJoinNode16OnProbeSideBatchEmNS_7compute9ExecBatchE+0x25c)[0xffffaf601800]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero12HashJoinNode13InputReceivedEPNS0_8ExecNodeENS_7compute9ExecBatchE+0x1b0)[0xffffaf6022c8]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero12HashJoinNode19OutputBatchCallbackENS_7compute9ExecBatchE+0x80)[0xffffaf603820]
/lib/aarch64-linux-gnu/libc.so.6(+0x837d0)[0xffffb52b37d0]
/lib/aarch64-linux-gnu/libc.so.6(+0xef54c)[0xffffb531f54c]
Aborted (core dumped)

kou · 2024-01-24T14:44:17Z

Oh, sorry. I missed this. I'll try this.

kou · 2024-01-27T23:12:07Z

Hmm. I couldn't reproduce this on my environment...
It seems that buffer size isn't enough for this case on your environment. The following isn't a real fix but does the following resolve this case on your environment?

diff --git a/cpp/src/arrow/compute/row/grouper.cc b/cpp/src/arrow/compute/row/grouper.cc
index 5e23eda16f..bdf2f52572 100644
--- a/cpp/src/arrow/compute/row/grouper.cc
+++ b/cpp/src/arrow/compute/row/grouper.cc
@@ -533,7 +533,7 @@ struct GrouperFastImpl : public Grouper {
     auto impl = std::make_unique<GrouperFastImpl>();
     impl->ctx_ = ctx;
 
-    RETURN_NOT_OK(impl->temp_stack_.Init(ctx->memory_pool(), 64 * minibatch_size_max_));
+    RETURN_NOT_OK(impl->temp_stack_.Init(ctx->memory_pool(), 256 * minibatch_size_max_));
     impl->encode_ctx_.hardware_flags =
         arrow::internal::CpuInfo::GetInstance()->hardware_flags();
     impl->encode_ctx_.stack = &impl->temp_stack_;

stenlarsson · 2024-01-29T14:47:14Z

Thanks for looking into this. Your change has no effect, however this seems to help:

diff --git a/cpp/src/arrow/acero/query_context.cc b/cpp/src/arrow/acero/query_context.cc
index 9f838508f..f5558f6fc 100644
--- a/cpp/src/arrow/acero/query_context.cc
+++ b/cpp/src/arrow/acero/query_context.cc
@@ -53,7 +53,7 @@ size_t QueryContext::max_concurrency() const { return thread_indexer_.Capacity()
 Result<util::TempVectorStack*> QueryContext::GetTempStack(size_t thread_index) {
   if (!tld_[thread_index].is_init) {
     RETURN_NOT_OK(tld_[thread_index].stack.Init(
-        memory_pool(), 8 * util::MiniBatch::kMiniBatchLength * sizeof(uint64_t)));
+        memory_pool(), 256 * util::MiniBatch::kMiniBatchLength * sizeof(uint64_t)));
     tld_[thread_index].is_init = true;
   }
   return &tld_[thread_index].stack;

rejeep · 2024-02-08T10:14:04Z

Hey! Any updates on this? We are still on Arrow 10 because of this bug. Also, this is not Ruby-specific so perhaps remove the Ruby component label and update the issue title? Thanks!

kou · 2024-02-08T13:22:57Z

Oh, sorry. I missed comments again...
I can't reproduce this on my local machine but could you open a pull request with the change? Let's discuss the approach on the PR.

stenlarsson · 2024-02-08T13:47:58Z

@kou I can open a PR, but how do I know if 256 is a good value? Since I don't understand what is happening, maybe there is a situation where 256 is not enough either?

I used the value 256 since that was what you used in your patch, but I see now that I should have used 32 to get the same size (four times larger).

kou · 2024-02-08T13:58:24Z

Let's discuss it too on the PR. :-)

Certain Acero execution plans can cause an overflow of the TempVectorStack initialized by the QueryContext, and increasing the size of the stack fixes the problem. I don't know exactly what causes the overflow, so I haven't written a test for it. Fixes apache#39582.

stenlarsson · 2024-02-08T19:51:11Z

Ok, PR created: #40007

Certain Acero execution plans can cause an overflow of the TempVectorStack initialized by the QueryContext, and increasing the size of the stack fixes the problem. I don't know exactly what causes the overflow, so I haven't written a test for it. Fixes apache#39582.

We have had problems for a long time with a specific batch job that combines data from different sources. There is something in the data causing an Acero execution plan to hang or crash at random. The problem has been reproduced since Arrow 11.0.0, originally in Ruby, but it has also in Python. There is unfortunately no test case that reliably reproduces the issue in a release build. However, in a debug build we can see that the batch job causes an overflow on the temp stack in arrow/cpp/src/arrow/compute/util.cc:38. Increasing the size of the stack created in the Acero QueryContext works around the issue, but a real fix should be investigated separately. **This PR contains a "Critical Fix".** * Closes: #39582 Lead-authored-by: Sten Larsson <sten@burtcorp.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

…#40007) We have had problems for a long time with a specific batch job that combines data from different sources. There is something in the data causing an Acero execution plan to hang or crash at random. The problem has been reproduced since Arrow 11.0.0, originally in Ruby, but it has also in Python. There is unfortunately no test case that reliably reproduces the issue in a release build. However, in a debug build we can see that the batch job causes an overflow on the temp stack in arrow/cpp/src/arrow/compute/util.cc:38. Increasing the size of the stack created in the Acero QueryContext works around the issue, but a real fix should be investigated separately. **This PR contains a "Critical Fix".** * Closes: apache#39582 Lead-authored-by: Sten Larsson <sten@burtcorp.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

stenlarsson · 2024-03-11T18:32:16Z

The milestone for this issue is 15.0.1, but the changes in the corresponding PR doesn't seem to be included in the 15.0.1 release? 🤔

pitrou · 2024-03-11T20:05:25Z

The milestone for this issue is 15.0.1, but the changes in the corresponding PR doesn't seem to be included in the 15.0.1 release? 🤔

Ping @raulcd

raulcd · 2024-03-13T10:08:21Z

This was merged after the code freeze and tagged as 15.0.1 when merged. It did not make it into 15.0.1. I am adding it to 15.0.2.

We have had problems for a long time with a specific batch job that combines data from different sources. There is something in the data causing an Acero execution plan to hang or crash at random. The problem has been reproduced since Arrow 11.0.0, originally in Ruby, but it has also in Python. There is unfortunately no test case that reliably reproduces the issue in a release build. However, in a debug build we can see that the batch job causes an overflow on the temp stack in arrow/cpp/src/arrow/compute/util.cc:38. Increasing the size of the stack created in the Acero QueryContext works around the issue, but a real fix should be investigated separately. **This PR contains a "Critical Fix".** * Closes: #39582 Lead-authored-by: Sten Larsson <sten@burtcorp.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

stenlarsson added the Type: bug label Jan 12, 2024

github-actions bot added the Component: Ruby label Jan 12, 2024

raulcd changed the title ~~Random hangs when joining tables with ExecutePlan in Ruby~~ [Ruby] Random hangs when joining tables with ExecutePlan in Ruby Jan 15, 2024

kou changed the title ~~[Ruby] Random hangs when joining tables with ExecutePlan in Ruby~~ [C++][Acero] Random hangs when joining tables with ExecutePlan Feb 8, 2024

stenlarsson mentioned this issue Feb 8, 2024

GH-39582: [C++][Acero] Increase size of Acero TempStack #40007

Merged

github-actions bot assigned stenlarsson Feb 8, 2024

pitrou added this to the 15.0.1 milestone Feb 26, 2024

pitrou closed this as completed in #40007 Feb 26, 2024

pitrou modified the milestones: 15.0.1, 16.0.0 Feb 26, 2024

raulcd modified the milestones: 15.0.1, 15.0.2 Mar 13, 2024

amoeba added the Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. label Mar 21, 2024

zanmato1984 mentioned this issue Apr 22, 2024

[C++][Acero] Acero's shared (per-thread) temp vector stack usage may overflow #41334

Closed

zanmato1984 mentioned this issue May 1, 2024

GH-41334: [C++][Acero] Use per-node basis temp vector stack to mitigate overflow #41335

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Acero] Random hangs when joining tables with ExecutePlan #39582

[C++][Acero] Random hangs when joining tables with ExecutePlan #39582

stenlarsson commented Jan 12, 2024

stenlarsson commented Jan 24, 2024

stenlarsson commented Jan 24, 2024 •

edited

Loading

kou commented Jan 24, 2024

kou commented Jan 27, 2024

stenlarsson commented Jan 29, 2024

rejeep commented Feb 8, 2024

kou commented Feb 8, 2024

stenlarsson commented Feb 8, 2024

kou commented Feb 8, 2024

stenlarsson commented Feb 8, 2024

stenlarsson commented Mar 11, 2024

pitrou commented Mar 11, 2024

raulcd commented Mar 13, 2024

[C++][Acero] Random hangs when joining tables with ExecutePlan #39582

[C++][Acero] Random hangs when joining tables with ExecutePlan #39582

Comments

stenlarsson commented Jan 12, 2024

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

stenlarsson commented Jan 24, 2024

stenlarsson commented Jan 24, 2024 • edited Loading

kou commented Jan 24, 2024

kou commented Jan 27, 2024

stenlarsson commented Jan 29, 2024

rejeep commented Feb 8, 2024

kou commented Feb 8, 2024

stenlarsson commented Feb 8, 2024

kou commented Feb 8, 2024

stenlarsson commented Feb 8, 2024

stenlarsson commented Mar 11, 2024

pitrou commented Mar 11, 2024

raulcd commented Mar 13, 2024

stenlarsson commented Jan 24, 2024 •

edited

Loading