-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Acero] Random hangs when joining tables with ExecutePlan #39582
Comments
I initially thought this was a Ruby problem, but now I managed to reproduce the problem with Python 3.11.6 / Arrow 15.0.0 as well. It doesn't crash when running it on macOS, but maybe I'm just lucky. It crashes randomly when running Ubuntu inside a Lima VM:
I pushed |
I also tried to compile a debug version of Arrow. Not sure if I built it correctly, but when running it the following assertion fails:
|
Oh, sorry. I missed this. I'll try this. |
Hmm. I couldn't reproduce this on my environment... diff --git a/cpp/src/arrow/compute/row/grouper.cc b/cpp/src/arrow/compute/row/grouper.cc
index 5e23eda16f..bdf2f52572 100644
--- a/cpp/src/arrow/compute/row/grouper.cc
+++ b/cpp/src/arrow/compute/row/grouper.cc
@@ -533,7 +533,7 @@ struct GrouperFastImpl : public Grouper {
auto impl = std::make_unique<GrouperFastImpl>();
impl->ctx_ = ctx;
- RETURN_NOT_OK(impl->temp_stack_.Init(ctx->memory_pool(), 64 * minibatch_size_max_));
+ RETURN_NOT_OK(impl->temp_stack_.Init(ctx->memory_pool(), 256 * minibatch_size_max_));
impl->encode_ctx_.hardware_flags =
arrow::internal::CpuInfo::GetInstance()->hardware_flags();
impl->encode_ctx_.stack = &impl->temp_stack_; |
Thanks for looking into this. Your change has no effect, however this seems to help: diff --git a/cpp/src/arrow/acero/query_context.cc b/cpp/src/arrow/acero/query_context.cc
index 9f838508f..f5558f6fc 100644
--- a/cpp/src/arrow/acero/query_context.cc
+++ b/cpp/src/arrow/acero/query_context.cc
@@ -53,7 +53,7 @@ size_t QueryContext::max_concurrency() const { return thread_indexer_.Capacity()
Result<util::TempVectorStack*> QueryContext::GetTempStack(size_t thread_index) {
if (!tld_[thread_index].is_init) {
RETURN_NOT_OK(tld_[thread_index].stack.Init(
- memory_pool(), 8 * util::MiniBatch::kMiniBatchLength * sizeof(uint64_t)));
+ memory_pool(), 256 * util::MiniBatch::kMiniBatchLength * sizeof(uint64_t)));
tld_[thread_index].is_init = true;
}
return &tld_[thread_index].stack; |
Hey! Any updates on this? We are still on Arrow 10 because of this bug. Also, this is not Ruby-specific so perhaps remove the Ruby component label and update the issue title? Thanks! |
Oh, sorry. I missed comments again... |
@kou I can open a PR, but how do I know if 256 is a good value? Since I don't understand what is happening, maybe there is a situation where 256 is not enough either? I used the value 256 since that was what you used in your patch, but I see now that I should have used 32 to get the same size (four times larger). |
Let's discuss it too on the PR. :-) |
Certain Acero execution plans can cause an overflow of the TempVectorStack initialized by the QueryContext, and increasing the size of the stack fixes the problem. I don't know exactly what causes the overflow, so I haven't written a test for it. Fixes apache#39582.
Ok, PR created: #40007 |
Certain Acero execution plans can cause an overflow of the TempVectorStack initialized by the QueryContext, and increasing the size of the stack fixes the problem. I don't know exactly what causes the overflow, so I haven't written a test for it. Fixes apache#39582.
We have had problems for a long time with a specific batch job that combines data from different sources. There is something in the data causing an Acero execution plan to hang or crash at random. The problem has been reproduced since Arrow 11.0.0, originally in Ruby, but it has also in Python. There is unfortunately no test case that reliably reproduces the issue in a release build. However, in a debug build we can see that the batch job causes an overflow on the temp stack in arrow/cpp/src/arrow/compute/util.cc:38. Increasing the size of the stack created in the Acero QueryContext works around the issue, but a real fix should be investigated separately. **This PR contains a "Critical Fix".** * Closes: #39582 Lead-authored-by: Sten Larsson <sten@burtcorp.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
…#40007) We have had problems for a long time with a specific batch job that combines data from different sources. There is something in the data causing an Acero execution plan to hang or crash at random. The problem has been reproduced since Arrow 11.0.0, originally in Ruby, but it has also in Python. There is unfortunately no test case that reliably reproduces the issue in a release build. However, in a debug build we can see that the batch job causes an overflow on the temp stack in arrow/cpp/src/arrow/compute/util.cc:38. Increasing the size of the stack created in the Acero QueryContext works around the issue, but a real fix should be investigated separately. **This PR contains a "Critical Fix".** * Closes: apache#39582 Lead-authored-by: Sten Larsson <sten@burtcorp.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
…#40007) We have had problems for a long time with a specific batch job that combines data from different sources. There is something in the data causing an Acero execution plan to hang or crash at random. The problem has been reproduced since Arrow 11.0.0, originally in Ruby, but it has also in Python. There is unfortunately no test case that reliably reproduces the issue in a release build. However, in a debug build we can see that the batch job causes an overflow on the temp stack in arrow/cpp/src/arrow/compute/util.cc:38. Increasing the size of the stack created in the Acero QueryContext works around the issue, but a real fix should be investigated separately. **This PR contains a "Critical Fix".** * Closes: apache#39582 Lead-authored-by: Sten Larsson <sten@burtcorp.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
The milestone for this issue is 15.0.1, but the changes in the corresponding PR doesn't seem to be included in the 15.0.1 release? 🤔 |
Ping @raulcd |
This was merged after the code freeze and tagged as 15.0.1 when merged. It did not make it into 15.0.1. I am adding it to 15.0.2. |
We have had problems for a long time with a specific batch job that combines data from different sources. There is something in the data causing an Acero execution plan to hang or crash at random. The problem has been reproduced since Arrow 11.0.0, originally in Ruby, but it has also in Python. There is unfortunately no test case that reliably reproduces the issue in a release build. However, in a debug build we can see that the batch job causes an overflow on the temp stack in arrow/cpp/src/arrow/compute/util.cc:38. Increasing the size of the stack created in the Acero QueryContext works around the issue, but a real fix should be investigated separately. **This PR contains a "Critical Fix".** * Closes: #39582 Lead-authored-by: Sten Larsson <sten@burtcorp.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
Describe the bug, including details regarding any error messages, version, and platform.
We have problems for a long time with a specific batch job that combines data from different sources. There is something in the data causing the issue, but I haven't been able to figure exactly what. I have created a test case where I tried my best to minimise and anonymise the data: https://github.com/stenlarsson/arrow-test
Sometimes it hangs after a random number of iterations:
Sometimes it crashes:
I'm running macOS / Ruby 3.2.2 / Arrow 14.0.2 on my computer, but have also reproduced the error with Linux / Ruby 3.0.6 / Arrow 11.0.0. It doesn't seem to happen with Arrow 10.0.1.
Component(s)
Ruby
The text was updated successfully, but these errors were encountered: