-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Python] Setting rows per group cause segmentation fault writing parquet with write_dataset #34539
Comments
|
Thank you for reporting @isvoboda! I am able to reproduce with the docker example you added (awesome, thanks!). I wasn't able to look deeper into the issue but it does look like a bug in datasets. |
FWIW, I can't reproduce this on Linux (ubuntu), not in my local development environment, and also not using docker with the script and steps described above. Using docker, reading the created file after running the script looks OK:
@isvoboda what platform are you using? |
@jorisvandenbossche I have tried several environments
|
I can add to the list of environments: macOS Monterey 12.6.3 with Python 3.10.10 (with Docker or without). In my case I get segfault no matter of how I define
Without
And I had to change
|
Just to note, I have tried running the script locally with jemalloc (instead of system) allocator, but it didn't make a difference. |
I've faced a similar issue with C++ / Acero dataset writer. Regardless of options, pipeline gets crashed when dataset "write" node is used. I found that crash cause is stack overflow. Depending on backtrace
When I set Tested versions: |
This stack trace is very useful. I have a few hunches and this is probably enough to go on. I will try and get some time to look at this further. I'll need to find some time when I can dig in for a few hours. I think the scheduler is expecting a "task" to actually correspond to a new thread task getting launched and there are a few places where we don't do that in the dataset writer. If you are able to reproduce this again and maybe attach the first 300-400 or so frames so I can see the actual loop a bit more clearly that would be helpful. |
@westonpace As it has been a while and the release is approaching do you think we can get a fix ready and is this a blocker in your eyes? |
Yes. I'll look at this now. |
@westonpace here is full backtrace under |
@snizovtsev thanks! This confirms what I discovered in #35075. I believe that fix should work for your situation. |
@westonpace thanks! I've tested your PR on my workload and stack overflows had gone. However Acero engine still tends to starve and leak memory in some cases. In latter case I can see that at some random point write speed degrades to almost 0 and the engine starts consuming memory until exhaustion. It has higher chances to finish successfully when number of output partitions is small like and reader is slow (after dropping system page cache). |
…taset writer (#35075) ### Rationale for this change Fixes a bug in the throttled scheduler. ### What changes are included in this PR? The throttled scheduler will no longer recurse in the ContinueTasks loop if the continued task was immediately finished. ### Are these changes tested? Yes, I added a new stress test that exposed the stack overflow very reliably on a standard Linux system. ### Are there any user-facing changes? No. * Closes: #34539 Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
…taset writer (#35075) ### Rationale for this change Fixes a bug in the throttled scheduler. ### What changes are included in this PR? The throttled scheduler will no longer recurse in the ContinueTasks loop if the continued task was immediately finished. ### Are these changes tested? Yes, I added a new stress test that exposed the stack overflow very reliably on a standard Linux system. ### Are there any user-facing changes? No. * Closes: #34539 Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
… in dataset writer (apache#35075) ### Rationale for this change Fixes a bug in the throttled scheduler. ### What changes are included in this PR? The throttled scheduler will no longer recurse in the ContinueTasks loop if the continued task was immediately finished. ### Are these changes tested? Yes, I added a new stress test that exposed the stack overflow very reliably on a standard Linux system. ### Are there any user-facing changes? No. * Closes: apache#34539 Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
… in dataset writer (apache#35075) ### Rationale for this change Fixes a bug in the throttled scheduler. ### What changes are included in this PR? The throttled scheduler will no longer recurse in the ContinueTasks loop if the continued task was immediately finished. ### Are these changes tested? Yes, I added a new stress test that exposed the stack overflow very reliably on a standard Linux system. ### Are there any user-facing changes? No. * Closes: apache#34539 Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
… in dataset writer (apache#35075) ### Rationale for this change Fixes a bug in the throttled scheduler. ### What changes are included in this PR? The throttled scheduler will no longer recurse in the ContinueTasks loop if the continued task was immediately finished. ### Are these changes tested? Yes, I added a new stress test that exposed the stack overflow very reliably on a standard Linux system. ### Are there any user-facing changes? No. * Closes: apache#34539 Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
Describe the bug, including details regarding any error messages, version, and platform.
Writing parquet file crashes with segmentation fault with
write_dataset
and settingmin_rows_per_file
andmax_rows_per_file
.It seems the crash depends on particular number for example setting
10_000
for both parameters works while60_000
crashes.pyarrow==11.0
See an example and dedicated docker environment
segfault.py
Dockerfile
Run the example
docker build . -t pyarrow-segfault
docker run --rm -it pyarrow-segfault bash
python /tmp/segfault.py
Component(s)
Python
The text was updated successfully, but these errors were encountered: