-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI][C++] Flaky segmentation fault on arrow-dataset-file-json-test #34653
Comments
I'm (finally) able to reproduce this locally. For the record, running the alpine service alone won't trigger it, but imposing github's cpu/mem limits will. I'm aiming to rectify the issue without needing to revert anything for 12.0. Unfortunately, there are three recently-introduced APIs at play here so hopefully it's nothing too crazy. |
With this taken into consideration should we keep this as a blocker? Outside of maybe having to rerun some jobs when cutting the rc, this does not seem like a critical issue? |
Probably not, although I think I've identified the cause. It's essentially a stack overflow in the JSON reader which can occur under very specific circumstances. In this case, it's brought about by one of the tests (which uses parameters that aren't representative of typical usage). The details are fairly convoluted since threading is involved. I should have a fix soon regardless, although the JSON reader itself is (in principle) working as intended. I'm still not sure why this only occurs on Alpine... |
Iterating futures can lead to stack blowup (VisitAsyncGenerator and a few other places have specialized code to avoid this). I thought maybe alpine linux might have a smaller stack depth but it seems the default is the same as Ubuntu so it might just be because it is running more slowly. Slower code tends to experience more stack blowup because the futures are likely to be already fulfilled before the continuation is set. |
As there seems to be a PR almost ready for this, I am marking it as |
…alpine-linux-cpp (#35047) ### What changes are included in this PR? Increases the block size used in the `ScanWithParallelDecoding` test to reduce the number of (potentially parallel) parsing/decoding jobs from 1000+ to roughly 60 while increasing the runtime of each job. This should still satisfy the purpose of test without going completely over the top. ### Are these changes tested? Yes, tested locally on the alpine docker image many times after successfully reproducing the original issue. ### Are there any user-facing changes? No ### Notes This doesn't solve the underlying cause (although the testing parameters were arguably far too unusual in the first place), however I do believe that I've identified the issue via a core dump. The problem starts [here](https://github.com/apache/arrow/blob/47a602dbd9b7b7f7720a5e62467e3e6c61712cf3/cpp/src/arrow/json/reader.cc#L362-L369), where a `MappingGenerator` gets stacked on top of a generator that applies readahead. It seems that the underlying futures were completing very quickly, resulting in `AddCallback` being called recursively many, many times - starting [here](https://github.com/apache/arrow/blob/47a602dbd9b7b7f7720a5e62467e3e6c61712cf3/cpp/src/arrow/util/async_generator.h#L240). This leads to a stack overflow under specific circumstances. So, to fully guard against the problem, you'd probably want to change the logic of `MappingGenerator` to use `TryAddCallback` + an inner loop to avoid overflowing the stack. Not entirely sure if doing this would be worthwhile though. * Closes: #34653 Authored-by: benibus <bpharks@gmx.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
…lt on alpine-linux-cpp (apache#35047) ### What changes are included in this PR? Increases the block size used in the `ScanWithParallelDecoding` test to reduce the number of (potentially parallel) parsing/decoding jobs from 1000+ to roughly 60 while increasing the runtime of each job. This should still satisfy the purpose of test without going completely over the top. ### Are these changes tested? Yes, tested locally on the alpine docker image many times after successfully reproducing the original issue. ### Are there any user-facing changes? No ### Notes This doesn't solve the underlying cause (although the testing parameters were arguably far too unusual in the first place), however I do believe that I've identified the issue via a core dump. The problem starts [here](https://github.com/apache/arrow/blob/47a602dbd9b7b7f7720a5e62467e3e6c61712cf3/cpp/src/arrow/json/reader.cc#L362-L369), where a `MappingGenerator` gets stacked on top of a generator that applies readahead. It seems that the underlying futures were completing very quickly, resulting in `AddCallback` being called recursively many, many times - starting [here](https://github.com/apache/arrow/blob/47a602dbd9b7b7f7720a5e62467e3e6c61712cf3/cpp/src/arrow/util/async_generator.h#L240). This leads to a stack overflow under specific circumstances. So, to fully guard against the problem, you'd probably want to change the logic of `MappingGenerator` to use `TryAddCallback` + an inner loop to avoid overflowing the stack. Not entirely sure if doing this would be worthwhile though. * Closes: apache#34653 Authored-by: benibus <bpharks@gmx.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
…lt on alpine-linux-cpp (apache#35047) ### What changes are included in this PR? Increases the block size used in the `ScanWithParallelDecoding` test to reduce the number of (potentially parallel) parsing/decoding jobs from 1000+ to roughly 60 while increasing the runtime of each job. This should still satisfy the purpose of test without going completely over the top. ### Are these changes tested? Yes, tested locally on the alpine docker image many times after successfully reproducing the original issue. ### Are there any user-facing changes? No ### Notes This doesn't solve the underlying cause (although the testing parameters were arguably far too unusual in the first place), however I do believe that I've identified the issue via a core dump. The problem starts [here](https://github.com/apache/arrow/blob/47a602dbd9b7b7f7720a5e62467e3e6c61712cf3/cpp/src/arrow/json/reader.cc#L362-L369), where a `MappingGenerator` gets stacked on top of a generator that applies readahead. It seems that the underlying futures were completing very quickly, resulting in `AddCallback` being called recursively many, many times - starting [here](https://github.com/apache/arrow/blob/47a602dbd9b7b7f7720a5e62467e3e6c61712cf3/cpp/src/arrow/util/async_generator.h#L240). This leads to a stack overflow under specific circumstances. So, to fully guard against the problem, you'd probably want to change the logic of `MappingGenerator` to use `TryAddCallback` + an inner loop to avoid overflowing the stack. Not entirely sure if doing this would be worthwhile though. * Closes: apache#34653 Authored-by: benibus <bpharks@gmx.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
…lt on alpine-linux-cpp (apache#35047) ### What changes are included in this PR? Increases the block size used in the `ScanWithParallelDecoding` test to reduce the number of (potentially parallel) parsing/decoding jobs from 1000+ to roughly 60 while increasing the runtime of each job. This should still satisfy the purpose of test without going completely over the top. ### Are these changes tested? Yes, tested locally on the alpine docker image many times after successfully reproducing the original issue. ### Are there any user-facing changes? No ### Notes This doesn't solve the underlying cause (although the testing parameters were arguably far too unusual in the first place), however I do believe that I've identified the issue via a core dump. The problem starts [here](https://github.com/apache/arrow/blob/47a602dbd9b7b7f7720a5e62467e3e6c61712cf3/cpp/src/arrow/json/reader.cc#L362-L369), where a `MappingGenerator` gets stacked on top of a generator that applies readahead. It seems that the underlying futures were completing very quickly, resulting in `AddCallback` being called recursively many, many times - starting [here](https://github.com/apache/arrow/blob/47a602dbd9b7b7f7720a5e62467e3e6c61712cf3/cpp/src/arrow/util/async_generator.h#L240). This leads to a stack overflow under specific circumstances. So, to fully guard against the problem, you'd probably want to change the logic of `MappingGenerator` to use `TryAddCallback` + an inner loop to avoid overflowing the stack. Not entirely sure if doing this would be worthwhile though. * Closes: apache#34653 Authored-by: benibus <bpharks@gmx.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
Describe the bug, including details regarding any error messages, version, and platform.
I have seen the following error on the test-alpine-linux-cpp. It does seem flaky because it succeeded yesterday but it failed two days ago https://github.com/ursacomputing/crossbow/actions/runs/4458056700/jobs/7829539126.
Component(s)
C++, Continuous Integration
The text was updated successfully, but these errors were encountered: