Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI][C++] Flaky segmentation fault on arrow-dataset-file-json-test #34653

Closed
raulcd opened this issue Mar 21, 2023 · 5 comments · Fixed by #35047
Closed

[CI][C++] Flaky segmentation fault on arrow-dataset-file-json-test #34653

raulcd opened this issue Mar 21, 2023 · 5 comments · Fixed by #35047

Comments

@raulcd
Copy link
Member

raulcd commented Mar 21, 2023

Describe the bug, including details regarding any error messages, version, and platform.

I have seen the following error on the test-alpine-linux-cpp. It does seem flaky because it succeeded yesterday but it failed two days ago https://github.com/ursacomputing/crossbow/actions/runs/4458056700/jobs/7829539126.

 68/103 Test  #60: arrow-dataset-file-json-test ..............***Failed   59.87 sec
Running arrow-dataset-file-json-test, redirecting output into /build/cpp/build/test-logs/arrow-dataset-file-json-test.txt (attempt 1/1)
/arrow/cpp/build-support/run-test.sh: line 88: 28009 Segmentation fault      (core dumped) $TEST_EXECUTABLE "$@" > $LOGFILE.raw 2>&1
Running main() from /home/buildozer/aports/main/gtest/src/googletest-14aa11db02d9851d957f93ef9fddb110c1aafdc6/googletest/src/gtest_main.cc
[==========] Running 51 tests from 4 test suites.
[----------] Global test environment set-up.
[----------] 6 tests from TestJsonFormat
[ RUN      ] TestJsonFormat.Equals
[       OK ] TestJsonFormat.Equals (0 ms)
[ RUN      ] TestJsonFormat.IsSupported
[       OK ] TestJsonFormat.IsSupported (1159 ms)
[ RUN      ] TestJsonFormat.Inspect
[       OK ] TestJsonFormat.Inspect (1146 ms)
[ RUN      ] TestJsonFormat.FragmentEquals
[       OK ] TestJsonFormat.FragmentEquals (2053 ms)
[ RUN      ] TestJsonFormat.InspectFailureWithRelevantError
[       OK ] TestJsonFormat.InspectFailureWithRelevantError (19 ms)
[ RUN      ] TestJsonFormat.CountRows
[       OK ] TestJsonFormat.CountRows (2690 ms)
[----------] 6 tests from TestJsonFormat (7069 ms total)

[----------] 5 tests from TestJsonFormatV2
[ RUN      ] TestJsonFormatV2.IsSupported
[       OK ] TestJsonFormatV2.IsSupported (1164 ms)
[ RUN      ] TestJsonFormatV2.Inspect
[       OK ] TestJsonFormatV2.Inspect (1163 ms)
[ RUN      ] TestJsonFormatV2.FragmentEquals
[       OK ] TestJsonFormatV2.FragmentEquals (1821 ms)
[ RUN      ] TestJsonFormatV2.InspectFailureWithRelevantError
[       OK ] TestJsonFormatV2.InspectFailureWithRelevantError (18 ms)
[ RUN      ] TestJsonFormatV2.CountRows
[       OK ] TestJsonFormatV2.CountRows (2649 ms)
[----------] 5 tests from TestJsonFormatV2 (6818 ms total)

[----------] 24 tests from TestJsonScan/TestJsonScan
[ RUN      ] TestJsonScan/TestJsonScan.Scan/0Threaded16b1024r
[       OK ] TestJsonScan/TestJsonScan.Scan/0Threaded16b1024r (384 ms)
[ RUN      ] TestJsonScan/TestJsonScan.Scan/1Serial16b1024r
[       OK ] TestJsonScan/TestJsonScan.Scan/1Serial16b1024r (375 ms)
[ RUN      ] TestJsonScan/TestJsonScan.ScanBatchSize/0Threaded16b1024r
[       OK ] TestJsonScan/TestJsonScan.ScanBatchSize/0Threaded16b1024r (1011 ms)
[ RUN      ] TestJsonScan/TestJsonScan.ScanBatchSize/1Serial16b1024r
[       OK ] TestJsonScan/TestJsonScan.ScanBatchSize/1Serial16b1024r (990 ms)
[ RUN      ] TestJsonScan/TestJsonScan.ScanProjected/0Threaded16b1024r
[       OK ] TestJsonScan/TestJsonScan.ScanProjected/0Threaded16b1024r (865 ms)
[ RUN      ] TestJsonScan/TestJsonScan.ScanProjected/1Serial16b1024r
[       OK ] TestJsonScan/TestJsonScan.ScanProjected/1Serial16b1024r (855 ms)
[ RUN      ] TestJsonScan/TestJsonScan.ScanWithDuplicateColumnError/0Threaded16b1024r
[       OK ] TestJsonScan/TestJsonScan.ScanWithDuplicateColumnError/0Threaded16b1024r (0 ms)
[ RUN      ] TestJsonScan/TestJsonScan.ScanWithDuplicateColumnError/1Serial16b1024r
[       OK ] TestJsonScan/TestJsonScan.ScanWithDuplicateColumnError/1Serial16b1024r (0 ms)
[ RUN      ] TestJsonScan/TestJsonScan.ScanWithVirtualColumn/0Threaded16b1024r
[       OK ] TestJsonScan/TestJsonScan.ScanWithVirtualColumn/0Threaded16b1024r (708 ms)
[ RUN      ] TestJsonScan/TestJsonScan.ScanWithVirtualColumn/1Serial16b1024r
[       OK ] TestJsonScan/TestJsonScan.ScanWithVirtualColumn/1Serial16b1024r (865 ms)
[ RUN      ] TestJsonScan/TestJsonScan.ScanWithPushdownNulls/0Threaded16b1024r
[       OK ] TestJsonScan/TestJsonScan.ScanWithPushdownNulls/0Threaded16b1024r (7 ms)
[ RUN      ] TestJsonScan/TestJsonScan.ScanWithPushdownNulls/1Serial16b1024r
[       OK ] TestJsonScan/TestJsonScan.ScanWithPushdownNulls/1Serial16b1024r (6 ms)
[ RUN      ] TestJsonScan/TestJsonScan.ScanProjectedMissingCols/0Threaded16b1024r
[       OK ] TestJsonScan/TestJsonScan.ScanProjectedMissingCols/0Threaded16b1024r (2324 ms)
[ RUN      ] TestJsonScan/TestJsonScan.ScanProjectedMissingCols/1Serial16b1024r
[       OK ] TestJsonScan/TestJsonScan.ScanProjectedMissingCols/1Serial16b1024r (2519 ms)
[ RUN      ] TestJsonScan/TestJsonScan.ScanProjectedNested/0Threaded16b1024r
[       OK ] TestJsonScan/TestJsonScan.ScanProjectedNested/0Threaded16b1024r (7324 ms)
[ RUN      ] TestJsonScan/TestJsonScan.ScanProjectedNested/1Serial16b1024r
[       OK ] TestJsonScan/TestJsonScan.ScanProjectedNested/1Serial16b1024r (7409 ms)
[ RUN      ] TestJsonScan/TestJsonScan.ScanWithBOM/0Threaded16b1024r
[       OK ] TestJsonScan/TestJsonScan.ScanWithBOM/0Threaded16b1024r (7 ms)
[ RUN      ] TestJsonScan/TestJsonScan.ScanWithBOM/1Serial16b1024r
[       OK ] TestJsonScan/TestJsonScan.ScanWithBOM/1Serial16b1024r (5 ms)
[ RUN      ] TestJsonScan/TestJsonScan.ScanWithCustomParseOptions/0Threaded16b1024r
[       OK ] TestJsonScan/TestJsonScan.ScanWithCustomParseOptions/0Threaded16b1024r (5 ms)
[ RUN      ] TestJsonScan/TestJsonScan.ScanWithCustomParseOptions/1Serial16b1024r
[       OK ] TestJsonScan/TestJsonScan.ScanWithCustomParseOptions/1Serial16b1024r (4 ms)
[ RUN      ] TestJsonScan/TestJsonScan.ScanWithCustomBlockSize/0Threaded16b1024r
[       OK ] TestJsonScan/TestJsonScan.ScanWithCustomBlockSize/0Threaded16b1024r (6 ms)
[ RUN      ] TestJsonScan/TestJsonScan.ScanWithCustomBlockSize/1Serial16b1024r
[       OK ] TestJsonScan/TestJsonScan.ScanWithCustomBlockSize/1Serial16b1024r (6 ms)
[ RUN      ] TestJsonScan/TestJsonScan.ScanWithParallelDecoding/0Threaded16b1024r
[       OK ] TestJsonScan/TestJsonScan.ScanWithParallelDecoding/0Threaded16b1024r (2857 ms)
[ RUN      ] TestJsonScan/TestJsonScan.ScanWithParallelDecoding/1Serial16b1024r
[       OK ] TestJsonScan/TestJsonScan.ScanWithParallelDecoding/1Serial16b1024r (3923 ms)
[----------] 24 tests from TestJsonScan/TestJsonScan (32468 ms total)

[----------] 16 tests from TestJsonScanNode/TestJsonScanNode
[ RUN      ] TestJsonScanNode/TestJsonScanNode.Scan/0Threaded16b1024r
[       OK ] TestJsonScanNode/TestJsonScanNode.Scan/0Threaded16b1024r (732 ms)
[ RUN      ] TestJsonScanNode/TestJsonScanNode.Scan/1Serial16b1024r
[       OK ] TestJsonScanNode/TestJsonScanNode.Scan/1Serial16b1024r (683 ms)
[ RUN      ] TestJsonScanNode/TestJsonScanNode.ScanMissingFilterField/0Threaded16b1024r
/arrow/cpp/src/arrow/status.cc:134: Cancelled: Plan was cancelled early.
[       OK ] TestJsonScanNode/TestJsonScanNode.ScanMissingFilterField/0Threaded16b1024r (779 ms)
[ RUN      ] TestJsonScanNode/TestJsonScanNode.ScanMissingFilterField/1Serial16b1024r
[       OK ] TestJsonScanNode/TestJsonScanNode.ScanMissingFilterField/1Serial16b1024r (798 ms)
[ RUN      ] TestJsonScanNode/TestJsonScanNode.ScanProjected/0Threaded16b1024r
[       OK ] TestJsonScanNode/TestJsonScanNode.ScanProjected/0Threaded16b1024r (1384 ms)
[ RUN      ] TestJsonScanNode/TestJsonScanNode.ScanProjected/1Serial16b1024r
[       OK ] TestJsonScanNode/TestJsonScanNode.ScanProjected/1Serial16b1024r (1610 ms)
[ RUN      ] TestJsonScanNode/TestJsonScanNode.ScanProjectedMissingColumns/0Threaded16b1024r
[       OK ] TestJsonScanNode/TestJsonScanNode.ScanProjectedMissingColumns/0Threaded16b1024r (4082 ms)
[ RUN      ] TestJsonScanNode/TestJsonScanNode.ScanProjectedMissingColumns/1Serial16b1024r
[       OK ] TestJsonScanNode/TestJsonScanNode.ScanProjectedMissingColumns/1Serial16b1024r (2371 ms)
[ RUN      ] TestJsonScanNode/TestJsonScanNode.ScanWithBOM/0Threaded16b1024r
[       OK ] TestJsonScanNode/TestJsonScanNode.ScanWithBOM/0Threaded16b1024r (4 ms)
[ RUN      ] TestJsonScanNode/TestJsonScanNode.ScanWithBOM/1Serial16b1024r
[       OK ] TestJsonScanNode/TestJsonScanNode.ScanWithBOM/1Serial16b1024r (4 ms)
[ RUN      ] TestJsonScanNode/TestJsonScanNode.ScanWithCustomParseOptions/0Threaded16b1024r
[       OK ] TestJsonScanNode/TestJsonScanNode.ScanWithCustomParseOptions/0Threaded16b1024r (3 ms)
[ RUN      ] TestJsonScanNode/TestJsonScanNode.ScanWithCustomParseOptions/1Serial16b1024r
[       OK ] TestJsonScanNode/TestJsonScanNode.ScanWithCustomParseOptions/1Serial16b1024r (3 ms)
[ RUN      ] TestJsonScanNode/TestJsonScanNode.ScanWithCustomBlockSize/0Threaded16b1024r
[       OK ] TestJsonScanNode/TestJsonScanNode.ScanWithCustomBlockSize/0Threaded16b1024r (3 ms)
[ RUN      ] TestJsonScanNode/TestJsonScanNode.ScanWithCustomBlockSize/1Serial16b1024r
[       OK ] TestJsonScanNode/TestJsonScanNode.ScanWithCustomBlockSize/1Serial16b1024r (4 ms)
[ RUN      ] TestJsonScanNode/TestJsonScanNode.ScanWithParallelDecoding/0Threaded16b1024r
/build/cpp/src/arrow/dataset

Component(s)

C++, Continuous Integration

@raulcd raulcd added the Priority: Blocker Marks a blocker for the release label Mar 21, 2023
@raulcd raulcd added this to the 12.0.0 milestone Mar 21, 2023
@benibus benibus self-assigned this Mar 21, 2023
@benibus
Copy link
Collaborator

benibus commented Mar 30, 2023

I'm (finally) able to reproduce this locally. For the record, running the alpine service alone won't trigger it, but imposing github's cpu/mem limits will.

I'm aiming to rectify the issue without needing to revert anything for 12.0. Unfortunately, there are three recently-introduced APIs at play here so hopefully it's nothing too crazy.

@assignUser
Copy link
Member

imposing github's cpu/mem limits will.

With this taken into consideration should we keep this as a blocker? Outside of maybe having to rerun some jobs when cutting the rc, this does not seem like a critical issue?

@benibus
Copy link
Collaborator

benibus commented Apr 5, 2023

should we keep this as a blocker?

Probably not, although I think I've identified the cause. It's essentially a stack overflow in the JSON reader which can occur under very specific circumstances. In this case, it's brought about by one of the tests (which uses parameters that aren't representative of typical usage). The details are fairly convoluted since threading is involved.

I should have a fix soon regardless, although the JSON reader itself is (in principle) working as intended. I'm still not sure why this only occurs on Alpine...

@assignUser assignUser removed the Priority: Blocker Marks a blocker for the release label Apr 6, 2023
@westonpace
Copy link
Member

I'm still not sure why this only occurs on Alpine...

Iterating futures can lead to stack blowup (VisitAsyncGenerator and a few other places have specialized code to avoid this). I thought maybe alpine linux might have a smaller stack depth but it seems the default is the same as Ubuntu so it might just be because it is running more slowly. Slower code tends to experience more stack blowup because the futures are likely to be already fulfilled before the continuation is set.

@raulcd
Copy link
Member Author

raulcd commented Apr 12, 2023

As there seems to be a PR almost ready for this, I am marking it as Priority: Blocker to add it to the release.

@raulcd raulcd added the Priority: Blocker Marks a blocker for the release label Apr 12, 2023
westonpace pushed a commit that referenced this issue Apr 12, 2023
…alpine-linux-cpp (#35047)

### What changes are included in this PR?

Increases the block size used in the `ScanWithParallelDecoding` test to reduce the number of (potentially parallel) parsing/decoding jobs from 1000+ to roughly 60 while increasing the runtime of each job. This should still satisfy the purpose of test without going completely over the top.

### Are these changes tested?

Yes, tested locally on the alpine docker image many times after successfully reproducing the original issue.

### Are there any user-facing changes?

No

### Notes

This doesn't solve the underlying cause (although the testing parameters were arguably far too unusual in the first place), however I do believe that I've identified the issue via a core dump.

The problem starts [here](https://github.com/apache/arrow/blob/47a602dbd9b7b7f7720a5e62467e3e6c61712cf3/cpp/src/arrow/json/reader.cc#L362-L369), where a `MappingGenerator` gets stacked on top of a generator that applies readahead. It seems that the underlying futures were completing very quickly, resulting in `AddCallback` being called recursively many, many times - starting [here](https://github.com/apache/arrow/blob/47a602dbd9b7b7f7720a5e62467e3e6c61712cf3/cpp/src/arrow/util/async_generator.h#L240). This leads to a stack overflow under specific circumstances.

So, to fully guard against the problem, you'd probably want to change the logic of `MappingGenerator` to use `TryAddCallback` + an inner loop to avoid overflowing the stack. Not entirely sure if doing this would be worthwhile though.
* Closes: #34653

Authored-by: benibus <bpharks@gmx.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
liujiacheng777 pushed a commit to LoongArch-Python/arrow that referenced this issue May 11, 2023
…lt on alpine-linux-cpp (apache#35047)

### What changes are included in this PR?

Increases the block size used in the `ScanWithParallelDecoding` test to reduce the number of (potentially parallel) parsing/decoding jobs from 1000+ to roughly 60 while increasing the runtime of each job. This should still satisfy the purpose of test without going completely over the top.

### Are these changes tested?

Yes, tested locally on the alpine docker image many times after successfully reproducing the original issue.

### Are there any user-facing changes?

No

### Notes

This doesn't solve the underlying cause (although the testing parameters were arguably far too unusual in the first place), however I do believe that I've identified the issue via a core dump.

The problem starts [here](https://github.com/apache/arrow/blob/47a602dbd9b7b7f7720a5e62467e3e6c61712cf3/cpp/src/arrow/json/reader.cc#L362-L369), where a `MappingGenerator` gets stacked on top of a generator that applies readahead. It seems that the underlying futures were completing very quickly, resulting in `AddCallback` being called recursively many, many times - starting [here](https://github.com/apache/arrow/blob/47a602dbd9b7b7f7720a5e62467e3e6c61712cf3/cpp/src/arrow/util/async_generator.h#L240). This leads to a stack overflow under specific circumstances.

So, to fully guard against the problem, you'd probably want to change the logic of `MappingGenerator` to use `TryAddCallback` + an inner loop to avoid overflowing the stack. Not entirely sure if doing this would be worthwhile though.
* Closes: apache#34653

Authored-by: benibus <bpharks@gmx.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this issue May 15, 2023
…lt on alpine-linux-cpp (apache#35047)

### What changes are included in this PR?

Increases the block size used in the `ScanWithParallelDecoding` test to reduce the number of (potentially parallel) parsing/decoding jobs from 1000+ to roughly 60 while increasing the runtime of each job. This should still satisfy the purpose of test without going completely over the top.

### Are these changes tested?

Yes, tested locally on the alpine docker image many times after successfully reproducing the original issue.

### Are there any user-facing changes?

No

### Notes

This doesn't solve the underlying cause (although the testing parameters were arguably far too unusual in the first place), however I do believe that I've identified the issue via a core dump.

The problem starts [here](https://github.com/apache/arrow/blob/47a602dbd9b7b7f7720a5e62467e3e6c61712cf3/cpp/src/arrow/json/reader.cc#L362-L369), where a `MappingGenerator` gets stacked on top of a generator that applies readahead. It seems that the underlying futures were completing very quickly, resulting in `AddCallback` being called recursively many, many times - starting [here](https://github.com/apache/arrow/blob/47a602dbd9b7b7f7720a5e62467e3e6c61712cf3/cpp/src/arrow/util/async_generator.h#L240). This leads to a stack overflow under specific circumstances.

So, to fully guard against the problem, you'd probably want to change the logic of `MappingGenerator` to use `TryAddCallback` + an inner loop to avoid overflowing the stack. Not entirely sure if doing this would be worthwhile though.
* Closes: apache#34653

Authored-by: benibus <bpharks@gmx.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
rtpsw pushed a commit to rtpsw/arrow that referenced this issue May 16, 2023
…lt on alpine-linux-cpp (apache#35047)

### What changes are included in this PR?

Increases the block size used in the `ScanWithParallelDecoding` test to reduce the number of (potentially parallel) parsing/decoding jobs from 1000+ to roughly 60 while increasing the runtime of each job. This should still satisfy the purpose of test without going completely over the top.

### Are these changes tested?

Yes, tested locally on the alpine docker image many times after successfully reproducing the original issue.

### Are there any user-facing changes?

No

### Notes

This doesn't solve the underlying cause (although the testing parameters were arguably far too unusual in the first place), however I do believe that I've identified the issue via a core dump.

The problem starts [here](https://github.com/apache/arrow/blob/47a602dbd9b7b7f7720a5e62467e3e6c61712cf3/cpp/src/arrow/json/reader.cc#L362-L369), where a `MappingGenerator` gets stacked on top of a generator that applies readahead. It seems that the underlying futures were completing very quickly, resulting in `AddCallback` being called recursively many, many times - starting [here](https://github.com/apache/arrow/blob/47a602dbd9b7b7f7720a5e62467e3e6c61712cf3/cpp/src/arrow/util/async_generator.h#L240). This leads to a stack overflow under specific circumstances.

So, to fully guard against the problem, you'd probably want to change the logic of `MappingGenerator` to use `TryAddCallback` + an inner loop to avoid overflowing the stack. Not entirely sure if doing this would be worthwhile though.
* Closes: apache#34653

Authored-by: benibus <bpharks@gmx.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants