Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Flight integration tests fail on verify rc nightly on linux amd64 #20301

Closed
asfimport opened this issue Jun 28, 2022 · 23 comments
Closed

Comments

@asfimport
Copy link

asfimport commented Jun 28, 2022

Some of our nightly builds to verify the release are failing:

verify-rc-source-integration-linux-almalinux-8-amd64
verify-rc-source-integration-linux-ubuntu-18.04-amd64
verify-rc-source-integration-linux-ubuntu-20.04-amd64
verify-rc-source-integration-linux-ubuntu-22.04-amd64

with the following:

################# FAILURES #################
FAILED TEST: middleware C++ producing,  C++ consuming
1 failures
  File "/arrow/dev/archery/archery/integration/util.py", line 139, in run_cmd
    output = subprocess.check_output(cmd, stderr=subprocess.STDOUT)
  File "/usr/lib/python3.8/subprocess.py", line 411, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.8/subprocess.py", line 512, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/tmp/arrow-HEAD.PZocX/cpp-build/release/flight-test-integration-client', '-host', 'localhost', '-port=36719', '-scenario', 'middleware']' died with <Signals.SIGABRT: 6>.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/arrow/dev/archery/archery/integration/runner.py", line 379, in _run_flight_test_case
    consumer.flight_request(port, **client_args)
  File "/arrow/dev/archery/archery/integration/tester_cpp.py", line 134, in flight_request
    run_cmd(cmd)
  File "/arrow/dev/archery/archery/integration/util.py", line 148, in run_cmd
    raise RuntimeError(sio.getvalue())
RuntimeError: Command failed: /tmp/arrow-HEAD.PZocX/cpp-build/release/flight-test-integration-client -host localhost -port=36719 -scenario middleware
With output:
--------------
Headers received successfully on failing call.
Headers received successfully on passing call.
free(): double free detected in tcache 2 

Reporter: Raúl Cumplido / @raulcd
Assignee: David Li / @lidavidm

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-16919. Please see the migration documentation for further details.

@asfimport
Copy link
Author

David Li / @lidavidm:
I'll take a look if I get a chance.

@asfimport
Copy link
Author

David Li / @lidavidm:
Hmm, I wasn't able to reproduce this locally with the docker container. (tried with Ubuntu 20.04)

@asfimport
Copy link
Author

Raúl Cumplido / @raulcd:
I am not able to reproduce locally either (tried with Ubuntu 20.04 and 18.04) but the failures seem to be pretty consistent on the nightly builds for the last days.

@asfimport
Copy link
Author

David Li / @lidavidm:
Same here. I still can't reproduce, and yet the nightlies fail consistently. The output also doesn't really give us anything to go off of…

@asfimport
Copy link
Author

David Li / @lidavidm:
I also tried setting docker to mimic the CPU/RAM configuration of the GitHub runners, but that just resulted in the tests hanging/timing out.

@asfimport
Copy link
Author

Raúl Cumplido / @raulcd:
It has only failed on verify-rc-source-integration-linux-ubuntu-18.04-amd64 today but has worked on the rest, so it it not a consistent failure there either.

@asfimport
Copy link
Author

David Li / @lidavidm:
I suppose it's possible to SSH into a GitHub Actions runner, so if we can get this to happen on a personal fork we could debug it that way.

@asfimport
Copy link
Author

Raúl Cumplido / @raulcd:
I have pushed a branch to be able to ssh to the github actions runner. See: #13478

Once there we can trigger the job via github-actions with: @github-actions crossbow submit verify-rc-source-integration-linux-ubuntu-18.04-amd64

This will spin up a new runner and on the logs you'll be able to see the connection string like: SSH Session: ssh WHATEVER:[WHATEVER=@uptermd.upterm.dev|mailto:WHATEVER=@uptermd.upterm.dev]

Once on the terminal you can run the command that we use to verify the RC: archery docker run -e VERIFY_VERSION="" -e VERIFY_RC="" -e TEST_DEFAULT=0 -e TEST_INTEGRATION=1 ubuntu-verify-rc

I think this can give us an initial base to debug a little further.

@asfimport
Copy link
Author

David Li / @lidavidm:
I tried again with the container and manually invoked the integration command - I can't get it to fail even in a tight loop.

@asfimport
Copy link
Author

David Li / @lidavidm:
We could maybe try building with something like libbacktrace temporarily?

@asfimport
Copy link
Author

David Li / @lidavidm:
I'm trying libSegFault.so to hopefully get a backtrace: #13515

@asfimport
Copy link
Author

David Li / @lidavidm:
This is still happening and I wasn't able to get a backtrace…I'll make another try soon.

@asfimport
Copy link
Author

David Li / @lidavidm:
I got a backtrace!!!!


Backtrace:
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f8f4a0b900b]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f8f4a098859]
/lib/x86_64-linux-gnu/libc.so.6(+0x8d26e)[0x7f8f4a10326e]
/lib/x86_64-linux-gnu/libc.so.6(+0x952fc)[0x7f8f4a10b2fc]
/lib/x86_64-linux-gnu/libc.so.6(+0x96f6d)[0x7f8f4a10cf6d]
/tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt8_Rb_treeISt10shared_ptrIN5arrow8DataTypeEES3_St9_IdentityIS3_ESt4lessIS3_ESaIS3_EE8_M_eraseEPSt13_Rb_tree_nodeIS3_E+0x39)[0x5557a9a83b19]
/tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt8_Rb_treeISt10shared_ptrIN5arrow8DataTypeEES3_St9_IdentityIS3_ESt4lessIS3_ESaIS3_EE8_M_eraseEPSt13_Rb_tree_nodeIS3_E+0x1f)[0x5557a9a83aff]
/tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt3setISt10shared_ptrIN5arrow8DataTypeEESt4lessIS3_ESaIS3_EED1Ev+0x33)[0x5557a9a83b83]
/lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xce)[0x7f8f4a0bcfde]
/tmp/arrow-HEAD.y8UwB/cpp-build/release/libarrow.so.900(+0x440b67)[0x7f8f47d56b67] 

Given {}__cxa_finalize{}, seems some static is being destructed twice?

@asfimport
Copy link
Author

David Li / @lidavidm:
_ZNSt3setISt10shared_ptrIN5arrow8DataTypeEESt4lessIS3_ESaIS3_EED1Ev is the destructor of std::set<std::shared_ptr<arrow::DataType>. The only time such a type occurs in the Arrow source is in the asof join node. Can we trace these failures back to after that was introduced? ARROW-16083

@asfimport
Copy link
Author

David Li / @lidavidm:
Oh. The integration test client is 1) linked statically to Arrow and 2) linked dynamically to the test libraries which 3) link Arrow dynamically again. So that could easily explain the double-destructor.

@asfimport
Copy link
Author

David Li / @lidavidm:
I think ARROW-17051 (#13599) incidentally fixes this by properly linking to the testing library statically, though there's one more fix needed to use ARROW_TEST_STATIC_LINK_LIBS explicitly.

@asfimport
Copy link
Author

David Li / @lidavidm:
Yup, the first failure I see is 06/25, one day after ARROW-16083 was merged; there are sporadic unrelated failures before that.

@asfimport
Copy link
Author

Krisztian Szucs / @kszucs:
That looks like quite a journey :)

Thanks @lidavidm for figuring it out!

@asfimport
Copy link
Author

Raúl Cumplido / @raulcd:
https://issues.apache.org/jira/browse/ARROW-17051 was merged yesterday. Should we close this one as duplicate, should we wait a couple nightly runs to validate or is there something else we think it has to be done?

@asfimport
Copy link
Author

David Li / @lidavidm:
IMO let's make sure nightlies pass before we close this.

@asfimport
Copy link
Author

Raúl Cumplido / @raulcd:
Ok, I am going to change the fix version though as there's nothing that we are actively doing on this one apart from waiting.

@asfimport
Copy link
Author

David Li / @lidavidm:
Looks like things have been good for the past couple days?

@asfimport
Copy link
Author

Raúl Cumplido / @raulcd:
Yes, there hasn't been failures on the verify-rc-source-integration-linux-ubuntu-XX.04-amd64 since ARROW-17501 was merged and I haven't seen the 
double free detected
error anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants