Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

several tests segfault in --enable-ares CI build #3961

Closed
jay opened this issue May 29, 2019 · 8 comments

Comments

@jay
Copy link
Member

commented May 29, 2019

--enable-ares CI build segfaults in tests 507 534 1553. Failures are consistent. Seems to have started yesterday with a6183ab.

test 0507...[multi interface get with non-existing host name]
sh: line 1: 45622 Segmentation fault: 11  ./libtest/lib507 http://non-existing-host.haxx.se/ > log/stdout507 2> log/stderr507
lib507 returned 139, when expecting 6
 exit FAILED

test 0534...[FTP RETR twice using multi: non-existing host and non-existing file]
sh: line 1: 46056 Segmentation fault: 11  ./libtest/lib533 ftp://non-existing-host.haxx.se/path/534 ftp://127.0.0.1:8992/path/534 > log/stdout534 2> log/stderr534
 534: protocol FAILED:

sh: line 1: 56791 Segmentation fault: 11  ./libtest/lib1553 imap://non-existing-host.haxx.se:9003/1553 > log/stdout1553 2> log/stderr1553
lib1553 returned 139, when expecting 0
 exit FAILED

TESTFAIL: These test cases failed: 507 534 1553 

@jay jay added the build label May 29, 2019

@bagder

This comment has been minimized.

Copy link
Member

commented May 29, 2019

I can't reproduce locally. Neither with my "normal" c-ares build with or without valgrind, neither with my san build with all sorts of sanitizers switched on.

@bagder

This comment has been minimized.

Copy link
Member

commented May 29, 2019

The travis failures are on macOS. I built 09eef8a (current master) on my mac with c-ares and ran the same tests but they work for me. Very similar component versions as well, although not the exact same OS version:

$ ./runtests.pl 507 534 1553
********* System characteristics ******** 
* curl 7.65.1-DEV (x86_64-apple-darwin17.7.0) 
* libcurl/7.65.1-DEV zlib/1.2.11 c-ares/1.15.0 nghttp2/1.38.0
* Features: AsynchDNS Debug HTTP2 IPv6 Largefile libz TrackMemory UnixSockets
* Host: Daniels-mini2
* System: Darwin Daniels-mini2 17.7.0 Darwin Kernel Version 17.7.0: Wed Feb 27 00:43:23 PST 2019; root:xnu-4570.71.35~1/RELEASE_X86_64 x86_64
* Servers: HTTP-IPv6 HTTP-unix FTP-IPv6 
* Env: 
***************************************** 
test 0507...[multi interface get with non-existing host name]
-------em-- OK (1   out of 3  , remaining: 00:04)
test 0534...[FTP RETR twice using multi: non-existing host and non-existing file]
--p----em-- OK (2   out of 3  , remaining: 00:02)
test 1553...[IMAP cleanup before a connection was created]
-------em-- OK (3   out of 3  , remaining: 00:00)
TESTDONE: 3 tests out of 3 reported OK: 100%
TESTDONE: 3 tests were considered during 7 seconds.
@bagder

This comment has been minimized.

Copy link
Member

commented May 29, 2019

The crashes are flaky and seem to happen independently of what we do. During the afternoon today I created a branch off master and had it only build and run the two c-ares builds on travis. That worked fine.

Prior to that I tried to run different kinds of runs on travis for macos only and have them run with lldb and do a stack trace on a crash, but I didn't manage to get them to reproduce the segfaults in those attempts.

Conclusion: they're really fickle. The commit that started the segfault was also seemingly totally unrelated to c-ares and resolving non-existing hostnames, which backs up the theory that this is something weird going on in the travis infra rather than an actual problem in curl code we landed recently.

@bagder

This comment has been minimized.

Copy link
Member

commented May 29, 2019

Also, the travis jobs for #3967 no longer showed the segfaults!

@jay

This comment has been minimized.

Copy link
Member Author

commented May 29, 2019

I restarted a6183ab for just that failing --enable-ares build, we'll see what happens. I've ran test 507 on Ubuntu using cares-1_15_0 and a6183ab w address and undefined sanitizer 100,000 times without fail.

Conclusion: they're really fickle. The commit that started the segfault was also seemingly totally unrelated to c-ares and resolving non-existing hostnames, which backs up the theory that this is something weird going on in the travis infra rather than an actual problem in curl code we landed recently.

Yeah it's weird. I noticed 48b9ea4 from this morning from which a6183ab is an ancestor passes with --enable-ares in CI as well. Yet why was it the same 3 failed tests in multiple commits, that's pretty strange.

@bagder

This comment has been minimized.

Copy link
Member

commented May 29, 2019

The segfaults only happened on the macos builds on travis though, not the linux ones...

@jay

This comment has been minimized.

Copy link
Member Author

commented May 29, 2019

Sure I got it, but I thought it might be a race condition. Anyway the initial one that failed is restarted let's wait and see.

@jay

This comment has been minimized.

Copy link
Member Author

commented May 30, 2019

I restarted T=debug C=--enable-ares for the two oldest commits that had those test failures and they both passed.
https://travis-ci.org/curl/curl/jobs/538446030
https://travis-ci.org/curl/curl/jobs/538465912

I reviewed log_commit_6961322_job_538465912_yesterday.txt compared to log_commit_6961322_job_538465912_today.txt and I don't see any difference other than the test failures.

Given the occasional random weirdness we've seen in Travis builds I'm closing this for now.

@jay jay closed this May 30, 2019

jay referenced this issue Jun 6, 2019

multi: track users of a socket better
They need to be removed from the socket hash linked list with more care.

When sh_delentry() is called to remove a sockethash entry, remove all
individual transfers from the list first. To enable this, each Curl_easy struct
now stores a pointer to the sockethash entry to know how to remove itself.

Reported-by: Tom van der Woerdt and Kunal Ekawde

Fixes #3952
Fixes #3904
Closes #3953

bagder added a commit that referenced this issue Jun 8, 2019

multi: make sure 'data' can present in several sockhash entries
Since more than one socket can be used by each transfer at a given time,
each sockhash entry how has its own hash table with transfers using that
socket.

In addition, the sockhash entry can now be marked 'blocked = TRUE'"
which then makes the delete function just set 'removed = TRUE' instead
of removing it "for real", as a way to not rip out the carpet under the
feet of a parent function that iterates over the transfers of that same
sockhash entry.

Reported-by: Tom van der Woerdt
Fixes #3961
Fixes #3986
Fixes #3995
Closes #3997

bagder added a commit that referenced this issue Jun 10, 2019

multi: make sure 'data' can present in several sockhash entries
Since more than one socket can be used by each transfer at a given time,
each sockhash entry how has its own hash table with transfers using that
socket.

In addition, the sockhash entry can now be marked 'blocked = TRUE'"
which then makes the delete function just set 'removed = TRUE' instead
of removing it "for real", as a way to not rip out the carpet under the
feet of a parent function that iterates over the transfers of that same
sockhash entry.

Reported-by: Tom van der Woerdt
Fixes #3961
Fixes #3986
Fixes #3995
Fixes #4004
Closes #3997
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.