Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI/azure: disable flaky tests in the msys2 builds #5954

Closed
wants to merge 1 commit into from

Conversation

@bagder
Copy link
Member

@bagder bagder commented Sep 11, 2020

They fail almost every CI job.

@bagder bagder added the CI label Sep 11, 2020
@bagder
Copy link
Member Author

@bagder bagder commented Sep 11, 2020

@mback2k don't think we can just as well disable them until we have a better idea on how to make them somewhat reliable?

@mback2k
Copy link
Member

@mback2k mback2k commented Sep 11, 2020

No, this is not correct. They don't fail every time currently due to the flakiness, but something broke the RTSP test 571 while dropping WinSock 1 support. I wanted to investigate that this week(end), but was occupied until today.

@bagder
Copy link
Member Author

@bagder bagder commented Sep 11, 2020

Isn't it strange that only msys2 builds fail that test if its winsock related?

@mback2k
Copy link
Member

@mback2k mback2k commented Sep 11, 2020

Yes, indeed that is very strange. That is why I triggered some CI re-runs at the beginning of the week to identify the problematic commit via some manual CI bisecting.

Turns out this is the last commit that worked according to the CI history:
https://github.com/curl/curl/runs/1046015772 (8b69ac0)

This is currently the first commit that failed according to the CI history:
https://github.com/curl/curl/runs/1047898399 (a337355)

After that test 571 never seems to have worked again on the msys2 based CI builds. I am guess that the following commit is problematic, but since it was pushed with the commit above there was no dedicated CI run yet: 728f8d3

I will trigger one now to see if this guess is true.

@mback2k
Copy link
Member

@mback2k mback2k commented Sep 11, 2020

I manually bisected with my local msys2 and found commit 728f8d3 to be the root cause of test 571 failing with msys2.

The CI run for that commit also confirms this: https://github.com/curl/curl/runs/1103146941
The previous CI run for PR #5893 also shows that: https://github.com/curl/curl/runs/1045669968

@mback2k
Copy link
Member

@mback2k mback2k commented Sep 11, 2020

This is an example output of the failing test run and the shown diff is consistent over multiple test runs:

startnew: perl -I. ./rtspserver.pl --pidfile ".rtsp_server.pid" --portfile ".rtsp_server.port" --logfile "log/rtsp_server.log" --ipv4 --port 0 --srcdir "."
RUN: ../src/curl.exe --max-time 13 --output log/rtsp_verify.out --silent --verbose --globoff "http://127.0.0.1:61528/verifiedserver" 2>log/rtsp_verify.log
CMD (0): ../src/curl.exe --max-time 13 --output log/rtsp_verify.out --silent --verbose --globoff "http://127.0.0.1:61528/verifiedserver" 2>log/rtsp_verify.log
RUN: RTSP server PID 30369 port 61528
* pid rtsp => 30369 1192
test 0571...[RTSP RTP Interleaving Test]
./libtest/lib571.exe rtsp://127.0.0.1:61528/571 log/protofile571.txt >log/stdout571 2>log/stderr571
CMD (7168): ./libtest/lib571.exe rtsp://127.0.0.1:61528/571 log/protofile571.txt >log/stdout571 2>log/stderr571

 571: stdout FAILED:
--- log/check-expected  2020-09-11 21:24:29.320156800 +0200
+++ log/check-generated 2020-09-11 21:24:29.319144400 +0200
@@ -8,6 +8,3 @@
 RTP: message size 798, channel 0[LF]
 RTP: message size 42, channel 0[LF]
 RTP: message size 30, channel 1[LF]
-RTP: message size 2048, channel 0[LF]
-RTP: message size 85, channel 0[LF]
-RTP: message size 24, channel 1[LF]

 - abort tests
@mback2k
Copy link
Member

@mback2k mback2k commented Sep 11, 2020

Confirmed by reverting 728f8d3 on current master 25a25f4 and getting a working test 571 again.

@bagder
Copy link
Member Author

@bagder bagder commented Sep 11, 2020

That makes me wonder why connection reuse would be problematic there and nowhere else? Perhaps the verbose output differences from a working vs a non-working case says something?

If not, sprinkleextract_if_dead with some debug output to figure out exactly what is triggered that makes the test go boom? The function is supposed to detect if the connection is "dead", and if so remove it from the connection cache and not reuse it.

@mback2k
Copy link
Member

@mback2k mback2k commented Sep 12, 2020

That makes me wonder why connection reuse would be problematic there and nowhere else? Perhaps the verbose output differences from a working vs a non-working case says something?

$ diff log.failing/stderr571 log.working/stderr571
1,2c1,2
< URL: rtsp://127.0.0.1:59644/571
< * STATE: INIT => CONNECT handle 0xd9acd8; line 1797 (connection #-5000)
---
> URL: rtsp://127.0.0.1:59666/571
> * STATE: INIT => CONNECT handle 0xd2acd8; line 1797 (connection #-5000)
5,8c5,8
< *   Trying 127.0.0.1:59644...
< * STATE: CONNECT => WAITCONNECT handle 0xd9acd8; line 1858 (connection #0)
< * Connected to 127.0.0.1 (127.0.0.1) port 59644 (#0)
< * STATE: WAITCONNECT => SENDPROTOCONNECT handle 0xd9acd8; line 1988 (connection #0)
---
> *   Trying 127.0.0.1:59666...
> * STATE: CONNECT => WAITCONNECT handle 0xd2acd8; line 1858 (connection #0)
> * Connected to 127.0.0.1 (127.0.0.1) port 59666 (#0)
> * STATE: WAITCONNECT => SENDPROTOCONNECT handle 0xd2acd8; line 1988 (connection #0)
10,11c10,11
< * STATE: SENDPROTOCONNECT => DO handle 0xd9acd8; line 2012 (connection #0)
< > SETUP rtsp://127.0.0.1:59644/5710001 RTSP/1.0
---
> * STATE: SENDPROTOCONNECT => DO handle 0xd2acd8; line 2012 (connection #0)
> > SETUP rtsp://127.0.0.1:59666/5710001 RTSP/1.0
15,16c15,16
< * STATE: DO => DO_DONE handle 0xd9acd8; line 2083 (connection #0)
< * STATE: DO_DONE => PERFORM handle 0xd9acd8; line 2204 (connection #0)
---
> * STATE: DO => DO_DONE handle 0xd2acd8; line 2083 (connection #0)
> * STATE: DO_DONE => PERFORM handle 0xd2acd8; line 2204 (connection #0)
23c23
< * STATE: PERFORM => DONE handle 0xd9acd8; line 2394 (connection #0)
---
> * STATE: PERFORM => DONE handle 0xd2acd8; line 2394 (connection #0)
26,28c26,28
< * Expire cleared (transfer 0xd9acd8)
< * STATE: INIT => CONNECT handle 0xd9acd8; line 1797 (connection #-5000)
< * Found bundle for host 127.0.0.1: 0xd9eae8 [serially]
---
> * Expire cleared (transfer 0xd2acd8)
> * STATE: INIT => CONNECT handle 0xd2acd8; line 1797 (connection #-5000)
> * Found bundle for host 127.0.0.1: 0xd2eae8 [serially]
30,32c30,32
< * Connected to 127.0.0.1 (127.0.0.1) port 59644 (#0)
< * STATE: CONNECT => DO handle 0xd9acd8; line 1851 (connection #0)
< > PLAY rtsp://127.0.0.1:59644/5710002 RTSP/1.0
---
> * Connected to 127.0.0.1 (127.0.0.1) port 59666 (#0)
> * STATE: CONNECT => DO handle 0xd2acd8; line 1851 (connection #0)
> > PLAY rtsp://127.0.0.1:59666/5710002 RTSP/1.0
36,37c36,37
< * STATE: DO => DO_DONE handle 0xd9acd8; line 2083 (connection #0)
< * STATE: DO_DONE => PERFORM handle 0xd9acd8; line 2204 (connection #0)
---
> * STATE: DO => DO_DONE handle 0xd2acd8; line 2083 (connection #0)
> * STATE: DO_DONE => PERFORM handle 0xd2acd8; line 2204 (connection #0)
44c44
< * STATE: PERFORM => DONE handle 0xd9acd8; line 2394 (connection #0)
---
> * STATE: PERFORM => DONE handle 0xd2acd8; line 2394 (connection #0)
47,49c47,49
< * Expire cleared (transfer 0xd9acd8)
< * STATE: INIT => CONNECT handle 0xd9acd8; line 1797 (connection #-5000)
< * Found bundle for host 127.0.0.1: 0xd9eae8 [serially]
---
> * Expire cleared (transfer 0xd2acd8)
> * STATE: INIT => CONNECT handle 0xd2acd8; line 1797 (connection #-5000)
> * Found bundle for host 127.0.0.1: 0xd2eae8 [serially]
51,53c51,53
< * Connected to 127.0.0.1 (127.0.0.1) port 59644 (#0)
< * STATE: CONNECT => DO handle 0xd9acd8; line 1851 (connection #0)
< > DESCRIBE rtsp://127.0.0.1:59644/5710003 RTSP/1.0
---
> * Connected to 127.0.0.1 (127.0.0.1) port 59666 (#0)
> * STATE: CONNECT => DO handle 0xd2acd8; line 1851 (connection #0)
> > DESCRIBE rtsp://127.0.0.1:59666/5710003 RTSP/1.0
58,59c58,59
< * STATE: DO => DO_DONE handle 0xd9acd8; line 2083 (connection #0)
< * STATE: DO_DONE => PERFORM handle 0xd9acd8; line 2204 (connection #0)
---
> * STATE: DO => DO_DONE handle 0xd2acd8; line 2083 (connection #0)
> * STATE: DO_DONE => PERFORM handle 0xd2acd8; line 2204 (connection #0)
68a69
> * RTP Rewinding 754 (READMORE)
78c79
< * STATE: PERFORM => DONE handle 0xd9acd8; line 2394 (connection #0)
---
> * STATE: PERFORM => DONE handle 0xd2acd8; line 2394 (connection #0)
81,83c82,84
< * Expire cleared (transfer 0xd9acd8)
< * STATE: INIT => CONNECT handle 0xd9acd8; line 1797 (connection #-5000)
< * Found bundle for host 127.0.0.1: 0xd9eae8 [serially]
---
> * Expire cleared (transfer 0xd2acd8)
> * STATE: INIT => CONNECT handle 0xd2acd8; line 1797 (connection #-5000)
> * Found bundle for host 127.0.0.1: 0xd2eae8 [serially]
85,87c86,88
< * Connected to 127.0.0.1 (127.0.0.1) port 59644 (#0)
< * STATE: CONNECT => DO handle 0xd9acd8; line 1851 (connection #0)
< > PLAY rtsp://127.0.0.1:59644/5710004 RTSP/1.0
---
> * Connected to 127.0.0.1 (127.0.0.1) port 59666 (#0)
> * STATE: CONNECT => DO handle 0xd2acd8; line 1851 (connection #0)
> > PLAY rtsp://127.0.0.1:59666/5710004 RTSP/1.0
91,92c92,93
< * STATE: DO => DO_DONE handle 0xd9acd8; line 2083 (connection #0)
< * STATE: DO_DONE => PERFORM handle 0xd9acd8; line 2204 (connection #0)
---
> * STATE: DO => DO_DONE handle 0xd2acd8; line 2083 (connection #0)
> * STATE: DO_DONE => PERFORM handle 0xd2acd8; line 2204 (connection #0)
103c104
< * STATE: PERFORM => DONE handle 0xd9acd8; line 2394 (connection #0)
---
> * STATE: PERFORM => DONE handle 0xd2acd8; line 2394 (connection #0)
106c107
< * Expire cleared (transfer 0xd9acd8)
---
> * Expire cleared (transfer 0xd2acd8)
109,110c110,111
< * STATE: INIT => CONNECT handle 0xd9acd8; line 1797 (connection #-5000)
< * Found bundle for host 127.0.0.1: 0xd9eae8 [serially]
---
> * STATE: INIT => CONNECT handle 0xd2acd8; line 1797 (connection #-5000)
> * Found bundle for host 127.0.0.1: 0xd2eae8 [serially]
112,116c113,116
< * Connected to 127.0.0.1 (127.0.0.1) port 59644 (#0)
< * STATE: CONNECT => DO handle 0xd9acd8; line 1851 (connection #0)
< * STATE: DO => DO_DONE handle 0xd9acd8; line 2083 (connection #0)
< * STATE: DO_DONE => PERFORM handle 0xd9acd8; line 2204 (connection #0)
< * RTP Rewinding 600 (READMORE)
---
> * Connected to 127.0.0.1 (127.0.0.1) port 59666 (#0)
> * STATE: CONNECT => DO handle 0xd2acd8; line 1851 (connection #0)
> * STATE: DO => DO_DONE handle 0xd2acd8; line 2083 (connection #0)
> * STATE: DO_DONE => PERFORM handle 0xd2acd8; line 2204 (connection #0)
123,124c123,124
< * RTP Rewinding 318 (READMORE)
< * STATE: PERFORM => DONE handle 0xd9acd8; line 2394 (connection #0)
---
> * RTP Rewinding 118 (READMORE)
> * STATE: PERFORM => DONE handle 0xd2acd8; line 2394 (connection #0)
127c127
< * Expire cleared (transfer 0xd9acd8)
---
> * Expire cleared (transfer 0xd2acd8)
129,151c129,145
< * STATE: INIT => CONNECT handle 0xd9acd8; line 1797 (connection #-5000)
< * Found bundle for host 127.0.0.1: 0xd9eae8 [serially]
< * Connection 0 seems to be dead!
< * The cache now contains 0 members
< * Closing connection 0
< * Added connection 1. The cache now contains 1 members
< * Hostname 127.0.0.1 was found in DNS cache
< * family0 == v4, family1 == v6
< *   Trying 127.0.0.1:59644...
< * STATE: CONNECT => WAITCONNECT handle 0xd9acd8; line 1858 (connection #1)
< * Connected to 127.0.0.1 (127.0.0.1) port 59644 (#1)
< * STATE: WAITCONNECT => SENDPROTOCONNECT handle 0xd9acd8; line 1988 (connection #1)
< * Marked for [keep alive]: HTTP default
< * STATE: SENDPROTOCONNECT => DO handle 0xd9acd8; line 2012 (connection #1)
< * STATE: DO => DO_DONE handle 0xd9acd8; line 2083 (connection #1)
< * STATE: DO_DONE => PERFORM handle 0xd9acd8; line 2204 (connection #1)
< * Operation timed out after 3012 milliseconds with 0 bytes received
< * Marked for [closure]: Disconnected with pending data
< * multi_done
< * Got an RTP Receive with a CSeq of 0
< * The cache now contains 0 members
< * Closing connection 1
< * Expire cleared (transfer 0xd9acd8)
---
> * STATE: INIT => CONNECT handle 0xd2acd8; line 1797 (connection #-5000)
> * Found bundle for host 127.0.0.1: 0xd2eae8 [serially]
> * Re-using existing connection! (#0) with host 127.0.0.1
> * Connected to 127.0.0.1 (127.0.0.1) port 59666 (#0)
> * STATE: CONNECT => DO handle 0xd2acd8; line 1851 (connection #0)
> * STATE: DO => DO_DONE handle 0xd2acd8; line 2083 (connection #0)
> * STATE: DO_DONE => PERFORM handle 0xd2acd8; line 2204 (connection #0)
> * RTP write channel 0 rtp_length 2048
> packet count is 11
> * RTP write channel 0 rtp_length 85
> packet count is 12
> * RTP write channel 1 rtp_length 24
> packet count is 13
> * STATE: PERFORM => DONE handle 0xd2acd8; line 2394 (connection #0)
> * multi_done
> * Connection #0 to host 127.0.0.1 left intact
> * Expire cleared (transfer 0xd2acd8)

The first difference is this only showing for the working case: > * RTP Rewinding 754 (READMORE)
For the failing case it is shown later as < * RTP Rewinding 600 (READMORE), leading to this difference:

123,124c123,124
< * RTP Rewinding 318 (READMORE)
< * STATE: PERFORM => DONE handle 0xd9acd8; line 2394 (connection #0)
---
> * RTP Rewinding 118 (READMORE)
> * STATE: PERFORM => DONE handle 0xd2acd8; line 2394 (connection #0)

And the trailer of the stderr output logfile is basically completely different after that. The failing case also shows:

< * Operation timed out after 3012 milliseconds with 0 bytes received
< * Marked for [closure]: Disconnected with pending data
@bagder
Copy link
Member Author

@bagder bagder commented Sep 12, 2020

I believe the key is the * Connection 0 seems to be dead! output it one of them that isn't in the other. That's also a strong clue that the extract_if_dead really is to blame... Or perhaps that something on that platform makes extract_if_dead misbehave.

@mback2k
Copy link
Member

@mback2k mback2k commented Sep 13, 2020

Seems like there is really something different with the connection. This is the diff of the RTSP test server logs (without timestamps):

$ diff log.failing/rtsp_server.txt log.working/rtsp_server.txt
1,3c1,3
< Running IPv4 version on port 59644
< Wrote pid 75868 to .rtsp_server.pid
< Wrote port 59644 to .rtsp_server.port
---
> Running IPv4 version on port 59666
> Wrote pid 66668 to .rtsp_server.pid
> Wrote port 59666 to .rtsp_server.port
7c7
< 127.0.0.1:59644
---
> 127.0.0.1:59666
17,18c17,18
< Sent off 68 bytes
< Response sent (68 bytes) and written to log/server.response
---
> Sent off 67 bytes
> Response sent (67 bytes) and written to log/server.response
23c23
< ProcessRequest() called with testno -7 and line [SETUP rtsp://127.0.0.1:59644/5710001 RTSP/1.0
---
> ProcessRequest() called with testno -7 and line [SETUP rtsp://127.0.0.1:59666/5710001 RTSP/1.0
28c28
< Got request: SETUP rtsp://127.0.0.1:59644/5710001 RTSP/1.0
---
> Got request: SETUP rtsp://127.0.0.1:59666/5710001 RTSP/1.0
39c39
< ProcessRequest() called with testno -7 and line [PLAY rtsp://127.0.0.1:59644/5710002 RTSP/1.0
---
> ProcessRequest() called with testno -7 and line [PLAY rtsp://127.0.0.1:59666/5710002 RTSP/1.0
44c44
< Got request: PLAY rtsp://127.0.0.1:59644/5710002 RTSP/1.0
---
> Got request: PLAY rtsp://127.0.0.1:59666/5710002 RTSP/1.0
66c66
< ProcessRequest() called with testno -7 and line [DESCRIBE rtsp://127.0.0.1:59644/5710003 RTSP/1.0
---
> ProcessRequest() called with testno -7 and line [DESCRIBE rtsp://127.0.0.1:59666/5710003 RTSP/1.0
72c72
< Got request: DESCRIBE rtsp://127.0.0.1:59644/5710003 RTSP/1.0
---
> Got request: DESCRIBE rtsp://127.0.0.1:59666/5710003 RTSP/1.0
88c88
< ProcessRequest() called with testno -7 and line [PLAY rtsp://127.0.0.1:59644/5710004 RTSP/1.0
---
> ProcessRequest() called with testno -7 and line [PLAY rtsp://127.0.0.1:59666/5710004 RTSP/1.0
93c93
< Got request: PLAY rtsp://127.0.0.1:59644/5710004 RTSP/1.0
---
> Got request: PLAY rtsp://127.0.0.1:59666/5710004 RTSP/1.0
116,118d115
< recv() returned error: (10054) Unknown error
< ====> Client disconnect
< ====> Client connect
@mback2k
Copy link
Member

@mback2k mback2k commented Sep 13, 2020

@bagder I don't yet get where exactly the difference in behavior is coming from. I am attaching a PCAP with one failed test run and a successful test run afterwards: 517.zip - maybe you can take a look at it?

@mback2k
Copy link
Member

@mback2k mback2k commented Sep 13, 2020

And here is another set of complete log data and corresponding packet dumps stored as dump.pcap in the files:

@bagder
Copy link
Member Author

@bagder bagder commented Sep 14, 2020

I don't think you get the answer to this by looking at PCAPs.

The answer most likely lies within extract_if_dead. In particular I would suspect that SocketIsDead() returns TRUE on this platform/situation, indicating that the socket signals that it is readable - which it really shouldn't if it is indeed still alive. I think this is either a socket/select problem or a bad test case.

@mback2k
Copy link
Member

@mback2k mback2k commented Sep 15, 2020

I have the feeling that the socket is readable, because there is still data pending to be read. I will try to add some more debug output and check if that is the case.

@bagder
Copy link
Member Author

@bagder bagder commented Sep 15, 2020

Yeah, and if that really is the case, the test is flawed...

@mback2k
Copy link
Member

@mback2k mback2k commented Sep 15, 2020

Since I don't know how fast I can further investigate the issue this week, you may also go ahead and just ignore the result of test 571 for the msys2 builds by adding ~571 to the TFLAGS for these. I can then remove that during a new PR.

@mback2k mback2k changed the title azure: disable the 4 msys2 builds CI/azure: disable the 4 msys2 builds Sep 15, 2020
@bagder
Copy link
Member Author

@bagder bagder commented Sep 15, 2020

Good idea!

@bagder bagder force-pushed the bagder/azure-disable-msys2 branch from af92abe to ab94f8b Sep 15, 2020
@bagder bagder changed the title CI/azure: disable the 4 msys2 builds CI/azure: disable flaky tests in the msys2 builds Sep 15, 2020
@bagder
Copy link
Member Author

@bagder bagder commented Sep 15, 2020

Okay, this now instead shows the problem I'm fixing I already fixed in #5962

Closes #5954
@bagder bagder force-pushed the bagder/azure-disable-msys2 branch from ab94f8b to 14bf6b9 Sep 15, 2020
@bagder
Copy link
Member Author

@bagder bagder commented Sep 16, 2020

it looks better. Out of the 4 msys2 builds that often fails, now only two failed. One on test 190 and one on test 1527...

@bagder
Copy link
Member Author

@bagder bagder commented Sep 16, 2020

@mback2k I presume you're 👍 with merging this as at least a step along the way?

@mback2k
Copy link
Member

@mback2k mback2k commented Sep 16, 2020

👍 with me, I would just put the CI/ prefix at the beginning of the commit message, like with past commits to this stuff.

@bagder bagder closed this in 1556b99 Sep 16, 2020
@bagder bagder deleted the bagder/azure-disable-msys2 branch Sep 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants
You can’t perform that action at this time.