-
Notifications
You must be signed in to change notification settings - Fork 26.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI networking issues #50730
Comments
Also is it happening only for |
@fkorotkov the curl is called from the |
Created #50731 for cURL issue to add retries. |
Repeating my comment from the chat on this: I believe we have some kind of general issue where the macOS machines are intermitently having network connectivity issues, and that it is not specific to any task or server. We see this on connections to pub, Google maven, jcenter maven, and Cocoapods. I'm not opposed to retrying a failed download. This can happen because of a variety of issues. But I suspect there's some deeper issue with the network infrastructure that the macs are using, which ends up resulting in failed network connections regardless of the server. |
Mac networking is less sophisticated then Linux and Windows one the runs on GCP but I've just seen a Gradle failure in a Linux task in #50731 as well: https://cirrus-ci.com/task/4852194834907136?command=main#L1609. Do you see the same issues in
IMO adding retries is essential because non of the third-party dependencies like Google Storage, Maven Repository, CocoaPods or RubyGem has 100% availability. From all the points above I'm most interested in getting visibility in hanging |
You can run |
One thing we could consider doing as well is having some way to build a total package needed for Flutter CI and uploading it to a source like CIPD or Isolated (Chromium infra projects). The idea is that we could do all of the flaky stuff once, create a "golden" set of artifacts for this commit, and then have all of the other CI use that golden set. CIPD/Isolated should be very reliable - and at the least, it would eliminate questioning which one of the many servers we're contacting are down. |
Related to |
@dnfield Would caching to CIPD be inherently more stable/reliable than the caching @jmagman did in #50496 (which had to be reverted because in populating the cache we were still hitting network issues, but the cache didn't have timeout/retry logic yet)? It seems to me like it would have the same problems. |
@christopherfujino I'm not sure this works but something like this:
|
And to be clear - this is not an entirely trivial undertaking, and I have no proof that it would actually speed up CI or make it more reliable. |
I like this idea, Jenn and I had talked about it (though using the cirrus cache rather than CIPD, having one "precache" task per-platform that all other tasks depend on). Not sure if that would be an improvement over having to deflake possibly multiple tasks. Not sure what using CIPD over cirrus cache would buy us (maybe up/down speed or reliability is better to CIPD than wherever cirrus cache lives) |
The main thing it would buy is is that we can precisely control what gets cached and why, without worrying about whatever rules Cirrus has in place. The downside is we have to implement most of those rules ourselves. If the Cirrus caching API meets this need then that sounds good. I was under the impression we were running into problems where the cache fails on a networking issue. |
where would we build the CIPD cache? on cirrus, or LUCI? The problems with caching are that when downloading gradle/cocoapods/pubs/et al we would time out while trying to create the cache file. Creating the CIPD cache on Cirrus would have this same problem. If we could create the cache on LUCI, upload to CIPD, that would be more reliable, we'd just have to find a way to block the cirrus tests until the CIPD cache is populated. |
I think we have to do something on Cirrus to prevent the jobs from starting prematurely. If the goal is just to reduce potential flakes, this should help. If the goal is to avoid making network calls on Cirrus at all, then I think we might be out of luck. We could do something crazy like use the BuildBucket API to schedule a LUCI job to build the cache and then collect it. We'd still need at a minimum Cirrus to have a reliable enough network connection to schedule the job and poll for its finish. And in that case, how much worse is it to have just one cirrus job build the cache? |
IMO before jumping into implementing custom caching logic it makes sense to try to improve the current tools to be more network resilient. This will help in any case. There are couple of report of hanging I don't think there will much flakiness after |
Also Cirrus agent just started to re-downloading caches in case of a corrupted archive and today/tomorrow there will be an option to disable re-uploads requested by @jmagman in order to re-land caching. |
Sounds good @fkorotkov - we definitely want to do whatever we can to make these processes more resillient (both for CI and for users - we get bug reports where it's clear a retry would have helped). We're still concerned about the volume of failures we're seeing on mac. It anecdotally seems worse in the last week or two - lots of networking failures that just work when retried on a new instance. I'm concerned that sometimes the instance is just having trouble making new outbound network connections to sites that are otherwise up. |
@fkorotkov we're now on Gradle 5.6 https://github.com/flutter/flutter/pull/50388/files |
Sorry to hear that. I'm trying to get some visibility into networking on the nodes: I'm concerned if it's related to the recent changes to infra. It's hard for me to believe that removing a single gateway in favor of individual 1Gb connections made the situation worse. Plus some of the issues are happening not only on Macs. I'm working on gathering statistic on which physical hosts failures are happening to see if the failures are evenly distributed. |
@christopherfujino indeed! but I saw a stacktrace from 4.10.2 in some build. So 馃that Gradle issue is fixed now. |
Instead of using one provided by Anka which will proxy the host one. Might be related to flutter/flutter#50730
Have you noticed any changes in flakiness today by any chance? I wonder if setting DNS servers explicitly in cirruslabs/macos-image-templates@28ee45b instead of relying on the virtualization environment helped. |
It does seem better, although we're having other issues today that have reduced the volume of commits anyway. |
Got it. I'm implementing some tooling/monitoring to help with diagnosing the issue. Just wanted to check in the meantime. |
Only host information for macOS tasks for now. Related to flutter/flutter#50730
Just a quick update that Cirrus now reports a few things to help with data around this issue:
I'm collecting occurrences of the issue with some manual analysis of Cirrus Agent logs in this Google Spreadsheet. Feel free to add links to tasks there. |
Changed the title since the issues occur mostly on macOS but not exclusively. |
cc @kf6gpe to follow. |
I've backfilled some data and added a new chart on metrics page which shows that flakes started around January 21st when there was an issue with virtualization that we discussed in #hacker-infra: Because of the issue (which impacted 10-20% of the tasks) I had to upgraded Mac Infrastructure to Catalina and newer Virtualization (this was the only suggestion from virtualization support). But as the newly added chart shows that the upgrade only dropped flakiness to around 2% for The overall situation across all macOS tasks seems more or less stable: Seems only very network intense tasks are experiencing more then normal flakiness. On the bright side I contacted Anka Virtualization support with all the data and they are releasing a new version of virtualization tomorrow with fixes that they claim will resolve issues with hanging sockets. 馃 |
@fkorotkov thanks for looking into this and following up with Anka! |
Just an update from Anka. They pushed the bug fix release to Tuesday. 馃槳馃 |
I've upgraded the virtualization last night and saw the issue still occurring. I was able to somewhat reliable reproduce the issue on my local Mac Mini and debugging it with Anka folks right now. Created #51411 in the meantime while still debugging the issue with them. |
BTW beside the upgrade I tweaked some things on the Mac nodes and according to the Flaky graph there were no flakes today for Flaked task is defined as a successfully re-run task triggered manually by a user for a task that failed. Have you noticed improvements? I also noticed that the amount of tasks is lower then usual, hopefully you are peer cycle review is not too exhausting. 馃槄 |
mac network flakiness has definitely gone down yesterday and today. |
According to the new flakiness metric the issue seems actually fixed after virtualization upgrade and some simplification of network config on the physical nodes. Will close the issue. Please reopen if you'll it appearing again. |
Thanks for looking into this Fedor! |
@fkorotkov thanks so much for all this work! |
Sorry it wasn't detected automatically before and it took two weeks to propagate from your side. Now there is an alert on the flakiness rate and in general Cirrus now collects data about flakiness. |
This thread has been automatically locked since there has not been any recent activity after it was closed. If you are still experiencing a similar issue, please open a new bug, including the output of |
It's an epic issue for all the networking issues on CI right now.
So far couple of scenarios are identified.
馃煩 - there is a viable mitigation already implemented (like retries)
馃煥 - there is an alleged mitigation not yet implemented.
馃煡 - things are unclear
Tried to also sort them by importance.
1. Hanging
setup
whilepub get
. 馃煥At the moment unclear why it's happening: there are no verbose logs from
pub
and Cirrus agent only reports occasionalSIGPIPE
error that are re-tried by Cirrus agent.Related issues created by @jmagman:
cirruslabs/cirrus-ci-docs#566
cirruslabs/cirrus-ci-docs#567
2. Gradle fails to download files. 馃煩
Gradle upgrade should help with that since Gradle started re-trying downloads. It was a huge pain for a lot of people using ephemeral CI environments which was solved in more recent Gradle versions. Flutter uses Gradle 4.10.2 which was release on Sep 19, 2018.
Fixed in #50388
3. Git clone failure. 馃煩
In that case Cirrus automatically re-runs tasks.
4. Cocoapods failures 馃煥
Here is an error from logs:
Seems like the same issue as 1
5. cURL issue while downloading Dart SDK 馃煩
Here is an error from logs:
Can someone point to the exact place where the curl is called?
The text was updated successfully, but these errors were encountered: