Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI networking issues #50730

Closed
fkorotkov opened this issue Feb 13, 2020 · 38 comments
Closed

CI networking issues #50730

fkorotkov opened this issue Feb 13, 2020 · 38 comments
Labels
platform-mac Building on or for macOS specifically team Infra upgrades, team productivity, code health, technical debt. See also team: labels. team-infra Owned by Infrastructure team

Comments

@fkorotkov
Copy link
Contributor

fkorotkov commented Feb 13, 2020

It's an epic issue for all the networking issues on CI right now.

So far couple of scenarios are identified.

馃煩 - there is a viable mitigation already implemented (like retries)
馃煥 - there is an alleged mitigation not yet implemented.
馃煡 - things are unclear

Tried to also sort them by importance.

1. Hanging setup while pub get. 馃煥

At the moment unclear why it's happening: there are no verbose logs from pub and Cirrus agent only reports occasional SIGPIPE error that are re-tried by Cirrus agent.

Related issues created by @jmagman:

cirruslabs/cirrus-ci-docs#566
cirruslabs/cirrus-ci-docs#567

2. Gradle fails to download files. 馃煩

Gradle upgrade should help with that since Gradle started re-trying downloads. It was a huge pain for a lot of people using ephemeral CI environments which was solved in more recent Gradle versions. Flutter uses Gradle 4.10.2 which was release on Sep 19, 2018.

Fixed in #50388

3. Git clone failure. 馃煩

In that case Cirrus automatically re-runs tasks.

4. Cocoapods failures 馃煥

Here is an error from logs:

[!] Couldn't determine repo type for URL: `https://cdn.cocoapods.org/`: Failed to open TCP connection to cdn.cocoapods.org:443 (getaddrinfo: nodename nor servname provided, or not known)

Seems like the same issue as 1

5. cURL issue while downloading Dart SDK 馃煩

Here is an error from logs:

curl: (56) LibreSSL SSL_read: SSL_ERROR_SYSCALL, errno 54
Failed to retrieve the Dart SDK from: https://storage.googleapis.com/flutter_infra/flutter/e0ebaea59071b35a44dbe1e0830ee15fb7563486/dart-sdk-darwin-x64.zip

Can someone point to the exact place where the curl is called?

@fkorotkov fkorotkov changed the title CI networking issue macOS CI networking issue Feb 13, 2020
@fkorotkov
Copy link
Contributor Author

Also is it happening only for macOS tasks. Last week's infrastructure migration removed a single firewall that all macOS nodes were using and now each one of the nodes has a 1Gb connection instead of going through a shared switch.

@christopherfujino
Copy link
Member

@fkorotkov the curl is called from the bin/internal/update_dart_sdk.sh script (may not be exact, I'm on my phone in the bus).

@fkorotkov fkorotkov mentioned this issue Feb 13, 2020
13 tasks
@VladyslavBondarenko VladyslavBondarenko added team Infra upgrades, team productivity, code health, technical debt. See also team: labels. team-infra Owned by Infrastructure team platform-mac Building on or for macOS specifically labels Feb 13, 2020
@fkorotkov
Copy link
Contributor Author

Created #50731 for cURL issue to add retries.

@dnfield
Copy link
Contributor

dnfield commented Feb 13, 2020

Repeating my comment from the chat on this: I believe we have some kind of general issue where the macOS machines are intermitently having network connectivity issues, and that it is not specific to any task or server. We see this on connections to pub, Google maven, jcenter maven, and Cocoapods.

I'm not opposed to retrying a failed download. This can happen because of a variety of issues. But I suspect there's some deeper issue with the network infrastructure that the macs are using, which ends up resulting in failed network connections regardless of the server.

@fkorotkov
Copy link
Contributor Author

Mac networking is less sophisticated then Linux and Windows one the runs on GCP but I've just seen a Gradle failure in a Linux task in #50731 as well: https://cirrus-ci.com/task/4852194834907136?command=main#L1609.

Do you see the same issues in plugins repo?

flutter CI is heavily network intense, when you run locally or in a non-clean environment (where artifacts from previous runs are persisted) you don't download that many things every time, you usually have most of thing already available.

IMO adding retries is essential because non of the third-party dependencies like Google Storage, Maven Repository, CocoaPods or RubyGem has 100% availability.

From all the points above I'm most interested in getting visibility in hanging pub because IMO it's the most disturbing issue. Do you have any ideas in how to get more verbose outputs from pub?

@dnfield
Copy link
Contributor

dnfield commented Feb 13, 2020

You can run pub -v (or flutter pub get -v).

@dnfield
Copy link
Contributor

dnfield commented Feb 13, 2020

One thing we could consider doing as well is having some way to build a total package needed for Flutter CI and uploading it to a source like CIPD or Isolated (Chromium infra projects).

The idea is that we could do all of the flaky stuff once, create a "golden" set of artifacts for this commit, and then have all of the other CI use that golden set. CIPD/Isolated should be very reliable - and at the least, it would eliminate questioning which one of the many servers we're contacting are down.

@fkorotkov
Copy link
Contributor Author

Related to pub issue where @Hixie had an issue in a Windows task: dart-lang/pub#2257

@christopherfujino
Copy link
Member

One thing we could consider doing as well is having some way to build a total package needed for Flutter CI and uploading it to a source like CIPD or Isolated (Chromium infra projects).

The idea is that we could do all of the flaky stuff once, create a "golden" set of artifacts for this commit, and then have all of the other CI use that golden set. CIPD/Isolated should be very reliable - and at the least, it would eliminate questioning which one of the many servers we're contacting are down.

@dnfield Would caching to CIPD be inherently more stable/reliable than the caching @jmagman did in #50496 (which had to be reverted because in populating the cache we were still hitting network issues, but the cache didn't have timeout/retry logic yet)? It seems to me like it would have the same problems.

@dnfield
Copy link
Contributor

dnfield commented Feb 13, 2020

@christopherfujino I'm not sure this works but something like this:

  1. We have a Cirrus task that all others depend on that prepares the cached artifacts.
  2. Once it is done, all the others use the output of that task.
  3. It can be retried as often as we like, and it's not tied to any Cirrus specific caching rules - it's basically just our own custom caching task.

@dnfield
Copy link
Contributor

dnfield commented Feb 13, 2020

And to be clear - this is not an entirely trivial undertaking, and I have no proof that it would actually speed up CI or make it more reliable.

@christopherfujino
Copy link
Member

@christopherfujino I'm not sure this works but something like this:

  1. We have a Cirrus task that all others depend on that prepares the cached artifacts.
  2. Once it is done, all the others use the output of that task.
  3. It can be retried as often as we like, and it's not tied to any Cirrus specific caching rules - it's basically just our own custom caching task.

I like this idea, Jenn and I had talked about it (though using the cirrus cache rather than CIPD, having one "precache" task per-platform that all other tasks depend on). Not sure if that would be an improvement over having to deflake possibly multiple tasks. Not sure what using CIPD over cirrus cache would buy us (maybe up/down speed or reliability is better to CIPD than wherever cirrus cache lives)

@dnfield
Copy link
Contributor

dnfield commented Feb 13, 2020

The main thing it would buy is is that we can precisely control what gets cached and why, without worrying about whatever rules Cirrus has in place.

The downside is we have to implement most of those rules ourselves. If the Cirrus caching API meets this need then that sounds good. I was under the impression we were running into problems where the cache fails on a networking issue.

@christopherfujino
Copy link
Member

where would we build the CIPD cache? on cirrus, or LUCI?

The problems with caching are that when downloading gradle/cocoapods/pubs/et al we would time out while trying to create the cache file. Creating the CIPD cache on Cirrus would have this same problem. If we could create the cache on LUCI, upload to CIPD, that would be more reliable, we'd just have to find a way to block the cirrus tests until the CIPD cache is populated.

@dnfield
Copy link
Contributor

dnfield commented Feb 13, 2020

I think we have to do something on Cirrus to prevent the jobs from starting prematurely.

If the goal is just to reduce potential flakes, this should help.

If the goal is to avoid making network calls on Cirrus at all, then I think we might be out of luck. We could do something crazy like use the BuildBucket API to schedule a LUCI job to build the cache and then collect it. We'd still need at a minimum Cirrus to have a reliable enough network connection to schedule the job and poll for its finish. And in that case, how much worse is it to have just one cirrus job build the cache?

@fkorotkov
Copy link
Contributor Author

IMO before jumping into implementing custom caching logic it makes sense to try to improve the current tools to be more network resilient. This will help in any case.

There are couple of report of hanging pub unrelated to Cirrus. Upgrading Gradle to 5.x or 6.x will help a lot (from my personal experience 4.x version was pretty unreliable especially with public artifactories).

I don't think there will much flakiness after pub and Gradle situations are improved.

@fkorotkov
Copy link
Contributor Author

Also Cirrus agent just started to re-downloading caches in case of a corrupted archive and today/tomorrow there will be an option to disable re-uploads requested by @jmagman in order to re-land caching.

@dnfield
Copy link
Contributor

dnfield commented Feb 13, 2020

Sounds good @fkorotkov - we definitely want to do whatever we can to make these processes more resillient (both for CI and for users - we get bug reports where it's clear a retry would have helped).

We're still concerned about the volume of failures we're seeing on mac. It anecdotally seems worse in the last week or two - lots of networking failures that just work when retried on a new instance. I'm concerned that sometimes the instance is just having trouble making new outbound network connections to sites that are otherwise up.

@christopherfujino
Copy link
Member

@fkorotkov we're now on Gradle 5.6 https://github.com/flutter/flutter/pull/50388/files

@fkorotkov
Copy link
Contributor Author

Sorry to hear that. I'm trying to get some visibility into networking on the nodes: I'm concerned if it's related to the recent changes to infra. It's hard for me to believe that removing a single gateway in favor of individual 1Gb connections made the situation worse. Plus some of the issues are happening not only on Macs.

I'm working on gathering statistic on which physical hosts failures are happening to see if the failures are evenly distributed.

@fkorotkov
Copy link
Contributor Author

@christopherfujino indeed! but I saw a stacktrace from 4.10.2 in some build. So 馃that Gradle issue is fixed now.

fkorotkov added a commit to cirruslabs/macos-image-templates that referenced this issue Feb 14, 2020
Instead of using one provided by Anka which will proxy the host one.

Might be related to flutter/flutter#50730
@fkorotkov
Copy link
Contributor Author

Have you noticed any changes in flakiness today by any chance? I wonder if setting DNS servers explicitly in cirruslabs/macos-image-templates@28ee45b instead of relying on the virtualization environment helped.

@dnfield
Copy link
Contributor

dnfield commented Feb 14, 2020

It does seem better, although we're having other issues today that have reduced the volume of commits anyway.

@fkorotkov
Copy link
Contributor Author

Got it. I'm implementing some tooling/monitoring to help with diagnosing the issue. Just wanted to check in the meantime.

fkorotkov added a commit to cirruslabs/cirrus-ci-web that referenced this issue Feb 15, 2020
Only host information for macOS tasks for now. Related to flutter/flutter#50730
@fkorotkov
Copy link
Contributor Author

Just a quick update that Cirrus now reports a few things to help with data around this issue:

  • Metrics and reporting of flakes (when a manual re-run succeeds).
  • Metrics and reporting of missing/chunked logs.
  • For macOS tasks Cirrus now shows a physical host where this task was executed to see if things are tend to happen on a particular subset of hosts.

I'm collecting occurrences of the issue with some manual analysis of Cirrus Agent logs in this Google Spreadsheet. Feel free to add links to tasks there.

@fkorotkov fkorotkov changed the title macOS CI networking issue CI networking issues Feb 19, 2020
@fkorotkov
Copy link
Contributor Author

Changed the title since the issues occur mostly on macOS but not exclusively.

@kf6gpe
Copy link
Contributor

kf6gpe commented Feb 19, 2020

cc @kf6gpe to follow.

@fkorotkov
Copy link
Contributor Author

I've backfilled some data and added a new chart on metrics page which shows that flakes started around January 21st when there was an issue with virtualization that we discussed in #hacker-infra:

image

Because of the issue (which impacted 10-20% of the tasks) I had to upgraded Mac Infrastructure to Catalina and newer Virtualization (this was the only suggestion from virtualization support). But as the newly added chart shows that the upgrade only dropped flakiness to around 2% for flutter/flutter repository:

image

The overall situation across all macOS tasks seems more or less stable:

image

Seems only very network intense tasks are experiencing more then normal flakiness.

On the bright side I contacted Anka Virtualization support with all the data and they are releasing a new version of virtualization tomorrow with fixes that they claim will resolve issues with hanging sockets. 馃

@christopherfujino
Copy link
Member

@fkorotkov thanks for looking into this and following up with Anka!

@fkorotkov
Copy link
Contributor Author

Just an update from Anka. They pushed the bug fix release to Tuesday. 馃槳馃

@fkorotkov
Copy link
Contributor Author

I've upgraded the virtualization last night and saw the issue still occurring. I was able to somewhat reliable reproduce the issue on my local Mac Mini and debugging it with Anka folks right now.

Created #51411 in the meantime while still debugging the issue with them.

@fkorotkov
Copy link
Contributor Author

fkorotkov commented Feb 26, 2020

BTW beside the upgrade I tweaked some things on the Mac nodes and according to the Flaky graph there were no flakes today for flutter/flutter: https://cirrus-ci.com/metrics/repository/flutter/flutter

Flaked task is defined as a successfully re-run task triggered manually by a user for a task that failed.

image

Have you noticed improvements? I also noticed that the amount of tasks is lower then usual, hopefully you are peer cycle review is not too exhausting. 馃槄

@jmagman
Copy link
Member

jmagman commented Feb 26, 2020

mac network flakiness has definitely gone down yesterday and today.

@fkorotkov
Copy link
Contributor Author

According to the new flakiness metric the issue seems actually fixed after virtualization upgrade and some simplification of network config on the physical nodes. Will close the issue. Please reopen if you'll it appearing again.

@dnfield
Copy link
Contributor

dnfield commented Mar 2, 2020

Thanks for looking into this Fedor!

@christopherfujino
Copy link
Member

@fkorotkov thanks so much for all this work!

@fkorotkov
Copy link
Contributor Author

Sorry it wasn't detected automatically before and it took two weeks to propagate from your side. Now there is an alert on the flakiness rate and in general Cirrus now collects data about flakiness.

@github-actions
Copy link

This thread has been automatically locked since there has not been any recent activity after it was closed. If you are still experiencing a similar issue, please open a new bug, including the output of flutter doctor -v and a minimal reproduction of the issue.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 22, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
platform-mac Building on or for macOS specifically team Infra upgrades, team productivity, code health, technical debt. See also team: labels. team-infra Owned by Infrastructure team
Projects
None yet
Development

No branches or pull requests

7 participants