CI networking issues #50730

fkorotkov · 2020-02-13T16:02:07Z

It's an epic issue for all the networking issues on CI right now.

So far couple of scenarios are identified.

🟩 - there is a viable mitigation already implemented (like retries)
🟨 - there is an alleged mitigation not yet implemented.
🟥 - things are unclear

Tried to also sort them by importance.

1. Hanging `setup` while `pub get`. 🟨

At the moment unclear why it's happening: there are no verbose logs from pub and Cirrus agent only reports occasional SIGPIPE error that are re-tried by Cirrus agent.

Related issues created by @jmagman:

cirruslabs/cirrus-ci-docs#566
cirruslabs/cirrus-ci-docs#567

2. Gradle fails to download files. 🟩

Gradle upgrade should help with that since Gradle started re-trying downloads. It was a huge pain for a lot of people using ephemeral CI environments which was solved in more recent Gradle versions. Flutter uses Gradle 4.10.2 which was release on Sep 19, 2018.

Fixed in #50388

3. Git clone failure. 🟩

In that case Cirrus automatically re-runs tasks.

4. Cocoapods failures 🟨

Here is an error from logs:

[!] Couldn't determine repo type for URL: `https://cdn.cocoapods.org/`: Failed to open TCP connection to cdn.cocoapods.org:443 (getaddrinfo: nodename nor servname provided, or not known)

Seems like the same issue as 1

5. cURL issue while downloading Dart SDK 🟩

Here is an error from logs:

curl: (56) LibreSSL SSL_read: SSL_ERROR_SYSCALL, errno 54
Failed to retrieve the Dart SDK from: https://storage.googleapis.com/flutter_infra/flutter/e0ebaea59071b35a44dbe1e0830ee15fb7563486/dart-sdk-darwin-x64.zip

Can someone point to the exact place where the curl is called?

The text was updated successfully, but these errors were encountered:

fkorotkov · 2020-02-13T16:06:29Z

Also is it happening only for macOS tasks. Last week's infrastructure migration removed a single firewall that all macOS nodes were using and now each one of the nodes has a 1Gb connection instead of going through a shared switch.

christopherfujino · 2020-02-13T16:11:54Z

@fkorotkov the curl is called from the bin/internal/update_dart_sdk.sh script (may not be exact, I'm on my phone in the bus).

fkorotkov · 2020-02-13T17:11:48Z

Created #50731 for cURL issue to add retries.

dnfield · 2020-02-13T17:14:33Z

Repeating my comment from the chat on this: I believe we have some kind of general issue where the macOS machines are intermitently having network connectivity issues, and that it is not specific to any task or server. We see this on connections to pub, Google maven, jcenter maven, and Cocoapods.

I'm not opposed to retrying a failed download. This can happen because of a variety of issues. But I suspect there's some deeper issue with the network infrastructure that the macs are using, which ends up resulting in failed network connections regardless of the server.

fkorotkov · 2020-02-13T17:43:27Z

Mac networking is less sophisticated then Linux and Windows one the runs on GCP but I've just seen a Gradle failure in a Linux task in #50731 as well: https://cirrus-ci.com/task/4852194834907136?command=main#L1609.

Do you see the same issues in plugins repo?

flutter CI is heavily network intense, when you run locally or in a non-clean environment (where artifacts from previous runs are persisted) you don't download that many things every time, you usually have most of thing already available.

IMO adding retries is essential because non of the third-party dependencies like Google Storage, Maven Repository, CocoaPods or RubyGem has 100% availability.

From all the points above I'm most interested in getting visibility in hanging pub because IMO it's the most disturbing issue. Do you have any ideas in how to get more verbose outputs from pub?

dnfield · 2020-02-13T17:58:16Z

You can run pub -v (or flutter pub get -v).

dnfield · 2020-02-13T18:01:30Z

One thing we could consider doing as well is having some way to build a total package needed for Flutter CI and uploading it to a source like CIPD or Isolated (Chromium infra projects).

The idea is that we could do all of the flaky stuff once, create a "golden" set of artifacts for this commit, and then have all of the other CI use that golden set. CIPD/Isolated should be very reliable - and at the least, it would eliminate questioning which one of the many servers we're contacting are down.

fkorotkov · 2020-02-13T19:18:44Z

Related to pub issue where @Hixie had an issue in a Windows task: dart-lang/pub#2257

christopherfujino · 2020-02-13T19:24:06Z

One thing we could consider doing as well is having some way to build a total package needed for Flutter CI and uploading it to a source like CIPD or Isolated (Chromium infra projects).

The idea is that we could do all of the flaky stuff once, create a "golden" set of artifacts for this commit, and then have all of the other CI use that golden set. CIPD/Isolated should be very reliable - and at the least, it would eliminate questioning which one of the many servers we're contacting are down.

@dnfield Would caching to CIPD be inherently more stable/reliable than the caching @jmagman did in #50496 (which had to be reverted because in populating the cache we were still hitting network issues, but the cache didn't have timeout/retry logic yet)? It seems to me like it would have the same problems.

dnfield · 2020-02-13T20:59:49Z

@christopherfujino I'm not sure this works but something like this:

We have a Cirrus task that all others depend on that prepares the cached artifacts.
Once it is done, all the others use the output of that task.
It can be retried as often as we like, and it's not tied to any Cirrus specific caching rules - it's basically just our own custom caching task.

dnfield · 2020-02-13T21:08:55Z

And to be clear - this is not an entirely trivial undertaking, and I have no proof that it would actually speed up CI or make it more reliable.

christopherfujino · 2020-02-13T21:19:23Z

@christopherfujino I'm not sure this works but something like this:

We have a Cirrus task that all others depend on that prepares the cached artifacts.

Once it is done, all the others use the output of that task.

It can be retried as often as we like, and it's not tied to any Cirrus specific caching rules - it's basically just our own custom caching task.

I like this idea, Jenn and I had talked about it (though using the cirrus cache rather than CIPD, having one "precache" task per-platform that all other tasks depend on). Not sure if that would be an improvement over having to deflake possibly multiple tasks. Not sure what using CIPD over cirrus cache would buy us (maybe up/down speed or reliability is better to CIPD than wherever cirrus cache lives)

dnfield · 2020-02-13T21:23:08Z

The main thing it would buy is is that we can precisely control what gets cached and why, without worrying about whatever rules Cirrus has in place.

The downside is we have to implement most of those rules ourselves. If the Cirrus caching API meets this need then that sounds good. I was under the impression we were running into problems where the cache fails on a networking issue.

christopherfujino · 2020-02-13T21:26:58Z

where would we build the CIPD cache? on cirrus, or LUCI?

The problems with caching are that when downloading gradle/cocoapods/pubs/et al we would time out while trying to create the cache file. Creating the CIPD cache on Cirrus would have this same problem. If we could create the cache on LUCI, upload to CIPD, that would be more reliable, we'd just have to find a way to block the cirrus tests until the CIPD cache is populated.

dnfield · 2020-02-13T21:33:51Z

I think we have to do something on Cirrus to prevent the jobs from starting prematurely.

If the goal is just to reduce potential flakes, this should help.

If the goal is to avoid making network calls on Cirrus at all, then I think we might be out of luck. We could do something crazy like use the BuildBucket API to schedule a LUCI job to build the cache and then collect it. We'd still need at a minimum Cirrus to have a reliable enough network connection to schedule the job and poll for its finish. And in that case, how much worse is it to have just one cirrus job build the cache?

fkorotkov · 2020-02-13T21:49:47Z

IMO before jumping into implementing custom caching logic it makes sense to try to improve the current tools to be more network resilient. This will help in any case.

There are couple of report of hanging pub unrelated to Cirrus. Upgrading Gradle to 5.x or 6.x will help a lot (from my personal experience 4.x version was pretty unreliable especially with public artifactories).

I don't think there will much flakiness after pub and Gradle situations are improved.

fkorotkov · 2020-02-13T21:53:02Z

Also Cirrus agent just started to re-downloading caches in case of a corrupted archive and today/tomorrow there will be an option to disable re-uploads requested by @jmagman in order to re-land caching.

dnfield · 2020-02-13T21:57:29Z

Sounds good @fkorotkov - we definitely want to do whatever we can to make these processes more resillient (both for CI and for users - we get bug reports where it's clear a retry would have helped).

We're still concerned about the volume of failures we're seeing on mac. It anecdotally seems worse in the last week or two - lots of networking failures that just work when retried on a new instance. I'm concerned that sometimes the instance is just having trouble making new outbound network connections to sites that are otherwise up.

christopherfujino · 2020-02-13T22:02:30Z

@fkorotkov we're now on Gradle 5.6 https://github.com/flutter/flutter/pull/50388/files

fkorotkov · 2020-02-13T22:06:05Z

Sorry to hear that. I'm trying to get some visibility into networking on the nodes: I'm concerned if it's related to the recent changes to infra. It's hard for me to believe that removing a single gateway in favor of individual 1Gb connections made the situation worse. Plus some of the issues are happening not only on Macs.

I'm working on gathering statistic on which physical hosts failures are happening to see if the failures are evenly distributed.

fkorotkov · 2020-02-13T22:08:37Z

@christopherfujino indeed! but I saw a stacktrace from 4.10.2 in some build. So 🤞that Gradle issue is fixed now.

Instead of using one provided by Anka which will proxy the host one. Might be related to flutter/flutter#50730

fkorotkov · 2020-02-14T22:05:49Z

Have you noticed any changes in flakiness today by any chance? I wonder if setting DNS servers explicitly in cirruslabs/macos-image-templates@28ee45b instead of relying on the virtualization environment helped.

dnfield · 2020-02-14T22:15:51Z

It does seem better, although we're having other issues today that have reduced the volume of commits anyway.

fkorotkov · 2020-02-14T22:20:52Z

Got it. I'm implementing some tooling/monitoring to help with diagnosing the issue. Just wanted to check in the meantime.

Only host information for macOS tasks for now. Related to flutter/flutter#50730

fkorotkov · 2020-02-19T21:26:11Z

Just a quick update that Cirrus now reports a few things to help with data around this issue:

Metrics and reporting of flakes (when a manual re-run succeeds).
Metrics and reporting of missing/chunked logs.
For macOS tasks Cirrus now shows a physical host where this task was executed to see if things are tend to happen on a particular subset of hosts.

I'm collecting occurrences of the issue with some manual analysis of Cirrus Agent logs in this Google Spreadsheet. Feel free to add links to tasks there.

fkorotkov · 2020-02-19T21:26:52Z

Changed the title since the issues occur mostly on macOS but not exclusively.

kf6gpe · 2020-02-19T21:49:44Z

cc @kf6gpe to follow.

fkorotkov · 2020-02-21T02:55:02Z

I've backfilled some data and added a new chart on metrics page which shows that flakes started around January 21st when there was an issue with virtualization that we discussed in #hacker-infra:

Because of the issue (which impacted 10-20% of the tasks) I had to upgraded Mac Infrastructure to Catalina and newer Virtualization (this was the only suggestion from virtualization support). But as the newly added chart shows that the upgrade only dropped flakiness to around 2% for flutter/flutter repository:

The overall situation across all macOS tasks seems more or less stable:

Seems only very network intense tasks are experiencing more then normal flakiness.

On the bright side I contacted Anka Virtualization support with all the data and they are releasing a new version of virtualization tomorrow with fixes that they claim will resolve issues with hanging sockets. 🤞

christopherfujino · 2020-02-21T06:52:35Z

@fkorotkov thanks for looking into this and following up with Anka!

fkorotkov · 2020-02-23T19:51:54Z

Just an update from Anka. They pushed the bug fix release to Tuesday. 😪🤞

fkorotkov · 2020-02-25T15:48:47Z

I've upgraded the virtualization last night and saw the issue still occurring. I was able to somewhat reliable reproduce the issue on my local Mac Mini and debugging it with Anka folks right now.

Created #51411 in the meantime while still debugging the issue with them.

fkorotkov · 2020-02-26T00:31:53Z

BTW beside the upgrade I tweaked some things on the Mac nodes and according to the Flaky graph there were no flakes today for flutter/flutter: https://cirrus-ci.com/metrics/repository/flutter/flutter

Flaked task is defined as a successfully re-run task triggered manually by a user for a task that failed.

Have you noticed improvements? I also noticed that the amount of tasks is lower then usual, hopefully you are peer cycle review is not too exhausting. 😅

jmagman · 2020-02-26T03:02:39Z

mac network flakiness has definitely gone down yesterday and today.

fkorotkov · 2020-03-02T17:12:36Z

According to the new flakiness metric the issue seems actually fixed after virtualization upgrade and some simplification of network config on the physical nodes. Will close the issue. Please reopen if you'll it appearing again.

dnfield · 2020-03-02T17:29:45Z

Thanks for looking into this Fedor!

christopherfujino · 2020-03-02T18:12:17Z

@fkorotkov thanks so much for all this work!

fkorotkov · 2020-03-02T19:22:13Z

Sorry it wasn't detected automatically before and it took two weeks to propagate from your side. Now there is an alert on the flakiness rate and in general Cirrus now collects data about flakiness.

github-actions · 2021-08-22T15:01:06Z

This thread has been automatically locked since there has not been any recent activity after it was closed. If you are still experiencing a similar issue, please open a new bug, including the output of flutter doctor -v and a minimal reproduction of the issue.

fkorotkov changed the title ~~CI networking issue~~ macOS CI networking issue Feb 13, 2020

fkorotkov mentioned this issue Feb 13, 2020

Retry cURL downloads #50731

Merged

13 tasks

VladyslavBondarenko added team Infra upgrades, team productivity, code health, technical debt. See also team: labels. team-infra Owned by Infrastructure team platform-mac Building on or for macOS specifically labels Feb 13, 2020

godofredoc added the passed secondary triage label Feb 13, 2020

fkorotkov added a commit to cirruslabs/macos-image-templates that referenced this issue Feb 14, 2020

Setup DNS to use Google DNS and Cloudflare DNS

28ee45b

Instead of using one provided by Anka which will proxy the host one. Might be related to flutter/flutter#50730

fkorotkov added a commit to cirruslabs/cirrus-ci-web that referenced this issue Feb 15, 2020

Simple Execution Info

003b9a6

Only host information for macOS tasks for now. Related to flutter/flutter#50730

fkorotkov mentioned this issue Feb 19, 2020

Cirrus network connection failures on flutter/plugins CI #51074

Closed

fkorotkov changed the title ~~macOS CI networking issue~~ CI networking issues Feb 19, 2020

fkorotkov mentioned this issue Feb 25, 2020

[skip-ci] Log ifconfig for debugging #51411

Closed

13 tasks

jmagman mentioned this issue Feb 26, 2020

Turn back on macOS shard Cirrus caching #51516

Closed

10 tasks

christopherfujino mentioned this issue Feb 27, 2020

re-land "Retry cURL downloads" #51518

Merged

13 tasks

fkorotkov closed this as completed Mar 2, 2020

github-actions bot locked as resolved and limited conversation to collaborators Aug 22, 2021

CI networking issues #50730

CI networking issues #50730

Comments

fkorotkov commented Feb 13, 2020 • edited

1. Hanging setup while pub get. 🟨

2. Gradle fails to download files. 🟩

3. Git clone failure. 🟩

4. Cocoapods failures 🟨

5. cURL issue while downloading Dart SDK 🟩

fkorotkov commented Feb 13, 2020

christopherfujino commented Feb 13, 2020

fkorotkov commented Feb 13, 2020

dnfield commented Feb 13, 2020

fkorotkov commented Feb 13, 2020

dnfield commented Feb 13, 2020

dnfield commented Feb 13, 2020

fkorotkov commented Feb 13, 2020

christopherfujino commented Feb 13, 2020

dnfield commented Feb 13, 2020

dnfield commented Feb 13, 2020

christopherfujino commented Feb 13, 2020

dnfield commented Feb 13, 2020

christopherfujino commented Feb 13, 2020

dnfield commented Feb 13, 2020

fkorotkov commented Feb 13, 2020

fkorotkov commented Feb 13, 2020

dnfield commented Feb 13, 2020

christopherfujino commented Feb 13, 2020

fkorotkov commented Feb 13, 2020

fkorotkov commented Feb 13, 2020

fkorotkov commented Feb 14, 2020

dnfield commented Feb 14, 2020

fkorotkov commented Feb 14, 2020

fkorotkov commented Feb 19, 2020

fkorotkov commented Feb 19, 2020

kf6gpe commented Feb 19, 2020

fkorotkov commented Feb 21, 2020

christopherfujino commented Feb 21, 2020

fkorotkov commented Feb 23, 2020

fkorotkov commented Feb 25, 2020

fkorotkov commented Feb 26, 2020 • edited

jmagman commented Feb 26, 2020

fkorotkov commented Mar 2, 2020

dnfield commented Mar 2, 2020

christopherfujino commented Mar 2, 2020

fkorotkov commented Mar 2, 2020

github-actions bot commented Aug 22, 2021

fkorotkov commented Feb 13, 2020 •

edited

1. Hanging `setup` while `pub get`. 🟨

fkorotkov commented Feb 26, 2020 •

edited