Networking is Flaky on GitHub Hosted Runners #1187

iamrecursion · 2020-07-07T08:22:50Z

Describe the bug
Situations that require networking, such as downloading a file using curl or running tests that utilise localhost, are proving to be very flaky, with spurious timeouts occurring often.

Area for Triage:
Servers

Question, Bug, or Feature?:
Bug

Virtual environments affected

Expected behavior
Networking behaviour should be consistent, and not cause spurious test failures due to timeouts that are impossible to reproduce on a non-CI machine.

Actual behavior
As the issue is flakiness, it is difficult to provide a consistent reproduction. The issues primarily occur with the Engine CI in the Enso repository, and manifest as spurious test failures, or failures to download things using curl, and the like.

Execute the workflow.
Wait for a failure to happen.

Virtually all of the failures (e.g. this one) are spurious and seem to occur due to networking timeouts.

The text was updated successfully, but these errors were encountered:

dsame · 2020-07-09T02:38:02Z

We tried to reproduce the problem from few data centres of Azure and Digital Ocean in different regions and none of the confirmed the network issues by agent itself, but the destination piccolo.link might have the bandwidth limited.

To diagnose the exact problem please add tracepath piccolo.link (on ubuntu) to a shell task and provide the output of it to your Internet/hosting provider.

iamrecursion · 2020-07-13T08:53:10Z

We only very rarely see a failure due to that link. Primarily we are seeing failures for connections using localhost where they time out.

dsame · 2020-07-28T09:12:05Z

Can you please provide exact URLs which time out?
And still can you add a task with tracepath localhost and provide a log? Looks like a DNS problem that leads to incorrect name resolving. tracepath can help to detect this situation in your build

dsame · 2020-08-05T15:32:43Z

We do not have the response for a long time and closing the issue.

@iamrecursion please open new issue or reopen this one in case if you still have the problem

iamrecursion · 2020-08-06T08:40:41Z

Sorry, I never saw the reply 9 days ago.

I can't provide URLs as they're randomly generated on localhost. The tests are talking to a local server.
Logs as follows for Ubuntu. tracepath doesn't seem to exist on the macOS and Windows runners where the problems tend to occur.

Ubuntu Runner

 1?: [LOCALHOST]                        0.006ms pmtu 65536
 1:  localhost                                             0.159ms reached
 1:  localhost                                             0.032ms reached
     Resume: pmtu 65536 hops 1 back 1

iamrecursion · 2020-08-06T08:41:45Z

I can't seem to actively re-open this.

dsame · 2020-08-10T06:11:06Z

In order to make sure the problem is with networking and not with the local service, please add

time ping -o localhost on ubuntu and osx before running the test and send the build logs to us.

Actually having the same issue on all 3 runners shows the problem with the local server itself rather than with network insrastructure.

iamrecursion · 2020-08-10T12:31:51Z

Quite possibly, yes, but it occurs far more often during periods that I could expect the actions machines to be under heavier load.

Do you still want the time stats?

dsame · 2020-08-14T15:11:54Z

Do you notice the total build duration increases as well? Can you send the success and failed logs?

iamrecursion · 2020-08-17T09:23:53Z

The total build duration doesn't seem to increase, not noticeably in any case. Please find logs attached from a successul and failing run on ubuntu.

logs_failure_ubuntu-latest.zip
logs_success_ubuntu-latest.zip

Darleev · 2020-08-24T15:43:55Z

Hello @iamrecursion,
As far as I see, currently Ubuntu "Engine CI" runs work fine. It seems one of the latest Ubuntu updates did the trick.
However, your pipeline was affected by the issue described here, which has been fixed.
Is the issue still actual?

iamrecursion · 2020-08-25T09:34:57Z

Yes, the issue is still present, especially on windows. Please find windows logs attached.

windows_falure.zip
windows_success.zip

LouisCAD · 2020-08-28T07:28:53Z

I'm also encountering network flakiness with timeouts. It happens at almost every run.
Here's a workflow run where you can search for "failed" and see the error happens several time despite a logic to retry twice after first failure: https://github.com/LouisCAD/Splitties/runs/948101638?check_suite_focus=true

Darleev · 2020-09-01T10:44:15Z

@LouisCAD as far as I see there was a temporary issue and now everything looks good. Am I correct?
@iamrecursion I have checked current failures for github actions jobs and It seems network issue is not actual anymore. Could you please confirm this?

LouisCAD · 2020-09-01T10:47:33Z

@Darleev Last time it happened to me was 2-3 days ago, but the action has been erased as I retried it and there's no history of failures for now in GitHub Actions.

Yesterday, I didn't encounter network issues. I'll comment back here if it surfaces again, but it seems it's no longer reproducing.

LouisCAD · 2020-09-02T00:41:08Z

@Darleev It looks like it keeps happening:

For the run that I used for the screenshot, the failure happened on both Windows and macOS.

LeonidLapshin · 2020-09-03T20:23:07Z

Hi,@LouisCAD, @iamrecursion,
It seems that your problem can be tied with network offloading, that is turned on by default.
Could you please try to disable TCP/UDP offload before build, it can be done literally with one-line:
sudo ethtool -K eth0 tx off rx off
Reloading/restarts are not required, performance should not be affected.
Thank you, we are looking forward to your reply.

LouisCAD · 2020-09-04T08:37:14Z

@LeonidLapshin Is that command Linux-only? I had timeouts happen on macOS runners too. Also, does Windows have something similar that'd need to be disabled with another command?

LeonidLapshin · 2020-09-04T20:24:21Z

Hey, @LouisCAD!

Yes, you right, the ethtool is availible on Ubuntu, but not on MacOS

On Windows the proper command (powershell) is :
Disable-NetAdapterChecksumOffload -Name * -TcpIPv4 -UdpIPv4 -TcpIPv6 -UdpIPv6

For now I can not provide the way to change these settings in runtime on MacOS, but I'll definetly try to find the solution.
Could you please test Windows/Ubuntu scenarios?

LouisCAD · 2020-09-05T11:30:20Z

@LeonidLapshin I can try for the next time I release a dev or stable version of that affected project, but it's much less likely to show any difference since I didn't witness these timeouts on Windows or Linux in the first place, which is expectable since they have significantly less upload work to do than the job running on macOS.

I'd be best to test with the corresponding setting on macOS to see if there's any difference for the next development versions releases that trigger a lot of uploads.

iamrecursion · 2020-09-07T09:27:49Z

We also see the instability on macOS, but less often than on Windows. Configuring this for Windows and Linux seems to have reduced the incidence of the network-related failures significantly. I'll be keeping an eye on it, but at least initially it seems to have helped!

I'm definitely interested in an equivalent setting for macOS.

Still seeing sporadic read timeouts on GitHub Actions macos test runs. Let's try: actions/runner-images#1187 (comment) Addendum to #464

smorimoto · 2023-04-17T16:07:38Z

Hope this helps someone: https://github.com/smorimoto/tune-github-hosted-runner-network

try to fix github runners network flakiness According to actions/runner-images#1187 (comment) disabling network offloading can help with github runner network flakiness. See merge request dfinity-lab/public/ic!14097

actions/runner-images#1187

try adding the workarounds mentioned in actions/runner-images#1187

actions/runner-images#1187 https://github.com/smorimoto/tune-github-hosted-runner-network

actions/runner-images#1187 https://github.com/smorimoto/tune-github-hosted-runner-network Resolves #6

github-actions bot added the needs triage label Jul 7, 2020

Darleev added Area: Image administration investigate Collect additional information, like space on disk, other tool incompatibilities etc. and removed needs triage labels Jul 7, 2020

Darleev assigned dsame Jul 7, 2020

dsame closed this as completed Aug 5, 2020

AlenaSviridenko reopened this Aug 6, 2020

AlenaSviridenko assigned Darleev Aug 20, 2020

AlenaSviridenko assigned LeonidLapshin Sep 4, 2020

lread mentioned this issue Aug 3, 2022

ci: try adjusting macos network settings clj-commons/etaoin#490

Merged

lread added a commit to clj-commons/etaoin that referenced this issue Aug 3, 2022

ci: try adjusting macos network settings (#490)

c9666dd

Still seeing sporadic read timeouts on GitHub Actions macos test runs. Let's try: actions/runner-images#1187 (comment) Addendum to #464

reiniervlinschoten mentioned this issue Aug 15, 2022

Flaky Network -> RemoteProtocolError #6076

Closed

10 tasks

nikita-volkov mentioned this issue Aug 21, 2022

Reproduce the hanging bug on Github Actions Ubuntu runner vincenthz/hs-connection#53

Closed

npepinpe mentioned this issue Sep 22, 2022

Recurring failing builds due to maven connection timeouts camunda/camunda#10447

Closed

vzamanillo mentioned this issue Dec 1, 2022

Failing integration tests for crtsh and waybackarchive projectdiscovery/subfinder#678

Closed

1 task

nsrip-dd mentioned this issue Jan 17, 2023

[FLAKY TEST] contrib/bradfitz/gomemcache/memcache: TestMemcacheIntegration/default DataDog/dd-trace-go#1669

Closed

fasmat mentioned this issue Jan 20, 2023

[Merged by Bors] - Skip TestNIPostBuilderWithClients until fixed spacemeshos/go-spacemesh#3983

Closed

7 tasks

sorhawell mentioned this issue May 12, 2023

try smorimoto/tune-github-hosted-runner-network pola-rs/r-polars#212

Merged

jonathangreen mentioned this issue May 18, 2023

Try workaround for github runner network issues. ThePalaceProject/circulation#1116

Merged

2 tasks

ivanzhelyazkov mentioned this issue Jun 1, 2023

ci health workflow network fix bancorprotocol/carbon-contracts#93

Merged

neysofu mentioned this issue Jun 12, 2023

Reduce CI flakiness: disable TCP checksum offloading graphprotocol/graph-node#4684

Merged

bkoelman added a commit to json-api-dotnet/JsonApiDotNetCore that referenced this issue Nov 22, 2023

Try workaround for flaky network in GitHub hosted runners

71e5aeb

actions/runner-images#1187

bkoelman mentioned this issue Nov 22, 2023

Try workaround for flaky network in GitHub hosted runners json-api-dotnet/JsonApiDotNetCore#1403

Merged

14 tasks

bkoelman added a commit to json-api-dotnet/JsonApiDotNetCore that referenced this issue Nov 23, 2023

Try workaround for flaky network in GitHub hosted runners

8fb9f7d

actions/runner-images#1187

bkoelman added a commit to json-api-dotnet/JsonApiDotNetCore that referenced this issue Nov 23, 2023

Try workaround for flaky network in GitHub hosted runners

c4251f2

actions/runner-images#1187

bkoelman added a commit to json-api-dotnet/JsonApiDotNetCore that referenced this issue Nov 23, 2023

Try workaround for flaky network in GitHub hosted runners

e517f58

actions/runner-images#1187

bkoelman added a commit to json-api-dotnet/JsonApiDotNetCore that referenced this issue Nov 25, 2023

Try workaround for flaky network in GitHub hosted runners

f3016e2

actions/runner-images#1187

bkoelman added a commit to json-api-dotnet/JsonApiDotNetCore that referenced this issue Dec 8, 2023

Try workaround for flaky network in GitHub hosted runners

f71c562

actions/runner-images#1187

bkoelman added a commit to json-api-dotnet/JsonApiDotNetCore that referenced this issue Dec 10, 2023

Try workaround for flaky network in GitHub hosted runners (#1403)

337a2a9

actions/runner-images#1187

IharYakimush added a commit to epam/epam-kafka that referenced this issue Apr 4, 2024

try network workaround actions/runner-images#1187 (comment)

9b53059

AlessioGr mentioned this issue Apr 15, 2024

ci: add weird tune linux network step which seems to reduce flakes payloadcms/payload#5855

Merged

11 tasks

commonism added a commit to commonism/aiopenapi3_redfish that referenced this issue May 27, 2024

ci - github runners fail to download the DSP8010 zip

eefb5e3

try adding the workarounds mentioned in actions/runner-images#1187

iainelder mentioned this issue May 31, 2024

Fix flaky networking in GitHub Actions iainelder/dotfiles#6

Closed

iainelder added a commit to iainelder/dotfiles that referenced this issue May 31, 2024

Fix GitHub runner flaky networking

c8956d3

actions/runner-images#1187 https://github.com/smorimoto/tune-github-hosted-runner-network

iainelder added a commit to iainelder/dotfiles that referenced this issue May 31, 2024

Fix GitHub runner flaky networking

8eded5c

actions/runner-images#1187 https://github.com/smorimoto/tune-github-hosted-runner-network Resolves #6

This was referenced Jul 16, 2024

Bugfix MTE-3145 Stablize Focus tests on Github Actions mozilla-mobile/firefox-ios#21051

Closed

Bugfix MTE-3145 Stablize Focus tests on Github Actions mozilla-mobile/firefox-ios#21053

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Networking is Flaky on GitHub Hosted Runners #1187

Networking is Flaky on GitHub Hosted Runners #1187

iamrecursion commented Jul 7, 2020

dsame commented Jul 9, 2020

iamrecursion commented Jul 13, 2020

dsame commented Jul 28, 2020

dsame commented Aug 5, 2020

iamrecursion commented Aug 6, 2020

iamrecursion commented Aug 6, 2020

dsame commented Aug 10, 2020

iamrecursion commented Aug 10, 2020

dsame commented Aug 14, 2020

iamrecursion commented Aug 17, 2020

Darleev commented Aug 24, 2020 •

edited

Loading

iamrecursion commented Aug 25, 2020

LouisCAD commented Aug 28, 2020

Darleev commented Sep 1, 2020 •

edited

Loading

LouisCAD commented Sep 1, 2020

LouisCAD commented Sep 2, 2020

LeonidLapshin commented Sep 3, 2020

LouisCAD commented Sep 4, 2020

LeonidLapshin commented Sep 4, 2020

LouisCAD commented Sep 5, 2020

iamrecursion commented Sep 7, 2020

smorimoto commented Apr 17, 2023

Networking is Flaky on GitHub Hosted Runners #1187

Networking is Flaky on GitHub Hosted Runners #1187

Comments

iamrecursion commented Jul 7, 2020

dsame commented Jul 9, 2020

iamrecursion commented Jul 13, 2020

dsame commented Jul 28, 2020

dsame commented Aug 5, 2020

iamrecursion commented Aug 6, 2020

iamrecursion commented Aug 6, 2020

dsame commented Aug 10, 2020

iamrecursion commented Aug 10, 2020

dsame commented Aug 14, 2020

iamrecursion commented Aug 17, 2020

Darleev commented Aug 24, 2020 • edited Loading

iamrecursion commented Aug 25, 2020

LouisCAD commented Aug 28, 2020

Darleev commented Sep 1, 2020 • edited Loading

LouisCAD commented Sep 1, 2020

LouisCAD commented Sep 2, 2020

LeonidLapshin commented Sep 3, 2020

LouisCAD commented Sep 4, 2020

LeonidLapshin commented Sep 4, 2020

LouisCAD commented Sep 5, 2020

iamrecursion commented Sep 7, 2020

smorimoto commented Apr 17, 2023

Darleev commented Aug 24, 2020 •

edited

Loading

Darleev commented Sep 1, 2020 •

edited

Loading