Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Networking is Flaky on GitHub Hosted Runners #1187

Closed
3 of 6 tasks
iamrecursion opened this issue Jul 7, 2020 · 28 comments
Closed
3 of 6 tasks

Networking is Flaky on GitHub Hosted Runners #1187

iamrecursion opened this issue Jul 7, 2020 · 28 comments
Assignees
Labels
Area: Image administration investigate Collect additional information, like space on disk, other tool incompatibilities etc.

Comments

@iamrecursion
Copy link

Describe the bug
Situations that require networking, such as downloading a file using curl or running tests that utilise localhost, are proving to be very flaky, with spurious timeouts occurring often.

Area for Triage:
Servers

Question, Bug, or Feature?:
Bug

Virtual environments affected

  • macOS 10.15
  • Ubuntu 16.04 LTS
  • Ubuntu 18.04 LTS
  • Ubuntu 20.04 LTS
  • Windows Server 2016 R2
  • Windows Server 2019

Expected behavior
Networking behaviour should be consistent, and not cause spurious test failures due to timeouts that are impossible to reproduce on a non-CI machine.

Actual behavior
As the issue is flakiness, it is difficult to provide a consistent reproduction. The issues primarily occur with the Engine CI in the Enso repository, and manifest as spurious test failures, or failures to download things using curl, and the like.

  1. Execute the workflow.
  2. Wait for a failure to happen.

Virtually all of the failures (e.g. this one) are spurious and seem to occur due to networking timeouts.

@Darleev Darleev added Area: Image administration investigate Collect additional information, like space on disk, other tool incompatibilities etc. and removed needs triage labels Jul 7, 2020
@dsame
Copy link
Contributor

dsame commented Jul 9, 2020

We tried to reproduce the problem from few data centres of Azure and Digital Ocean in different regions and none of the confirmed the network issues by agent itself, but the destination piccolo.link might have the bandwidth limited.

To diagnose the exact problem please add tracepath piccolo.link (on ubuntu) to a shell task and provide the output of it to your Internet/hosting provider.

@iamrecursion
Copy link
Author

We only very rarely see a failure due to that link. Primarily we are seeing failures for connections using localhost where they time out.

@dsame
Copy link
Contributor

dsame commented Jul 28, 2020

Can you please provide exact URLs which time out?
And still can you add a task with tracepath localhost and provide a log? Looks like a DNS problem that leads to incorrect name resolving. tracepath can help to detect this situation in your build

@dsame
Copy link
Contributor

dsame commented Aug 5, 2020

We do not have the response for a long time and closing the issue.

@iamrecursion please open new issue or reopen this one in case if you still have the problem

@dsame dsame closed this as completed Aug 5, 2020
@iamrecursion
Copy link
Author

Sorry, I never saw the reply 9 days ago.

  • I can't provide URLs as they're randomly generated on localhost. The tests are talking to a local server.
  • Logs as follows for Ubuntu. tracepath doesn't seem to exist on the macOS and Windows runners where the problems tend to occur.
Ubuntu Runner
 1?: [LOCALHOST]                        0.006ms pmtu 65536
 1:  localhost                                             0.159ms reached
 1:  localhost                                             0.032ms reached
     Resume: pmtu 65536 hops 1 back 1 

@iamrecursion
Copy link
Author

I can't seem to actively re-open this.

@dsame
Copy link
Contributor

dsame commented Aug 10, 2020

In order to make sure the problem is with networking and not with the local service, please add

time ping -o localhost on ubuntu and osx before running the test and send the build logs to us.

Actually having the same issue on all 3 runners shows the problem with the local server itself rather than with network insrastructure.

@iamrecursion
Copy link
Author

Quite possibly, yes, but it occurs far more often during periods that I could expect the actions machines to be under heavier load.

Do you still want the time stats?

@dsame
Copy link
Contributor

dsame commented Aug 14, 2020

Do you notice the total build duration increases as well? Can you send the success and failed logs?

@iamrecursion
Copy link
Author

The total build duration doesn't seem to increase, not noticeably in any case. Please find logs attached from a successul and failing run on ubuntu.

logs_failure_ubuntu-latest.zip
logs_success_ubuntu-latest.zip

@Darleev
Copy link
Contributor

Darleev commented Aug 24, 2020

Hello @iamrecursion,
As far as I see, currently Ubuntu "Engine CI" runs work fine. It seems one of the latest Ubuntu updates did the trick.
However, your pipeline was affected by the issue described here, which has been fixed.
Is the issue still actual?

@iamrecursion
Copy link
Author

Yes, the issue is still present, especially on windows. Please find windows logs attached.

windows_falure.zip
windows_success.zip

@LouisCAD
Copy link

I'm also encountering network flakiness with timeouts. It happens at almost every run.
Here's a workflow run where you can search for "failed" and see the error happens several time despite a logic to retry twice after first failure: https://github.com/LouisCAD/Splitties/runs/948101638?check_suite_focus=true

@Darleev
Copy link
Contributor

Darleev commented Sep 1, 2020

@LouisCAD as far as I see there was a temporary issue and now everything looks good. Am I correct?
@iamrecursion I have checked current failures for github actions jobs and It seems network issue is not actual anymore. Could you please confirm this?

@LouisCAD
Copy link

LouisCAD commented Sep 1, 2020

@Darleev Last time it happened to me was 2-3 days ago, but the action has been erased as I retried it and there's no history of failures for now in GitHub Actions.

Yesterday, I didn't encounter network issues. I'll comment back here if it surfaces again, but it seems it's no longer reproducing.

@LouisCAD
Copy link

LouisCAD commented Sep 2, 2020

@Darleev It looks like it keeps happening:
Screenshot 2020-09-02 at 02 38 53

For the run that I used for the screenshot, the failure happened on both Windows and macOS.

@LeonidLapshin
Copy link
Contributor

Hi,@LouisCAD, @iamrecursion,
It seems that your problem can be tied with network offloading, that is turned on by default.
Could you please try to disable TCP/UDP offload before build, it can be done literally with one-line:
sudo ethtool -K eth0 tx off rx off
Reloading/restarts are not required, performance should not be affected.
Thank you, we are looking forward to your reply.

@LouisCAD
Copy link

LouisCAD commented Sep 4, 2020

@LeonidLapshin Is that command Linux-only? I had timeouts happen on macOS runners too. Also, does Windows have something similar that'd need to be disabled with another command?

@LeonidLapshin
Copy link
Contributor

Hey, @LouisCAD!

Yes, you right, the ethtool is availible on Ubuntu, but not on MacOS

On Windows the proper command (powershell) is :
Disable-NetAdapterChecksumOffload -Name * -TcpIPv4 -UdpIPv4 -TcpIPv6 -UdpIPv6

For now I can not provide the way to change these settings in runtime on MacOS, but I'll definetly try to find the solution.
Could you please test Windows/Ubuntu scenarios?

@LouisCAD
Copy link

LouisCAD commented Sep 5, 2020

@LeonidLapshin I can try for the next time I release a dev or stable version of that affected project, but it's much less likely to show any difference since I didn't witness these timeouts on Windows or Linux in the first place, which is expectable since they have significantly less upload work to do than the job running on macOS.

I'd be best to test with the corresponding setting on macOS to see if there's any difference for the next development versions releases that trigger a lot of uploads.

@iamrecursion
Copy link
Author

We also see the instability on macOS, but less often than on Windows. Configuring this for Windows and Linux seems to have reduced the incidence of the network-related failures significantly. I'll be keeping an eye on it, but at least initially it seems to have helped!

I'm definitely interested in an equivalent setting for macOS.

@smorimoto
Copy link
Contributor

Hope this helps someone: https://github.com/smorimoto/tune-github-hosted-runner-network

gitlab-dfinity pushed a commit to dfinity/ic that referenced this issue Aug 14, 2023
try to fix github runners network flakiness

According to actions/runner-images#1187 (comment) disabling network offloading can help with github runner network flakiness. 

See merge request dfinity-lab/public/ic!14097
bkoelman added a commit to json-api-dotnet/JsonApiDotNetCore that referenced this issue Nov 22, 2023
bkoelman added a commit to json-api-dotnet/JsonApiDotNetCore that referenced this issue Nov 23, 2023
bkoelman added a commit to json-api-dotnet/JsonApiDotNetCore that referenced this issue Nov 23, 2023
bkoelman added a commit to json-api-dotnet/JsonApiDotNetCore that referenced this issue Nov 23, 2023
bkoelman added a commit to json-api-dotnet/JsonApiDotNetCore that referenced this issue Nov 25, 2023
bkoelman added a commit to json-api-dotnet/JsonApiDotNetCore that referenced this issue Dec 8, 2023
bkoelman added a commit to json-api-dotnet/JsonApiDotNetCore that referenced this issue Dec 10, 2023
IharYakimush added a commit to epam/epam-kafka that referenced this issue Apr 4, 2024
commonism added a commit to commonism/aiopenapi3_redfish that referenced this issue May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Image administration investigate Collect additional information, like space on disk, other tool incompatibilities etc.
Projects
None yet
Development

No branches or pull requests

9 participants