-
Notifications
You must be signed in to change notification settings - Fork 3k
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Networking is Flaky on GitHub Hosted Runners #1187
Comments
We tried to reproduce the problem from few data centres of Azure and Digital Ocean in different regions and none of the confirmed the network issues by agent itself, but the destination To diagnose the exact problem please add |
We only very rarely see a failure due to that link. Primarily we are seeing failures for connections using |
Can you please provide exact URLs which time out? |
We do not have the response for a long time and closing the issue. @iamrecursion please open new issue or reopen this one in case if you still have the problem |
Sorry, I never saw the reply 9 days ago.
Ubuntu Runner
|
I can't seem to actively re-open this. |
In order to make sure the problem is with networking and not with the local service, please add
Actually having the same issue on all 3 runners shows the problem with the local server itself rather than with network insrastructure. |
Quite possibly, yes, but it occurs far more often during periods that I could expect the actions machines to be under heavier load. Do you still want the time stats? |
Do you notice the total build duration increases as well? Can you send the success and failed logs? |
The total build duration doesn't seem to increase, not noticeably in any case. Please find logs attached from a successul and failing run on ubuntu. logs_failure_ubuntu-latest.zip |
Hello @iamrecursion, |
Yes, the issue is still present, especially on windows. Please find windows logs attached. |
I'm also encountering network flakiness with timeouts. It happens at almost every run. |
@LouisCAD as far as I see there was a temporary issue and now everything looks good. Am I correct? |
@Darleev Last time it happened to me was 2-3 days ago, but the action has been erased as I retried it and there's no history of failures for now in GitHub Actions. Yesterday, I didn't encounter network issues. I'll comment back here if it surfaces again, but it seems it's no longer reproducing. |
@Darleev It looks like it keeps happening: For the run that I used for the screenshot, the failure happened on both Windows and macOS. |
Hi,@LouisCAD, @iamrecursion, |
@LeonidLapshin Is that command Linux-only? I had timeouts happen on macOS runners too. Also, does Windows have something similar that'd need to be disabled with another command? |
Hey, @LouisCAD! Yes, you right, the ethtool is availible on Ubuntu, but not on MacOS On Windows the proper command (powershell) is : For now I can not provide the way to change these settings in runtime on MacOS, but I'll definetly try to find the solution. |
@LeonidLapshin I can try for the next time I release a dev or stable version of that affected project, but it's much less likely to show any difference since I didn't witness these timeouts on Windows or Linux in the first place, which is expectable since they have significantly less upload work to do than the job running on macOS. I'd be best to test with the corresponding setting on macOS to see if there's any difference for the next development versions releases that trigger a lot of uploads. |
We also see the instability on macOS, but less often than on Windows. Configuring this for Windows and Linux seems to have reduced the incidence of the network-related failures significantly. I'll be keeping an eye on it, but at least initially it seems to have helped! I'm definitely interested in an equivalent setting for macOS. |
Still seeing sporadic read timeouts on GitHub Actions macos test runs. Let's try: actions/runner-images#1187 (comment) Addendum to #464
Hope this helps someone: https://github.com/smorimoto/tune-github-hosted-runner-network |
try to fix github runners network flakiness According to actions/runner-images#1187 (comment) disabling network offloading can help with github runner network flakiness. See merge request dfinity-lab/public/ic!14097
try adding the workarounds mentioned in actions/runner-images#1187
…gger_release.yml./File on: schedule: # run every day at 23:15 - cron: '15 23 * * *' workflow_dispatch: inputs: releaseType: description: stable, canary, or release candidate? required: true type: choice options: - canary - stable - release-candidate semverType: description: semver type? type: choice options: - patch - minor - major force: description: create a new release even if there are no new commits default: false type: boolean secrets: RELEASE_BOT_GITHUB_TOKEN: required: true name: Trigger Release env: NAPI_CLI_VERSION: 2.14.7 TURBO_VERSION: 2.0.9 NODE_LTS_VERSION: 20 jobs: start: if: github.repository_owner == 'vercel' runs-on: ubuntu-latest env: NEXT_TELEMETRY_DISABLED: 1 # we build a dev binary for use in CI so skip downloading # canary next-swc binaries in the monorepo NEXT_SKIP_NATIVE_POSTINSTALL: 1 environment: release-${{ github.event.inputs.releaseType || 'canary' }} steps: - name: Setup node uses: actions/setup-node@v4 with: node-version: 18 check-latest: true - name: Clone Next.js repository run: git clone https://github.com/vercel/next.js.git --depth=25 --single-branch --branch ${GITHUB_REF_NAME:-canary} . - name: Check token run: gh auth status env: GITHUB_TOKEN: ${{ secrets.RELEASE_BOT_GITHUB_TOKEN }} - name: Get commit of the latest tag run: echo "LATEST_TAG_COMMIT=$(git rev-list -n 1 $(git describe --tags --abbrev=0))" >> $GITHUB_ENV - name: Get latest commit run: echo "LATEST_COMMIT=$(git rev-parse HEAD)" >> $GITHUB_ENV - name: Check if new commits since last tag if: ${{ github.event.inputs.releaseType != 'stable' && github.event.inputs.force != true }} run: | if [ "$LATEST_TAG_COMMIT" = "$LATEST_COMMIT" ]; then echo "No new commits. Exiting..." exit 1 fi # actions/runner-images#1187 - name: tune linux network run: sudo ethtool -K eth0 tx off rx off - run: corepack enable && pnpm --version - id: get-store-path run: echo STORE_PATH=$(pnpm store path) >> $GITHUB_OUTPUT - uses: actions/cache@v4 timeout-minutes: 5 id: cache-pnpm-store with: path: ${{ steps.get-store-path.outputs.STORE_PATH }} key: pnpm-store-${{ hashFiles('pnpm-lock.yaml') }} restore-keys: | pnpm-store- pnpm-store-${{ hashFiles('pnpm-lock.yaml') }} - run: pnpm install - run: pnpm run build - run: node ./scripts/start-release.js --release-type ${{ github.event.inputs.releaseType || 'canary' }} --semver-type ${{ github.event.inputs.semverType }} env: RELEASE_BOT_GITHUB_TOKEN: ${{ secrets.RELEASE_BOT_GITHUB_TOKEN }}
Describe the bug
Situations that require networking, such as downloading a file using
curl
or running tests that utilise localhost, are proving to be very flaky, with spurious timeouts occurring often.Area for Triage:
Servers
Question, Bug, or Feature?:
Bug
Virtual environments affected
Expected behavior
Networking behaviour should be consistent, and not cause spurious test failures due to timeouts that are impossible to reproduce on a non-CI machine.
Actual behavior
As the issue is flakiness, it is difficult to provide a consistent reproduction. The issues primarily occur with the Engine CI in the Enso repository, and manifest as spurious test failures, or failures to download things using
curl
, and the like.Virtually all of the failures (e.g. this one) are spurious and seem to occur due to networking timeouts.
The text was updated successfully, but these errors were encountered: