Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support self-hosted Cirrus workers on forks #29274

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

Sjors
Copy link
Member

@Sjors Sjors commented Jan 18, 2024

I find myself making pull requests against my fork (mostly on top of #28983), or asking others to do so. Currently only the Github actions are run on forks, because we use self-hosted runners for the Cirrus tasks.

While setting up my own self-hosted runners for my fork, I ran into a number of issues. This PR addresses those.

Some issues are related to me using multiple regular users on the same physical machine, rather than running every worker in a VM. Update: these fixes have been absorbed by #29441.

Replaces #29259 using the self hosted workers via Cirrus instead of Github.

You can see this PR in action on this pull request to my fork: Sjors#30

To test it yourself:

  1. spin up at least two self hosted runners. Either use a seperate VM for each, or give them their own user.
  2. Install Podman and other CI dependencies (see .cirrus.yml)
  3. Give Cirrus access to your fork at https://cirrus-ci.com/settings/github/YOU
  4. Get a token from Cirrus and use it to start your worker(s)
  5. Optionally set NO_BRANCH=true, NO_ARM=true and NO_NOBLE=true env variables (see .cirrus.yml)
  6. make a pull request to your own fork, with this PR as the base branch

Security wise: when dealing with code from strangers on the internet, review it first before running the CI. There's a Cirrus check-box that requires approval for people without write access to trigger CI.

One side-effect of this PR is that on forks Github CI is no longer runs on branch pushes. It does still run for pull requests. I could undo that, but the downside is that is Github CI would run twice in a typical pull request workflow.

@DrahtBot
Copy link
Contributor

DrahtBot commented Jan 18, 2024

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.

Type Reviewers
Stale ACK maflcko

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

No conflicts as of last run.

@@ -161,11 +172,13 @@ task:

task:
name: 'ASan + LSan + UBSan + integer, no depends, USDT'
# Skip if no Noble worker exists (for a fork)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? Won't this be skipped already by default if the worker is offline?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It stays pending, at least in the Cirrus interface.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It stays pending, at least in the Cirrus interface.

But that seems preferable to me, as it communicates clearly that the task didn't run.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying that UX in Sjors#36 ...

Copy link
Member Author

@Sjors Sjors Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the first push of a branch, due to the UI glitched described below (#29274 (comment)) they show as skipped:

Scherm­afbeelding 2024-02-27 om 17 40 10

That seems fine. On the second push they are put in "queued sate" (forever?).

Scherm­afbeelding 2024-02-27 om 17 56 11

I agree that hiding the task altogether is not ideal (as ecf01f4 does). Perhaps marking it as skipped is better.

Copy link
Member Author

@Sjors Sjors Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well for one thing it's way faster (using just an idle desktop computer).

Copy link
Member Author

@Sjors Sjors Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also not this black and white. Sure, Microsoft and Cirrus can screw us over by faking the CI result. But by not relying on them for hosting for the runners, it's both easier to catch them doing shenanigans and it's easier to move elsewhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

easier to catch them

Why? Someone will have to run a fully local backup either way, whose cost is independent of what is running on the other side.

easier to move elsewhere

Why? Moving elsewhere also requires moving the repo off of the GitHub mirror, or writing a GH integration and a CI framework from scratch, both of whose costs are largely independent of where the CI CPUs happen to sit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I run a self-hosted CI I can see what it's doing. If it fails on my machine and is marked as successful on Github, I can see that. If I collude with Github that's a different story of course.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI cleans up after itself, latest when the next run starts, so I don't think it is reliably possible to inspect a failure of a remote-triggered CI run locally, even assuming no "attack" from GH.

More importantly, the CI machine is literally doing RCE of downloaded binary executables and random code, so reliably detecting a test failure (assuming a GH attack) seems questionable at best.

# Do not run branch pushes on forks, because they are typically redundant with
# pull requests to the main repo.
- 'bitcoin/**'
- 'bitcoin-core/**'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No objection, but I presume this will cause issues on forks (Inquisition, Knots, ...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, this probably needs an ENV flag.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Knots pull requests seem to be typically made against the 25.x-knots branch. Those would run normally. But with the above code CI would not run when e.g. a future 26.x.knots is created (or rebased).

Inquisition seems to follow the some pattern.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aaargh, afaik Github CI doesn't support custom ENV variables for forks. So this is solved, but on Github forks are unconditionally skipped.

.cirrus.yml Outdated Show resolved Hide resolved
.cirrus.yml Outdated Show resolved Hide resolved
ci/test/02_run_container.sh Outdated Show resolved Hide resolved
@@ -11,7 +11,7 @@ set -ex
if [ -n "$LOCAL_BRANCH" ]; then
# To faithfully recreate CI linting locally, specify all commits on the current
# branch.
COMMIT_RANGE="$(git merge-base HEAD master)..HEAD"
COMMIT_RANGE="$(git merge-base HEAD origin/master)..HEAD"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preserving @maflcko's comment on the previous PR:

unrelated: Generally I'd find it better if we get rid of the commit range completely. For example, when running the whitespace linter, it seems easier to just lint all code at HEAD, instead of trying to guess which code was modified and only lint that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #29487

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, I don't understand the change here. This is only run when LOCAL_BRANCH is set, which does not happen in CI, so I doubt that your "fix" fixes anything.

Copy link
Member Author

@Sjors Sjors Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember 0ac57f8 fixed an error complaining that there's no master branch. But I can retest.

Copy link
Member Author

@Sjors Sjors Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC only the cirrus linter CI calls this script. That's not self-hosted even on forks. So it's possible this commit was only useful in #29259, where I added that linter to a Github Action. I'll drop it if CI passes on Sjors#30

Copy link
Member

@maflcko maflcko Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember 0ac57f8 fixed an error complaining that there's no master branch. But I can retest.

Sure, but I wonder if it is worth it. Given that COMMIT_RANGE is only really required for the scripted-diff check, I think it is fine to force devs to locally specify the range.

There are endless ways in which this can fail. I don't have a master branch locally either. origin/master isn't guaranteed to exist either. And finally, all of them (if they exist) may point to different commits, leading to even more bugs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC only the cirrus linter CI calls this script. That's not self-hosted even on forks. So it's possible this commit was only useful in #29259, where I added that linter to a Github Action. I'll drop it if CI passes on Sjors#30

Yeah, obviously it will be passing. As I said, this code in this line is never executed in CI. For pulls, the branch below will be run.

https://cirrus-ci.com/task/6340412349087744?logs=lint#L263

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dropped 0ac57f8.

Someone can drop COMMIT_RANGE altogether in a separate PR. Maybe move the incantation to a README, because it's certainly useful to know how to run the linter on each commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #29660

@@ -29,7 +29,10 @@ jobs:
test-each-commit:
name: 'test each commit'
runs-on: ubuntu-22.04
if: github.event_name == 'pull_request' && github.event.pull_request.commits != 1
# Only run on pull requests, skip if there's only commit.
# Do not run on forks.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nicer to fix the actual problem, but I don't understand what this job is doing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hebasto @ryanofsky any thoughts on how to make the test-each-commit task work for a scenario like in Sjors#30 where the base branch is fork/some-branch and the PR branch is fork/some-pr?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example of earlier failure: https://github.com/Sjors/bitcoin/actions/runs/7556559882/job/20573828392?pr=28#step:4:42

 shell: /usr/bin/bash -e {0}
  env:
    DANGER_RUN_CI_ON_HOST: 1
    CI_FAILFAST_TEST_LEAVE_DANGLING: 1
    MAKEJOBS: -j10
    MAX_COUNT: 6
    FETCH_DEPTH: 8
Warning: you are leaving 1 commit behind, not connected to
any of your branches:

  3b5c5d4 ci: add self-hosted runners for GitHub

If you want to keep it by creating a new branch, this may be a good time
to do so with:

 git branch <new-branch-name> 3b5c5d4

HEAD is now at cbe740b ci: vary /tmp/env
fatal: bad revision '^^@'

@Sjors Sjors force-pushed the 2023/01/ci-fork branch 2 times, most recently from f79ca32 to e44e9d2 Compare January 18, 2024 17:20
@maflcko
Copy link
Member

maflcko commented Jan 18, 2024

lgtm ACK c65fde4

Didn't review the other commits.

@maflcko
Copy link
Member

maflcko commented Jan 19, 2024

As for the other commits, as mentioned previously, I am not sure if this is really needed. But no objection, of course. (#29259 (comment))

@Sjors
Copy link
Member Author

Sjors commented Jan 22, 2024

@maflcko that comment refers to running CI for checking my own work. But what this pull request enables is running CI for pull requests, which may be from myself but also from others.

I'm somewhat puzzled why Knots (cc @luke-jr) and Inquisition (cc @ajtowns) are not running into the same issues. But I guess there's at least two differences:

  1. Presumably they run self hosted (virtual) machines via Cirrus but with sudo permissions
  2. They're both still building on the v25.0 tag, i.e. with somewhat different CI config

@ajtowns
Copy link
Contributor

ajtowns commented Jan 22, 2024

I'm somewhat puzzled why Knots (cc @luke-jr) and Inquisition (cc @ajtowns) are not running into the same issues. But I guess there's at least two differences:

Inquisition disabled all the automatic tests except the trivial linter when cirrus imposed limits earlier in the year, and tests are only run manually by reviewers. Haven't worked out what it should be doing, so don't have suggestions to offer, but interested to see what the resolution here is.

@luke-jr
Copy link
Member

luke-jr commented Jan 23, 2024

I'm somewhat puzzled why Knots and Inquisition are not running into the same issues

Maybe because I switch the self-hosted CI to not-self-hosted?

@@ -30,6 +30,10 @@ if [ -z "$DANGER_RUN_CI_ON_HOST" ]; then
docker volume create "${CONTAINER_NAME}_previous_releases" || true

if [ -n "${RESTART_CI_DOCKER_BEFORE_RUN}" ] ; then
# Kill harder, prevents "The container name "/ci_..." is already in use by container"
docker stop "$(docker ps -a -q)" || true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "kill harder"? Containers are already being --all --force removed below:

-a, --all                   Remove all containers
-f, --force                 Force removal of a running or unusable container

so they shouldn't need to be "stopped" beforehand?

Also, shouldn't this be podman?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative (with podman 4.8+) would be to use run --replace, see containers/podman@f21c1d2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was intentionally vague, because I might just be doing something else wrong. I'll read @maflcko's suggestion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like https://packages.ubuntu.com/noble/podman and https://packages.debian.org/trixie/podman includes the fixed --replace, so might try that at one point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this is only available in Noble? In that case I might just improve the code comment (including sleep) and add a note that can go away with Noble.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we add a dependency on podman at all?

A dependency on a container engine is needed, because this repo is the wrong place to implement a container engine from scratch.

Is the plan to move to podman entirely?

For running the CI locally, any container engine should work. Inside the RESTART_CI_DOCKER_BEFORE_RUN, only podman is supported.

If you want to switch to docker, the podman commands would have to replaced by docker commands inside the RESTART_CI_DOCKER_BEFORE_RUN block. Otherwise, both podman and docker are required to run on docker, which seems confusing.

Copy link
Member Author

@Sjors Sjors Jan 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see where my confusion comes from. I thought podman was just a wrapper for Docker. But it's actually an alternative which can run the same containers. Why not switch over entirely to Podman? It's quite confusing that we're using both.

Since Podman doesn't need root either, I might just nuke Docker and give it a try.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I nuked Docker in favor of Podman. So far it seems to work. I dropped 7cdb75d.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not switch over entirely to Podman? It's quite confusing that we're using both.

I don't see the benefit of requiring that docker is uninstalled. If someone wants to use their container engine of their choice locally, just let them do it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not specifically related to the issue at hand: the new podman desktop app is pretty smart, it can show you containers running via docker and podman (and kubernetes clusters) all in the same UI: https://podman-desktop.io/

@Sjors
Copy link
Member Author

Sjors commented Jan 25, 2024

I switched my own machine from Docker to Podman. This removes the need for 7cdb75d so that's dropped.

Added a commit that explains the process a bit more clearly.

fanquake added a commit that referenced this pull request Feb 20, 2024
fa91bf2 ci: Skip git install if it is already installed (MarcoFalke)
c65fde4 ci: vary /tmp/env (Sjors Provoost)

Pull request description:

  * Currently, running separate CI tasks at the same time may intermittently fail, because they race to read/write `/tmp/env`. Fix this by adding `$CONTAINER_NAME` to the file name.

  * Also, add `$USER`, while touching the line, to allow different users to run the same CI task at the same time.

  * Also, skip the git install if there is no need.

  Ref: #29274

ACKs for top commit:
  Sjors:
    ACK fa91bf2
  BrandonOdiwuor:
    ACK fa91bf2
  hebasto:
    ACK fa91bf2.

Tree-SHA512: 9a8479255a2afb6618f9d0796488d9430ba95266b90ce39536a9817c1974ca4049beeaab5355a38b25171f76fc386dbec06b1919aaa079f08a5a0c0a146232c8
@Sjors
Copy link
Member Author

Sjors commented Feb 20, 2024

Rebased after #29441 landed which contains two commits from this PR.

@Sjors Sjors changed the title Support self-hosted Cirrus workers on forks (and multi-user) Support self-hosted Cirrus workers on forks Feb 20, 2024
@Sjors Sjors mentioned this pull request Feb 20, 2024
@Sjors
Copy link
Member Author

Sjors commented Feb 27, 2024

Dropped 0ac57f8 since that wasn't doing anything. Also rebased.

@Sjors
Copy link
Member Author

Sjors commented Feb 27, 2024

One annoying side-effect of 80f338c is that if you open a pull request (after pushing the branch) the Cirrus tasks initially all show as skipped. The tasks are actually running though, and as they finish that's reflected on Github.

This only happens for Cirrus workers, not for Github workers (3048700).

See screenshots: Sjors#36 (comment)

This seems like a bug in Github or in Cirrus's integration. Dropping 80f338c has two undesirable effects:

  1. It runs CI twice for every push
  2. It runs my self-hosted CI for every push to pull requests on the main repo

The new behaviour is opt-in by setting NO_BRANCH=true, so everyone can decide what they prefer. But perhaps there's a more elegant solution here?

(marked draft to try a few things)

Possibly related: cirruslabs/cirrus-ci-docs#1233

@Sjors Sjors marked this pull request as draft February 27, 2024 16:58
@Sjors
Copy link
Member Author

Sjors commented Feb 27, 2024

I pushed two changes:

  1. 80f338c -> 0d08429: use only_if: instead of skip:: this slightly improves the experience of when you open a new PR.

  2. ecf01f4 -> 7f7f87b: unsupported architecture are marked skipped instead of hidden (except for the first time you open a PR, which hopefully is fixed with Requesting not to update the CIRRUS_LAST_GREEN_CHANGE after builds with no tasks, nor after builds with all tasks skipped cirruslabs/cirrus-ci-docs#1233)

As demonstrated in Sjors#37 (comment), the first push looks like this (with ARM and Asan missing):

Scherm­afbeelding 2024-02-27 om 19 18 04

And then second: (ARM and Asan shown as skipped)

Scherm­afbeelding 2024-02-27 om 19 21 23

@Sjors Sjors marked this pull request as ready for review February 27, 2024 18:10
.cirrus.yml Outdated Show resolved Hide resolved
Copy link
Member

@luke-jr luke-jr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems more like "don't support Cirrus on forks", since it disables it to a large extent. Maybe that should also be behind env vars?

@Sjors
Copy link
Member Author

Sjors commented Mar 4, 2024

Maybe that should also be behind env vars?

It is. Nothing is disabled for Cirrus on forks, unless you set NO_BRANCH, NO_ARM and/or NO_NOBLE.

On Github CI there's no way (afaik) to set env vars, so there a few things are disabled for forks.

.cirrus.yml Outdated
@@ -8,24 +8,41 @@ env: # Global defaults
CCACHE_DIR: "/tmp/ccache_dir"
CCACHE_NOHASHDIR: "1" # Debug info might contain a stale path if the build dir changes, but this is fine

# A self-hosted machine(s) can be used via Cirrus CI. It can be configured with a
# single user with sudo access, or multiple users without by setting USER_MODE=true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What acts on this env var?

Copy link
Member Author

@Sjors Sjors Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, nothing does. An earlier version 3044813 had this variable. I removed it after this comment: #29274 (comment), but forgot to change the text.

I've now changed it from:

It can be configured with a
single user with sudo access, or multiple users without by setting USER_MODE=true

To:

It can be configured with
multiple users to run tasks in parallel. No sudo permission is required.

.cirrus.yml Outdated Show resolved Hide resolved
fanquake added a commit that referenced this pull request Mar 15, 2024
5555395 lint: Use git --no-pager to print any output in one go (MarcoFalke)
fa57294 lint: Fix lint-whitespace issues (MarcoFalke)

Pull request description:

  The lint check has many issues:

  * It uses `COMMIT_RANGE`, which is brittle code, apparently making it harder to run the CI locally, or self-hosted. See #29274 (comment)
  * The result depends on `COMMIT_RANGE`, or the number of commits passed to the script, which can cause false negatives or false positives.
  * It is based on the diff output, parsing it, and printing it again, which is brittle as well.
  * The output does not include line number, making it harder to act on a lint error.

  Fix all issues by removing the script and replacing it with a simple call to `git grep -I --line-number ...`.

ACKs for top commit:
  TheCharlatan:
    Re-ACK 5555395

Tree-SHA512: 71ea8b6382af064beb72fb17f21a0ae9e9238c97e7fa43c2ec353fd1dd73a7bbd696ba0f0a9f65d1eff7c86cbf6cc104a992cb5450a3d50f122955e835270065
.cirrus.yml Outdated
#
# ```
# apt update && apt install git screen python3 bash podman-docker curl -y && curl -L -o cirrus "https://github.com/cirruslabs/cirrus-cli/releases/latest/download/cirrus-linux-$(dpkg --print-architecture)" && mv cirrus /usr/local/bin/cirrus && chmod +x /usr/local/bin/cirrus
# ```
#
# - There are no strict requirements on the hardware, because having less CPUs
# - There are no strict requirements on the hardware, because having less CPU cores
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this change is correct. With CPU I meant vCPU, or CPU thread.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to "having fewer CPU threads run".

Also rebased.

fanquake added a commit that referenced this pull request Mar 25, 2024
fa1146d lint: Fix COMMIT_RANGE issues (MarcoFalke)

Pull request description:

  `COMMIT_RANGE` has problems on forks or local branches:

  * When `LOCAL_BRANCH` is set, it assumes the presence of a `master` branch, and that the `master` branch is up-to-date. Both of which may be false. (See also discussion in #29274 (comment))
  * When `COMMIT_RANGE` isn't set in `lint-git-commit-check.py`, and `--prev-commits` isn't set either, it has the same (broken) assumptions.

  Fix all issues by simply assuming a merge commit exists. This allows to drop `LOCAL_BRANCH`. It also allows to drop `SKIP_EMPTY_NOT_A_PR`, because scripts will already skip an empty range. Finally, it allows to drop `--prev-commits n`, because one can simply say `COMMIT_RANGE='HEAD~n..HEAD'` to achieve the same.

ACKs for top commit:
  Sjors:
    tACK fa1146d

Tree-SHA512: f1477a38267fd4fdb8d396211a5d6bed5f418798c7edaba43487957aaf726cd45244ccf15187b3dd896d398fa1df3fe0a37323e49cf44d60a2018786ed41e5ba
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants