New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support self-hosted Cirrus workers on forks #29274
base: master
Are you sure you want to change the base?
Conversation
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code CoverageFor detailed information about the code coverage, see the test coverage report. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update. ConflictsNo conflicts as of last run. |
@@ -161,11 +172,13 @@ task: | |||
|
|||
task: | |||
name: 'ASan + LSan + UBSan + integer, no depends, USDT' | |||
# Skip if no Noble worker exists (for a fork) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed? Won't this be skipped already by default if the worker is offline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It stays pending, at least in the Cirrus interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It stays pending, at least in the Cirrus interface.
But that seems preferable to me, as it communicates clearly that the task didn't run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying that UX in Sjors#36 ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the first push of a branch, due to the UI glitched described below (#29274 (comment)) they show as skipped:
That seems fine. On the second push they are put in "queued sate" (forever?).
I agree that hiding the task altogether is not ideal (as ecf01f4 does). Perhaps marking it as skipped is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well for one thing it's way faster (using just an idle desktop computer).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also not this black and white. Sure, Microsoft and Cirrus can screw us over by faking the CI result. But by not relying on them for hosting for the runners, it's both easier to catch them doing shenanigans and it's easier to move elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
easier to catch them
Why? Someone will have to run a fully local backup either way, whose cost is independent of what is running on the other side.
easier to move elsewhere
Why? Moving elsewhere also requires moving the repo off of the GitHub mirror, or writing a GH integration and a CI framework from scratch, both of whose costs are largely independent of where the CI CPUs happen to sit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I run a self-hosted CI I can see what it's doing. If it fails on my machine and is marked as successful on Github, I can see that. If I collude with Github that's a different story of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CI cleans up after itself, latest when the next run starts, so I don't think it is reliably possible to inspect a failure of a remote-triggered CI run locally, even assuming no "attack" from GH.
More importantly, the CI machine is literally doing RCE of downloaded binary executables and random code, so reliably detecting a test failure (assuming a GH attack) seems questionable at best.
# Do not run branch pushes on forks, because they are typically redundant with | ||
# pull requests to the main repo. | ||
- 'bitcoin/**' | ||
- 'bitcoin-core/**' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No objection, but I presume this will cause issues on forks (Inquisition, Knots, ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, this probably needs an ENV flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Knots pull requests seem to be typically made against the 25.x-knots
branch. Those would run normally. But with the above code CI would not run when e.g. a future 26.x.knots
is created (or rebased).
Inquisition seems to follow the some pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aaargh, afaik Github CI doesn't support custom ENV variables for forks. So this is solved, but on Github forks are unconditionally skipped.
40bfae2
to
baa7109
Compare
ci/lint/06_script.sh
Outdated
@@ -11,7 +11,7 @@ set -ex | |||
if [ -n "$LOCAL_BRANCH" ]; then | |||
# To faithfully recreate CI linting locally, specify all commits on the current | |||
# branch. | |||
COMMIT_RANGE="$(git merge-base HEAD master)..HEAD" | |||
COMMIT_RANGE="$(git merge-base HEAD origin/master)..HEAD" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preserving @maflcko's comment on the previous PR:
unrelated: Generally I'd find it better if we get rid of the commit range completely. For example, when running the whitespace linter, it seems easier to just lint all code at
HEAD
, instead of trying to guess which code was modified and only lint that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #29487
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case, I don't understand the change here. This is only run when LOCAL_BRANCH
is set, which does not happen in CI, so I doubt that your "fix" fixes anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember 0ac57f8 fixed an error complaining that there's no master
branch. But I can retest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember 0ac57f8 fixed an error complaining that there's no master branch. But I can retest.
Sure, but I wonder if it is worth it. Given that COMMIT_RANGE
is only really required for the scripted-diff check, I think it is fine to force devs to locally specify the range.
There are endless ways in which this can fail. I don't have a master
branch locally either. origin/master
isn't guaranteed to exist either. And finally, all of them (if they exist) may point to different commits, leading to even more bugs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC only the cirrus linter CI calls this script. That's not self-hosted even on forks. So it's possible this commit was only useful in #29259, where I added that linter to a Github Action. I'll drop it if CI passes on Sjors#30
Yeah, obviously it will be passing. As I said, this code in this line is never executed in CI. For pulls, the branch below will be run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dropped 0ac57f8.
Someone can drop COMMIT_RANGE
altogether in a separate PR. Maybe move the incantation to a README, because it's certainly useful to know how to run the linter on each commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in #29660
@@ -29,7 +29,10 @@ jobs: | |||
test-each-commit: | |||
name: 'test each commit' | |||
runs-on: ubuntu-22.04 | |||
if: github.event_name == 'pull_request' && github.event.pull_request.commits != 1 | |||
# Only run on pull requests, skip if there's only commit. | |||
# Do not run on forks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nicer to fix the actual problem, but I don't understand what this job is doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hebasto @ryanofsky any thoughts on how to make the test-each-commit
task work for a scenario like in Sjors#30 where the base branch is fork/some-branch
and the PR branch is fork/some-pr
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example of earlier failure: https://github.com/Sjors/bitcoin/actions/runs/7556559882/job/20573828392?pr=28#step:4:42
shell: /usr/bin/bash -e {0}
env:
DANGER_RUN_CI_ON_HOST: 1
CI_FAILFAST_TEST_LEAVE_DANGLING: 1
MAKEJOBS: -j10
MAX_COUNT: 6
FETCH_DEPTH: 8
Warning: you are leaving 1 commit behind, not connected to
any of your branches:
3b5c5d4 ci: add self-hosted runners for GitHub
If you want to keep it by creating a new branch, this may be a good time
to do so with:
git branch <new-branch-name> 3b5c5d4
HEAD is now at cbe740b ci: vary /tmp/env
fatal: bad revision '^^@'
f79ca32
to
e44e9d2
Compare
lgtm ACK c65fde4 Didn't review the other commits. |
As for the other commits, as mentioned previously, I am not sure if this is really needed. But no objection, of course. (#29259 (comment)) |
@maflcko that comment refers to running CI for checking my own work. But what this pull request enables is running CI for pull requests, which may be from myself but also from others. I'm somewhat puzzled why Knots (cc @luke-jr) and Inquisition (cc @ajtowns) are not running into the same issues. But I guess there's at least two differences:
|
Inquisition disabled all the automatic tests except the trivial linter when cirrus imposed limits earlier in the year, and tests are only run manually by reviewers. Haven't worked out what it should be doing, so don't have suggestions to offer, but interested to see what the resolution here is. |
Maybe because I switch the self-hosted CI to not-self-hosted? |
ci/test/02_run_container.sh
Outdated
@@ -30,6 +30,10 @@ if [ -z "$DANGER_RUN_CI_ON_HOST" ]; then | |||
docker volume create "${CONTAINER_NAME}_previous_releases" || true | |||
|
|||
if [ -n "${RESTART_CI_DOCKER_BEFORE_RUN}" ] ; then | |||
# Kill harder, prevents "The container name "/ci_..." is already in use by container" | |||
docker stop "$(docker ps -a -q)" || true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is "kill harder"? Containers are already being --all --force
removed below:
-a, --all Remove all containers
-f, --force Force removal of a running or unusable container
so they shouldn't need to be "stopped" beforehand?
Also, shouldn't this be podman
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative (with podman 4.8+) would be to use run --replace
, see containers/podman@f21c1d2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was intentionally vague, because I might just be doing something else wrong. I'll read @maflcko's suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like https://packages.ubuntu.com/noble/podman and https://packages.debian.org/trixie/podman includes the fixed --replace
, so might try that at one point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this is only available in Noble? In that case I might just improve the code comment (including sleep
) and add a note that can go away with Noble.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we add a dependency on podman at all?
A dependency on a container engine is needed, because this repo is the wrong place to implement a container engine from scratch.
Is the plan to move to podman entirely?
For running the CI locally, any container engine should work. Inside the RESTART_CI_DOCKER_BEFORE_RUN
, only podman is supported.
If you want to switch to docker, the podman commands would have to replaced by docker commands inside the RESTART_CI_DOCKER_BEFORE_RUN
block. Otherwise, both podman and docker are required to run on docker, which seems confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see where my confusion comes from. I thought podman was just a wrapper for Docker. But it's actually an alternative which can run the same containers. Why not switch over entirely to Podman? It's quite confusing that we're using both.
Since Podman doesn't need root either, I might just nuke Docker and give it a try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I nuked Docker in favor of Podman. So far it seems to work. I dropped 7cdb75d.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not switch over entirely to Podman? It's quite confusing that we're using both.
I don't see the benefit of requiring that docker is uninstalled. If someone wants to use their container engine of their choice locally, just let them do it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not specifically related to the issue at hand: the new podman desktop app is pretty smart, it can show you containers running via docker and podman (and kubernetes clusters) all in the same UI: https://podman-desktop.io/
e44e9d2
to
1c05b6e
Compare
I switched my own machine from Docker to Podman. This removes the need for 7cdb75d so that's dropped. Added a commit that explains the process a bit more clearly. |
fa91bf2 ci: Skip git install if it is already installed (MarcoFalke) c65fde4 ci: vary /tmp/env (Sjors Provoost) Pull request description: * Currently, running separate CI tasks at the same time may intermittently fail, because they race to read/write `/tmp/env`. Fix this by adding `$CONTAINER_NAME` to the file name. * Also, add `$USER`, while touching the line, to allow different users to run the same CI task at the same time. * Also, skip the git install if there is no need. Ref: #29274 ACKs for top commit: Sjors: ACK fa91bf2 BrandonOdiwuor: ACK fa91bf2 hebasto: ACK fa91bf2. Tree-SHA512: 9a8479255a2afb6618f9d0796488d9430ba95266b90ce39536a9817c1974ca4049beeaab5355a38b25171f76fc386dbec06b1919aaa079f08a5a0c0a146232c8
1c05b6e
to
2ff8c34
Compare
Rebased after #29441 landed which contains two commits from this PR. |
2ff8c34
to
0905db4
Compare
Dropped 0ac57f8 since that wasn't doing anything. Also rebased. |
One annoying side-effect of 80f338c is that if you open a pull request (after pushing the branch) the Cirrus tasks initially all show as skipped. The tasks are actually running though, and as they finish that's reflected on Github. This only happens for Cirrus workers, not for Github workers (3048700). See screenshots: Sjors#36 (comment) This seems like a bug in Github or in Cirrus's integration. Dropping 80f338c has two undesirable effects:
The new behaviour is opt-in by setting (marked draft to try a few things) Possibly related: cirruslabs/cirrus-ci-docs#1233 |
0905db4
to
bfe2f35
Compare
I pushed two changes:
As demonstrated in Sjors#37 (comment), the first push looks like this (with ARM and Asan missing): And then second: (ARM and Asan shown as skipped) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems more like "don't support Cirrus on forks", since it disables it to a large extent. Maybe that should also be behind env vars?
It is. Nothing is disabled for Cirrus on forks, unless you set On Github CI there's no way (afaik) to set env vars, so there a few things are disabled for forks. |
.cirrus.yml
Outdated
@@ -8,24 +8,41 @@ env: # Global defaults | |||
CCACHE_DIR: "/tmp/ccache_dir" | |||
CCACHE_NOHASHDIR: "1" # Debug info might contain a stale path if the build dir changes, but this is fine | |||
|
|||
# A self-hosted machine(s) can be used via Cirrus CI. It can be configured with a | |||
# single user with sudo access, or multiple users without by setting USER_MODE=true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What acts on this env var?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, nothing does. An earlier version 3044813 had this variable. I removed it after this comment: #29274 (comment), but forgot to change the text.
I've now changed it from:
It can be configured with a
single user with sudo access, or multiple users without by setting USER_MODE=true
To:
It can be configured with
multiple users to run tasks in parallel. No sudo permission is required.
5555395 lint: Use git --no-pager to print any output in one go (MarcoFalke) fa57294 lint: Fix lint-whitespace issues (MarcoFalke) Pull request description: The lint check has many issues: * It uses `COMMIT_RANGE`, which is brittle code, apparently making it harder to run the CI locally, or self-hosted. See #29274 (comment) * The result depends on `COMMIT_RANGE`, or the number of commits passed to the script, which can cause false negatives or false positives. * It is based on the diff output, parsing it, and printing it again, which is brittle as well. * The output does not include line number, making it harder to act on a lint error. Fix all issues by removing the script and replacing it with a simple call to `git grep -I --line-number ...`. ACKs for top commit: TheCharlatan: Re-ACK 5555395 Tree-SHA512: 71ea8b6382af064beb72fb17f21a0ae9e9238c97e7fa43c2ec353fd1dd73a7bbd696ba0f0a9f65d1eff7c86cbf6cc104a992cb5450a3d50f122955e835270065
.cirrus.yml
Outdated
# | ||
# ``` | ||
# apt update && apt install git screen python3 bash podman-docker curl -y && curl -L -o cirrus "https://github.com/cirruslabs/cirrus-cli/releases/latest/download/cirrus-linux-$(dpkg --print-architecture)" && mv cirrus /usr/local/bin/cirrus && chmod +x /usr/local/bin/cirrus | ||
# ``` | ||
# | ||
# - There are no strict requirements on the hardware, because having less CPUs | ||
# - There are no strict requirements on the hardware, because having less CPU cores |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this change is correct. With CPU
I meant vCPU
, or CPU thread
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to "having fewer CPU threads run".
Also rebased.
fa1146d lint: Fix COMMIT_RANGE issues (MarcoFalke) Pull request description: `COMMIT_RANGE` has problems on forks or local branches: * When `LOCAL_BRANCH` is set, it assumes the presence of a `master` branch, and that the `master` branch is up-to-date. Both of which may be false. (See also discussion in #29274 (comment)) * When `COMMIT_RANGE` isn't set in `lint-git-commit-check.py`, and `--prev-commits` isn't set either, it has the same (broken) assumptions. Fix all issues by simply assuming a merge commit exists. This allows to drop `LOCAL_BRANCH`. It also allows to drop `SKIP_EMPTY_NOT_A_PR`, because scripts will already skip an empty range. Finally, it allows to drop `--prev-commits n`, because one can simply say `COMMIT_RANGE='HEAD~n..HEAD'` to achieve the same. ACKs for top commit: Sjors: tACK fa1146d Tree-SHA512: f1477a38267fd4fdb8d396211a5d6bed5f418798c7edaba43487957aaf726cd45244ccf15187b3dd896d398fa1df3fe0a37323e49cf44d60a2018786ed41e5ba
I find myself making pull requests against my fork (mostly on top of #28983), or asking others to do so. Currently only the Github actions are run on forks, because we use self-hosted runners for the Cirrus tasks.
While setting up my own self-hosted runners for my fork, I ran into a number of issues. This PR addresses those.
Some issues are related to me using multiple regular users on the same physical machine, rather than running every worker in a VM. Update: these fixes have been absorbed by #29441.
Replaces #29259 using the self hosted workers via Cirrus instead of Github.
You can see this PR in action on this pull request to my fork: Sjors#30
To test it yourself:
.cirrus.yml
)https://cirrus-ci.com/settings/github/YOU
NO_BRANCH=true
,NO_ARM=true
andNO_NOBLE=true
env variables (see.cirrus.yml
)Security wise: when dealing with code from strangers on the internet, review it first before running the CI. There's a Cirrus check-box that requires approval for people without write access to trigger CI.
One side-effect of this PR is that on forks Github CI is no longer runs on branch pushes. It does still run for pull requests. I could undo that, but the downside is that is Github CI would run twice in a typical pull request workflow.