Overhauled `startup.sh` Script #1454

Fleshgrinder · 2022-05-17T07:54:56Z

This overhaul turns it into a shellcheck valid script with explicit error handling for all possible situations I could think of. This change takes #1409 into account and things can be merged in any order. There are a few important changes here to the logic:

The wait logic for checking if docker comes up was fundamentally flawed because it checks for the PID. Docker will always come up and thus become visible in the process list, just to immediately die when it encounters an issue, after which supervisor starts it again. This means that our check so far is flaky due to the sleep 1 it might encounter a PID, or it might not, and the existence of the PID does not mean anything. The docker ps check we have in the entrypoint.sh script does not suffer from this as it checks for a feature of docker and not a PID. I thus entirely removed the PID check, and instead I am handing things over to our entrypoint.sh script by setting the environment variables correctly.
This change has an influence on the docker0 interface MTU configuration, because the interface might or might not exist after we started docker. Hence, I changed this to a time boxed loop that tries for one minute to set up the interface's MTU. In case the command fails we log an error and continue with the run.
I changed the entire MTU handling by validating its value before configuring it, logging an error and continuing without if it is set incorrectly. This ensures that we are not going to send our users on a bug hunt.
The way we started supervisord did not make much sense to me. It sends itself into the background automatically, there is no need for us to do so with Bash.

The decision to not fail on errors but continue is a deliberate choice, because I believe that running a build is more important than having a perfectly configured system. However, this strategy might also hide issues for all users who are not properly checking their logs. It also makes testing harder. Hence, we could change all error conditions from graceful to panicking. We should then align the exit codes across startup.sh and entrypoint.sh to ensure that every possible error condition has its own unique error code for easy debugging.

Fleshgrinder · 2022-05-17T07:57:49Z

runner/startup.sh

 done

-log.debug 'Starting supervisor daemon'
-sudo /usr/bin/supervisord -n >> /dev/null 2>&1 &


We forcefully start supervisord in the foreground to then directly send it to the background again. We also set stdout to append to /dev/null and redirect its stderr to stdout. I really don't understand why we do what we do here. 😕

This overhaul turns it into a shellcheck valid script with explicit error handling for all possible situations I could think of. This change takes #1409 into account and things can be merged in any order. There are a few important changes here to the logic: - The wait logic for checking if docker comes up was fundamentally flawed because it checks for the PID. Docker will always come up and thus become visible in the process list, just to immediately die when it encounters an issue, after which supervisor starts it again. This means that our check so far is flaky due to the `sleep 1` it might encounter a PID, or it might not, and the existence of the PID does not mean anything. The `docker ps` check we have in the `entrypoint.sh` script does not suffer from this as it checks for a feature of docker and not a PID. I thus entirely removed the PID check, and instead I am handing things over to our `entrypoint.sh` script by setting the environment variables correctly. - This change has an influence on the `docker0` interface MTU configuration, because the interface might or might not exist after we started docker. Hence, I changed this to a time boxed loop that tries for one minute to set up the interface's MTU. In case the command fails we log an error and continue with the run. - I changed the entire MTU handling by validating its value before configuring it, logging an error and continuing without if it is set incorrectly. This ensures that we are not going to send our users on a bug hunt. - The way we started supervisord did not make much sense to me. It sends itself into the background automatically, there is no need for us to do so with Bash. The decision to not fail on errors but continue is a deliberate choice, because I believe that running a build is more important than having a perfectly configured system. However, this strategy might also hide issues for all users who are not properly checking their logs. It also makes testing harder. Hence, we could change all error conditions from graceful to panicking. We should then align the exit codes across `startup.sh` and `entrypoint.sh` to ensure that every possible error condition has its own unique error code for easy debugging.

mumoshu · 2022-06-07T02:04:48Z

@Fleshgrinder Hey! Thanks for your contribution. Just curious, but how did you test this to ensure it's not breaking anything? I was going to follow your test procedures but you touched many places in the code so it wasn't immediately clear to me what to cover.

Fleshgrinder · 2022-06-07T06:35:47Z

I only touched one script, nothing else. I ran our image locally in privileged mode which can hardly be considered as a proper integration test. I don't have a real dind setup where I could run this.

mumoshu · 2022-06-07T07:04:11Z

@Fleshgrinder Thanks! Okay then, I'll try to figure it out.

At glance this change touches the branching on MTU and the existence of a pre-baked /etc/docker/daemon.json, and it lacks the logic to wait until the dockerd process to start. So at least we should test it with and without a custom MTU, with and without a custom daemon.json, and see if it's safe to not wait for the dockerd to start up before execing entrypoint.sh.

Fleshgrinder · 2022-06-07T07:16:01Z

Correct, the existing entrypoint.sh is going to wait for the daemon to start.

If the docker daemon is in a crash loop the existing check might or might not work since there is a race between the daemon starting and thus getting a PID + dying and our PID check. Depending on how our sleeps work out we might see a PID for the process at some point or not. It is not a reliable way to check that the deaemon is up.

The check we have in our entrypoint.sh is safe, since it does not check for a running process but that some real functionality is working and thus yields the result we are seeking. With this change users also have the possibility to disable the wait via the same environment variables as they are able without dind. This gives us proper symmetry for all possible configuration nobs.

mumoshu · 2022-06-07T07:25:35Z

@Fleshgrinder Thanks! Your theory does make sense. If only we had an unit tests for startup.sh 😂 I just reviewed https://github.com/actions-runner-controller/actions-runner-controller/tree/master/test/entrypoint and realized that we only had ones for entrypoint.sh only so I either need to write tests or test this manually before merging anyway.

toast-gear · 2022-06-22T19:33:58Z

runner/startup.sh

+# We leave the wait for docker to come up to our entrypoint.sh script, since its
+# check properly handles the case where the docker service is in a crash loop.
+# Users still have the ability to disable the wait.
+export DOCKER_ENABLED=true


why are we exporting DOCKER_ENABLED in the startup.sh? The controller is adding this env to the runner container regardless

It could be set to DOCKER_ENABLED=false which would be incorrect in this case since we just started it and know that it is enabled. We want the entrypoint.sh script to verify that it is running before starting the run itself.

That said, I have to admit that I do not fully understand why we have two variables for the seemingly same thing in our entrypoint.sh script, but maybe you know more and can tell, since that information might change the implementation here.

This startup.sh doesn't have proper support for DOCKER_ENABLED=false. If it supported it, it should have started dockerd only when DOCKER_ENABLED=true.

As this script doesn't have a proper support for DOCKER_ENABLED=false, exporting either DOCKER_ENABLED=false or DOCKER_ENABLED=true is fine. The latter just makes the runner startup/entrypoint script to silently ignore DOCKER_ENABLED=false provided by the controller and treat it as DOCKER_ENABLED=true, which isn't worse or better than the current behavior IMHO 😃 So, maybe this change is practically safe.

toast-gear

Apart from 1 comment this LGTM, the changes make sense to me and the PR tested OK with my fairly light testing. Myself or @mumoshu should merge this in the morning so if there is a problem we can revert if needed as this is a fairly invasive change without any testing around it.

mumoshu · 2022-06-23T00:36:05Z

This kind of change should really be covered by an automated test suite in the future!
Let's go head for this time and see how it goes.

mumoshu · 2022-06-23T00:36:20Z

🤞

This reverts commit 071898c.

I am about to revisit #1517, #1454, #1561, and #1560 as a part of our on-going effort to a major enhancement to the runner entrypoints being made in #1759. This commit adds the makefile target to run shellformat locally, so that any contributor can use it before submitting a pull request.

I am about to revisit #1517, #1454, #1561, and #1560 as a part of our on-going effort to a major enhancement to the runner entrypoints being made in #1759. This change is a counterpart of #1852. #1852 enables you to easily run shellcheck locally. This enables you to automatically run shellcheck on every pull request. Currently, any shellcheck error does not result in failing the workflow job. Once we addressed all the shellcheck findings, we can flip the fail_on_error option to true and let jobs start failing on pull requests that introduces invalid or suspicious bash code.

@CASABECI

I am about to revisit #1517, #1454, #1561, and #1560 as a part of our on-going effort for a major enhancement to the runner entrypoints being made in #1759. This change updates and reintroduces #1517 contributed by @CASABECI in a way it becomes applicable to today's code-base.

I am about to revisit #1517, #1454, #1561, and #1560 as a part of our on-going effort for a major enhancement to the runner entrypoints being made in #1759. This commit adds the makefile target to run shellformat locally, so that any contributor can use it before submitting a pull request.

I am about to revisit #1517, #1454, #1561, and #1560 as a part of our on-going effort to a major enhancement to the runner entrypoints being made in #1759. This change is a counterpart of #1852. #1852 enables you to easily run shellcheck locally. This enables you to automatically run shellcheck on every pull request. Currently, any shellcheck error does not result in failing the workflow job. Once we addressed all the shellcheck findings, we can flip the fail_on_error option to true and let jobs start failing on pull requests that introduces invalid or suspicious bash code.

@CASABECI

I am about to revisit #1517, #1454, #1561, and #1560 as a part of our on-going effort for a major enhancement to the runner entrypoints being made in #1759. This change updates and reintroduces #1517 contributed by @CASABECI in a way it becomes applicable to today's code-base.

I am about to revisit #1517, #1454, #1561, and #1560 as a part of our on-going effort for a major enhancement to the runner entrypoints being made in #1759. This commit adds the makefile target to run shellformat locally, so that any contributor can use it before submitting a pull request.

* Add workflow for validating runner scripts with shellcheck I am about to revisit #1517, #1454, #1561, and #1560 as a part of our on-going effort to a major enhancement to the runner entrypoints being made in #1759. This change is a counterpart of #1852. #1852 enables you to easily run shellcheck locally. This enables you to automatically run shellcheck on every pull request. Currently, any shellcheck error does not result in failing the workflow job. Once we addressed all the shellcheck findings, we can flip the fail_on_error option to true and let jobs start failing on pull requests that introduce invalid or suspicious bash code.

Fleshgrinder requested review from mumoshu and toast-gear as code owners May 17, 2022 07:54

Fleshgrinder commented May 17, 2022

View reviewed changes

toast-gear reviewed Jun 22, 2022

View reviewed changes

mumoshu approved these changes Jun 23, 2022

View reviewed changes

mumoshu merged commit 071898c into actions:master Jun 23, 2022

Fleshgrinder deleted the startup.sh branch June 23, 2022 06:40

toast-gear added a commit that referenced this pull request Jun 23, 2022

Revert "Overhauled startup.sh Script (#1454)"

949b614

This reverts commit 071898c.

toast-gear mentioned this pull request Jun 23, 2022

Revert "Overhauled startup.sh Script" #1561

Merged

toast-gear added a commit that referenced this pull request Jun 23, 2022

revert: "Overhauled startup.sh Script (#1454)" (#1561)

84d16c1

This reverts commit 071898c.

mumoshu mentioned this pull request Jun 28, 2022

Fix startup.sh of runner from exiting Bad file descriptor error #1560

Closed

mumoshu mentioned this pull request Sep 25, 2022

makefile: Add shellcheck installation and run targets #1852

Merged

mumoshu mentioned this pull request Sep 25, 2022

Add workflow for validating runner scripts with shellcheck #1853

Merged

mumoshu mentioned this pull request Sep 25, 2022

runner: Address all shellcheck findings #1854

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overhauled `startup.sh` Script #1454

Overhauled `startup.sh` Script #1454

Fleshgrinder commented May 17, 2022

Fleshgrinder May 17, 2022

mumoshu commented Jun 7, 2022

Fleshgrinder commented Jun 7, 2022

mumoshu commented Jun 7, 2022 •

edited

Loading

Fleshgrinder commented Jun 7, 2022 •

edited

Loading

mumoshu commented Jun 7, 2022

toast-gear Jun 22, 2022 •

edited

Loading

Fleshgrinder Jun 22, 2022

mumoshu Jun 23, 2022

toast-gear left a comment •

edited

Loading

mumoshu commented Jun 23, 2022

mumoshu commented Jun 23, 2022

Overhauled startup.sh Script #1454

Overhauled startup.sh Script #1454

Conversation

Fleshgrinder commented May 17, 2022

Fleshgrinder May 17, 2022

Choose a reason for hiding this comment

mumoshu commented Jun 7, 2022

Fleshgrinder commented Jun 7, 2022

mumoshu commented Jun 7, 2022 • edited Loading

Fleshgrinder commented Jun 7, 2022 • edited Loading

mumoshu commented Jun 7, 2022

toast-gear Jun 22, 2022 • edited Loading

Choose a reason for hiding this comment

Fleshgrinder Jun 22, 2022

Choose a reason for hiding this comment

mumoshu Jun 23, 2022

Choose a reason for hiding this comment

toast-gear left a comment • edited Loading

Choose a reason for hiding this comment

mumoshu commented Jun 23, 2022

mumoshu commented Jun 23, 2022

Overhauled `startup.sh` Script #1454

Overhauled `startup.sh` Script #1454

mumoshu commented Jun 7, 2022 •

edited

Loading

Fleshgrinder commented Jun 7, 2022 •

edited

Loading

toast-gear Jun 22, 2022 •

edited

Loading

toast-gear left a comment •

edited

Loading