Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Access to the path '/proc/<ID>/oom_score_adj' is denied #3132

Closed
4 tasks done
romanvogman opened this issue Dec 6, 2023 · 27 comments
Closed
4 tasks done

Access to the path '/proc/<ID>/oom_score_adj' is denied #3132

romanvogman opened this issue Dec 6, 2023 · 27 comments
Labels
gha-runner-scale-set Related to the gha-runner-scale-set mode question Further information is requested

Comments

@romanvogman
Copy link

romanvogman commented Dec 6, 2023

Checks

Controller Version

0.7.0

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

Trigger a jenkins-action job with a self hosted github runner

Describe the bug

Trying to set up a self hosted runner in a kubernets mode that will replace a self hosted runner which currently runs on a dedicated VM.
When running a github action that runs a jenkins action I'm seeing a permissions error on Access to the path '/proc/<ID>/oom_score_adj' is denied

Describe the expected behavior

Expect it to run as is when running on a self hosted runner in a dedicated VM instead of in a kubernets cluster

Additional Context

Followed the following video to set it up

This is the runner config I'm currently using for the helm chart:

containerMode:
 type: "kubernetes"  ## type can be set to dind or kubernetes
   ## the following is required when containerMode.type=kubernetes
 kubernetesModeWorkVolumeClaim:
   accessModes: ["ReadWriteOnce"]
   storageClassName: "standard-rwo"
   resources:
     requests:
       storage: 1Gi

template:
  spec:
    securityContext:
      fsGroup: 123
      runAsUser: 1001
    containers:
      - name: runner
        image: ghcr.io/actions/actions-runner:latest
        command: ["/home/runner/run.sh"]
        env:
          - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
            value: "false"
        resources:
          limits:
            memory: "512Mi"
            cpu: "500m"
          requests:
            memory: "128Mi"
            cpu: "100m"

Controller Logs

same

Runner Pod Logs

[WORKER 2023-12-06 12:45:00Z INFO ProcessInvokerWrapper] Starting process:
[WORKER 2023-12-06 12:45:00Z INFO ProcessInvokerWrapper]   File name: '/home/runner/externals/node16/bin/node'
[WORKER 2023-12-06 12:45:00Z INFO ProcessInvokerWrapper]   Arguments: '/home/runner/k8s/index.js'
[WORKER 2023-12-06 12:45:00Z INFO ProcessInvokerWrapper]   Working directory: '/home/runner/_work/end2end/end2end'
[WORKER 2023-12-06 12:45:00Z INFO ProcessInvokerWrapper]   Require exit code zero: 'False'
[WORKER 2023-12-06 12:45:00Z INFO ProcessInvokerWrapper]   Encoding web name:  ; code page: ''
[WORKER 2023-12-06 12:45:00Z INFO ProcessInvokerWrapper]   Force kill process on cancellation: 'False'
[WORKER 2023-12-06 12:45:00Z INFO ProcessInvokerWrapper]   Redirected STDIN: 'True'
[WORKER 2023-12-06 12:45:00Z INFO ProcessInvokerWrapper]   Persist current code page: 'False'
[WORKER 2023-12-06 12:45:00Z INFO ProcessInvokerWrapper]   Keep redirected STDIN open: 'False'
[WORKER 2023-12-06 12:45:00Z INFO ProcessInvokerWrapper]   High priority process: 'False'
[WORKER 2023-12-06 12:45:00Z INFO ProcessInvokerWrapper] Failed to update oom_score_adj for PID: 57.
[WORKER 2023-12-06 12:45:00Z INFO ProcessInvokerWrapper] System.UnauthorizedAccessException: Access to the path '/proc/57/oom_score_adj' is denied.
[WORKER 2023-12-06 12:45:00Z INFO ProcessInvokerWrapper]  ---> System.IO.IOException: Permission denied
[WORKER 2023-12-06 12:45:00Z INFO ProcessInvokerWrapper]    --- End of inner exception stack trace ---
[WORKER 2023-12-06 12:45:00Z INFO ProcessInvokerWrapper]    at System.IO.RandomAccess.WriteAtOffset(SafeFileHandle handle, ReadOnlySpan`1 buffer, Int64 fileOffset)
[WORKER 2023-12-06 12:45:00Z INFO ProcessInvokerWrapper]    at System.IO.Strategies.OSFileStreamStrategy.Write(ReadOnlySpan`1 buffer)
[WORKER 2023-12-06 12:45:00Z INFO ProcessInvokerWrapper]    at System.IO.Strategies.BufferedFileStreamStrategy.Flush(Boolean flushToDisk)
[WORKER 2023-12-06 12:45:00Z INFO ProcessInvokerWrapper]    at System.IO.Strategies.BufferedFileStreamStrategy.Dispose(Boolean disposing)
[WORKER 2023-12-06 12:45:00Z INFO ProcessInvokerWrapper]    at System.IO.StreamWriter.CloseStreamFromDispose(Boolean disposing)
[WORKER 2023-12-06 12:45:00Z INFO ProcessInvokerWrapper]    at System.IO.StreamWriter.Dispose(Boolean disposing)
[WORKER 2023-12-06 12:45:00Z INFO ProcessInvokerWrapper]    at System.IO.File.WriteAllText(String path, String contents)
[WORKER 2023-12-06 12:45:00Z INFO ProcessInvokerWrapper]    at GitHub.Runner.Sdk.ProcessInvoker.WriteProcessOomScoreAdj(Int32 processId, Int32 oomScoreAdj)
[WORKER 2023-12-06 12:45:00Z INFO ProcessInvokerWrapper] Process started with process id 57, waiting for process exit.
@romanvogman romanvogman added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Dec 6, 2023
Copy link
Contributor

github-actions bot commented Dec 6, 2023

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

@nikola-jokic
Copy link
Contributor

Hey @romanvogman,

This issue is related to the runner. However, can you please confirm that the job executes without issues? I know the runner raises this exception but usually, it does not influence the execution of the job. I am curious is this exception affecting your job, or are you reporting that the runner throws the exception?

@nikola-jokic nikola-jokic added question Further information is requested and removed bug Something isn't working needs triage Requires review from the maintainers labels Dec 6, 2023
@romanvogman
Copy link
Author

Hi @nikola-jokic !
Sadly it fails with the following error, which also causes to runner pod to be terminated and a new one is lunched afterwards (min instances is set to 1 so perhaps that's the reason for scaling a new one):

[WORKER 2023-12-06 13:45:23Z ERR  StepsRunner] Caught exception from step: System.Exception: Executing the custom container implementation failed. Please contact your self hosted runner administrator.
[WORKER 2023-12-06 13:45:23Z ERR  StepsRunner]  ---> System.Exception: The hook script at '/home/runner/k8s/index.js' running command 'RunContainerStep' did not execute successfully
[WORKER 2023-12-06 13:45:23Z ERR  StepsRunner]    at GitHub.Runner.Worker.Container.ContainerHooks.ContainerHookManager.ExecuteHookScript[T](IExecutionContext context, HookInput input, ActionRunStage stage, String prependPath)
[WORKER 2023-12-06 13:45:23Z ERR  StepsRunner]    --- End of inner exception stack trace ---
[WORKER 2023-12-06 13:45:23Z ERR  StepsRunner]    at GitHub.Runner.Worker.Container.ContainerHooks.ContainerHookManager.ExecuteHookScript[T](IExecutionContext context, HookInput input, ActionRunStage stage, String prependPath)
[WORKER 2023-12-06 13:45:23Z ERR  StepsRunner]    at GitHub.Runner.Worker.Container.ContainerHooks.ContainerHookManager.RunContainerStepAsync(IExecutionContext context, ContainerInfo container, String dockerFile)
[WORKER 2023-12-06 13:45:23Z ERR  StepsRunner]    at GitHub.Runner.Worker.Handlers.ContainerActionHandler.RunAsync(ActionRunStage stage)
[WORKER 2023-12-06 13:45:23Z ERR  StepsRunner]    at GitHub.Runner.Worker.ActionRunner.RunAsync()
[WORKER 2023-12-06 13:45:23Z ERR  StepsRunner]    at GitHub.Runner.Worker.StepsRunner.RunStepAsync(IStep step, CancellationToken jobCancellationToken)
[WORKER 2023-12-06 13:45:23Z INFO StepsRunner] Step result: Failed


[RUNNER 2023-12-06 13:45:26Z INFO Terminal] WRITE LINE: 2023-12-06 13:45:26Z: Job execute tests completed with result: Failed
2023-12-06 13:45:26Z: Job execute tests completed with result: Failed

√ Removed .credentials
√ Removed .runner
[RUNNER 2023-12-06 13:45:27Z INFO Listener] Runner execution has finished with return code 0
Runner listener exit with 0 return code, stop the service, no retry needed.
Exiting runner...
image

@nikola-jokic
Copy link
Contributor

Oh, from this report, it definitely is not causing failure of the job.

The output that you provided showed that the hook execution failed. We should include better error reporting in the hook. The HTTP request failed is not nearly enough for users to troubleshoot the configuration issues.

It is possible that the node pressure is causing this kind of issue. The job pod needs to land on the runner node, so that may be causing issues with the hook implementation.

I will close this issue here, since it is not ARC related, but feel free to comment on it!

@zerola
Copy link

zerola commented Jan 24, 2024

Hi @romanvogman , we have encountered the same issue as you described (using our GKE cluster for ARC). Just wondering - have you managed to solve it?

@dmalone-keebo
Copy link

Same here GKE + ARC

@Nuru
Copy link
Contributor

Nuru commented Jan 27, 2024

@nikola-jokic wrote:

This issue is related to the runner.
...
I will close this issue here, since it is not ARC related, but feel free to comment on it!

@nikola-jokic Where is the right place to open this issue so it gets addressed? I remain confused about where the source code is for the runners used for Runner Controller Sets and where to open issues about them.

This is still happening in version 0.8.2

System.UnauthorizedAccessException: Access to the path '/proc/224/oom_score_adj' is denied

@nikola-jokic
Copy link
Contributor

Hey,

Just to clarify, the issue with the access to the path is denied should not influence the workings of the runner at all. It is just an annoying exception that the runner throws. If you want to submit it, you can create an issue in the runner repo.

As far as the error reporting goes with the hook, we are hoping to publish a new 0.5.1 release soon and re-publish the image. That can help troubleshoot the hook setup. However, the System.UnauthorizedAccessException has nothing to do with the hook's HTTP error.

@Nuru
Copy link
Contributor

Nuru commented Jan 29, 2024

@nikola-jokic wrote:

If you want to submit it, you can create an issue in the runner repo.

See, this is what I'm talking about with regard to confusion. That repo (actions/runner), as far as I can tell, is for the Summwerwind runner only (current version v2.312.0), but this issue is for the GitHub self-hosted runner image (version 0.7.0 as of this issue, now current is v0.8.2). I don't know where to report issues on that runner (as opposed to the controller).

@zerola
Copy link

zerola commented Jan 31, 2024

@nikola-jokic Could you advise how to trouble shoot the hook problems? As described above, we are both running in GKE (standard, no autopilot), regular GitHub jobs work fine, the problem is with the containerized ones and kubernetes mode in ARC. The runner pod should start a second workflow pod for the container, but this is not happening. I can see inside the runner pod that hook process is running, however I do not see any relevant logs, even when I tried to provide RUNNER_DEBUG variable. I have checked the Kubernetes API also for authorization problems with regards to the used service accounts, but there was no problem. At the same time, Kubernetes events do not show anything suspicious. Thank you.

@nikola-jokic
Copy link
Contributor

Hey @zerola,

Of course, currently debugging hook is almost impossible since the information about the error is hidden in the exception and not logged anywhere. This has been changed, starting at 0.5.0 release, but that release introduced a bug on alpine containers, so we ended up rolling back the hook version added to the runner version 2.312.0. We have a PR ready that should be released, but for now, you would have to build your own hook and provide it to the runner. If you decide to go with that approach, please use the branch where this PR is, or if you don't use alpine containers in your workflow, you can probably safely use the 0.5.0 release.

@zerola
Copy link

zerola commented Jan 31, 2024

Hi @nikola-jokic , thanks for instructions. I have built my own runner image based on your https://github.com/actions/runner/blob/main/images/Dockerfile and provided the RUNNER_VERSION=2.312.0 with RUNNER_CONTAINER_HOOKS_VERSION=0.5.0.
However, the logs from the runner container still contain only this sort of information:

[WORKER 2024-01-31 16:38:36Z ERR  StepsRunner] Caught exception from step: System.Exception: Executing the custom container implementation failed. Please contact your self hosted runner administrator.
[WORKER 2024-01-31 16:38:36Z ERR  StepsRunner]  ---> System.Exception: The hook script at '/home/runner/k8s/index.js' running command 'PrepareJob' did not execute successfully
[WORKER 2024-01-31 16:38:36Z ERR  StepsRunner]    at GitHub.Runner.Worker.Container.ContainerHooks.ContainerHookManager.ExecuteHookScript[T](IExecutionContext context, HookInput input, ActionRunStage stage, String prependPath)
[WORKER 2024-01-31 16:38:36Z ERR  StepsRunner]    --- End of inner exception stack trace ---
[WORKER 2024-01-31 16:38:36Z ERR  StepsRunner]    at GitHub.Runner.Worker.Container.ContainerHooks.ContainerHookManager.ExecuteHookScript[T](IExecutionContext context, HookInput input, ActionRunStage stage, String prependPath)
[WORKER 2024-01-31 16:38:36Z ERR  StepsRunner]    at GitHub.Runner.Worker.Container.ContainerHooks.ContainerHookManager.PrepareJobAsync(IExecutionContext context, List`1 containers)
[WORKER 2024-01-31 16:38:36Z ERR  StepsRunner]    at GitHub.Runner.Worker.ContainerOperationProvider.StartContainersAsync(IExecutionContext executionContext, Object data)
[WORKER 2024-01-31 16:38:36Z ERR  StepsRunner]    at GitHub.Runner.Worker.JobExtensionRunner.RunAsync()
[WORKER 2024-01-31 16:38:36Z ERR  StepsRunner]    at GitHub.Runner.Worker.StepsRunner.RunStepAsync(IStep step, CancellationToken jobCancellationToken)
[WORKER 2024-01-31 16:38:36Z INFO StepsRunner] Step result: Failed

@nikola-jokic
Copy link
Contributor

Can you please turn on debugging and see the output in the workflow?

@zerola
Copy link

zerola commented Feb 1, 2024

Can you please turn on debugging and see the output in the workflow?

Could you advise please how to turn on debugging? I found only RUNNER_DEBUG env variable, which is set.

@nikola-jokic
Copy link
Contributor

Does the step output that can be seen in the UI show the reason for the failure? Based on this issue, it does seem to help so I'm trying to understand how are we missing the HTTP response log on the latest 0.5.0 version

@zerola
Copy link

zerola commented Feb 1, 2024

No, the output in UI is still this:

Error: Error: Client network socket disconnected before secure TLS connection was established
Error: Process completed with exit code 1.
Error: Executing the custom container implementation failed. Please contact your self hosted runner administrator.```

@caiocsgomes
Copy link

Have you guys managed to find a solution for this? I'm running into the same problem.

@caiocsgomes
Copy link

I'm not able to catch what the problem is, I'm getting the same logs as @zerola

@zerola
Copy link

zerola commented Feb 28, 2024

@caiocsgomes - Unfortunately no, in the end we have decided to use Docker-In-Docker mode for containerized workflows and that works. In any case, I will keep an eye on this PR if someone manages to solve it.

@MPV
Copy link

MPV commented Mar 12, 2024

If you want to submit it, you can create an issue in the runner repo.

So, is this related to / caused by this upstream issue?

@romanvogman
Copy link
Author

Hey @zerola, sorry for the late reply.

As @nikola-jokic mentioned - the issue wasn't related to Access to the path '/proc/57/oom_score_adj' is denied error. We assumed that it's related because it was the main exception we saw in the logs.

From our side we were running containerized tasks which required arc to run in a dind mode. After changing to dind (with a few other unrelated fixes) the issue was resolved.

Hope it helps to anyone who encounters this issue

@remidebette
Copy link

Hi, can someone definitely confirm that kubernetes mode does not support containerized task?

@nikola-jokic
Copy link
Contributor

We should probably better report the error on the hook side. There is definitely a room to improve.

@remidebette, I'm sorry I don't understand, what do you mean when you say containerized task? Are you referring to the container step?

@remidebette
Copy link

remidebette commented Apr 12, 2024

Hi @nikola-jokic, we have been trying a "vanilla" install of the scaleset helm chart in kubernetes mode, switched our CI jobs to containers and are encountering the issue that I see in several tickets:

##[debug] ---> System.Exception: The hook script at '/home/runner/k8s/index.js' running command 'PrepareJob' did not execute successfully

In my understanding, this script is not stable and people in the discussions online have issues with it and switch back to dind instead.

For example
actions/runner-container-hooks#128 (comment)
actions/runner-container-hooks#103

What is specific to us is that we are using an onpremisses rancher cluster, the PVC class is ceph-rbd and the helm charts are installed with flux.

@nikola-jokic
Copy link
Contributor

The script should be fine, but the error reported does not give you any clue what is going on.
Maybe if you turn on debugging for the workflow, you can see it?
There are e2e tests confirming that container hook is running the job, and I'm using it regularly, so I'm wondering if there is something wrong either with the configuration, or with the image (i.e. it fails to pull, or something else)

If you can, please let me know if the workflow pod is created, but something is incorrect there. If there is an example workflow I can run to see what is going on, that would also be helpful. One thing to note, if you are using private images, the container hook will not inherit the pull policy of the runner pod

@noamgreen
Copy link

@nikola-jokic HI this error come from the runner
in the code
"#if OS_LINUX
private void WriteProcessOomScoreAdj(int processId, int oomScoreAdj)
{
try
{
string procFilePath = $"/proc/{processId}/oom_score_adj";
if (File.Exists(procFilePath))
{
File.WriteAllText(procFilePath, oomScoreAdj.ToString());
Trace.Info($"Updated oom_score_adj to {oomScoreAdj} for PID: {processId}.");
}
}
catch (Exception ex)
{
Trace.Info($"Failed to update oom_score_adj for PID: {processId}.");
Trace.Info(ex.ToString());
}
}
#endif
"
https://github.com/actions/runner/blob/2979fbad9460c32bea9419595d8c3eacc8f4930d/src/Runner.Sdk/ProcessInvoker.cs#L657

not sure why

@Nek0trkstr
Copy link

Hi I've encountered the same error while trying to run ARC in kubernetes mode and was following the same guide as @romanvogman.
Adding following lines resolved the issue:

containerMode:
  type: "kubernetes"  ## type can be set to dind or kubernetes
  ## the following is required when containerMode.type=kubernetes
  kubernetesModeWorkVolumeClaim:
    accessModes: ["ReadWriteOnce"]
    # For local testing, use https://github.com/openebs/dynamic-localpv-provisioner/blob/develop/docs/quickstart.md to provide dynamic provision volume with storageClassName: openebs-hostpath
    storageClassName: "openebs-hostpath"
    resources:
      requests:
        storage: 1Gi
+  kubernetesModeServiceAccount:
+   annotations:

This wasn't a part of the video that we both followed ,I see that this was already existed in 0.7.0 . So this change was introduced somewhere between 0.4.0 and 0.7.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gha-runner-scale-set Related to the gha-runner-scale-set mode question Further information is requested
Projects
None yet
Development

No branches or pull requests

10 participants