Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document difference between public/private repo's for organizational runners #732

Open
consideRatio opened this issue Aug 18, 2021 · 26 comments
Labels
documentation Improvements or additions to documentation

Comments

@consideRatio
Copy link
Contributor

consideRatio commented Aug 18, 2021

UPDATE

A organizational runner won't take on a job to run on a public repo, unless its explicitly asked via a quite hard to find checkbox: see #732 (comment).

Action point to close issue: document it.


There is a troubleshooting section to fix common mistakes, but there is no debugging section to help users figure out how to identify what could have gone wrong.

I'd be happy to contribute updating such documentation if someone can help me on my way to debug my situation better, I've been stuck at this for ~6 hours or so now.

My specific situation in need of debugging advice

  1. I've created an app within Org A, and only granted the repository level permission listed for organizational runners.
  2. I've installed the app in Org B
  3. I've created a k8s secret with credentials to use
  4. I've got pods running and ready representing from a RunnerDeployment, and they are listed in Org B as idle.
  5. I've created a PR that reference self-hosted, but the job isn't picked up by the org runner ,and my runner is stuck with the following logs as read via kubectl.
√ Connected to GitHub

2021-08-18 20:29:02Z: Listening for Jobs

Version details

  • Chart 0.19.0
  • summerwind/action-runner:latest (Trasnaltes to v2.280.2-ubuntu-20.04-b6465c5 thanks to imagePullPolicy: Always)
  • A k8s cluster based on k3s that has a Arm64 architecture on its RaspberryPi 4B computers.
@mumoshu
Copy link
Collaborator

mumoshu commented Aug 19, 2021

@consideRatio Hey! I read your stiuation and wasn't sure what was your goal.

So you seem to have correctly set up actions-runner-controller to successfully register organizational runners for Org B. That looks good.

Now, in the step 5 in which repository are you submitting a PR? Is that repo in Org B? Then it should trigger workflow run whose jobs will be scheduled onto the Org B's runners. It it doesn't it may be a bug in GitHub, not us, as all we do is to configure and deploy runners as you've specified.

If you're submitting a PR against a repo in a Org A and asking why the jobs aren't scheduled onto Org B's runners, I don't understand how it can work. If that's the case, probably you shall clarify a bit more about your goal.

@consideRatio
Copy link
Contributor Author

Now, in the step 5 in which repository are you submitting a PR? Is that repo in Org B?

Ah yes, I'm submitting a PR to a repo in Org B where the runners are observed as registered as running.

[...] it may be a bug in GitHub, not us, as all we do is to configure and deploy runners as you've specified.

Absolutely. Not having a deep understanding of this or actions/runner or related code bases, I find it hard to make a conclusion on where to focus my attention to resolve the issue I have. Is it your belief that if the following conditions are met, then it probably is a bug in GitHub somehow rather than this repo?

  1. Runners are registered to an org according to GitHub's UI
  2. Runners report "Listening for Jobs"
  3. A repo in the org receives a pull request that has a job that runs-on: self-hosted, and the runner has a label of self-hosted.
    A checklist like above would be helpful in a debugging process of an unknown issue actually!

@consideRatio
Copy link
Contributor Author

consideRatio commented Aug 19, 2021

My next debugging ideas

  • The org I've tried against doesn't grant the GITHUB_TOKEN any notable permissions by default and instead require them to be explicitly requested.
    • Action point on my end: verify this doesn't matter by testing this against another org without this
  • The PR I've tried updated the github workflow file as part of the PR, making the job be triggered by having the PR opting to use the self-hosted runners.
    • Action point on my end: verify this doesn't matter by triggering jobs to execute by push events to the default branch of some repo instead
  • I haven't granted the GitHub App the read/write administration permissions on the repository level as needed for repository runners as was pre-filled when clicking the link for the organizational runners link. I am only interested in organizational runners though.
    • Action point on my end: trial with a GitHub App requesting that permission as well and install it in a GitHub org and trial it there.

Btw thank you @mumoshu for your work on this project and responding to this potentially support-like issue. I hope to make it a contribution rather than just become a support errand for you maintainers by focusing on identifying advice to document under a "debugging" topic or similar.

@mumoshu
Copy link
Collaborator

mumoshu commented Aug 19, 2021

if the following conditions are met, then it probably is a bug in GitHub somehow rather than this repo?

Yes, I believe so. Thanks for summarizing it nicely!

To be extra sure, can you share your RunnerDeployment manifest here? You should replace some concrete values in your manifest, like spec.organization to Org B to make it comparable with your description made above.

@mumoshu
Copy link
Collaborator

mumoshu commented Aug 19, 2021

@consideRatio I saw your comment and although I see nothing suspicious in your setup right now, you could try:

  • Give more permissions to your GitHub app
  • Do use private key, not GITHUB_TOKEN for GitHub app based deployment of actions-runner-controller, as explained in our README (YOu mention GITHUB_TOKEN but I don't understand why you need to mention that when you're deploying it with GitHub App

@consideRatio
Copy link
Contributor Author

# kubectl apply -f actions-runner-controller-runnerdeployment.yaml
#
# reference example: https://github.com/actions-runner-controller/actions-runner-controller#additional-tweaks
#
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: jupyterhub-org
  namespace: actions-runner-controller
spec:
  replicas: 1
  template:
    spec:
      organization: jupyterhub
      labels:
        - self-hosted

I'm using a Private Key created from the GitHub app's config page, downloaded and added to a k8s secret read by the actions-runner-controller pod - so I'm not using a GitHub token. I also doubt the organization configuration of default permissions granted to the github token injected into job's is irrelevant, but it was one of the things that I know could make my configuration stand out from others in some way or another. I'm just guessing at wild things at this point =/

@mumoshu
Copy link
Collaborator

mumoshu commented Aug 19, 2021

@consideRatio What do you see in the Actions tab of your repo? Does trigger job fail due to no runner error, or anything else?

In your manifest, why do you explicitly specify the self-hosted runner label? I thought self-hosted was automatically added by GitHub to any runner on registration. I have never tried specifying it from my end so that might make some difference.

@consideRatio
Copy link
Contributor Author

consideRatio commented Aug 19, 2021

Now it sais "starting job...", but it sometimes have said no matching org or repo level runner matched the label "self-hosted".

I specified self-hosted explicitly as the error mentioned it, even though i saw they already had such label automatically registered.

I dont know why it sometimes say starting up and other times sais no runner with matching label.

@consideRatio
Copy link
Contributor Author

Screenshot_20210819-041807

@mumoshu
Copy link
Collaborator

mumoshu commented Aug 19, 2021

After a few moments, it shows this error. So, I believe this is due to some discrepacy between your expectation and the actual config.

https://github.com/jupyterhub/zero-to-jupyterhub-k8s/runs/3365601314?check_suite_focus=true

CleanShot 2021-08-19 at 11 28 42@2x

@mumoshu
Copy link
Collaborator

mumoshu commented Aug 19, 2021

I specified self-hosted explicitly as the error mentioned it, even though i saw they already had such label automatically registered.

Anyway, AFAIK, you don't need it. Can you try removing it from RunnerDeployment yaml?

@mumoshu
Copy link
Collaborator

mumoshu commented Aug 19, 2021

Chart 0.19.0

To be extra sure, you should recheck your chart version. There's no chart of that version. 0.19.0 might be controller version.

@consideRatio
Copy link
Contributor Author

consideRatio commented Aug 19, 2021

NAME                     	NAMESPACE                	REVISION	UPDATED                                	STATUS  	CHART                           	APP VERSION
actions-runner-controller	actions-runner-controller	1       	2021-08-18 00:40:12.75749028 +0200 CEST	deployed	actions-runner-controller-0.12.7	0.19.0

Woops, okay, it is Chart version 0.12.7 - sorry for the confusion.

Anyway, AFAIK, you don't need it. Can you try removing it from RunnerDeployment yaml?

Absolutely, I've already done it - it was how i started, but I'll trial going onwards by not having it explicitly listed.


Thank you for your attention and help to debug this @mumoshu, I have some work to do to investigate this further and will try to summarize findings after that!

@consideRatio
Copy link
Contributor Author

consideRatio commented Aug 20, 2021

Yikes okay, I've tried all ideas that we discussed to try with no change in the outcome. I've also tried the summerwind/actions-runner images with version 2.277.1, 2.278.0, 2.279.0, 2.280.2, and 2.280.3 without any change in the outcome.

It seems some people report something similar from time to time at their discourse forum without a clear resolution. I couldn't identity a sticky issue about this in https://github.com/actions/runner/issues either.

Overall, I remain clueless and not sure at all how to proceed.

@mumoshu
Copy link
Collaborator

mumoshu commented Aug 23, 2021

@consideRatio Hey! Thanks for reporting. Unfortunately, I have no idea what would be the answer to your issue yet.

If I were you, I would try to isolate the cause by using the same as where you create your GitHub App on, and onto where you install the app, like you both create the app on and install it onto either Org A or Org B, not across those.

I would also try verifying all other settings are correct, by trying to make it work with a personal access token, not as a GitHub App.

Other possibilities- try repository runners rather than organizational runners you have already tried.

@mumoshu
Copy link
Collaborator

mumoshu commented Aug 23, 2021

@consideRatio I was rereading your original issue details and caught by this:

I've created an app within Org A, and only granted the repository level permission listed for organizational runners.

Could you share the exact list of permissions you've provided to your app?

Are you sure you did also provide Self-hosted runners (read / write) organizational permission?

https://github.com/actions-runner-controller/actions-runner-controller#deploying-using-github-app-authentication

@consideRatio
Copy link
Contributor Author

I've reduced the complexity by using a single GitHub Organization where the app is defined and installed as well already. It made no difference.

Are you sure you did also provide Self-hosted runners (read / write) organizational permission?

Yepp!

Could you share the exact list of permissions you've provided to your app?

Repo Org
image image

The installed application within the organization describes the following permissions granted. It does not mention repository related permissions.

image

@consideRatio
Copy link
Contributor Author

Do you think these points are of relevance?

  1. I didn't configure a webhook URL, and explicitly unchecked "active" in this section for the application to be able to create the app without a webhook url configured. My Helm chart installation have the webhook setup disabled as well as.

    image

  2. I have not created a client secret for the GitHub App, but instead relied on a private key that I did create according to this repo's README.

    image

  3. I observe there is an opt-out'able feature enabled for the GitHub, I didn't explicitly opt-in or similar, it was a default.

    image

  4. I did made the app installable by any org during creation of the app.

    image

@mumoshu
Copy link
Collaborator

mumoshu commented Aug 23, 2021

@consideRatio You aren't using actions-runner-controller's webhook-based autoscale, right? Then point 1 seems ok.

For 2, honestly, I haven't tried using client secrets so it may make some difference. Sry for asking a question for a question but would a client secret can be used as an alternative to a personal access token?

For 3 and 4, I have not tried changing the defaults while testing actions-runner-controller(sry but developing all the features and testing all the combinations of settings myself isn't sustainable) so I can't surely say how it affects the setup.

@consideRatio
Copy link
Contributor Author

[...] could a client secret can be used as an alternative to a personal access token?

I don't think so, I'm just confused in general. I remember reacting to the fact that I could not delete a client secret after creating one without creating a new first. So, its like they required you to have one, but at the same time, we are not using one. Due to that hint from github, I got a bit confused about the situation without any proper understanding about it.

sry but developing all the features and testing all the combinations of settings myself isn't sustainable

Absolutely understand that. I've bashed my head against this a bit more and tried various permutations. Still stuck with the same issue. I'll raise a question in the GitHub forums to ask how to debug a situation where I have a action that is registered and Idle according to GitHub, while at the same time not having it pick up any jobs etc even though it has matching labels.

My current hypothesis is that they are sending out some request back to my runner, but it doesn't receive it due to some networking issue or just ignores logging an error or similar.

@consideRatio
Copy link
Contributor Author

I installed tcpdump and analyzed the traffic coming to my runner pod. It seems like it's just a repeat of some form of keep-alive packets coming.

https://github.community/t/self-hosted-runner-registered-as-idle-but-not-picking-up-jobs/198240/2?u=consideratio

@mumoshu
Copy link
Collaborator

mumoshu commented Aug 30, 2021

@consideRatio Thanks for the update and your patience. I just saw the response you got on the forum. Glad to see you've found the solution.

I wish I could have pointed out it myself- I'm feeling very sorry about that 😢

I think I've tested my GitHub App-based deployment by triggering jobs on a private repo. Apparently, that was the difference.

Based on this experience, we should at least add some notes about how to use organizational runners on public repositories. Would you agree?

The biggest gotcha from my perspective was that you need to configure the Default runner group even if you aren't going to use runner groups with your organizational runners. An organizational runner implicitly belongs to the Default runner group but who cares.

CleanShot 2021-08-30 at 09 14 41@2x

@consideRatio
Copy link
Contributor Author

consideRatio commented Aug 30, 2021

Based on this experience, we should at least add some notes about how to use organizational runners on public repositories. Would you agree?

Haha yes, it can save some workdays of debugging ;D Excellent image!

I wish I could have pointed out it myself- I'm feeling very sorry about that 😢

Oh no worries at all, I'm very thankful for your help considering this with me! 🙇 ❤️

@stale
Copy link

stale bot commented Sep 29, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Sep 29, 2021
@consideRatio consideRatio changed the title Debugging advice documentation Document difference between public/private repo's for organizational runners Sep 29, 2021
@stale stale bot removed the stale label Sep 29, 2021
@stale
Copy link

stale bot commented Oct 29, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Oct 29, 2021
@mumoshu mumoshu added documentation Improvements or additions to documentation and removed stale labels Nov 10, 2021
@dongho-jung
Copy link
Contributor

you saved my day indeed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

3 participants