Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: webhook does not return error on 'instance not found'. #28

Merged
merged 1 commit into from
Oct 7, 2022

Conversation

maigl
Copy link
Contributor

@maigl maigl commented Oct 6, 2022

This case is probably caused by a webhook event that was meant for another runner controller / manager. No need to report this as error, we can simply ignore this and avoid noise in the logs.

Michael Kuhnt michael.kuhnt@mercedes-benz.com Mercedes-Benz Tech Innovation GmbH (ProviderInformation)

@maigl
Copy link
Contributor Author

maigl commented Oct 6, 2022

Hi @gabriel-samfira, here is another nit. I don't know if you're interested in those little things .. we might add bigger features in future :)

@gabriel-samfira
Copy link
Member

That's right. We get all webhooks, including those that target other runner types (Github hosted for example). Garm currently returns an error in all cases (queued, in_progress and completed) if the action targets tags we don't care about, or if the webhook targets a worker that we don't manage (like a github hosted one).

In all cases, the store will return a runnerErrors.ErrNotFound. The only potentially undesirable effects of this are:

  • log spam
  • you see a 404 error in github webhooks under "Recent deliveries"

I am a bit torn between knowing it's pointless to log/react to these events and correctly reporting back the fact that the webhook we got, was not meant for us 😄.

I think we can just not report an error if we get an ErrNotFound when a webhook is fired. Will leave a comment on the changed lines in a sec.

@@ -103,6 +104,11 @@ func (a *APIController) handleWorkflowJobEvent(w http.ResponseWriter, r *http.Re

if err := a.r.DispatchWorkflowJob(hookType, signature, body); err != nil {
log.Printf("failed to dispatch work: %s", err)
// in this case we probably got a webhook that was not meant for us
// no need to return an error
if strings.Contains(err.Error(), "fetching instance by name: not found") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We try to return meaningful errors we can test for, as much as we can. This case can be handled with:

if errors.Is(err, gErrors.ErrNotFound) {
    log.Printf("got not found error from DispatchWorkflowJob. webhook not meant for us?: %q", err)
    return nil
}

It would also be useful to move the log.Printf("failed to dispatch work: %s", err) statement under the if statement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure .. I'd love to use errors.Is but I was under the impression that there are multiple cases where we return ErrNotFound but only one of this cases means that its not our runner ("fetching instance by name: not found"). Other cases still might point to a real error. Is that wrong ?

Copy link
Member

@gabriel-samfira gabriel-samfira Oct 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TL;DR: It should be safe to ignore any ErrNotFound in this particular code path (webhook handling).

TS; NM: I will elaborate ( bare with me, I sometimes tend to be verbose 😄 ).

The web hook handler is how we consume GitHub events and react to them in 3 particular ways:

  • When a new job is queued (a queued event is sent)
  • When a job is started (an in_progress event is sent)
  • When a job is completed (a completed event is sent)

In the case of a queued event, garm will look at the job info and using the tags that were requested by the workflow, will attempt to find a pool where it will try to spawn a worker. Garm looks for a pool that matches all requested tags. If a pool is not found, then an ErrNotFound is returned.

This is not necessarily an error. It's just that we don't have a pool to handle that particular workflow, so for us, that webhook is a noop.

In the case of an in_progress event, garm will attempt to determine the worker name that is handling the job we were just notified about. When a worker name is found, garm will look for that particular worker and attempt to change it's status from idle to active. This is important because once that happens, if we have min_idle_runners set on the pool, garm will attempt to automatically create a new worker, if one was not already created during the queued event (which can happen if max_runners is reached).

If garm can't find the runner in the database, it will return ErrNotFound. But this means that we never managed the runner we got a hook for. So this may be handled by another controller or by github itself. So it is safe to ignore.

In the case of a completed event, we do pretty much the same thing as in the in_progress event, but we mark the runner as pending_delete. At this point the runner was already removed from github, so all that's left for us to do is remove it from our provider. If we can't find the runner, then it's either a runner we don't manage, or it was already deleted. Desired state is current state, so this too will be a noop.

There are a few things to consider here. This change only affects the webhook API endpoint that GitHub sends hook events to. The only consumer of this endpoint is GitHub itself. We don't interact with this at all. Ignoring the ErrNotFound error here, will only have the side effect of not seeing an error in the GitHub console. But there is no further handling of the error.

We should still log that error condition. We may want to know that GitHub requested something and we generated an ErrNotFound, even though we're ignoring it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, thanks for taking the time to create this PR! I was on the fence about this issue, but considering it bugs other people too, we might as well change it. 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx for the elaborate explanation :) .. I changed the code as you have suggested.

This case is probably caused by a webhook event that was meant for
another runner controller / manager. No need to report this as
error, we can simply ignore this and avoid noise in the logs.
@gabriel-samfira
Copy link
Member

Thanks for the PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants