Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle GitLab job cancellation #15

Merged
merged 1 commit into from Oct 17, 2022
Merged

Conversation

refi64
Copy link
Collaborator

@refi64 refi64 commented Sep 9, 2022

When a job is cancelled on GitLab, the corresponding handler futures will be dropped. If the handler was already created, the cleanup() method will still be called, giving the opportunity to clean up after anything that may have been interrupted.

Signed-off-by: Ryan Gonzalez ryan.gonzalez@collabora.com

@refi64 refi64 force-pushed the wip/refi64/cancellation branch 2 times, most recently from 6eb243f to bdba64c Compare September 9, 2022 15:59
@eds-collabora
Copy link
Collaborator

I think an API with an explicit cancellation token would be preferable, as per our discussion. Writing code that can cease execution at any point and remain consistent can be difficult. Storing all necessary state to be ready to unwind wherever you were when execution stopped in a final cleanup() call is probably not ideal. Conversely, if simple clients don't care about cancellation they can safely simply not do anything when cancellation occurs, and then when they try to interact with gitlab, an error will be signalled. All cancellation really buys you is time, and in most situations a cancellation token is ideal to give you some control over how you abort.

@sjoerdsimons
Copy link
Collaborator

I think an API with an explicit cancellation token would be preferable, as per our discussion. Writing code that can cease execution at any point and remain consistent can be difficult. Storing all necessary state to be ready to unwind wherever you were when execution stopped in a final cleanup() call is probably not ideal. Conversely, if simple clients don't care about cancellation they can safely simply not do anything when cancellation occurs, and then when they try to interact with gitlab, an error will be signalled. All cancellation really buys you is time, and in most situations a cancellation token is ideal to give you some control over how you abort.

Well cancellation is not just about time; it's also about load either the local system (not typical with runners) or the remote system (more typical). And ofcourse it's the only way for a user to really stop a job (either due to timeout from gitlab iirc or by the user pressing cancel). So i do think it's important for runner implementation to actually stop when cancelled.

However i would like it to be simple for simple runners to be correct and not have to be forced to deal with e.g. cancellation. Especially since for things like reqwest in the end all you can do is to drop the future; So for those cases i'd expect dropping the step future to actually be the right thing potentially followed by the cleanup function cancelling e.g. a remote task.

Where it's more likely that access to the cancellation token is explicitly required is for implementation that spawn tasks from within their step function or for those that want to have some kind of transactional safety (iotw explicilty opt out of cancellation for some section of code).

I wonder if to handle both cases nicely we should add a async fn step_with_cancelleation to the trait which passes the cancellation token and has a default implementation which is simply a select! on the cancelation and the current step function; Which means simple runners, which tbh i expect our main current runners to be, don't have to bother and there is a option for more complex runners to take that into account.

@refi64
Copy link
Collaborator Author

refi64 commented Sep 13, 2022

I wonder if to handle both cases nicely we should add a async fn step_with_cancelleation to the trait which passes the cancellation token and has a default implementation which is simply a select! on the cancelation and the current step function

While implementing this, I noticed a particular downside to adding step_with_cancelletion to the same trait: if one wants to override step_with_cancellation, they also have to override step, because the latter has no default.

The two main fixes would be:

  • split out the *_with_cancellation functions to their own separate trait, RunWithCancellation JobHandlerWithCancellation or similar
  • provide a dummy default for step()

Personally, I'm leaning towards the former, because it feels a bit cleaner overall wrt separation of concerns (despite being a bit more verbose).

@eds-collabora
Copy link
Collaborator

Well cancellation is not just about time; it's also about load either the local system (not typical with runners) or the remote system (more typical). And ofcourse it's the only way for a user to really stop a job (either due to timeout from gitlab iirc or by the user pressing cancel). So i do think it's important for runner implementation to actually stop when cancelled.

But as you point out, the load is generally not from the runner at present, so dropping the future doesn't decrease load at all. The remote load is only decreased when the runner proactively takes some step to abort whatever remote work it is managing. That means to my mind that the pattern of "well I'll just stop executing it" isn't very helpful [I also think it's a future minefield as I'll come to].

However i would like it to be simple for simple runners to be correct and not have to be forced to deal with e.g. cancellation. Especially since for things like reqwest in the end all you can do is to drop the future; So for those cases i'd expect dropping the step future to actually be the right thing potentially followed by the cleanup function cancelling e.g. a remote task.

I'm not clear on what those simple runners would look like. Our current runners manage work on a remote resource. They're well placed to look at a cancellation token and decide what to do next at safe and convenient moments. They would both have to proactively make a cancellation call wouldn't they to respond in any way appropriately to cancellation? A very noticeable case for LAVA for example is the need to unqueue jobs that have not started when cancellation occurs.

In the case of a local executor that is not managing some remote resource, surely it'd be better to have a choice of when you stop executing. It seems very difficult to always be ready to transition to cleanup(), and to have recorded enough state in yourself to do so (.e.g in lava at the least one has to record whether the job was submitted and what its id was - and of course whether it completed).

Where it's more likely that access to the cancellation token is explicitly required is for implementation that spawn tasks from within their step function or for those that want to have some kind of transactional safety (iotw explicilty opt out of cancellation for some section of code).

But surely the "simple" runner writer shouldn't have to worry about whether transactional safety is going to be violated. It's "safe" to ignore cancellation completely, we've got a long way without worrying about it at all. It doesn't have weird edge cases, or leave things in unexpected states. Just checking a cancellation token to see if you need to proactively cancel doesn't seem onerous to me at all.

I wonder if to handle both cases nicely we should add a async fn step_with_cancelleation to the trait which passes the cancellation token and has a default implementation which is simply a select! on the cancelation and the current step function; Which means simple runners, which tbh i expect our main current runners to be, don't have to bother and there is a option for more complex runners to take that into account.

I don't really understand the need here. I don't know why it's desirable to have a default behaviour of "any await can cease your execution, and if you need to cleanup, you must do a blanket cleanup at the end, storing sufficient state to do so". That just seems like a really hard model to get reliable.

A much simpler model is just "here's a thing that tells you if you were cancelled, take action if you need to". It' has no extra preconditions, it's easy to explain, it makes it obvious what happens if you ignore it, it imposes no new behaviours throughout your runner (bear in mind it's not just your runner that's treated this way, all referenced libraries need to behave sensibly too when their futures are dropped).

And it matches the LAVA case well: you need to proactively cancel your LAVA job if cancellation occurs, why do I want to do that from a point where I don't know what dynamic state I had (i.e. cleanup() instead of inline where it's much easier to think about?)

@refi64 refi64 force-pushed the wip/refi64/cancellation branch 2 times, most recently from 5b79d03 to 16721f4 Compare September 13, 2022 18:36
@refi64
Copy link
Collaborator Author

refi64 commented Sep 13, 2022

PR updated to add on a new trait to separate things out better, as well as actual documentation.

I don't know why it's desirable to have a default behaviour of "any await can cease your execution, and if you need to cleanup, you must do a blanket cleanup at the end, storing sufficient state to do so". That just seems like a really hard model to get reliable.

Honestly in general in the async world, I'd argue that you generally need to be aware that your tasks can drop at any time, and the amount of state that can't easily be cleaned up in any scenario should ideally be quite minimal. Network connections drop, servers go down, containers get booted, etc. There's definitely some precedent here (everything Erlang, async exceptions in Haskell, a lot of other stuff about monads), though admittedly I'm not sure about the Rust world...

In practice, I have a hard time imagining where the new behavior would be particularly worse than before: e.g. in the OBS runner, if it gets terminated mid-upload, that's certainly unfortunate, but nothing really goes wrong. If it gets cancelled while monitoring that's again unfortunate, but it's not particularly wrong. Generally, in the worst case, this isn't going to be that much worse than the previous behavior, but it has the distinct advantage that people don't need to remember to handle cancellation. Because:

A much simpler model is just "here's a thing that tells you if you were cancelled, take action if you need to". It' has no extra preconditions, it's easy to explain, it makes it obvious what happens if you ignore it, it imposes no new behaviours throughout your runner

and it's also quite easy to forget 😅 IME people will constantly wire around contexts / cancellation tokens or notifications around in some places but not others, functions that can hang forever will never be given one and never be waiting in appropriately for cancellation to propagate, etc. Passing down important contextual state can definitely be an error-prone hassle, and it's one that IMO isn't needed in the majority of cases.

@eds-collabora
Copy link
Collaborator

I think the updated API is fine.

The default behaviour with all three options (no change, auto cancel, cancellable api) is to do nothing useful on cancellation. The important resource to release is the managed resource, not the runner's resources. In every case that requires you to take some positive action. I think it's strange to voluntarily give up the context of what you were doing when you come to cancel, it seems to make your recovery code much more heavyweight for no real gain in this situation.

It's arguably easier to forget to write cleanup() (and restructure your code such that you can make a heavyweight cleanup routine that can correctly identify all remote state you might need to manage no matter where your code was executing) than it is to forget to use a cancellation token that's handed to you. It's arguable without good docs that you might not be aware at all in the former case that cancellation could happen.

Cancellation is not a pervasive concern, and it is not necessary to wire it through to every call and every sleep. We are not talking about something that needs to happen extremely quickly, it's happening in response to a user clicking a button - we have seconds to respond, we won't even notice the button was hit until our next keepalive occurs. So I think the idea of plumbing the token in everywhere is probably overkill. But knowing what command you were executing when cancel happened seems like a lot to give up.

The updated API looks fine to me, I'll rework the lava runner against this PR.

@sjoerdsimons
Copy link
Collaborator

I think the updated API is fine.

The default behaviour with all three options (no change, auto cancel, cancellable api) is to do nothing useful on cancellation. The important resource to release is the managed resource, not the runner's resources. In every case that requires you to take some positive action. I think it's strange to voluntarily give up the context of what you were doing when you come to cancel, it seems to make your recovery code much more heavyweight for no real gain in this situation.

It's arguably easier to forget to write cleanup() (and restructure your code such that you can make a heavyweight cleanup routine that can correctly identify all remote state you might need to manage no matter where your code was executing) than it is to forget to use a cancellation token that's handed to you. It's arguable without good docs that you might not be aware at all in the former case that cancellation could happen.

Cancellation is not a pervasive concern, and it is not necessary to wire it through to every call and every sleep. We are not talking about something that needs to happen extremely quickly, it's happening in response to a user clicking a button - we have seconds to respond, we won't even notice the button was hit until our next keepalive occurs. So I think the idea of plumbing the token in everywhere is probably overkill. But knowing what command you were executing when cancel happened seems like a lot to give up.

The updated API looks fine to me, I'll rework the lava runner against this PR.

Fwiw the point of the now two traits is so that the user has options. If you keep using the Jobhandler your steps future will just be dropped (which for our current runners should be fine), but if you want or need to (e.g. the api you're using is not cancellation safe) you can take matters in your own hand and use specific cancellation points and/or handle rollback yourself just like you can when there is an error bubbling up from your own implementation

@eds-collabora
Copy link
Collaborator

I'm happy for this to merge in its present state; I've rebased my changes on top of it.

@refi64 refi64 force-pushed the wip/refi64/cancellation branch 2 times, most recently from 70a3f70 to 9674ede Compare October 5, 2022 16:10
gitlab-runner/src/lib.rs Outdated Show resolved Hide resolved
gitlab-runner/src/run.rs Outdated Show resolved Hide resolved
When a job is cancelled on GitLab, the corresponding handler futures
will be dropped. If the handler was already created, the cleanup()
method will still be called, giving the opportunity to clean up after
anything that may have been interrupted.

Signed-off-by: Ryan Gonzalez <ryan.gonzalez@collabora.com>
@sjoerdsimons sjoerdsimons merged commit 9107456 into main Oct 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants