New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle GitLab job cancellation #15
Conversation
6eb243f
to
bdba64c
Compare
I think an API with an explicit cancellation token would be preferable, as per our discussion. Writing code that can cease execution at any point and remain consistent can be difficult. Storing all necessary state to be ready to unwind wherever you were when execution stopped in a final |
Well cancellation is not just about time; it's also about load either the local system (not typical with runners) or the remote system (more typical). And ofcourse it's the only way for a user to really stop a job (either due to timeout from gitlab iirc or by the user pressing cancel). So i do think it's important for runner implementation to actually stop when cancelled. However i would like it to be simple for simple runners to be correct and not have to be forced to deal with e.g. cancellation. Especially since for things like reqwest in the end all you can do is to drop the future; So for those cases i'd expect dropping the Where it's more likely that access to the cancellation token is explicitly required is for implementation that spawn tasks from within their step function or for those that want to have some kind of transactional safety (iotw explicilty opt out of cancellation for some section of code). I wonder if to handle both cases nicely we should add a |
While implementing this, I noticed a particular downside to adding The two main fixes would be:
Personally, I'm leaning towards the former, because it feels a bit cleaner overall wrt separation of concerns (despite being a bit more verbose). |
But as you point out, the load is generally not from the runner at present, so dropping the future doesn't decrease load at all. The remote load is only decreased when the runner proactively takes some step to abort whatever remote work it is managing. That means to my mind that the pattern of "well I'll just stop executing it" isn't very helpful [I also think it's a future minefield as I'll come to].
I'm not clear on what those simple runners would look like. Our current runners manage work on a remote resource. They're well placed to look at a cancellation token and decide what to do next at safe and convenient moments. They would both have to proactively make a cancellation call wouldn't they to respond in any way appropriately to cancellation? A very noticeable case for LAVA for example is the need to unqueue jobs that have not started when cancellation occurs. In the case of a local executor that is not managing some remote resource, surely it'd be better to have a choice of when you stop executing. It seems very difficult to always be ready to transition to
But surely the "simple" runner writer shouldn't have to worry about whether transactional safety is going to be violated. It's "safe" to ignore cancellation completely, we've got a long way without worrying about it at all. It doesn't have weird edge cases, or leave things in unexpected states. Just checking a cancellation token to see if you need to proactively cancel doesn't seem onerous to me at all.
I don't really understand the need here. I don't know why it's desirable to have a default behaviour of "any await can cease your execution, and if you need to cleanup, you must do a blanket cleanup at the end, storing sufficient state to do so". That just seems like a really hard model to get reliable. A much simpler model is just "here's a thing that tells you if you were cancelled, take action if you need to". It' has no extra preconditions, it's easy to explain, it makes it obvious what happens if you ignore it, it imposes no new behaviours throughout your runner (bear in mind it's not just your runner that's treated this way, all referenced libraries need to behave sensibly too when their futures are dropped). And it matches the LAVA case well: you need to proactively cancel your LAVA job if cancellation occurs, why do I want to do that from a point where I don't know what dynamic state I had (i.e. |
5b79d03
to
16721f4
Compare
PR updated to add on a new trait to separate things out better, as well as actual documentation.
Honestly in general in the async world, I'd argue that you generally need to be aware that your tasks can drop at any time, and the amount of state that can't easily be cleaned up in any scenario should ideally be quite minimal. Network connections drop, servers go down, containers get booted, etc. There's definitely some precedent here (everything Erlang, async exceptions in Haskell, a lot of other stuff about monads), though admittedly I'm not sure about the Rust world... In practice, I have a hard time imagining where the new behavior would be particularly worse than before: e.g. in the OBS runner, if it gets terminated mid-upload, that's certainly unfortunate, but nothing really goes wrong. If it gets cancelled while monitoring that's again unfortunate, but it's not particularly wrong. Generally, in the worst case, this isn't going to be that much worse than the previous behavior, but it has the distinct advantage that people don't need to remember to handle cancellation. Because:
and it's also quite easy to forget 😅 IME people will constantly wire around contexts / cancellation tokens or notifications around in some places but not others, functions that can hang forever will never be given one and never be waiting in appropriately for cancellation to propagate, etc. Passing down important contextual state can definitely be an error-prone hassle, and it's one that IMO isn't needed in the majority of cases. |
I think the updated API is fine. The default behaviour with all three options (no change, auto cancel, cancellable api) is to do nothing useful on cancellation. The important resource to release is the managed resource, not the runner's resources. In every case that requires you to take some positive action. I think it's strange to voluntarily give up the context of what you were doing when you come to cancel, it seems to make your recovery code much more heavyweight for no real gain in this situation. It's arguably easier to forget to write Cancellation is not a pervasive concern, and it is not necessary to wire it through to every call and every sleep. We are not talking about something that needs to happen extremely quickly, it's happening in response to a user clicking a button - we have seconds to respond, we won't even notice the button was hit until our next keepalive occurs. So I think the idea of plumbing the token in everywhere is probably overkill. But knowing what command you were executing when cancel happened seems like a lot to give up. The updated API looks fine to me, I'll rework the lava runner against this PR. |
16721f4
to
a03edb0
Compare
Fwiw the point of the now two traits is so that the user has options. If you keep using the Jobhandler your steps future will just be dropped (which for our current runners should be fine), but if you want or need to (e.g. the api you're using is not cancellation safe) you can take matters in your own hand and use specific cancellation points and/or handle rollback yourself just like you can when there is an error bubbling up from your own implementation |
a03edb0
to
f1c674c
Compare
I'm happy for this to merge in its present state; I've rebased my changes on top of it. |
70a3f70
to
9674ede
Compare
9674ede
to
315a07f
Compare
When a job is cancelled on GitLab, the corresponding handler futures will be dropped. If the handler was already created, the cleanup() method will still be called, giving the opportunity to clean up after anything that may have been interrupted. Signed-off-by: Ryan Gonzalez <ryan.gonzalez@collabora.com>
315a07f
to
225e38b
Compare
When a job is cancelled on GitLab, the corresponding handler futures will be dropped. If the handler was already created, the cleanup() method will still be called, giving the opportunity to clean up after anything that may have been interrupted.
Signed-off-by: Ryan Gonzalez ryan.gonzalez@collabora.com