New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implemented linear/exponential back off recovery strategy #156
Implemented linear/exponential back off recovery strategy #156
Conversation
Do I need to add an example with the exponential back-off recovery strategy? |
First thank you. But not an ordinary one because this is one of the cleanest PRs this repo has seen. Second, if you add an example that would be the good. I am commuting rn. I will review when I get back home. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, what a high quality Pull Request!
I really like the overall direction of this work, and I really enjoyed reading the comments you wrote. They prove you really care for people who read the code, and I had a great time doing it! 🎉
I've added a couple of inline comments and questions that I hope can prove useful.
@vertexclique here are a couple of overall thoughts, that aren't relevant for this pull request, but might come in handy later:
- It might be cool to someday define a LinearBackOff strategy.
- We might want to reset the restart_count (or maybe an other count?) if an actor behaved correctly for a while, so we don't suffer a too long waiting time when it's not required
- I'm not experienced enough with these kinds of strategies, but maybe we will someday want to be able to bail and enter a "failure mode" or something, or even compose several strategies to build a cool one (like strategy.with_retries().with_timeout() or so)
Again congratulations and thank you for such a great Pull Request!
bastion/src/supervisor.rs
Outdated
multiplier, | ||
} => { | ||
let start_in = | ||
timeout.as_secs() + (timeout.as_secs() * multiplier * restart_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That means that if we someday want to declare a LinearBackOff, It can be build by declaring an ExponentialBackOff with the expected timeout and a multiplier of 0. Pretty cool ! 🎉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhapse it make a sense to mention about it in the docs as well, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to add linear backoff too? If we can do that inside this PR that would be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could add the LinearBackOff
struct into the ActorRestartStrategy
enum with the similar signature (that will contain only the timeout). Also, I'd like to say that the if we pass the multiplayer: 0
and any timeout
value for the exponential back off, the supervisor will try to restart the failed actor at regular intervals. It's just a corner use case for this type of exponential back off :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing that I though about is to add the limit
/attempts
option to the ExponentialBackOff
struct that has the Option<u32>
type. It gives a way to stop recovering failed actor if we actually can't do.
However, I'm not sure, is it supported this feature right now (to stop recovering after N attempts)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, in other frameworks and langs it is working like that.
https://github.com/trendmicro/backoff-python#backoffon_exception
It would be nice to have max_retries. Which enables us to resolve #105 :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could add the LinearBackOff struct into the ActorRestartStrategy enum with the similar signature (that will contain only the timeout).
Yes please, that will be clear what is the difference. It is a corner case but let's be explicit rather than implicit. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vertexclique I'm thinking about the adding the max_retries in code. Does it make a sense to append the max_restarts
for the Supervisor
/Children
types instead of re-ogranizing it in the struct? (like in the code below)
pub struct RestartStrategy {
max_restarts: Option<usize>
strategy: ActorRestartStrategy
}
Any suggestions, better names for those things are welcome :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Relrin Actually what you are suggesting is better since they are logically separate. Naming also looks good max_restarts
. If you want to do that in this PR just ping us. We can close two issues at the same time. Or we can implement this in other PR. This is your call…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except the items pointed out by @o0Ignition0o and I. I don't see any blocker to merge this.
Also please do the cargo fmt and clippy checks, since they are failing right now.
bastion/src/supervisor.rs
Outdated
/// failed actor as soon as possible. | ||
/// - [`ActorRestartStrategy::ExponentialBackOff`] would restart the | ||
/// failed actor with the delay (in milliseconds), multiplied on the | ||
/// some coefficient. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// some coefficient. | |
/// given coefficient. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
bastion/src/supervisor.rs
Outdated
multiplier, | ||
} => { | ||
let start_in = | ||
timeout.as_secs() + (timeout.as_secs() * multiplier * restart_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could add the LinearBackOff struct into the ActorRestartStrategy enum with the similar signature (that will contain only the timeout).
Yes please, that will be clear what is the difference. It is a corner case but let's be explicit rather than implicit. :)
Relatively to those comments:
For this case, I think, the child process could send a message to a supervisor of the
For the use std::time::Duration;
pub trait RestartStrategy {
fn calculate(&self) -> Duration;
}
pub enum ActorRestartStrategy {
Immediate,
LinearBackOff {
timeout: Duration,
},
ExponentialBackOff {
timeout: Duration,
multiplier: u64,
},
Custom(Box<dyn RestartStrategy>)
}
struct MyConstantRestartStrategy;
impl MyConstantRestartStrategy {
pub fn new() -> Self {
MyConstantRestartStrategy {}
}
}
impl RestartStrategy for MyConstantRestartStrategy {
fn calculate(&self) -> Duration {
Duration::new(5, 0) // 5 seconds timeout
}
}
fn main() {
let custom_strategy = Box::new(MyConstantRestartStrategy::new());
let result = ActorRestartStrategy::Custom(custom_strategy);
} |
That makes total sense!
It sounds like a great idea, maybe for a followup PR. You have done an amazing job so far, let's fix the couple of nits and the CI lints and then LGTM !
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An amazing pull request, I had a great time reviewing it and talking with you! 🎉
Let's implement the |
@vertexclique Sure. I will open another PR a little bit later :) |
This pull request gives as opportunity to use an exponential back-off strategy for restarting/recovering failed actors. Also I made it so that it will use the old logic with restoring to the initial state by default.
Checklist
cargo test
.