Implemented linear/exponential back off recovery strategy #156

Relrin · 2020-01-20T17:22:47Z

This pull request gives as opportunity to use an exponential back-off strategy for restarting/recovering failed actors. Also I made it so that it will use the old logic with restoring to the initial state by default.

Checklist

tests are passing with cargo test.
tests and/or benchmarks are included
documentation is changed or added
commit message is clear

Relrin · 2020-01-20T17:23:57Z

Do I need to add an example with the exponential back-off recovery strategy?

vertexclique · 2020-01-20T17:29:02Z

First thank you. But not an ordinary one because this is one of the cleanest PRs this repo has seen.

Second, if you add an example that would be the good.

I am commuting rn. I will review when I get back home.

o0Ignition0o

Wow, what a high quality Pull Request!

I really like the overall direction of this work, and I really enjoyed reading the comments you wrote. They prove you really care for people who read the code, and I had a great time doing it! 🎉

I've added a couple of inline comments and questions that I hope can prove useful.

@vertexclique here are a couple of overall thoughts, that aren't relevant for this pull request, but might come in handy later:

It might be cool to someday define a LinearBackOff strategy.
We might want to reset the restart_count (or maybe an other count?) if an actor behaved correctly for a while, so we don't suffer a too long waiting time when it's not required
I'm not experienced enough with these kinds of strategies, but maybe we will someday want to be able to bail and enter a "failure mode" or something, or even compose several strategies to build a cool one (like strategy.with_retries().with_timeout() or so)

Again congratulations and thank you for such a great Pull Request!

bastion/src/supervisor.rs

o0Ignition0o · 2020-01-20T17:45:14Z

bastion/src/supervisor.rs

+                        multiplier,
+                    } => {
+                        let start_in =
+                            timeout.as_secs() + (timeout.as_secs() * multiplier * restart_count);


That means that if we someday want to declare a LinearBackOff, It can be build by declaring an ExponentialBackOff with the expected timeout and a multiplier of 0. Pretty cool ! 🎉

Perhapse it make a sense to mention about it in the docs as well, what do you think?

I think it would! :)

Is it possible to add linear backoff too? If we can do that inside this PR that would be nice.

I could add the LinearBackOff struct into the ActorRestartStrategy enum with the similar signature (that will contain only the timeout). Also, I'd like to say that the if we pass the multiplayer: 0 and any timeout value for the exponential back off, the supervisor will try to restart the failed actor at regular intervals. It's just a corner use case for this type of exponential back off :)

Another thing that I though about is to add the limit/attempts option to the ExponentialBackOff struct that has the Option<u32> type. It gives a way to stop recovering failed actor if we actually can't do.
However, I'm not sure, is it supported this feature right now (to stop recovering after N attempts)

yes, in other frameworks and langs it is working like that.
https://github.com/trendmicro/backoff-python#backoffon_exception
It would be nice to have max_retries. Which enables us to resolve #105 :)

I could add the LinearBackOff struct into the ActorRestartStrategy enum with the similar signature (that will contain only the timeout).

Yes please, that will be clear what is the difference. It is a corner case but let's be explicit rather than implicit. :)

@vertexclique I'm thinking about the adding the max_retries in code. Does it make a sense to append the max_restarts for the Supervisor/Children types instead of re-ogranizing it in the struct? (like in the code below)

pub struct RestartStrategy { max_restarts: Option<usize> strategy: ActorRestartStrategy }

Any suggestions, better names for those things are welcome :)

@Relrin Actually what you are suggesting is better since they are logically separate. Naming also looks good max_restarts. If you want to do that in this PR just ping us. We can close two issues at the same time. Or we can implement this in other PR. This is your call…

bastion/src/supervisor.rs

vertexclique

Except the items pointed out by @o0Ignition0o and I. I don't see any blocker to merge this.
Also please do the cargo fmt and clippy checks, since they are failing right now.

bastion/src/supervisor.rs

vertexclique · 2020-01-20T19:10:40Z

bastion/src/supervisor.rs

+    ///         failed actor as soon as possible.
+    ///     - [`ActorRestartStrategy::ExponentialBackOff`] would restart the
+    ///         failed actor with the delay (in milliseconds), multiplied on the
+    ///         some coefficient.


Suggested change

/// some coefficient.

/// given coefficient.

vertexclique · 2020-01-20T19:13:11Z

bastion/src/supervisor.rs

+                        multiplier,
+                    } => {
+                        let start_in =
+                            timeout.as_secs() + (timeout.as_secs() * multiplier * restart_count);


I could add the LinearBackOff struct into the ActorRestartStrategy enum with the similar signature (that will contain only the timeout).

Yes please, that will be clear what is the difference. It is a corner case but let's be explicit rather than implicit. :)

Relrin · 2020-01-20T20:56:19Z

Relatively to those comments:

We might want to reset the restart_count (or maybe an other count?) if an actor behaved correctly for a while, so we don't suffer a too long waiting time when it's not required

For this case, I think, the child process could send a message to a supervisor of the Envelope::Reset type of something similar to it, so that the supervisor will update the self.launched's record.

I'm not experienced enough with these kinds of strategies, but maybe we will someday want to be able to bail and enter a "failure mode" or something, or even compose several strategies to build a cool one (like strategy.with_retries().with_timeout() or so)

For the (like strategy.with_retries().with_timeout() or so) part I believe that the LinearBackOff / ExponentialBackOff will cover these cases.
For rare cases when a developer wants to define his own restart strategy I would like to propose to use a custom-implemented strategy. Potentially this feature can be added as an additional option for the enum. It could be implemented something like this (the first though out of my head):

use std::time::Duration;

pub trait RestartStrategy {
    fn calculate(&self) -> Duration;
}

pub enum ActorRestartStrategy {
    Immediate,
    LinearBackOff {
        timeout: Duration,  
    },
    ExponentialBackOff {
        timeout: Duration,
        multiplier: u64,
    },
    Custom(Box<dyn RestartStrategy>)
} 

struct MyConstantRestartStrategy;

impl MyConstantRestartStrategy {
    pub fn new() -> Self {
        MyConstantRestartStrategy {}
    }
}

impl RestartStrategy for MyConstantRestartStrategy {
    fn calculate(&self) -> Duration {
        Duration::new(5, 0) // 5 seconds timeout
    }
}

fn main() {
    let custom_strategy = Box::new(MyConstantRestartStrategy::new());
    let result = ActorRestartStrategy::Custom(custom_strategy);
}

o0Ignition0o · 2020-01-20T21:05:31Z

For this case, I think, the child process could send a message to a supervisor of the Envelope::Reset type of something similar to it, so that the supervisor will update the self.launched's record.

That makes total sense!

Potentially this feature can be added as an additional option for the enum. It could be implemented something like this (the first though out of my head)

It sounds like a great idea, maybe for a followup PR.

You have done an amazing job so far, let's fix the couple of nits and the CI lints and then LGTM !

Diff in /home/runner/work/bastion/bastion/bastion/src/lib.rs at line 96:
     pub use crate::message::{Answer, AnswerSender, Message, Msg};
     pub use crate::msg;
     pub use crate::path::{BastionPath, BastionPathElement};
-    pub use crate::supervisor::{ActorRestartStrategy, SupervisionStrategy, Supervisor, SupervisorRef};
+    pub use crate::supervisor::{
+        ActorRestartStrategy, SupervisionStrategy, Supervisor, SupervisorRef,
+    };
     pub use crate::{blocking, children, run, spawn, supervisor};
 }

o0Ignition0o

An amazing pull request, I had a great time reviewing it and talking with you! 🎉

vertexclique · 2020-01-21T12:53:22Z

Let's implement the max retries in another PR shall we @Relrin?
This is done perfectly. Thank you again!

Relrin · 2020-01-21T13:30:47Z

@vertexclique Sure. I will open another PR a little bit later :)

Implemented exponential back off recovery strategy

d6dcdf0

o0Ignition0o reviewed Jan 20, 2020

View reviewed changes

vertexclique reviewed Jan 20, 2020

View reviewed changes

Implemented linear back-off recovery strategy

54b016b

Relrin changed the title ~~Implemented exponential back off recovery strategy~~ Implemented linear/exponential back off recovery strategy Jan 21, 2020

vertexclique approved these changes Jan 21, 2020

View reviewed changes

Merge branch 'master' into feature-exponential-backoff

4f22705

o0Ignition0o approved these changes Jan 21, 2020

View reviewed changes

vertexclique merged commit f3d3ed0 into bastion-rs:master Jan 21, 2020

Relrin deleted the feature-exponential-backoff branch January 21, 2020 13:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implemented linear/exponential back off recovery strategy #156

Implemented linear/exponential back off recovery strategy #156

Relrin commented Jan 20, 2020

Relrin commented Jan 20, 2020

vertexclique commented Jan 20, 2020

o0Ignition0o left a comment

o0Ignition0o Jan 20, 2020

Relrin Jan 20, 2020

o0Ignition0o Jan 20, 2020

vertexclique Jan 20, 2020 •

edited

Relrin Jan 20, 2020

Relrin Jan 20, 2020 •

edited

vertexclique Jan 20, 2020 •

edited

vertexclique Jan 20, 2020

Relrin Jan 21, 2020 •

edited

vertexclique Jan 21, 2020 •

edited

vertexclique left a comment •

edited

vertexclique Jan 20, 2020

Relrin Jan 21, 2020

vertexclique Jan 20, 2020

Relrin commented Jan 20, 2020 •

edited

o0Ignition0o commented Jan 20, 2020

o0Ignition0o left a comment

vertexclique commented Jan 21, 2020

Relrin commented Jan 21, 2020

Implemented linear/exponential back off recovery strategy #156

Implemented linear/exponential back off recovery strategy #156

Conversation

Relrin commented Jan 20, 2020

Checklist

Relrin commented Jan 20, 2020

vertexclique commented Jan 20, 2020

o0Ignition0o left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vertexclique Jan 20, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Relrin Jan 20, 2020 • edited

Choose a reason for hiding this comment

vertexclique Jan 20, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Relrin Jan 21, 2020 • edited

Choose a reason for hiding this comment

vertexclique Jan 21, 2020 • edited

Choose a reason for hiding this comment

vertexclique left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Relrin commented Jan 20, 2020 • edited

o0Ignition0o commented Jan 20, 2020

o0Ignition0o left a comment

Choose a reason for hiding this comment

vertexclique commented Jan 21, 2020

Relrin commented Jan 21, 2020

vertexclique Jan 20, 2020 •

edited

Relrin Jan 20, 2020 •

edited

vertexclique Jan 20, 2020 •

edited

Relrin Jan 21, 2020 •

edited

vertexclique Jan 21, 2020 •

edited

vertexclique left a comment •

edited

Relrin commented Jan 20, 2020 •

edited