Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented linear/exponential back off recovery strategy #156

Merged
merged 3 commits into from Jan 21, 2020
Merged

Implemented linear/exponential back off recovery strategy #156

merged 3 commits into from Jan 21, 2020

Conversation

Relrin
Copy link
Member

@Relrin Relrin commented Jan 20, 2020

This pull request gives as opportunity to use an exponential back-off strategy for restarting/recovering failed actors. Also I made it so that it will use the old logic with restoring to the initial state by default.

Checklist
  • tests are passing with cargo test.
  • tests and/or benchmarks are included
  • documentation is changed or added
  • commit message is clear

@Relrin
Copy link
Member Author

Relrin commented Jan 20, 2020

Do I need to add an example with the exponential back-off recovery strategy?

@vertexclique
Copy link
Member

First thank you. But not an ordinary one because this is one of the cleanest PRs this repo has seen.

Second, if you add an example that would be the good.

I am commuting rn. I will review when I get back home.

Copy link
Contributor

@o0Ignition0o o0Ignition0o left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, what a high quality Pull Request!

I really like the overall direction of this work, and I really enjoyed reading the comments you wrote. They prove you really care for people who read the code, and I had a great time doing it! 🎉

I've added a couple of inline comments and questions that I hope can prove useful.

@vertexclique here are a couple of overall thoughts, that aren't relevant for this pull request, but might come in handy later:

  • It might be cool to someday define a LinearBackOff strategy.
  • We might want to reset the restart_count (or maybe an other count?) if an actor behaved correctly for a while, so we don't suffer a too long waiting time when it's not required
  • I'm not experienced enough with these kinds of strategies, but maybe we will someday want to be able to bail and enter a "failure mode" or something, or even compose several strategies to build a cool one (like strategy.with_retries().with_timeout() or so)

Again congratulations and thank you for such a great Pull Request!

bastion/src/supervisor.rs Outdated Show resolved Hide resolved
bastion/src/supervisor.rs Outdated Show resolved Hide resolved
bastion/src/supervisor.rs Outdated Show resolved Hide resolved
bastion/src/supervisor.rs Show resolved Hide resolved
bastion/src/supervisor.rs Outdated Show resolved Hide resolved
multiplier,
} => {
let start_in =
timeout.as_secs() + (timeout.as_secs() * multiplier * restart_count);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means that if we someday want to declare a LinearBackOff, It can be build by declaring an ExponentialBackOff with the expected timeout and a multiplier of 0. Pretty cool ! 🎉

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhapse it make a sense to mention about it in the docs as well, what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would! :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add linear backoff too? If we can do that inside this PR that would be nice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could add the LinearBackOff struct into the ActorRestartStrategy enum with the similar signature (that will contain only the timeout). Also, I'd like to say that the if we pass the multiplayer: 0 and any timeout value for the exponential back off, the supervisor will try to restart the failed actor at regular intervals. It's just a corner use case for this type of exponential back off :)

Copy link
Member Author

@Relrin Relrin Jan 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing that I though about is to add the limit/attempts option to the ExponentialBackOff struct that has the Option<u32> type. It gives a way to stop recovering failed actor if we actually can't do.
However, I'm not sure, is it supported this feature right now (to stop recovering after N attempts)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, in other frameworks and langs it is working like that.
https://github.com/trendmicro/backoff-python#backoffon_exception
It would be nice to have max_retries. Which enables us to resolve #105 :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could add the LinearBackOff struct into the ActorRestartStrategy enum with the similar signature (that will contain only the timeout).

Yes please, that will be clear what is the difference. It is a corner case but let's be explicit rather than implicit. :)

Copy link
Member Author

@Relrin Relrin Jan 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vertexclique I'm thinking about the adding the max_retries in code. Does it make a sense to append the max_restarts for the Supervisor/Children types instead of re-ogranizing it in the struct? (like in the code below)

pub struct RestartStrategy {
    max_restarts: Option<usize>
    strategy: ActorRestartStrategy
}

Any suggestions, better names for those things are welcome :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Relrin Actually what you are suggesting is better since they are logically separate. Naming also looks good max_restarts. If you want to do that in this PR just ping us. We can close two issues at the same time. Or we can implement this in other PR. This is your call…

bastion/src/supervisor.rs Outdated Show resolved Hide resolved
Copy link
Member

@vertexclique vertexclique left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except the items pointed out by @o0Ignition0o and I. I don't see any blocker to merge this.
Also please do the cargo fmt and clippy checks, since they are failing right now.

bastion/src/supervisor.rs Outdated Show resolved Hide resolved
/// failed actor as soon as possible.
/// - [`ActorRestartStrategy::ExponentialBackOff`] would restart the
/// failed actor with the delay (in milliseconds), multiplied on the
/// some coefficient.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// some coefficient.
/// given coefficient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

multiplier,
} => {
let start_in =
timeout.as_secs() + (timeout.as_secs() * multiplier * restart_count);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could add the LinearBackOff struct into the ActorRestartStrategy enum with the similar signature (that will contain only the timeout).

Yes please, that will be clear what is the difference. It is a corner case but let's be explicit rather than implicit. :)

@Relrin
Copy link
Member Author

Relrin commented Jan 20, 2020

Relatively to those comments:

We might want to reset the restart_count (or maybe an other count?) if an actor behaved correctly for a while, so we don't suffer a too long waiting time when it's not required

For this case, I think, the child process could send a message to a supervisor of the Envelope::Reset type of something similar to it, so that the supervisor will update the self.launched's record.

I'm not experienced enough with these kinds of strategies, but maybe we will someday want to be able to bail and enter a "failure mode" or something, or even compose several strategies to build a cool one (like strategy.with_retries().with_timeout() or so)

For the (like strategy.with_retries().with_timeout() or so) part I believe that the LinearBackOff / ExponentialBackOff will cover these cases.
For rare cases when a developer wants to define his own restart strategy I would like to propose to use a custom-implemented strategy. Potentially this feature can be added as an additional option for the enum. It could be implemented something like this (the first though out of my head):

use std::time::Duration;

pub trait RestartStrategy {
    fn calculate(&self) -> Duration;
}

pub enum ActorRestartStrategy {
    Immediate,
    LinearBackOff {
        timeout: Duration,  
    },
    ExponentialBackOff {
        timeout: Duration,
        multiplier: u64,
    },
    Custom(Box<dyn RestartStrategy>)
} 

struct MyConstantRestartStrategy;

impl MyConstantRestartStrategy {
    pub fn new() -> Self {
        MyConstantRestartStrategy {}
    }
}

impl RestartStrategy for MyConstantRestartStrategy {
    fn calculate(&self) -> Duration {
        Duration::new(5, 0) // 5 seconds timeout
    }
}

fn main() {
    let custom_strategy = Box::new(MyConstantRestartStrategy::new());
    let result = ActorRestartStrategy::Custom(custom_strategy);
}

@o0Ignition0o
Copy link
Contributor

For this case, I think, the child process could send a message to a supervisor of the Envelope::Reset type of something similar to it, so that the supervisor will update the self.launched's record.

That makes total sense!

Potentially this feature can be added as an additional option for the enum. It could be implemented something like this (the first though out of my head)

It sounds like a great idea, maybe for a followup PR.

You have done an amazing job so far, let's fix the couple of nits and the CI lints and then LGTM ! :shipit:

Diff in /home/runner/work/bastion/bastion/bastion/src/lib.rs at line 96:
     pub use crate::message::{Answer, AnswerSender, Message, Msg};
     pub use crate::msg;
     pub use crate::path::{BastionPath, BastionPathElement};
-    pub use crate::supervisor::{ActorRestartStrategy, SupervisionStrategy, Supervisor, SupervisorRef};
+    pub use crate::supervisor::{
+        ActorRestartStrategy, SupervisionStrategy, Supervisor, SupervisorRef,
+    };
     pub use crate::{blocking, children, run, spawn, supervisor};
 }

@Relrin Relrin changed the title Implemented exponential back off recovery strategy Implemented linear/exponential back off recovery strategy Jan 21, 2020
Copy link
Contributor

@o0Ignition0o o0Ignition0o left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An amazing pull request, I had a great time reviewing it and talking with you! 🎉

@vertexclique
Copy link
Member

Let's implement the max retries in another PR shall we @Relrin?
This is done perfectly. Thank you again!

@vertexclique vertexclique merged commit f3d3ed0 into bastion-rs:master Jan 21, 2020
@Relrin Relrin deleted the feature-exponential-backoff branch January 21, 2020 13:06
@Relrin
Copy link
Member Author

Relrin commented Jan 21, 2020

@vertexclique Sure. I will open another PR a little bit later :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants