Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow clean uninitialization of silos which failed to start #2119

Closed
ReubenBond opened this issue Sep 6, 2016 · 11 comments
Closed

Allow clean uninitialization of silos which failed to start #2119

ReubenBond opened this issue Sep 6, 2016 · 11 comments
Assignees
Labels

Comments

@ReubenBond
Copy link
Member

Often I see errors amounting to trying to re-initialize some singletons in Orleans which have already been initialized. Examples include GrainTypeManager and UnobservedExceptionHandler.

Usually when a user reports an error like this, the error is actually something else which caused their silo to fail initialization, but their code retries a few times and hides the original exception further up the log.

Currently it's not clear how to uninitialize a silo so that we can have a clean attempt at starting it.
We should fix this.

While looking into the issue, I see this code in Silo.Terminate() and hope someone can explain it to me. Why can we not terminate a silo which is not in the "Running" state?

if (SystemStatus.Current.Equals(SystemStatus.Stopping) || 
    SystemStatus.Current.Equals(SystemStatus.ShuttingDown) || 
    SystemStatus.Current.Equals(SystemStatus.Terminated))
{
    stopAlreadyInProgress = true;
    // Drop through to wait below
}
else if (!SystemStatus.Current.Equals(SystemStatus.Running))
{
    throw new InvalidOperationException(String.Format("Calling Silo.{0} on a silo which is not in the Running state. This silo is in the {1} state.", operation, SystemStatus.Current));
}
else
{
    if (gracefully)
        SystemStatus.Current = SystemStatus.ShuttingDown;
    else
        SystemStatus.Current = SystemStatus.Stopping;
}

If we could just remove that exception, then calls to Silo.Stop()/Silo.Shutdown() could successfully clean up the static members.

Of course, another way to fix this problem is to use DI pervasively and kill all stateful statics.

@ReubenBond
Copy link
Member Author

Another question: Silo.DoStart() has a try/catch around about half of its body, but why not the other half? Extending the scope of the try/catch would fix these initialization errors by calling FastKill() in the catch block.

@sergeybykov
Copy link
Contributor

Why can we not terminate a silo which is not in the "Running" state?

We generally don't want to terminate a silo which is going through the graceful shutdown process. Why would we?

Another question: Silo.DoStart() has a try/catch around about half of its body, but why not the other half?

I think the difference is that the try starts after the silo proceeded to join the cluster, and we need it to mark itself as dead upon a further failure to start (from within FastKill()) to speed up cluster reconfiguration. While if we fail before the silo joined the cluster, we can simply fail by re-throwing the exception in Start().

@ReubenBond
Copy link
Member Author

@sergeybykov do you know of a way to uninitialize everything after an initialization failure so that I can try again rather than having to tear down the whole process?

Do you see a problem with tweaking the Terminate() logic to allow transitioning from Starting to Stopping ?

@sergeybykov
Copy link
Contributor

Do you see a problem with tweaking the Terminate() logic to allow transitioning from Starting to Stopping ?

Conceptually, I don't see a problem wit trying that. I suspect the complications will be around the races from trying to stop system components that haven't initialized yet and new kinds of non-deterministic failures (NREs?) as a result. I wonder if @gabikliot can think of other reasons.

@gabikliot
Copy link
Contributor

Yes, what Sergey wrote is correct.
It is MUCH easier to think and reason about transitions between atomic states vs. dealing with transitions from within a half state to another state. Draw a state diagram and transitions between them and now reason about what each state means w.r.t. both global-distributed and local silo state.
That is exactly what I tried to do here and the diagram is in some internal doc, which I don't have access to any longer.

If we have some state where we get stuck forever (due to some error or what not) - that is obviously a bug, which needs to be fixed. Except for such potential bugs (which I did not see any evidence of existing so far), I see conceptually no problem with this atomic state transition handling.

If you can explain the problem again, I can try to help find a solution.

@gabikliot
Copy link
Contributor

gabikliot commented Sep 7, 2016

As further clarification:

Currently it's not clear how to uninitialize a silo so that we can have a clean attempt at starting it.

I think any call to silo.Terminate and silo.Stop should unitialize a silo, obviously once the call returns. Is not it what happens now? Multi-threaded simultaneous calls to those methods are queued/ignored.

I actually don't remember that we intended ever for silo to be restartable - start, stop, start. I think I assumed silo is one time. The state transition diagram I think, as far as I recall it, did not have a back arrow or loops. There was simply never a need to do that. Should be easy to fix if needed I think.

@ReubenBond
Copy link
Member Author

Thanks, Gabi. It's not that I want to restart a silo, I want to retry
starting it.

The only viable policy right now is to crash the process.

On 7 Sep 2016 4:49 PM, "Gabriel Kliot" notifications@github.com wrote:

As further clarification:

Currently it's not clear how to uninitialize a silo so that we can have a
clean attempt at starting it.

I think any call to silo.Terminate and silo.Stop should unitialize a silo,
obviously once the call returns. Is not it what happens now? Multi-threaded
simultaneous calls to those methods are queued/ignored.

I actually don't remember that we intended ever for silo to be restartable

  • start, stop, start. I think I assumed silo is one time. There was simply
    never a need to do that. Should be easy to fix if needed I think.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#2119 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAMcPwFP5XvrSGAzjZoVDHWuRfqtjRzeks5qnl59gaJpZM4J1fCZ
.

@sergeybykov
Copy link
Contributor

Thanks, Gabi. It's not that I want to restart a silo, I want to retry
starting it.

Why not create and start a new one instead? What's the benefit of trying to resuscitate a failed one?

@gabikliot
Copy link
Contributor

Can you recreate this in test, or at least show the exception? I am sure it will be easy to fix, just undo something static.

@ReubenBond
Copy link
Member Author

@sergeybykov that does not work, if you try to recreate your SiloHost, you will receive exceptions caused by the classes I mentioned above: GrainTypeManager and UnobservedExceptionHandler.

  • GrainTypeManager will throw InvalidOperationException("An attempt to create a second insance of GrainTypeManager.") in its constructor.
  • UnobservedExceptionHandler will throw InvalidOperationException("Calling SetUnobservedExceptionHandler the second time.").

@gabikliot I'll submit a PR with a test & a fix. It turns out that the fix is simple: I just needed to add a call to GrainTypeManager.Stop to SiloHost.UnInitializeOrleansSilo

@gabikliot
Copy link
Contributor

Yep, that's exactly what I thought. It should be trivial to fix.

sergeybykov added a commit that referenced this issue Sep 8, 2016
Fix #2119 by allowing full uninitialization in SiloHost
@ghost ghost locked as resolved and limited conversation to collaborators Sep 29, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants