Allow clean uninitialization of silos which failed to start #2119

ReubenBond · 2016-09-06T05:52:17Z

Often I see errors amounting to trying to re-initialize some singletons in Orleans which have already been initialized. Examples include GrainTypeManager and UnobservedExceptionHandler.

Usually when a user reports an error like this, the error is actually something else which caused their silo to fail initialization, but their code retries a few times and hides the original exception further up the log.

Currently it's not clear how to uninitialize a silo so that we can have a clean attempt at starting it.
We should fix this.

While looking into the issue, I see this code in Silo.Terminate() and hope someone can explain it to me. Why can we not terminate a silo which is not in the "Running" state?

if (SystemStatus.Current.Equals(SystemStatus.Stopping) || 
    SystemStatus.Current.Equals(SystemStatus.ShuttingDown) || 
    SystemStatus.Current.Equals(SystemStatus.Terminated))
{
    stopAlreadyInProgress = true;
    // Drop through to wait below
}
else if (!SystemStatus.Current.Equals(SystemStatus.Running))
{
    throw new InvalidOperationException(String.Format("Calling Silo.{0} on a silo which is not in the Running state. This silo is in the {1} state.", operation, SystemStatus.Current));
}
else
{
    if (gracefully)
        SystemStatus.Current = SystemStatus.ShuttingDown;
    else
        SystemStatus.Current = SystemStatus.Stopping;
}

If we could just remove that exception, then calls to Silo.Stop()/Silo.Shutdown() could successfully clean up the static members.

Of course, another way to fix this problem is to use DI pervasively and kill all stateful statics.

The text was updated successfully, but these errors were encountered:

ReubenBond · 2016-09-06T06:00:34Z

Another question: Silo.DoStart() has a try/catch around about half of its body, but why not the other half? Extending the scope of the try/catch would fix these initialization errors by calling FastKill() in the catch block.

sergeybykov · 2016-09-06T15:13:38Z

Why can we not terminate a silo which is not in the "Running" state?

We generally don't want to terminate a silo which is going through the graceful shutdown process. Why would we?

Another question: Silo.DoStart() has a try/catch around about half of its body, but why not the other half?

I think the difference is that the try starts after the silo proceeded to join the cluster, and we need it to mark itself as dead upon a further failure to start (from within FastKill()) to speed up cluster reconfiguration. While if we fail before the silo joined the cluster, we can simply fail by re-throwing the exception in Start().

ReubenBond · 2016-09-06T22:56:33Z

@sergeybykov do you know of a way to uninitialize everything after an initialization failure so that I can try again rather than having to tear down the whole process?

Do you see a problem with tweaking the Terminate() logic to allow transitioning from Starting to Stopping ?

sergeybykov · 2016-09-06T23:13:02Z

Do you see a problem with tweaking the Terminate() logic to allow transitioning from Starting to Stopping ?

Conceptually, I don't see a problem wit trying that. I suspect the complications will be around the races from trying to stop system components that haven't initialized yet and new kinds of non-deterministic failures (NREs?) as a result. I wonder if @gabikliot can think of other reasons.

gabikliot · 2016-09-07T06:45:16Z

Yes, what Sergey wrote is correct.
It is MUCH easier to think and reason about transitions between atomic states vs. dealing with transitions from within a half state to another state. Draw a state diagram and transitions between them and now reason about what each state means w.r.t. both global-distributed and local silo state.
That is exactly what I tried to do here and the diagram is in some internal doc, which I don't have access to any longer.

If we have some state where we get stuck forever (due to some error or what not) - that is obviously a bug, which needs to be fixed. Except for such potential bugs (which I did not see any evidence of existing so far), I see conceptually no problem with this atomic state transition handling.

If you can explain the problem again, I can try to help find a solution.

gabikliot · 2016-09-07T06:49:32Z

As further clarification:

Currently it's not clear how to uninitialize a silo so that we can have a clean attempt at starting it.

I think any call to silo.Terminate and silo.Stop should unitialize a silo, obviously once the call returns. Is not it what happens now? Multi-threaded simultaneous calls to those methods are queued/ignored.

I actually don't remember that we intended ever for silo to be restartable - start, stop, start. I think I assumed silo is one time. The state transition diagram I think, as far as I recall it, did not have a back arrow or loops. There was simply never a need to do that. Should be easy to fix if needed I think.

ReubenBond · 2016-09-07T07:18:00Z

Thanks, Gabi. It's not that I want to restart a silo, I want to retry
starting it.

The only viable policy right now is to crash the process.

On 7 Sep 2016 4:49 PM, "Gabriel Kliot" notifications@github.com wrote:

As further clarification:

Currently it's not clear how to uninitialize a silo so that we can have a
clean attempt at starting it.

I think any call to silo.Terminate and silo.Stop should unitialize a silo,
obviously once the call returns. Is not it what happens now? Multi-threaded
simultaneous calls to those methods are queued/ignored.

I actually don't remember that we intended ever for silo to be restartable

start, stop, start. I think I assumed silo is one time. There was simply
never a need to do that. Should be easy to fix if needed I think.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#2119 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAMcPwFP5XvrSGAzjZoVDHWuRfqtjRzeks5qnl59gaJpZM4J1fCZ
.

sergeybykov · 2016-09-07T14:24:58Z

Thanks, Gabi. It's not that I want to restart a silo, I want to retry
starting it.

Why not create and start a new one instead? What's the benefit of trying to resuscitate a failed one?

gabikliot · 2016-09-07T15:01:11Z

Can you recreate this in test, or at least show the exception? I am sure it will be easy to fix, just undo something static.

ReubenBond · 2016-09-07T23:49:01Z

@sergeybykov that does not work, if you try to recreate your SiloHost, you will receive exceptions caused by the classes I mentioned above: GrainTypeManager and UnobservedExceptionHandler.

GrainTypeManager will throw InvalidOperationException("An attempt to create a second insance of GrainTypeManager.") in its constructor.
UnobservedExceptionHandler will throw InvalidOperationException("Calling SetUnobservedExceptionHandler the second time.").

@gabikliot I'll submit a PR with a test & a fix. It turns out that the fix is simple: I just needed to add a call to GrainTypeManager.Stop to SiloHost.UnInitializeOrleansSilo

gabikliot · 2016-09-08T04:28:37Z

Yep, that's exactly what I thought. It should be trivial to fix.

Fix #2119 by allowing full uninitialization in SiloHost

gabikliot added the question label Sep 7, 2016

ReubenBond pushed a commit to ReubenBond/orleans that referenced this issue Sep 7, 2016

Fix dotnet#2119 by allowing full uninitialization in SiloHost.

84b2a50

ReubenBond mentioned this issue Sep 7, 2016

Fix #2119 by allowing full uninitialization in SiloHost #2127

Merged

ReubenBond assigned ReubenBond Sep 7, 2016

sergeybykov closed this as completed in #2127 Sep 8, 2016

sergeybykov added a commit that referenced this issue Sep 8, 2016

Merge pull request #2127 from ReubenBond/fix-silohost-unititialization

237af7e

Fix #2119 by allowing full uninitialization in SiloHost

ghost locked as resolved and limited conversation to collaborators Sep 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow clean uninitialization of silos which failed to start #2119

Allow clean uninitialization of silos which failed to start #2119

ReubenBond commented Sep 6, 2016

ReubenBond commented Sep 6, 2016

sergeybykov commented Sep 6, 2016

ReubenBond commented Sep 6, 2016

sergeybykov commented Sep 6, 2016

gabikliot commented Sep 7, 2016

gabikliot commented Sep 7, 2016 •

edited

ReubenBond commented Sep 7, 2016

sergeybykov commented Sep 7, 2016

gabikliot commented Sep 7, 2016

ReubenBond commented Sep 7, 2016

gabikliot commented Sep 8, 2016

Allow clean uninitialization of silos which failed to start #2119

Allow clean uninitialization of silos which failed to start #2119

Comments

ReubenBond commented Sep 6, 2016

ReubenBond commented Sep 6, 2016

sergeybykov commented Sep 6, 2016

ReubenBond commented Sep 6, 2016

sergeybykov commented Sep 6, 2016

gabikliot commented Sep 7, 2016

gabikliot commented Sep 7, 2016 • edited

ReubenBond commented Sep 7, 2016

sergeybykov commented Sep 7, 2016

gabikliot commented Sep 7, 2016

ReubenBond commented Sep 7, 2016

gabikliot commented Sep 8, 2016

gabikliot commented Sep 7, 2016 •

edited