Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add async startup #5436

Closed
wants to merge 19 commits into from
Closed

Conversation

Zetanova
Copy link
Contributor

@Zetanova Zetanova commented Dec 16, 2021

No description provided.

@Aaronontheweb
Copy link
Member

Breaking changes in here for sure break compat with Phobos but we'll need to test to see if this resolves the regression in 1.4.29

@Zetanova
Copy link
Contributor Author

If there are conflicts with phobos, we can resolve them later on

@Aaronontheweb
Copy link
Member

Yep, it's not a deal breaker. Fixing clustering is the bigger priority.

@Zetanova
Copy link
Contributor Author

@Aaronontheweb Why did the test failed?
The log its a bid confusing, no unit tests failed

@Aaronontheweb
Copy link
Member

Part of the test suite locked up and timed out after 30 minutes, which means we don't get the report indicating which test failed. And wherever it failed, it did it consistently. Need to click through to the build log to see it:

https://dev.azure.com/dotnet/Akka.NET/_build/results?buildId=62800&view=logs&j=d3d8bb3a-a87f-5af3-6a16-90b99d49172b&t=1d19db56-c24e-5fbb-3f13-e91c53ee1789&l=3886

Looks like it's the Akka.FSharp specs

@Zetanova
Copy link
Contributor Author

Yes, i saw this too but don't understand why it failed and how to resolve it.
the FSharpSpecs running local successful

@Aaronontheweb
Copy link
Member

Looks like the remoting system deadlocked - see all of the Uhandled warnings that go on for a while before the test times out?

@Zetanova
Copy link
Contributor Author

I think i found it, its something in RemoteActorRefProvider.CreateInternals()
The dispatcher of the components in Internals get faster executed then the CreateInternals() returns

@Zetanova
Copy link
Contributor Author

Zetanova commented Dec 16, 2021

one other major problem is the logging.
if something throws at startup, there is no output.

One other suspicious thing is the terminator routine,
I think it can deathlock with a startup exception.

I think in the last test-run (one before the last commit) there was an NRE on startup and the start-finalizer deathlocked
effect=> death-lock and no output

@Aaronontheweb
Copy link
Member

I think this is great progress - just need to keep whittling down the issues the test suite raises

@Aaronontheweb
Copy link
Member

one other major problem is the logging.
if something throws at startup, there is no output.

Is your issue related to this? #4424

@Aaronontheweb
Copy link
Member

It's still an issue with remoting.

If I had to guess, the problem here is that the /system actors that run remoting have to start before the Init call on the RemoteActorRefProvider exits. Something about the way this method has been updated prevents them from doing that consistently.

{
var extensions = LoadExtensions();
foreach (var init in extensions.OfType<IInitializable>())
await init.InitializeAsync(cancellationToken);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is part of your problem - the old system had ordering in-place created as an effect of lazy loading. If Cluster depends on the RARP extension, Cluster would start RARP if it wasn't created already and would load the cached value.

Scanning the assembly and arbitrarily front-loading the extensions is non-deterministic and will result in the types of deadlocks you're seeing now.

I wouldn't front-load any of the extensions at all - lazy loading until they're needed on-demand is still the better approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, one simple fix here to see if it resolves the deadlock - don't await each individual plugin starting up. Compose them into a Task.WhenAll and await on the group.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only Cluster and the RemoteActorRefProvider are supporting it,
yes the issue is WHO and WHEN the Cluster Extensions get cold and exit


// This is effectively a write-once variable similar to a lazy val. The reason for not using a lazy val is exception
// handling.
private volatile HashSet<Address> _addresses;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why replace this with a Volatile.Write down below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a multi read and only single write field.
volatile as perf issues if something is accessing it multiple times

_provider.Init(this);
if (_provider is IInitializable init)
await init.InitializeAsync(cancellationToken);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we call Init and InitializeAsync ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IInitializable is a feature faced, it does not replace the old API

@@ -302,13 +337,13 @@ private void StopScheduler()
sched?.Dispose();
}

private void LoadExtensions()
private List<object> LoadExtensions()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we type this usefully in order to avoid boxing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This are only the extensions loaded from config, excluding lazy loaded extensions

@Zetanova
Copy link
Contributor Author

It was simple the requirement that the Cluster extension need be initialized inside/right after the ActorRefProvider
It does not support "lazy" loading but it is used like it.

The whole extensions system is none-deterministic.
Because many system actors are lazy loading extensions
and there execution depends on Dispatcher.

Changing the Dispatcher can change the extension loading order.

@Zetanova
Copy link
Contributor Author

I can not reproduce the lock localy.

Something with the UnitTest dispatcher and most likely the Extesions-Lazy<>.Value is locking somewhere/somehow.

I mark this one as WIP

@Aaronontheweb
Copy link
Member

I think it'd be worth writing up a spec on this first and reworking the problem from the top down that way

@Zetanova
Copy link
Contributor Author

If I could reproduce the behavior then we could write a spec for it.
Sry, that I abuse the build-server, all unit-tests are sucessful on my local machine

I think that I found now the issue the addressPromise TCS of Remoting.StartAsync() leaked the dispatcher thread.

@Zetanova
Copy link
Contributor Author

@Aaronontheweb Would need help,

  1. I don't know how to activate more log output
  2. Have no idea, to reproduce the deadlock locally with debug to find the death-lock

I removed basically all from the Cluster.cotr and moved it into InitializeAsync()
and the same with the Transport.StartAsync()

The call sequence didn't change and still there is some death lock,
what I cannot find.

The secquence is Transport.StartAsync() and after right away Cluster.Get(sys).InitializeAsync()
and nothing is accessing the Cluster extension before

I could even remove the Cluster.ClusterCore late init,
it is not required anymore and if it is not set just throws.

The only idea what i have left is inside Cluster.Get(sys).InitializeAsync()

var = await _clusterDaemons.Ask<IActorRef>(new InternalClusterAction.GetClusterCoreRef(this), 
                    System.Settings.CreationTimeout, cancellationToken).ConfigureAwait(false);

that is just simply death-locking instantly are the answer-mailbox is not processed

@Zetanova
Copy link
Contributor Author

@Aaronontheweb I moved the fixes out to a new PR and open a discussion for the startup here:
#5447

@Zetanova Zetanova closed this Dec 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants