add async startup #5436

Zetanova · 2021-12-16T18:59:58Z

No description provided.

Aaronontheweb · 2021-12-16T19:10:10Z

Breaking changes in here for sure break compat with Phobos but we'll need to test to see if this resolves the regression in 1.4.29

Zetanova · 2021-12-16T19:15:15Z

If there are conflicts with phobos, we can resolve them later on

Aaronontheweb · 2021-12-16T19:18:29Z

Yep, it's not a deal breaker. Fixing clustering is the bigger priority.

Zetanova · 2021-12-16T19:58:32Z

@Aaronontheweb Why did the test failed?
The log its a bid confusing, no unit tests failed

Aaronontheweb · 2021-12-16T20:44:56Z

Part of the test suite locked up and timed out after 30 minutes, which means we don't get the report indicating which test failed. And wherever it failed, it did it consistently. Need to click through to the build log to see it:

https://dev.azure.com/dotnet/Akka.NET/_build/results?buildId=62800&view=logs&j=d3d8bb3a-a87f-5af3-6a16-90b99d49172b&t=1d19db56-c24e-5fbb-3f13-e91c53ee1789&l=3886

Looks like it's the Akka.FSharp specs

Zetanova · 2021-12-16T20:55:12Z

Yes, i saw this too but don't understand why it failed and how to resolve it.
the FSharpSpecs running local successful

Aaronontheweb · 2021-12-16T20:56:34Z

Looks like the remoting system deadlocked - see all of the Uhandled warnings that go on for a while before the test times out?

Zetanova · 2021-12-16T22:02:51Z

I think i found it, its something in RemoteActorRefProvider.CreateInternals()
The dispatcher of the components in Internals get faster executed then the CreateInternals() returns

Zetanova · 2021-12-16T22:31:28Z

one other major problem is the logging.
if something throws at startup, there is no output.

One other suspicious thing is the terminator routine,
I think it can deathlock with a startup exception.

I think in the last test-run (one before the last commit) there was an NRE on startup and the start-finalizer deathlocked
effect=> death-lock and no output

Aaronontheweb · 2021-12-16T23:08:07Z

I think this is great progress - just need to keep whittling down the issues the test suite raises

Aaronontheweb · 2021-12-16T23:12:26Z

one other major problem is the logging.
if something throws at startup, there is no output.

Is your issue related to this? #4424

Aaronontheweb · 2021-12-17T01:40:25Z

It's still an issue with remoting.

If I had to guess, the problem here is that the /system actors that run remoting have to start before the Init call on the RemoteActorRefProvider exits. Something about the way this method has been updated prevents them from doing that consistently.

Aaronontheweb · 2021-12-17T01:48:48Z

src/core/Akka/Actor/Internal/ActorSystemImpl.cs

+                {
+                    var extensions = LoadExtensions();
+                    foreach (var init in extensions.OfType<IInitializable>())
+                        await init.InitializeAsync(cancellationToken);


I think this is part of your problem - the old system had ordering in-place created as an effect of lazy loading. If Cluster depends on the RARP extension, Cluster would start RARP if it wasn't created already and would load the cached value.

Scanning the assembly and arbitrarily front-loading the extensions is non-deterministic and will result in the types of deadlocks you're seeing now.

I wouldn't front-load any of the extensions at all - lazy loading until they're needed on-demand is still the better approach.

BTW, one simple fix here to see if it resolves the deadlock - don't await each individual plugin starting up. Compose them into a Task.WhenAll and await on the group.

Only Cluster and the RemoteActorRefProvider are supporting it,
yes the issue is WHO and WHEN the Cluster Extensions get cold and exit

Aaronontheweb · 2021-12-17T01:49:37Z

src/core/Akka.Remote/Remoting.cs


        // This is effectively a write-once variable similar to a lazy val. The reason for not using a lazy val is exception
        // handling.
-        private volatile HashSet<Address> _addresses;


Why replace this with a Volatile.Write down below?

Its a multi read and only single write field.
volatile as perf issues if something is accessing it multiple times

Aaronontheweb · 2021-12-17T01:52:16Z

src/core/Akka/Actor/Internal/ActorSystemImpl.cs

+                    _provider.Init(this);
+                    if (_provider is IInitializable init)
+                        await init.InitializeAsync(cancellationToken);
+                }


Why would we call Init and InitializeAsync ?

IInitializable is a feature faced, it does not replace the old API

Aaronontheweb · 2021-12-17T01:54:50Z

src/core/Akka/Actor/Internal/ActorSystemImpl.cs

@@ -302,13 +337,13 @@ private void StopScheduler()
            sched?.Dispose();
        }

-        private void LoadExtensions()
+        private List<object> LoadExtensions()


Can we type this usefully in order to avoid boxing?

This are only the extensions loaded from config, excluding lazy loaded extensions

Zetanova · 2021-12-17T08:49:19Z

It was simple the requirement that the Cluster extension need be initialized inside/right after the ActorRefProvider
It does not support "lazy" loading but it is used like it.

The whole extensions system is none-deterministic.
Because many system actors are lazy loading extensions
and there execution depends on Dispatcher.

Changing the Dispatcher can change the extension loading order.

Zetanova · 2021-12-17T12:19:00Z

I can not reproduce the lock localy.

Something with the UnitTest dispatcher and most likely the Extesions-Lazy<>.Value is locking somewhere/somehow.

I mark this one as WIP

Aaronontheweb · 2021-12-17T14:11:27Z

I think it'd be worth writing up a spec on this first and reworking the problem from the top down that way

Zetanova · 2021-12-17T17:20:12Z

If I could reproduce the behavior then we could write a spec for it.
Sry, that I abuse the build-server, all unit-tests are sucessful on my local machine

I think that I found now the issue the addressPromise TCS of Remoting.StartAsync() leaked the dispatcher thread.

Zetanova · 2021-12-17T20:33:33Z

@Aaronontheweb Would need help,

I don't know how to activate more log output
Have no idea, to reproduce the deadlock locally with debug to find the death-lock

I removed basically all from the Cluster.cotr and moved it into InitializeAsync()
and the same with the Transport.StartAsync()

The call sequence didn't change and still there is some death lock,
what I cannot find.

The secquence is Transport.StartAsync() and after right away Cluster.Get(sys).InitializeAsync()
and nothing is accessing the Cluster extension before

I could even remove the Cluster.ClusterCore late init,
it is not required anymore and if it is not set just throws.

The only idea what i have left is inside Cluster.Get(sys).InitializeAsync()

var = await _clusterDaemons.Ask<IActorRef>(new InternalClusterAction.GetClusterCoreRef(this), 
                    System.Settings.CreationTimeout, cancellationToken).ConfigureAwait(false);

that is just simply death-locking instantly are the answer-mailbox is not processed

Zetanova · 2021-12-18T17:34:43Z

@Aaronontheweb I moved the fixes out to a new PR and open a discussion for the startup here:
#5447

add async startup

6f5a015

Zetanova mentioned this pull request Dec 16, 2021

Possible race condition in Akka.Cluster #5435

Closed

add startup timeout

e33e5c4

fix racy terminator

fbe149b

add startup error log output

92f24cf

change startup error log level

1a2ad4b

await lazy extension

10da487

Aaronontheweb reviewed Dec 17, 2021

View reviewed changes

Zetanova added 2 commits December 17, 2021 09:35

init cluster in ClusterRefProvider

459a0d0

reorder cluster init

c24be78

Zetanova added 3 commits December 17, 2021 11:13

inline cluster startup

84efb8d

preload uid extension

93f819f

preload transport adapter extension

492cc6d

unblock cluster-read-view

a7c7e90

Zetanova added 5 commits December 17, 2021 15:21

Merge branch 'dev' into akkasystem-startup

e79ee1c

unblock downingProvider

b068bae

remove ClusterCore unintinalized access

8455455

fix dispatcher thread leak

b84e735

api update

f617e46

Zetanova added 2 commits December 17, 2021 19:53

debug cluster lock

77612dd

set TCS async

14de26d

Zetanova mentioned this pull request Dec 18, 2021

Unlock cluster startup #5445

Open

Zetanova closed this Dec 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add async startup #5436

add async startup #5436

Zetanova commented Dec 16, 2021 •

edited

Aaronontheweb commented Dec 16, 2021

Zetanova commented Dec 16, 2021

Aaronontheweb commented Dec 16, 2021

Zetanova commented Dec 16, 2021

Aaronontheweb commented Dec 16, 2021

Zetanova commented Dec 16, 2021

Aaronontheweb commented Dec 16, 2021

Zetanova commented Dec 16, 2021

Zetanova commented Dec 16, 2021 •

edited

Aaronontheweb commented Dec 16, 2021

Aaronontheweb commented Dec 16, 2021

Aaronontheweb commented Dec 17, 2021

Aaronontheweb Dec 17, 2021

Aaronontheweb Dec 17, 2021 •

edited

Zetanova Dec 17, 2021

Aaronontheweb Dec 17, 2021

Zetanova Dec 17, 2021

Aaronontheweb Dec 17, 2021

Zetanova Dec 17, 2021

Aaronontheweb Dec 17, 2021

Zetanova Dec 17, 2021

Zetanova commented Dec 17, 2021

Zetanova commented Dec 17, 2021

Aaronontheweb commented Dec 17, 2021

Zetanova commented Dec 17, 2021

Zetanova commented Dec 17, 2021

Zetanova commented Dec 18, 2021

add async startup #5436

add async startup #5436

Conversation

Zetanova commented Dec 16, 2021 • edited

Aaronontheweb commented Dec 16, 2021

Zetanova commented Dec 16, 2021

Aaronontheweb commented Dec 16, 2021

Zetanova commented Dec 16, 2021

Aaronontheweb commented Dec 16, 2021

Zetanova commented Dec 16, 2021

Aaronontheweb commented Dec 16, 2021

Zetanova commented Dec 16, 2021

Zetanova commented Dec 16, 2021 • edited

Aaronontheweb commented Dec 16, 2021

Aaronontheweb commented Dec 16, 2021

Aaronontheweb commented Dec 17, 2021

Choose a reason for hiding this comment

Aaronontheweb Dec 17, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Zetanova commented Dec 17, 2021

Zetanova commented Dec 17, 2021

Aaronontheweb commented Dec 17, 2021

Zetanova commented Dec 17, 2021

Zetanova commented Dec 17, 2021

Zetanova commented Dec 18, 2021

Zetanova commented Dec 16, 2021 •

edited

Zetanova commented Dec 16, 2021 •

edited

Aaronontheweb Dec 17, 2021 •

edited