-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deadlock when initializing Akka cluster: document minimum thread pool size #17253
Comments
The problem is not in the What is your configuration? Can you share a way to reproduce it (random frequency is alright)? |
My configuration extensions = [
"akka.contrib.datareplication.DataReplication$"
]
akka {
cluster {
roles = ["master"]
auto-down-unreachable-after = 15s
}
loglevel = "INFO"
log-dead-letters = off
log-dead-letters-during-shutdown = off
actor {
## Master forms a akka cluster
provider = "akka.cluster.ClusterActorRefProvider"
creation-timeout = 60s
provider = "akka.remote.RemoteActorRefProvider"
default-mailbox {
mailbox-type = "akka.dispatch.SingleConsumerOnlyUnboundedMailbox"
}
default-dispatcher {
mailbox-type = "akka.dispatch.SingleConsumerOnlyUnboundedMailbox"
throughput = 10
fork-join-executor {
parallelism-factor = 2
parallelism-max = 2
parallelism-min = 1
}
}
}
remote {
log-remote-lifecycle-events = on
# use-dispatcher = ""
use-dispatcher = "akka.remote.default-remote-dispatcher"
enabled-transports = ["akka.remote.netty.tcp"]
netty.tcp {
port = 0
hostname = "127.0.0.1"
server-socket-worker-pool {
pool-size-min = 1
pool-size-max = 2
}
client-socket-worker-pool {
pool-size-min = 1
pool-size-max = 2
}
}
default-remote-dispatcher {
throughput = 5
type = Dispatcher
mailbox-type = "akka.dispatch.SingleConsumerOnlyUnboundedMailbox"
executor = "fork-join-executor"
fork-join-executor {
parallelism-min = 1
parallelism-max = 2
}
}
startup-timeout = 600 s
shutdown-timeout = 600 s
flush-wait-on-shutdown = 2 s
command-ack-timeout = 600 s
transport-failure-detector {
heartbeat-interval = 600 s
acceptable-heartbeat-pause = 2000 s
}
watch-failure-detector {
heartbeat-interval = 600 s
acceptable-heartbeat-pause = 10 s
unreachable-nodes-reaper-interval = 600s
expected-response-after = 3 s
}
retry-gate-closed-for = 5 s
prune-quarantine-marker-after = 5 d
system-message-ack-piggyback-timeout = 600 s
resend-interval = 600 s
initial-system-message-delivery-timeout = 3 m
enabled-transports = ["akka.remote.netty.tcp"]
netty.tcp.connection-timeout = 600 s
}
}
|
Test code is very simple import akka.actor.ActorSystem
import org.apache.gearpump.cluster.TestUtil
object BB extends App {
val config = TestUtil.MASTER_CONFIG
(0 until 1000).foreach {i =>
val system = ActorSystem(s"test$i", config)
system.shutdown()
}
} I will try to get some stack trace |
I now have the stacktrace It is clear now why deadlock happen.
|
I have configured the default-dispather pool size to be 2. From the stack trace, one dispatcher thread is hang by initialization process of ClusterCoreDaemon, another is hang by constructor of ClusterHeartbeatReceiver. As ClusterCoreDaemon and ClusterHeartbeatReceiver have occupied all two slots of the dispatcher pool, ClusterDaemon.receive cannot get scheduled to response to So, main thread is waiting for ClusterDaemon to be scheduled, and lead to wait for ClusterCoreDaemon or ClusterHeartbeatReceiver to release thread. While ClusterCoreDaemon or ClusterHeartbeatReceiver is waiting for Cluster extension to be initialized, and lead to wait for main thread complete initializing of Cluster extension. That create a dependency relation of: A(main thread) -> B (ClusterDaemon) -> C(ClusterCoreDaemon) , C(ClusterCoreDaemon)->A(main thread) A typical dead lock. |
heh, the few blocking calls that we have in startup. We should be able to run with one thread in principle, so we need to think about these. |
I also tried to configure the thread pool size to 3, the issue still exists, but less likely to happen. I found another class, There may exist other classes, so simplify changing the default-dispather number may not work, we need a complete fix for this. This issue is bothering me everyday, suddenly a nightly UT fails :) |
Yes it needs to be fixed, I think it should work with one thread. Easy to add a test though to enforce this, thanks for the reproducer! |
then make sure akka could run on a single thread pool. |
That is what I said :) |
Whether the ClusterCoreDaemon blocks in its initialization or not does not affect the ClusterCoreSupervisor since that happens on a different thread. If you are observing a real dead lock I suspect that you have configured the cluster dispatcher with too few threads (the default works just fine). What is not entirely clear to me is what your observation is, in case of a dead lock please post a full JVM stack trace. |
Thank you for the reply. Yes, it is due to the dispatcher thread is not enough(verified default dispatcher thread pool size <= 3 will cause dead lock). I have updated the analysis on #17253 (comment) I think the stack trace is clear enough when combined with comment at #17253 (comment) |
Can you give us an motivation why it is important to be able to run akka with less than 3 threads? We could have a startup error message for this kind of misconfiguration, though. |
Yes, we should document this in |
I am not sure. Why cannot we run on 1 thread? I mean, we can work around by documenting this, but I still feel uneasy about this. |
3 threads is the just test environment I used to reproduce this problem. It is possible that the deadlock stil exist after allocating more threads in default-dispatcher. I did a source code search in akka cluster, there are 12 actor class definition which will wait the extension to be initialized when created, which make me believe it is possible that the deadblock still exist for 4 dispatcher threads or more. So, I am not sure whether documenting is the right approach to solve this. here is a list of class which will wait for the Cluster extension in class construction.
|
@clockfly I see what you mean. I will take a look at the hierarchy of those actors and check that not more than necessary are started while initializing the cluster extension. |
I wonder whether there has been any update regarding this issue... I'm trying to use a |
The CallingThreadDispatcher is not a general purpose dispatcher (as is documented as well), using it as the default dispatcher is not supported and will lead to deadlocks, if not in the Cluster startup then elsewhere.
What is your reason for using it that way?
|
@rkuhn I was hoping it could make it easier to make sure my tests execute in a deterministic order. I am already using I was commenting on this issue because the stack trace is somewhat similar (ending at Thanks for your advice! |
Running the tests in a completely different fashion than the production code may give you good test results, but be aware that then you are not really testing what you deploy. If tests fail due to non-deterministic timing or execution order then that points out weaknesses in either your tests (assuming too much) or your code (providing too little). In any case it will lead to better code to fix these issues rather than trying to “avoid” them in tests. |
When using a dispatcher (default or separate cluster dispatcher) with less than 5 threads the Cluster extension initialization could deadlock. It was reproducable by adding a sleep before the Await of GetClusterCoreRef in the Cluster extension constructor. The reason was that other cluster actors were started too early and they also tried to get the Cluster extension and thereby blocking dispatcher threads. Note that the Cluster extension is started via ClusterActorRefProvider before ActorSystem.apply returns. The improvement is to start the cluster child actors lazily when the GetClusterCoreRef is received.
I have finally got around to improving this. See description in PR #18332. |
When using a dispatcher (default or separate cluster dispatcher) with less than 5 threads the Cluster extension initialization could deadlock. It was reproducable by adding a sleep before the Await of GetClusterCoreRef in the Cluster extension constructor. The reason was that other cluster actors were started too early and they also tried to get the Cluster extension and thereby blocking dispatcher threads. Note that the Cluster extension is started via ClusterActorRefProvider before ActorSystem.apply returns. The improvement is to start the cluster child actors lazily when the GetClusterCoreRef is received.
=clu #17253 Improve cluster startup thread usage
…patriknw =clu #17253 Warn about too small cluster dispatcher
Problem: Timeout when akka cluster trying to create the Cluster extension.
Frequency: random
Investigation
The timeout happen in class Cluster, when trying to GetClusterCoreRef
that lead to
ClusterCoreSupervisor will reply ClusterCoreDaemon
The problem is that ClusterCoreDaemon also requires the Cluster extension is initialized.
Check the defiition of ClusterCoreDaemon:
val cluster = Cluster(context.system) will call system.registerExtension, which is where the exception stack is waiting for.
update on 4/22
The updated analysis is on comment #17253 (comment)
The text was updated successfully, but these errors were encountered: