New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ERL-885: Calls to global:register_name hang in whole cluster for hours #3923
Comments
|
|
See also erlangGH-4448 and erlangGH-3923. A race between locker processes on different nodes has been resolved by using global_name_server as a proxy.
I'm not sure, but it's likely that GH-4912 fixes this particular |
Sorry for raising this issue again but I suppose it might need to be reopened. I had a new case of a similar problem yesterday in a completely different project which doesn't use any hidden nodes but does use "auto connect" and "prevent overlapping partitions"... This time the cluster consists of just 13 nodes, although they are executed in Kubernetes pods and occasionally they seem to be rescheduled to different hosts... Unfortunately this time I forgot to follow what I described here and I'm not sure the I tried to unblock global using the approach found in RabbitMQ rabbitmq/rabbitmq-server@fba455c but it didn't help. I tried to compare the versions attached to Then I noticed that the While trying to repeat this operation on the rest of the nodes, I noticed that "prevent overlapping partitions" kicked in and started disconnecting other nodes... As we use kubernetes API to discover the nodes which should be connected and libcluster to connect them all together, the nodes were reconnected back so the "overlapping partitions prevention" resulted in 2366 disconnections between just 10 different nodes: As I'm sure the nodes could actually connect to each other, I wonder if you think "prevent overlapping partitions" worked correctly in this case? Anyway, after I disconnected all the peer nodes found as Does this help in diagnosing what is actually misbehaving? I'm planning to add some diagnostic code before the fix would be applied automatically, so please let me know what parts of the state(s) I should dump for futher research? |
Original reporter:
rumataestor
Affected version:
OTP-19.3.6
Component:
kernel
Migrated from: https://bugs.erlang.org/browse/ERL-885
The text was updated successfully, but these errors were encountered: