Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check child pid when disowning processes #174

Closed
wants to merge 3 commits into from
Closed

Check child pid when disowning processes #174

wants to merge 3 commits into from

Conversation

pedro-gutierrez
Copy link
Contributor

This PR adds a check on the value returned by Map.pop/2 when disowning child processes given their ID. This should prevent the supervisor from crashing with errors such as:

12:41:20.080 [error] GenServer DistributedSupervisor terminating
** (MatchError) no match of right hand side value: {nil, %{44653510861120721877777239736197567562 => {{DistributedSupervisor, :"horde2@127.0.0.1"}, %{id: 44653510861120721877777239736197567562, restart: :transient, start: {<mod>, :start_link, [<args>]}}, #PID<0.714.0>}, 93025656845656942491727595806594601344 => ...}}
    (horde) lib/horde/dynamic_supervisor_impl.ex:255: Horde.DynamicSupervisorImpl.handle_cast/2
    (stdlib) gen_server.erl:637: :gen_server.try_dispatch/4
    (stdlib) gen_server.erl:711: :gen_server.handle_msg/6
    (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Last message: {:"$gen_cast", {:disown_child_process, 290756678567143339301485527515936608587}}

(I removed specifics from my project).

Note: this fix feels a bit like sweeping under the carpet. I understand the root cause might be somewhere else but so far I haven't been able to reproduce this error in a consistent way.

@pedro-gutierrez
Copy link
Contributor Author

pedro-gutierrez commented Nov 5, 2019

Added a test to this PR. It does not reproduce the issue that I am tracking down, but it shows a net split that, if run in a loop, will eventually fail. When this happens, I've observed that the key for one of the workers is simply removed from all registries for no apparent obvious reasons. In other occasions, I've noticed multiple name conflicts on the same key, and at the end, all registered pids are terminated (ie, we end up in a situation where no winner pid is kept). Hope this helps to track potential race conditions in Horde.

@derekkraan
Copy link
Owner

Hi @pedro-gutierrez, thanks for this PR. I think I agree with your statement that this is sweeping it under the rug. As such I think I'd like to leave Horde.DynamicSupervisorImpl the way it is, but I would like your test so that I can investigate this further when I have some time. Could you revert the changes to the library code for me?

@derekkraan
Copy link
Owner

@pedro-gutierrez I think this might be related to #188

@derekkraan derekkraan mentioned this pull request Dec 24, 2019
@derekkraan
Copy link
Owner

@pedro-gutierrez I think I have solved the problem in #190, can you please test & confirm?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants