Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option process_distribution: :active which rebalances processes on node joining / leaving. #164

Merged
merged 31 commits into from Nov 8, 2019

Conversation

sikanrong
Copy link
Contributor

@sikanrong sikanrong commented Sep 23, 2019

So the use-case is:

If you want to use Horde to dynamically add members to your DynamicSupervisor cluster at runtime (which in itself works just fine), there's no way in the library to manually trigger a re-distribution of processes so that they spread to the newly-added member(s) in the cluster according to your chosen :distribution_strategy. As a Horde user, all you're really left with is killing and re-starting all of your processes and hoping that they land on other nodes.

To solve this issue I've written DynamicSupervisor.rebalance/1. Basically what this does is run each child spec through the distribution_strategy.choose_node/2 function, detect which processes should be running on other nodes, and specifically terminate/restart those particular processes such that they'll spin up on the new node where they should be according to the new cluster configuration.

In the future perhaps this could take more options to make it so that we specifically find processes that should be on a particular node or list of nodes (e.g. the newly-added members of the cluster) and only re-assign those processes... For now, it's just a blunt uniform redistribution tool. In any case, I know this has been really useful for me and I imagine there are other users that are looking to do the same thing in a clean way.

I've also added a (very) comprehensive test for the new rebalance/1 function in dynamic_supervisor_test.exs and ensured that no type-checking (dialyzer) errors are thrown from the new code.

...I hope you like it! I'm obviously a big fan of Horde, and my team is using it in a big project 馃憤

Alexander Pilafian added 4 commits Sep 20, 2019
鈥 force redistribution of processes between existing nodes in the cluster according to the configured distribution strategy.
鈥rks with members (Horde.DynamicSupervisor.Member) and not just erlang nodes. Still should verify that after the rebalancing operation all the pieces fall where they should fall
鈥 and verifies that all of the pids which should have been redistributed to node 2 (n2) are actually present on that node according to the node's own state
鈥 start_child are both suceptible to the error patterns returned by proxy_to_node, as both return a call to this function under determined circumstances. The @SPEC needs to reflect this so that type-checking doesn't throw errors
@sikanrong
Copy link
Contributor Author

sikanrong commented Sep 23, 2019

oh fooey. Looks like I accidentally left a (single) compiler warning in the build. Not sure exactly how to update a PR on github because I've never done it before. I'll look into it.

In any case; not such a big deal.

@sikanrong
Copy link
Contributor Author

sikanrong commented Sep 23, 2019

Alright I get how this works now; added a commit to fix the compiler warning (unused var) and now circle is giving me formatting errors. Will also fix these post-haste!

@sikanrong
Copy link
Contributor Author

sikanrong commented Sep 23, 2019

...all checks passed!! 馃挴馃

@derekkraan
Copy link
Owner

derekkraan commented Sep 24, 2019

Hi @sikanrong, thanks for the PR! I'm going to read through it and then I'll let it marinade for a few days. This is a (relatively) big feature, and I want to make sure we're happy with it before we hit merge.

@sikanrong
Copy link
Contributor Author

sikanrong commented Sep 24, 2019

Hey @derekkraan, I totally understand. You marinade it up 馃憤

There was a lot working against me to get to this implementation. I mean at first, I had written this feature in the code for my own application using :sys.get_state/1 and generally "monkey-patching" Horde without actually opening it up or forking the repo..

The implementation could be a bit cleaner, but you would have to change the DynamicSupervisorImpl state struct a bit to make the data the Rebalancer depends on more accessible.

Specifically, it really would be much better if the process :name didn't have to be parsed out of child_spec.start, but since you randomize all the child_spec.ids instead of keeping the user-provided values, it's hard to keep track in any meaningful way (to the user) of what exactly is being rebalanced.

So idk; this PR is by no means feature-complete, and the implementation could be better, but not without making some deeper structural changes to DynamicSupervisorImpl. I look at it as a good first step on which to build { more | better | more elegant } rebalancing logic.

In any case; this is a major missing feature; Horde is unusable in my project without it. Without this you're just totally unable to create a truly elastic Horde cluster, because when you add new nodes there's no way to reassign existing processes to them.

@sikanrong
Copy link
Contributor Author

sikanrong commented Sep 24, 2019

Also it could lead to more elegant patterns if the Rebalancer GenServer would maintain its own state struct depending on its own needs, such that it wouldn't have to make a transformation of the DynamicSupervisorImpl state. However, that would take more coordination with DynamicSupervisorImpl and my goal here was to try and build something we could "drop-in" with minimal changes to other modules. In that, I feel like I've succeeded 馃

@sikanrong
Copy link
Contributor Author

sikanrong commented Oct 1, 2019

hey @derekkraan; it's been a week so I just thought I'd chime in and perhaps try to start a dialogue. I'm not sure how you're feeling about the implementation but I just wanted to say that if you were willing to be quite descriptive in telling me how you'd ideally like this feature to be implemented; my team and I would be more than happy to put in the necessary work to make it happen.

I suppose I just feel like the best thing would be for us to coordinate so that we can end up with the best possible solution, and put that in the hands of the community as fast as possible.

I wonder if you'd maybe like to be in more direct correspondence so we could talk about the best way forward? One thing I think that could really help rebalance/1 a lot (in general) might be the conservation of the child_spec IDs (instead of randomizing them as per current Horde implementation). Anyway idk, maybe there's a better forum for discussing the deeper Horde implementation details? Or perhaps you'd like to talk one-on-one? Just let me know :)

@derekkraan
Copy link
Owner

derekkraan commented Oct 2, 2019

Hi @sikanrong,

The problem is that I am myself not quite certain what I would accept in this PR. There are some initial questions to consider:

  1. How to signal to processes that they are being shut down for redistribution (see: exit signal when Registry does conflict resolution)?
  2. When should this be triggered?
  3. Should this be configurable (or perhaps pluggable functionality ala distribution_strategy)?
  4. How to make sure that this remains performant? The current approach (sending basically all the state to a separate process) will not be performant.

I think github issues is a good place to talk. Keep in mind that I have a day job, and can't make myself available 24/7 for this project.

I'm also going to ping @beardedeagle in here because he was asking about this before and might have some insights to share with us.

@sikanrong
Copy link
Contributor Author

sikanrong commented Oct 3, 2019

Alright so I've given this quite a bit of thought, and surely what I'm about to write is going to be a few pages long, so apologies in advance for that...

So first off @derekkraan I totally get it; I also have a day job where we're using Horde in a commercial project, so ensuring that it'll work for everything we'll need it to do is my day job. I hope this explains why I'm so enthusiastic about figuring out the rebalance logic in the best possible way. While I'm aware that I could've used another more production-ready solution, I just really liked the Horde base implementation (we're also using your MerkleMap and CRDT implementations in other projects) so I thought that the best thing to do would just be to use it and add anything that we find missing.

I'll try to respond to your enumerated points with my own enumerated points, but probably there will be some things that don't really fit into the pattern.

  1. How to signal to processes that they are being shut down for redistribution (see: exit signal when Registry does conflict resolution)?

So for this part, I think probably the best solution would be basically to just pass a different exit signal to the process in ProcessSupervisor.terminate_children/2 so as to give the user a chance to respond to the exit signal and perform any pre-redistribution logic that it may need to. Basically the same way that you resolve the eventual-consistency :name_conflict problem in RegistryImpl.process_diff/2.

Evidently for this to work we would have to flag a given process as "shutting down for redistribution" so that terminate_children/2 knows to pass the correct exit signal, but this should be easy. We already have logic to deal with the :shutting_down state, and we could just add a shutdown reason (which could be :redistributed).

  1. When should this be triggered?

As for when this should be triggered, I would think any time the membership has changed, either by using Cluster.set_members/2 or when triggered by :crdt_update should be handled in the same way.

Currently you use has_membership_change? to detect this state, which you respond to with a series of calls, one of which is handle_dead_nodes/1. Effectively handle_dead_nodes/1 and check_processes/3 are the two functions currently responsible for handling process redistribution in the case of a dead/shutdown cluster member node. I think instead of handle_dead_nodes/1 we should just have the call to rebalance/1 in its place, which would effectively redistribute the processes in a deterministic way under any membership-change circumstances.

Also (as a final thought) I think additionally that redistribution should be something that the user can access via the API and trigger on-demand if needed.

  1. Should this be configurable (or perhaps pluggable functionality ala distribution_strategy)?

Everything should be configurable! We should make no assumptions about how the user may want to use the tool. That being said, rebalance should definitely also be configurable as well. However I don't think that the DistributionStrategy is a good place for the rebalance logic to go, in the sense that redistribution depends too much on the (distributed) state in DynamicSupervisorImpl. As well, it would depend on calling private functions like DynamicSupervisorImpl.add_child/2 from the distribution strategy, and I just don't think that's going to end up being a very clean way to do things.

However, we could use the distribution_strategy to simply configure which events should trigger a redistribution (like :member_added or :member_removed). Though I wonder if maybe this is something that would be better put in a config/ file?

Note: for the same reason I find it hard to imagine a "pluggable" solution. I mean we already have DistributionStrategy, and one would assume that any logic about redistribution should fit under that scheme. It's hard to imagine having another pluggable solution (RedistributionStrategy?) just to do specifically this.

  1. How to make sure that this remains performant? The current approach (sending basically all the state to a separate process) will not be performant.

In terms of the current implementation:

  • I'm not too big a fan of calling DynamicSupervisor.start_child and DynamicSupervisor.terminate_child from the Rebalancer (though it does get the job done); I'd rather not have the Rebalancer be it's own process at all. I'd much rather handle things in a similar way to handle_dead_nodes/1 where the actual rebalance logic would be a function in DynamicSupervisorImpl which is called by all nodes on membership change event: members that detect they have processes which should be terminated would terminate them, and members which detect that they have new processes would start them up.
  • I really hate the current method of sending the state to a separate process and then traversing it (looping over things, possibly various times) in unnecessary ways. I would want to get rid of all the unecessary transformations of the DynamicSupervisorImpl state. All there should be is a single O(n) loop over DynamicSupervisorImpl.processes_by_id to run distribution_strategy.choose_node on each child_spec.
  • I really don't like pattern-matching the process name out of the child spec (which won't always work in every case); I've only done this so that I can output something useful to the user in terms of trying to communicate which processes are being redistributed, and to which members of the cluster.
    • Ideally, we would just keep the child_spec.id fields that the user provides instead of forcibly replacing them with a random number. In that way, we could just use the child_spec.id to output to the log on redistribution. Otherwise we would have to output the entire child_spec in order for the user to have something meaningful to look at. In any case, since this just concerns logging and debug info maybe it's not a big deal.
    • In a more over-arching general sense, it would be really nice as a Horde user to be able to have a better idea of which processes are running on which members; calling which_children doesn't return anything useful to that end. What is the logic behind replacing the child_spec.id with a random number?? Is it to avoid eventual-consistency :name_conflict type of errors? Why not just handle them in the same way that they are handled in Horde.Registry?

@sikanrong
Copy link
Contributor Author

sikanrong commented Oct 3, 2019

Anyway tell me what you think of all that when you get a chance, as well @beardedeagle should chime in with any tips! If all of this looks fine, I'd be happy to update my code to make it conform to these new design principles.

@derekkraan
Copy link
Owner

derekkraan commented Oct 7, 2019

Hi @sikanrong, Thanks for your message and your enthusiasm.

So the reason we randomize child ids is because this is what DynamicSupervisor does. Additionally, it doesn't work nicely when both Horde.DynamicSupervisor and Horde.Registry are checking the uniqueness of a process's registration, what you can end up with (and this is how Horde used to work) is a race condition where the supervisor chooses a different process to survive than the registry, and you end up with 0 running processes instead of 1.

I agree generally with your proposed way forward. I think there will be some tricky things in Horde.ProcessesSupervisor that you'll need to take care of, but let's cross that bridge perhaps when we get to it. Concentrating most of the logic in Horde.DynamicSupervisorImpl will solve a lot of the issues with this PR already.

The only part I'm somewhat unclear on is how we can best make this pluggable and or configurable. But perhaps we can start with just enabling / disabling with a flag (default to off) and go from there.

Let me know if you have any other questions before you get started.

@sikanrong
Copy link
Contributor Author

sikanrong commented Oct 8, 2019

@derekkraan alright great; thanks so much for reading over my changes and considering them! I'll update the code sometime later this week and then we can have another look at it and see if there's any other tweaks that need to be made.

Alexander Pilafian added 11 commits Oct 12, 2019
鈥sed in pull request 164. This gets rid of handle_dead_nodes and replaces it with a more general redistribute method which is automatically executed when the cluster configuration is changed. Updates the tests for redistribute (which pass), as well as maintiaining the manual DynamicSupervisor.redistribute/1 method. Not quite done as it must be made configurable and the standalone redistribute method must also have an accompanying test. As well, one other test is now failing and I must find out why...
鈥istribution has its own triggering mechanism which only will trigger when a node transitions into :alive, :dead, or :redistribute state(s). This makes it so that it will effectively wait for process gracefulshutdown before restarting those processes on their new nodes. This is either triggered via update_process or update_member info when a horde member node is marked :dead. Now the only remaining problem is that sometimes the DynamicSupervisorImpl genserver is too busy to respond to the :horde_shutting_down call
鈥an be more specific about exactly which state changes should trigger the redistribution, instead of just taking the final indicated state and trying to base it only on that. Like now I can say things like 'when a node transitions from :uninitialized to :alive or from :shutting_down to :down, do a rebalancing'
鈥own test; consolidating the choose_node logic into a private function which takes a Horde.DynamicSupervisorImpl.Member and correctly prepares it for distribution such that the distribution is effectively deterministic and no matter how the nodes are (re-)distributed they will always be (re-)distributed deterministically.
鈥te_on callback and associated events. Disable the :up redistribution for the graceful shutdown test, if not then the redistributor would fire when :horde_2_graceful node would join the cluster (as it should). All tests passing; also fixed remaining compilation errors and removed debug output
鈥wn then we should also gracefully NOT restart those processes which have been gracefully shutdown
鈥 to :dead are valid reasons for redistributionw
鈥 pass any termination reason to ProcessesSupervisor.terminate_child_by_id and have that propagate through the supervisor logic until it gets to monitor_children/exit_child. Uses this to add a :redistribute exit signal when a process is shut down by horde for redistribution. Also adding tests for this
鈥 such that I can add more tests regarding all sorts of redistribute related stuff without repeating too much code.
Alexander Pilafian added 6 commits Oct 16, 2019
鈥 as well as redistribute_processes. Also ensure that the tests properly clean up after themselves. Despite not all currently passing, the test suite is complete.
鈥e is responsible for the looping instead of doing it with pattern matching and various definitions
鈥 the worst fucking thing I've ever done in 15 years of programming I think. In the end the solution makes perfect sense. If (during redistribution) the node tries to start the nodes that it should be running only to find that those PIDs are already running (on another node) then it will monitor those processes and wait for them to gracefully terminate before adding those children on the destination node. All in all, a very robust system, allowing for the nodes to take as long as they need to
鈥fore doing the final assertion - it allows for n2 to catch up and spin-up its processes
鈥cking on the member status (which is hard to manage) it's now going to be its own key in the CRDT which is set with a Kernel.make_ref() and then that ref is stored in the state after redistribution such that it only triggers redistribution if the refs don't match. This makes it so that all connected nodes who receive the message trigger redistribution
@sikanrong
Copy link
Contributor Author

sikanrong commented Oct 18, 2019

@derekkraan w00t! It's done. I'm really happy with the way this turned out; now it really feels like a part of the library that was meant to be there from the beginning. The end result is really very elegant.

It conforms to the principles we discussed:

  • It replaces handle_dead_nodes with new the new redistribution logic in :crdt_update. The most central part of the implementation is in needs_redistribution?/2 and redistribute_processes/1; this is where horde reacts to the CRDT diffs and figures out if redistribution should be triggered or not.
  • It makes redistribution configurable; creates a new redistribute_on/1 method in UniformDistribution which at the moment just returns the default :all, but this can be also be overridden by using Config. In the DynamicSupervisor, all of the redistribution configuration is sourced from state.distribution_strategy, as discussed. Effectively, the redistribution configuration is both pluggable and configurable.
    • The available options are :all | :none | :up | :down. With :up and :down the user can specify that the redistribution should only occur on member :down or member :alive. The :down setting is how horde currently behaves on derekkraan/horde - meaning it only redistributes processes when their node goes down (Note: if :up is specified, this also implies that process shouldn't be restarted on graceful shutdown but instead terminated).
    • The user can still manually trigger redistribution via redistribute/1, but it only really makes sense anymore in the context of automatic redistribution being disabled by using the configuration option :none. Still, this is something users may want to do.
  • As discussed, when processes are shut down for redistribution, they are sent the :redistribute flag as the reason parameter in Process.exit/2 such that they can use Process.flag(:trap_exit, true) to specifically react to the graceful redistribution shutdown event.
  • The redistribution logic behaves well with graceful shutdown, other nodes will wait for the target processes to (gracefully) shut down before starting them back up.
  • All of these features are backed up by a lot of new tests; I really put a lot of effort into making the tests easy to write and read; partially by using helper methods and partially by using proper startup and teardown logic in ExUnit. The end result is that all of these new features are tested, and that all of the redistribution-related tests are grouped together in their own describe block within dynamic_supervisor_test.exs.

@derekkraan
Copy link
Owner

derekkraan commented Oct 22, 2019

Hi @sikanrong, I haven't had a chance to look at the actual code yet, but I do have a comment. The redistribute_on/1 callback seems unnecessary. We always want to redistribute on node down, since this is a failure condition that we have to react to (this is inherent to a supervisor). If you want a process to not come back up after a graceful shutdown, then you should use restart: :temporary or restart: :transient to trigger the desired behaviour that way.

So if we take away the option to redistribute on :down (and just always do it), then we are left with two ways to configure the same behaviour. In that case, I am sort of divided on whether it belongs in the distribution_strategy or in the configuration. I would perhaps lean towards configuration, since it's really just a simple boolean, and there is not really a need to be able to provide a callback that could implement its own logic.

Of course if you have arguments for why it should work the way you have done it in your PR then I'm open to hearing those.

The rest looks good (in principle), I'll take a look at the actual code if we can agree on the right way to proceed with the above issue.

Derek

@sikanrong
Copy link
Contributor Author

sikanrong commented Oct 22, 2019

hey @derekkraan; your comments make sense. I also thought that maybe redistribute_on and the various configuration types were overkill for this, it's hard to imagine anybody ever wanting to use :down and :up. So, I'll tweak it such that:

  • The redistribution config is just a boolean, if it's true then we redistribute nodes when they come :alive, otherwise we don't.
  • Just have a simple Config variable for this, instead of using the distribution_strategy
  • Remove logic for :up and :down, make it so that we always redistribute on :down (via graceful shutdown), and so that we redistribute on :up only if redistribution is configured as enabled.

I can have this updated by tomorrow
//Alex

鈥ve the redistribute_on method from the distribution strategy. Instead, establishes a much simpler boolean configuration key :redistribute_on_alive which determines (as the name suggests) whether or not Horde will automatically redistribute processes when it detects a new node has been added to the cluster. There is no longer any configuraion option which makes it so that horde will NOT redistribute processes on node :down (or on graceful shutdown).
@sikanrong
Copy link
Contributor Author

sikanrong commented Oct 23, 2019

@derekkraan - alright, those changes were really easy to make; I'm excited for you to review the code! If there's any further changes or tweaks to be made that's no problem at all, just let me know.

//Alex

@derekkraan
Copy link
Owner

derekkraan commented Oct 28, 2019

Hi Alex,

I just took a look at the code. I am unsure of the approach taken to track processes that should be supervised by the current node.

For some background information, the way we track processes in the CRDT is with a key {:process, child.id} and the value is {node_name, child, child_pid}.

The approach I would prefer is:

  1. When cluster membership changes, check the list of processes for any unclaimed processes that belong on your node, then start them.
  2. When cluster membership changes, check the list of your own processes to find out which ones should be moved, and move them (terminate them, but only report them as having no node, do not remove them from the CRDT entirely).
  3. When a cluster gets updated and has no node, check to see if it belongs on the current node, and start it.

Does this make sense? This way there is no requirement to coordinate redistribution. It also means that the user can't trigger redistribution globally, but I'm not sure this is a big problem. We could remove that option or just have it trigger a redistribution locally (so the user would be responsible for coordinating redistribution across the entire cluster if that's what she wants).

So basically, nodes terminate and "release" processes when they detect that they should no longer be the owners, and when updates come in, check to see if you should claim an unclaimed process.

I think this will necessitate a small change in ProcessesSupervisor, since it will need to respond differently to a :redistribute shutdown message. (Similar to how we disown temporary or transient processes).

I also want to check with you about the terminology. People are used to swarm's "process handoff", is "redistribute" what we want to call it? Think about it and let me know what you think.

Hope this feedback is useful. Let me know if you have any questions.

Cheers,
Derek

@sikanrong
Copy link
Contributor Author

sikanrong commented Oct 28, 2019

Hey @derekkraan !

All of this is really great feedback; it's all very well-explained. Your comments make perfect sense. The preferred approach is crystal clear; I can see how this will simplify the code and make for a more robust system.

I think I can probably tweak the implementation to conform to those ideas in the next two or three days.

  • For now, I'll just remove the ability to globally trigger a redistribution. I'm not bothered by this.
  • In terms of terminology we could change the :redistribute shutdown reason to :handoff or :horde_handoff to be more clear about what's happening, and to try and approximate the same language used by swarm. I originally started with the word "rebalance" and changed it to "redistribute" because that's what you and @beardedeagle were calling it. If the goal is clarity then probably :handoff or :horde_handoff are the best options.

It's possible that I'll have some more questions during development, so please watch this space!

//Alex

鈥ples laid out by @derekkraan such that each Horde.DynamicSupervisor be as stateless as possible such that redistribution really doesn't have to be coordinated between nodes. Rather, the whole process is now one of nodes 'relinquising' processes to the greater Horde, which are later 'picked up' by the nodes that they should be on as those new nodes intercept the message that a (child) process no longer has a (parent) node. This 'relinquish' message is the very same CRDT message used by the GracefulShutdownManager to indicate to member nodes that they need to pick up those parent-less children.
@sikanrong
Copy link
Contributor Author

sikanrong commented Oct 29, 2019

@derekkraan alright it's all done! Implementation ended up being faster than expected; I'm really happy with the changes made. It makes each Horde node as stateless as possible, and avoids the need for the nodes to do any coordination for process redistribution! Perfect 馃憤

Now each node will simply "relinquish" it's processes ( to the Horde! ), and then as the other nodes receive the signal that a given process is without a parent node, that node will check to see if it should own that process and if it finds that to be true it will start that process up via add_child.

I went with :horde_handoff for the shutdown reason.

Tell me what you think of the new changes! As ever, I'm still very open to tweaks, changes, and improvements.

Copy link
Owner

@derekkraan derekkraan left a comment

Hi @sikanrong,

Thanks for all the work on this PR. I think we are getting closer! Please see my individual comments for the necessary changes.

lib/horde/dynamic_supervisor_impl.ex Outdated Show resolved Hide resolved
lib/horde/dynamic_supervisor.ex Show resolved Hide resolved
lib/horde/dynamic_supervisor_impl.ex Show resolved Hide resolved
lib/horde/dynamic_supervisor_impl.ex Outdated Show resolved Hide resolved
config/config.exs Outdated Show resolved Hide resolved
lib/horde/dynamic_supervisor_impl.ex Outdated Show resolved Hide resolved
lib/horde/dynamic_supervisor_impl.ex Outdated Show resolved Hide resolved
lib/horde/graceful_shutdown_manager.ex Show resolved Hide resolved
@sikanrong
Copy link
Contributor Author

sikanrong commented Nov 1, 2019

Hey @derekkraan,

It's a holiday here today, but I'll spend some time later in the day looking over these change requests and implementing them (and/or responding to them).

I definitely did run horde through dialyzer (it's part of the circleCI integration tests too) and I don't know why it didn't pick up on the terminate_child typings

Alexander Pilafian added 2 commits Nov 2, 2019
鈥 the code around process-termination and signaling child-relinquish to the Horde. This is now implemented according to @derekkraan's proposed method of just calling Process.exit(child_pid, {:shutdown, :process_redistribution}) and then intercepting that message in ProcessesSupervisor.maybe_restart_child/5, which in turn notifies the DynamicSupervisorImpl that it should send out the process-without-parent message to the rest of the horde via the CRDT. Also moves away from using Application config and instead uses an options-based implementation where :process_redistribution is one of :active or :passive. Restructuring the tests to use this new active/passive init-options format. Resets the default behavior to :passive
鈥ich for some reason doesn't compile when I run mix compile on my local machine
@sikanrong
Copy link
Contributor Author

sikanrong commented Nov 2, 2019

Hey again @derekkraan ;

So I've made the requested changes and in general everything went totally fine.

The single snag that I ran into has to do with some nuance from Elixir's GenServer implementation; If you look at the documentation for terminate/2 it reads:

the GenServer traps exits (using Process.flag/2) and the parent process sends an exit signal

The bold emphasis is mine. So what it's trying to describe here is that when you set Process.flag(:trap_exits, true) in your GenServer, you're supposed to be able to catch those exits in terminate/2 without needing to implement handle_info({:EXIT, pid, reason}, state).

The key words here are "parent process". In Horde, that parent process would be Horde.ProcessesSupervisor. If it receives the exit signal from any other non-parent process (like Horde.DynamicSuperviosrImpl), you'll have to basically do an exit signal pass-through with handle_info:

  @impl true
  def handle_info({:EXIT, _pid, reason}, state) do
    {:stop, reason, state}
  end

If you don't do this, it will error about receiving an unexpected :EXIT info messages that you have to handle.

So to avoid exactly this: instead of calling Process.exit/2 from the DynamicSupervisorImpl process directly I've created a handle_call({:send_exit_signal, pid, reason}, _f, state) in ProcessesSupervisor which just calls Process.exit/2.

@sikanrong
Copy link
Contributor Author

sikanrong commented Nov 2, 2019

Oh also I don't understand how to mark your change requests as "done"? I thought I was just supposed to hit "resolve conversation" as I closed out each item, but it still says "1 review requesting changes" and I don't seem to be able to mark it as "finished" anywhere. Maybe you can help me understand what GitHub wants me to do there (if I'm supposed to do anything?).

sikanrong added 2 commits Nov 2, 2019
鈥y called from the ProcessesSupervisor process which eliminates the need for the ugly handle_info exit passthrough
鈥_exit_signal to a call (from a cast) which seems to work this time
@sikanrong
Copy link
Contributor Author

sikanrong commented Nov 2, 2019

alright @derekkraan , I made a few commits to straighten out some implementation snags that were bothering me (detailed in the much-edited comment above) but now I'm calling this officially "done"! The final version is perfect 馃槂I'm super happy with it.

@sikanrong sikanrong requested a review from derekkraan Nov 2, 2019
@derekkraan
Copy link
Owner

derekkraan commented Nov 8, 2019

@sikanrong, this is looking great. I am very happy with the structure of this feature and how it has turned out, and there are a lot of little things you were able to clean up along the way.

Sending the exit signal through the parent process (the supervisor) is a great idea, I'm glad you thought of that.

@derekkraan derekkraan merged commit d7dead0 into derekkraan:master Nov 8, 2019
1 check passed
@derekkraan derekkraan changed the title Add DynamicSupervisor.rebalance/1 to redistribute processes on-demand Add option process_distribution: :active which rebalances processes on node joining / leaving. Nov 8, 2019
@sikanrong
Copy link
Contributor Author

sikanrong commented Nov 13, 2019

Hey @derekkraan I just wanted to thank you for your kind words and for making yourself available to work with me on this and review my code. Many thanks! I've really enjoyed this and I hope you have too; I'm stoked that this is a part of Horde now :)

Since we're using a lot of other Kraan-brand鈩笍 libraries in our projects maybe we can have more collaborations like this in the future 馃槂

//Alex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants