Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{:EXIT, #PID<0.2945.0>, :normal} #79

Closed
garthk opened this issue Sep 23, 2020 · 6 comments
Closed

{:EXIT, #PID<0.2945.0>, :normal} #79

garthk opened this issue Sep 23, 2020 · 6 comments

Comments

@garthk
Copy link
Contributor

garthk commented Sep 23, 2020

G'day!

Replicated cache operations when trapping exits results in exit messages for processes you didn't start yourself, in turn causing FunctionClauseError crashes and other misbehaviour.

iex(311)> flush()
:ok
iex(312)> Process.flag(:trap_exit, true)
true
iex(313)> flush()
:ok
iex(314)> MyApp.Cache.flush()
:ok
iex(315)> flush()
{:EXIT, #PID<0.2945.0>, :normal}
:ok
iex(316)> MyApp.Cache.set(1, 1)
1
iex(317)> flush
{:EXIT, #PID<0.2950.0>, :normal}
:ok

I've confirmed with :dbg those processes were started by the cache's task supervisor:

iex(343)> Process.info(pid("0.705.0"))
[
  registered_name: MyApp.Cache.TaskSupervisor,
  current_function: {:gen_server, :loop, 7},

… and by modifying the code that it was Nebulex.RPC.multi_call/3 in particular:

@spec multi_call(Supervisor.supervisor(), node_group, Keyword.t()) :: term
def multi_call(supervisor, node_group, opts \\ []) do
node_group
|> Enum.map(fn {node, {mod, fun, args}} ->
Task.Supervisor.async({supervisor, node}, mod, fun, args)
end)
|> handle_multi_call(node_group, opts)
end

If you switch to Task.Supervisor.async_nolink/4 you'll not cause those messages. I think your use of Task.yield_many/2 below should still catch them exiting?

@garthk
Copy link
Contributor Author

garthk commented Oct 12, 2020

G'day! I've tried it out, but despite elixir-lang/elixir#5554 async_nolink doesn't seem to solve the problem. In one scenario I end up checking the cache a thousand times, saturating my schedulers. Reckon it'll help directing my get/2 calls to the primary, as if it's a read replica?

@cabol
Copy link
Owner

cabol commented Oct 12, 2020

Hey!! Sorry, I couldn't look into it yet, I'll try to check it out this week. On the other hand, out of curiosity, have you tried with the partitioned adapter Nebulex.Adapters.Partitioned? Because if the problem is related to Nebulex.RPC, other adapters could be affected too.

@garthk
Copy link
Contributor Author

garthk commented Oct 13, 2020

Not yet, no, but your hunch strikes me as reasonable.

I've figured out most of our recent trouble was coming from all/1:shards_local.select/3:proc_lib.spawn_link/3. Luckily, it turned out stream/1 uses :shards_local.select/4 which isn't spawning tasks… at least, for that particular usage pattern. I haven't checked the code.

Next worst was add/2, which I've replaced with set_many/2. Not quite the same semantics, but it'll do, and because of my usage pattern that cut down the exit message volume by ~99%.

Together, I think those changes mitigate most of our performance impact from this issue. We still need extra handle_info/2 clauses in four modules to avoid FunctionClauseError, but the fires are out for now.

@cabol
Copy link
Owner

cabol commented Oct 13, 2020

Right, I think it could be related to :shards as you mention, because, for the query-based operations :shards runs in parallel, spawning processes via :proc_lib.spawn_link/3, and those tasks/processes run under the caller process, in your example above the IEx/shell process, there is no like a dynamic a supervisor for them (like for example in Elixir Task.Supervisor). :shards has been refactored and improved significantly, and the v1 will come very soon, AND I will include this fix, adding a kind of supervisor like Task.Supervisor for parallel executions (BTW, in the new version, by default the query operation run sequentially, the parallel execution is a config parameter now).

In the meantime, I'd suggest, why if you run the same tests but with the master branch (which is now in v2). The cache something like:

defmodule MyApp.ReplicatedCache do
  use Nebulex.Cache,
    otp_app: :my_app,
    adapter: Nebulex.Adapters.Replicated
end

And config for example:

config :my_app, MyApp.ReplicatedCache,
  primary: [
    gc_interval: :timer.seconds(3600),
    backend: :ets
  ]

As you notice, we can test the adapter but using the backend :ets, see if the issue is definitely related to :shards or not. Then I think you can also try with backend :shards but this new version. Let me know what do you think, stay tuned!

Thanks!!

@cabol
Copy link
Owner

cabol commented Oct 29, 2020

@garthk I have updated Nebulex to use the latest version of :shards (along with other improvements and fixes), have you had the change to test if this issue happens with the master branch?

@garthk
Copy link
Contributor Author

garthk commented Oct 29, 2020

Looking good! I started those tests off in 1.2.2, made sure they failed, switched to master, adapted them to the new API, saw they still failed, remembered to pull from upstream, and 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants