{:EXIT, #PID<0.2945.0>, :normal} #79

garthk · 2020-09-23T03:13:40Z

G'day!

Replicated cache operations when trapping exits results in exit messages for processes you didn't start yourself, in turn causing FunctionClauseError crashes and other misbehaviour.

iex(311)> flush()
:ok
iex(312)> Process.flag(:trap_exit, true)
true
iex(313)> flush()
:ok
iex(314)> MyApp.Cache.flush()
:ok
iex(315)> flush()
{:EXIT, #PID<0.2945.0>, :normal}
:ok
iex(316)> MyApp.Cache.set(1, 1)
1
iex(317)> flush
{:EXIT, #PID<0.2950.0>, :normal}
:ok

I've confirmed with :dbg those processes were started by the cache's task supervisor:

iex(343)> Process.info(pid("0.705.0"))
[
  registered_name: MyApp.Cache.TaskSupervisor,
  current_function: {:gen_server, :loop, 7},

… and by modifying the code that it was Nebulex.RPC.multi_call/3 in particular:

nebulex/lib/nebulex/rpc.ex

Lines 90 to 97 in 38c2b4d

    
           @spec multi_call(Supervisor.supervisor(), node_group, Keyword.t()) :: term 
        
           def multi_call(supervisor, node_group, opts \\ []) do 
        
             node_group 
        
             |> Enum.map(fn {node, {mod, fun, args}} -> 
        
               Task.Supervisor.async({supervisor, node}, mod, fun, args) 
        
             end) 
        
             |> handle_multi_call(node_group, opts) 
        
           end

If you switch to Task.Supervisor.async_nolink/4 you'll not cause those messages. I think your use of Task.yield_many/2 below should still catch them exiting?

The text was updated successfully, but these errors were encountered:

garthk · 2020-10-12T08:04:48Z

G'day! I've tried it out, but despite elixir-lang/elixir#5554 async_nolink doesn't seem to solve the problem. In one scenario I end up checking the cache a thousand times, saturating my schedulers. Reckon it'll help directing my get/2 calls to the primary, as if it's a read replica?

cabol · 2020-10-12T13:36:55Z

Hey!! Sorry, I couldn't look into it yet, I'll try to check it out this week. On the other hand, out of curiosity, have you tried with the partitioned adapter Nebulex.Adapters.Partitioned? Because if the problem is related to Nebulex.RPC, other adapters could be affected too.

garthk · 2020-10-13T01:15:03Z

Not yet, no, but your hunch strikes me as reasonable.

I've figured out most of our recent trouble was coming from all/1 → :shards_local.select/3 → :proc_lib.spawn_link/3. Luckily, it turned out stream/1 uses :shards_local.select/4 which isn't spawning tasks… at least, for that particular usage pattern. I haven't checked the code.

Next worst was add/2, which I've replaced with set_many/2. Not quite the same semantics, but it'll do, and because of my usage pattern that cut down the exit message volume by ~99%.

Together, I think those changes mitigate most of our performance impact from this issue. We still need extra handle_info/2 clauses in four modules to avoid FunctionClauseError, but the fires are out for now.

cabol · 2020-10-13T20:27:08Z

Right, I think it could be related to :shards as you mention, because, for the query-based operations :shards runs in parallel, spawning processes via :proc_lib.spawn_link/3, and those tasks/processes run under the caller process, in your example above the IEx/shell process, there is no like a dynamic a supervisor for them (like for example in Elixir Task.Supervisor). :shards has been refactored and improved significantly, and the v1 will come very soon, AND I will include this fix, adding a kind of supervisor like Task.Supervisor for parallel executions (BTW, in the new version, by default the query operation run sequentially, the parallel execution is a config parameter now).

In the meantime, I'd suggest, why if you run the same tests but with the master branch (which is now in v2). The cache something like:

defmodule MyApp.ReplicatedCache do
  use Nebulex.Cache,
    otp_app: :my_app,
    adapter: Nebulex.Adapters.Replicated
end

And config for example:

config :my_app, MyApp.ReplicatedCache,
  primary: [
    gc_interval: :timer.seconds(3600),
    backend: :ets
  ]

As you notice, we can test the adapter but using the backend :ets, see if the issue is definitely related to :shards or not. Then I think you can also try with backend :shards but this new version. Let me know what do you think, stay tuned!

Thanks!!

cabol · 2020-10-29T16:33:02Z

@garthk I have updated Nebulex to use the latest version of :shards (along with other improvements and fixes), have you had the change to test if this issue happens with the master branch?

garthk · 2020-10-29T22:34:13Z

Looking good! I started those tests off in 1.2.2, made sure they failed, switched to master, adapted them to the new API, saw they still failed, remembered to pull from upstream, and 🎉

garthk pushed a commit to amplifiedai/nebulex that referenced this issue Oct 29, 2020

[cabol#79] Test: unflushed messages with exits trapped

f60b1c3

garthk mentioned this issue Oct 29, 2020

Add test for unflushed messages with exits trapped #85

Merged

cabol pushed a commit that referenced this issue Oct 30, 2020

[#79] Test: unflushed messages with exits trapped (#85)

dea1f55

cabol closed this as completed Oct 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

{:EXIT, #PID<0.2945.0>, :normal} #79

{:EXIT, #PID<0.2945.0>, :normal} #79

garthk commented Sep 23, 2020

garthk commented Oct 12, 2020

cabol commented Oct 12, 2020

garthk commented Oct 13, 2020

cabol commented Oct 13, 2020

cabol commented Oct 29, 2020

garthk commented Oct 29, 2020 •

edited

Loading

{:EXIT, #PID<0.2945.0>, :normal} #79

{:EXIT, #PID<0.2945.0>, :normal} #79

Comments

garthk commented Sep 23, 2020

garthk commented Oct 12, 2020

cabol commented Oct 12, 2020

garthk commented Oct 13, 2020

cabol commented Oct 13, 2020

cabol commented Oct 29, 2020

garthk commented Oct 29, 2020 • edited Loading

garthk commented Oct 29, 2020 •

edited

Loading