Skip to content

Fix badmatch in find_next_node/0#5193

Merged
nickva merged 1 commit intomainfrom
fix-badmatch-in-find-next-node
Aug 22, 2024
Merged

Fix badmatch in find_next_node/0#5193
nickva merged 1 commit intomainfrom
fix-badmatch-in-find-next-node

Conversation

@nickva
Copy link
Contributor

@nickva nickva commented Aug 21, 2024

Previously, find_next_node/0 didn't handle odd cases such as current node not being in the mem3:nodes() or having an empty mem3:nodes() and just crashed with a badmatch.

Make sure to handle those cases and return node() for them. We already handled node() target as a no-op in push/2. We just have to also add it to maybe_resubmit/2.

Also, to avoid confusion between "live" nodes and "all" nodes, in the helper function opt to just use Mem3Nodes variable name to make it clear what we're dealing with.

Fix #5191

Copy link
Contributor

@jaydoane jaydoane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix!

@nickva nickva force-pushed the fix-badmatch-in-find-next-node branch from 1f3b34d to d1e0988 Compare August 22, 2024 01:00
Previously, `find_next_node/0` didn't handle odd cases such as current node not
being in the `mem3:nodes()` or having an empty `mem3:nodes()` and just crashed
with a badmatch.

Make sure to handle those cases and return `node()` for them. We already
handled `node()` target as a no-op in `push/2`. We just have to also add it to
`maybe_resubmit/2`.

Also, to avoid confusion between "live" nodes and "all" nodes, in the helper
function opt to just use `Mem3Nodes` variable name to make it clear what we're
dealing with.

Fix #5191
@nickva nickva force-pushed the fix-badmatch-in-find-next-node branch from d1e0988 to dd4870f Compare August 22, 2024 01:01
@nickva nickva merged commit 637fb79 into main Aug 22, 2024
@nickva nickva deleted the fix-badmatch-in-find-next-node branch August 22, 2024 01:58
@rnewson
Copy link
Member

rnewson commented Aug 22, 2024

an explicit nonode result would be clearer, relying on the caller to decide to do nothing if it gets back node() is a bit cute. still, we need to handle these edge cases, it's interesting that we've never encountered this before.

@nickva
Copy link
Contributor Author

nickva commented Aug 22, 2024

It is a bit cute to use a replicate to self as a fallback. I was mainly relying on the existing "replicate to self" behavior being a no-op in

push(#job{node = Node} = Job) when Node =/= node() ->
gen_server:cast(?MODULE, {push, Job});
push(_) ->
ok.
with nonode we would have had to add both special cases: one for self one for nonode in various places.

I built a small test module to play with with the previous logic wondering the same thing, why we didn't see this before?

-module(nn).

-export([
    find_next_node/3
]).

find_next_node(Self, LiveNodes, Mem3Nodes) ->
    AllNodes0 = lists:sort(Mem3Nodes),
    AllNodes1 = [X || X <- AllNodes0, lists:member(X, LiveNodes)],
    AllNodes = AllNodes1 ++ [hd(AllNodes1)],
    [_Self, Next | _] = lists:dropwhile(fun(N) -> N =/= Self end, AllNodes),
    Next.
> c(nn).

> nn:find_next_node(n, [a,n], [a]).
** exception error: no match of right hand side value []
     in function  nn:find_next_node/3 (nn.erl, line 11)

So one case where we'd trigger this is if the node we're on removes itself from the nodes list. Then the user reported the logs being filled and the machine was "frozen". It must have happened between the initial sync started with the node in the mem3:nodes() list then it was removed and an error happened, where initial_sync crashed.

[error] 2024-08-21T19:02:06.680089Z couchdb@192.168.10.235 emulator -------- Error in process <0.30202.42> on node 'couchdb@192.168.10.235' with exit value:
{{badmatch,[]},[{mem3_sync,find_next_node,0,[{file,"src/mem3_sync.erl"},{line,309}]},{mem3_sync,sync_nodes_and_dbs,0,[{file,"src/mem3_sync.erl"},{line,265}]},{mem3_sync,initial_sync,1,[{file,"src/mem3_sync.erl"},{line,272}]}]}

On a crash we end up restarting it

handle_info({'DOWN', _, _, _, {sync_error, Nodes}}, #st{tid = Tid} = St) ->
Pid = start_sync(Nodes),
ets:insert(Tid, #job{nodes = Nodes, pid = Pid, retry = false}),
{noreply, St};
and from then on it will just keep crashing. So I opted to fold both the self and the "odd" cases like that into the already existing "no-op" case so the crash cycle due to this reason doesn't happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error loop with system freeze when removing a node from a cluster

3 participants