Fix badmatch in find_next_node/0 by nickva · Pull Request #5193 · apache/couchdb

nickva · 2024-08-21T23:49:54Z

Previously, find_next_node/0 didn't handle odd cases such as current node not being in the mem3:nodes() or having an empty mem3:nodes() and just crashed with a badmatch.

Make sure to handle those cases and return node() for them. We already handled node() target as a no-op in push/2. We just have to also add it to maybe_resubmit/2.

Also, to avoid confusion between "live" nodes and "all" nodes, in the helper function opt to just use Mem3Nodes variable name to make it clear what we're dealing with.

Fix #5191

jaydoane

Nice fix!

src/mem3/src/mem3_sync.erl

Previously, `find_next_node/0` didn't handle odd cases such as current node not being in the `mem3:nodes()` or having an empty `mem3:nodes()` and just crashed with a badmatch. Make sure to handle those cases and return `node()` for them. We already handled `node()` target as a no-op in `push/2`. We just have to also add it to `maybe_resubmit/2`. Also, to avoid confusion between "live" nodes and "all" nodes, in the helper function opt to just use `Mem3Nodes` variable name to make it clear what we're dealing with. Fix #5191

rnewson · 2024-08-22T06:50:43Z

an explicit nonode result would be clearer, relying on the caller to decide to do nothing if it gets back node() is a bit cute. still, we need to handle these edge cases, it's interesting that we've never encountered this before.

nickva · 2024-08-22T15:02:38Z

It is a bit cute to use a replicate to self as a fallback. I was mainly relying on the existing "replicate to self" behavior being a no-op in

couchdb/src/mem3/src/mem3_sync.erl

Lines 79 to 82 in 637fb79

    
           push(#job{node = Node} = Job) when Node =/= node() -> 
        
               gen_server:cast(?MODULE, {push, Job}); 
        
           push(_) -> 
        
               ok.

with nonode we would have had to add both special cases: one for self one for nonode in various places.

I built a small test module to play with with the previous logic wondering the same thing, why we didn't see this before?

-module(nn).

-export([
    find_next_node/3
]).

find_next_node(Self, LiveNodes, Mem3Nodes) ->
    AllNodes0 = lists:sort(Mem3Nodes),
    AllNodes1 = [X || X <- AllNodes0, lists:member(X, LiveNodes)],
    AllNodes = AllNodes1 ++ [hd(AllNodes1)],
    [_Self, Next | _] = lists:dropwhile(fun(N) -> N =/= Self end, AllNodes),
    Next.

> c(nn).

> nn:find_next_node(n, [a,n], [a]).
** exception error: no match of right hand side value []
     in function  nn:find_next_node/3 (nn.erl, line 11)

So one case where we'd trigger this is if the node we're on removes itself from the nodes list. Then the user reported the logs being filled and the machine was "frozen". It must have happened between the initial sync started with the node in the mem3:nodes() list then it was removed and an error happened, where initial_sync crashed.

[error] 2024-08-21T19:02:06.680089Z couchdb@192.168.10.235 emulator -------- Error in process <0.30202.42> on node 'couchdb@192.168.10.235' with exit value:
{{badmatch,[]},[{mem3_sync,find_next_node,0,[{file,"src/mem3_sync.erl"},{line,309}]},{mem3_sync,sync_nodes_and_dbs,0,[{file,"src/mem3_sync.erl"},{line,265}]},{mem3_sync,initial_sync,1,[{file,"src/mem3_sync.erl"},{line,272}]}]}

On a crash we end up restarting it

couchdb/src/mem3/src/mem3_sync_nodes.erl

Lines 75 to 78 in d38f14f

    
           handle_info({'DOWN', _, _, _, {sync_error, Nodes}}, #st{tid = Tid} = St) -> 
        
               Pid = start_sync(Nodes), 
        
               ets:insert(Tid, #job{nodes = Nodes, pid = Pid, retry = false}), 
        
               {noreply, St};

and from then on it will just keep crashing. So I opted to fold both the self and the "odd" cases like that into the already existing "no-op" case so the crash cycle due to this reason doesn't happen.

jaydoane approved these changes Aug 22, 2024

View reviewed changes

src/mem3/src/mem3_sync.erl Outdated Show resolved Hide resolved

nickva force-pushed the fix-badmatch-in-find-next-node branch from 1f3b34d to d1e0988 Compare August 22, 2024 01:00

nickva force-pushed the fix-badmatch-in-find-next-node branch from d1e0988 to dd4870f Compare August 22, 2024 01:01

nickva merged commit 637fb79 into main Aug 22, 2024

nickva deleted the fix-badmatch-in-find-next-node branch August 22, 2024 01:58

nickva mentioned this pull request Aug 22, 2024

Error loop with system freeze when removing a node from a cluster #5191

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix badmatch in find_next_node/0#5193

Fix badmatch in find_next_node/0#5193
nickva merged 1 commit intomainfrom
fix-badmatch-in-find-next-node

nickva commented Aug 21, 2024

Uh oh!

jaydoane left a comment

Uh oh!

Uh oh!

rnewson commented Aug 22, 2024

Uh oh!

nickva commented Aug 22, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nickva commented Aug 21, 2024

Uh oh!

jaydoane left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rnewson commented Aug 22, 2024

Uh oh!

nickva commented Aug 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nickva commented Aug 22, 2024 •

edited

Loading