Prevent delayed opener error from crashing index servers#4811
Conversation
Previously, an index and opener process dying could have resulted in the index gen_server crashing. This was observed in the CI test as described in: #4809 The process in more detail was as follows: * When an async opener result is handled in the index server, there is a period of time when the index server is linked to both the index and the opener process. * After we reply early to the waiting clients, a client may do something to cause the indexer to crash, which would crash the opener along with it. That would generate two `{'EXIT', Pid, _}` messages queued in the indexer process' mailbox. * The index gen_server, is still processing the async opener result callback, where it would remove the openener from the `openers` table, then it returns `ok` to the async opener. * Index gen_server continues processing queued `EXIT` messages in `handle_info`: - The one for the indexer Pid is handled appropriately - The one for the opener leads to an eexit(...)` clause since we ended with an unknown process exiting. To avoid the race condition, and the extra opener `EXIT` message, unlink and reply early to the opener, as soon we linked to the indexer or had received the error. To avoid the small chance of still getting an `EXIT` message from the opener, in case it crashed right before we unlinked, flush any exit messages from it. We do a similar flushing in two other places so create small utility function to avoid duplicating the code too much. Fix: #4809
| [gen_server:reply(From, {ok, Pid}) || From <- Waiters], | ||
| % Flush opener exit messages in case it died before we unlinked it | ||
| ok = flush_exit_messages_from(OpenerPid), | ||
| {noreply, State}; |
There was a problem hiding this comment.
I noticed that both here and in below you changed from {reply, ok, State} to {noreply, State}. Can you explain why?
There was a problem hiding this comment.
Good question, @jaydoane. That's because we already replied early to the opener with gen_server:reply(FromOpener, ok). The "opener" call into the index_server then gets to end early and it doesn't have to wait for us to finish the rest of the call.
These two calls can now end a bit quicker and we should be able to garbage collect them sooner.
new_index({Mod, IdxState, DbName, Sig}) ->
DDocId = Mod:get(idx_name, IdxState),
case couch_index:start_link({Mod, IdxState}) of
{ok, Pid} ->
ok = gen_server:call(
server_name(DbName), {async_open, {DbName, DDocId, Sig}, {ok, Pid}}
),
unlink(Pid);
Error ->
ok = gen_server:call(
server_name(DbName), {async_error, {DbName, DDocId, Sig}, Error}
)
end.The most importantly, however is we get the opener out of the way once we already linked to the main indexer process, or received the error.
There was a problem hiding this comment.
Ah, nice, I missed those early replies.
Previously, an index and opener process dying could have resulted in the index gen_server crashing. This was observed in the CI test as described in: #4809
The process in more detail was as follows:
When an async opener result is handled in the index server, there is a period of time when the index server is linked to both the index and the opener process.
After we reply early to the waiting clients, a client may do something to cause the indexer to crash, which would crash the opener along with it. That would generate two
{'EXIT', Pid, _}messages queued in the indexer process' mailbox.The index gen_server, is still processing the async opener result callback, where it would remove the opener from the
openerstable, then it returnsokto the async opener.Index gen_server continues processing queued
EXITmessages inhandle_info:exit(...)clause since we ended with an unknown process exiting.To avoid the race condition, and the extra opener
EXITmessage, unlink and reply early to the opener, as soon we linked to the indexer or had received the error. To avoid the small chance of still getting anEXITmessage from the opener, in case it crashed right before we unlinked, flush any exit messages from it. We do a similar flushing in two other places so create small utility function to avoid duplicating the code too much.Fix: #4809